# Internet Subscribers in the 30 OECD Countries (2000-2003)

 Please do not cite work from this wiki, since these are mainly students theses which may contain errors!

## Introduction

The data source for the current work is obtained from the Telecommunications Database 2005 for the thirty OECD countries. This paper analyzes the yearly collected data for the different types of Internet subscribers between 2000 and 2003 using the software program SPSS 17.0 with methods that are discussed during the course “Computer Aided Statistics” WS 09/10 at Humboldt University in Berlin. The investigated variables are Internet Subscribers, ISDN Subscribers Primary Rate, ISDN Subscribers Basic Rate, Cable Modem Subscribers, Dial-up Internet Subscribers, DSL Lines and Other Broadband Access Technologies to Internet.

Explanation of the variables:

• The number of Internet Subscribers is described as “[…] the number of active registered Internet accounts including all fixed network access technologies to internet.” (Statistics OECD)The mobile phone access to the Internet is excluded.
• Integrated Services Digital Network (ISDN) allows simultaneous digital transmission of voice, video, data and other network services over telephone wires. There are two types of ISDN service: the Basic Rate Interface and the Primary Rate Interface. The first one (the Basic Rate Interface) has the purpose to provide this kind of services to the home and small enterprises, the second one – the Primary Rate Interface is directed to larger users.
• Cable modems provide broadband Internet access, as it benefits from the high bandwidth of a cable television network.
• Digital Subscriber Line (DSL) is a type of technology that allows digital data transmission over a local telephone network. The DSL service is provided at the same time with the regular telephone on the same telephone line. As there is no found additional information about the number of DSL Lines in the Telecommunications Database 2005, it is supposed that it covers all telephone as well as Internet subscribers that use DSL Lines. There are several options for using DSL lines, such as: a DSL subscriber can use one and the same DSL line only for Internet access, only for telephone or both. As it is one of the mostly used technologies for Internet access in the most of the OECD countries during the period between 2000 and 2003, the number of DSL Lines will be included in the investigation.
• Dial-up Internet is a form of Internet access that uses a telephone line. Here it is easy to be made a difference between the Internet subscribers and telephone subscribers contrary to the DSL technology.
• "Other" broadband technologies are types of Internet access that include “[…] satellite broadband Internet, fibre-to-the-home Internet access, ethernet LANs, and fixed wireless subscribers (at downstream speeds greater than 256 kbps)”. (Statistics OECD)

In the first section of the work, the variables will be analyzed for outliers. For extremes’ identification, box plot is used. After that a normality tests on the variables with Kolmogorov-Smirnov Test and Shapiro-Wilk Test are conducted. At the end Grubbs-Test is used to check whether the identified extremes by the normally distributed variables are outliers. In the second section, linear regression models are built. For this purpose, dependencies between the chosen dependent variable Internet Subscribers and the other variables are examined with the help of a scatter plot. Furthermore, the bivariate correlations between all variables are calculated and analyzed. At the end, the linear regression results for each year are discussed.

It should be mentioned that in the context of the work not a random sample but the population is considered. Thus, an interpretation of the tests’ results can not be made in terms of inference to the population.

## Analysis

The first step conducted by the analysis of the observed data is the computing of new variables. It has the goal to make the existing variables comparable to each other for further analysis. For this purpose the number of different Internet subscribers (or user lines in the case of the variable “DSL Lines”) is divided by the population for the relevant year and country and the result is multiplied by 10 000. Thus, the number of each type of Internet subscribers (also for the total number of Internet subscribers) per 10 000 inhabitants is acquired. It should be mentioned that the number of population for each year and each country is also included in the investigated data set. All the included variables in the current analysis are metric and continuous.

### Outlier Identification

#### Box Plot

In the figures below are shown the box plots for each of the seven investigated variables between 2000 and 2003, used for identification of potential outliers.

The box plots for Internet Subscribers for the years 2002 and 2003 show two extremes. Both of them belong to Portugal. According to the information related to the investigated data, the number of dial-up customers for Portugal might be overestimated during these years. As the number of Internet Subscribers includes also the number of dial-up customers, that might lead to an overestimation of the number of Internet Subscribers as well. This assumption is also supported by the extreme value related to Portugal on the box plot for 2003 for the number of Dial-up Internet Subscribers. From the box plots for Internet Subscribers it is observed that the median constantly increases during these four years. The median for 2000 is ca. 1000 Internet subscribers pro 10000 inhabitants and for 2003 is ca. 2000 – twice as many Internet subscribers. In 2000, 50 percent of the data for Internet Subscribes lies between 300 and 2000 subscribers pro 10000 inhabitants. In 2003, the range of the 50 percent of the data lies between 1500 and 3000 subscribers pro 10000 inhabitants. This means that the number of Internet subscribers in many OECD countries has considerably increased.

The box plots for the ISDN Subscribers Primary Rate do not picture any outliers contrary to the box plots for the ISDN Subscribers Basic Rate. By the ISDN Subscribers Basic Rate extremes are identified for each of the observed years. For the years 2000, 2001, 2002 and 2003 the extreme values are referred to Norway and for 2003 there is one more extreme value for Luxemburg. According to the information found on the Web-Site of Telenor, a leading provider of telecommunication services in Norway, “Internet services, such as ISDN and ADSL were first introduced in Norway through the upgraded telephone cables.” (http://www.telenor.com/en/about-us/our-history/norwegian-history/p/2009-2005) That might be a reason why the number of ISDN Subscribers Basic Rate pro 10 000 inhabitants in Norway during these years is so high.

The box plots for the number of Cable Modem Internet Subscribers illustrate two extreme values for each of the years between 2000 and 2002 that are related to Canada and Korea and one for 2003 that refers to Finland. According to a comparative analysis on the broadband Internet access in OECD Countries, “Cable companies […] have been leaders in introducing broadband access services […]” (Sherille Ismail, Irene Wu (2003)) between the years 2000 and 2002 in markets such as Canada and Korea. This might explain the high number of Cable Modem Internet Subscribers for this period in Canada and Korea.

The extreme values on the box plots for the DSL Lines for the years 2000, 2001 and 2002 belong to Korea. According to an Internet Case Study on Korea, “Korea leads the world in broadband Internet access penetration. At December 2002, Korea’s penetration of Digital Subscriber Line (DSL) and cable modem Internet access was first in the world […]” (International Telecommunication Union (2003))In 2000 one more extreme value that belongs to Canada can be identifies on the figure. That high value might be due to the fact that incumbent telecommunications carriers by introducing DSL services provided strong competition to the cable modem services in Canada, as pointed out in the mentioned above comparative analysis on the broadband Internet access in the OECD countries. Thus, that might have resulted in the high number of DSL Lines in Canada for the year 2000.

The box plots for the Other Broadband Access to Internet Technology show that for the years 2001, 2002 and 2003 there are two extreme values related to Sweden and Korea. According to the comparative analysis on the broadband Internet access in OECD countries, the leading technology for broadband access in 2000 in Sweden was neither DSL nor cable modem but Ethernet LANS. The high number of Other Broadband Access to Internet Technology in Korea can be explained by the fact that “Korea’s international Internet connectivity has expanded tremendously and stood at 5.2 Gbps at December 2001 […] It has benefited from its proximity to the sea and hence fibre-optic submarine cables […]”(International Telecommunication Union (2003)), as pointed out in the Internet Case Study on Korea mentioned above. Both Ethernet LANS and the broadband network architecture that uses optical fiber belong to the Other Broadband Access to Internet Technology according to the description of the variable provided in the current work.

In order to be investigated whether these extreme values in the box plots are outliers, an appropriate outlier test is applied. A prerequisite for statistical testing using outlier tests is the normal distribution of the investigated variables. For this purpose, in the next section the distribution of the variables is checked.

#### Normality Tests

The nonparametric Kolmogorov-Smirnov Test and Shapiro-Wilk Test are used here to assess the goodness-of-fit to the normal distribution. By the Kolmogorov-Smirnov Test the equality of the empirical distribution function and the theoretical normal distribution function (null hypothesis) is checked. For this purpose, the K-S test measures the maximum distance between these two distribution functions. The K-S test here is conducted on the predetermined 5%-significance level. If the asymptotic significance level is > 0.05, the null hypothesis is accepted. If the asymp.sig < 0.05, the alternative hypothesis is accepted (the observed distribution does not correspond to the normal distribution).

The Shapiro-Wilk Test takes into account the symmetrically positioned data values and measures their distance around the middle value. This test provides better results in the case where the sample size is small. The observed significance is compared to the predetermined significance level 0.05. If the observed significance > 0.05, the null hypothesis is accepted - this means that the empirical distribution can be assumed to be normal distributed. If the observed significance < 0.05, the null hypothesis is rejected.

The results from the K-S test and Shapiro-Wilk test are shown in the tables below. According to them Internet Subscribers and ISDN Subscribers Primary Rate are normally distributed for the years 2000, 2001, 2002 and 2003. By ISDN Subscribers Basic Rate and Cable Modem Subscribers the null hypothesis is rejected for all of the years. Dial-up Internet Subscribers is normally distributed according to the both tests for the years 2000, 2001 and 2002, and normally distributed according to K-S test for 2003. As the Shapiro-Wilk test is more suitable for a small sample (n<50) and as the null hypothesis is rejected for 2003 according to it, it is assumed that Dial-up Internet subscribers for 2003 is not normally distributed. The null hypothesis is rejected according to both tests for DSL Lines for the years 2000, 2001 and 2002. The Shapiro-Wilk test also rejects the null hypothesis for 2003, contrary to the results from K-S test. Analogically to the Dial-up Internet Subscribers’ results on normality test for 2003, it is assumed that the DSL Lines for 2003 is not normally distributed. The null hypothesis is rejected by the both tests for Other Broadband Access Technologies to Internet for the years 2001, 2002 and 2003. For 2000 the Shapiro-Wilk test accepts the null hypothesis.

 Normality Tests: Internet Subscribers Normality Tests: ISDN Primary Rate Normality Tests: ISDN Basic Rate Normality Tests: Cable Modem Subscribers Normality Tests: Dial-up Internet Subscribers Normality Tests: DSL Lines Normality Tests: Other Broadband Access Technologies to Internet

#### Outlier Test

Here, the Grubbs -Test is used to check whether the smallest or the largest observation value is an outlier. An assumption for this test is the normal distribution of the population. According to the normality test results, only for Internet Subscribers for 2002 and 2003 an outlier test can be conducted. The other variables are not normally distributed or/and there are no extreme values identified on the box plots for them. Here the test statistic T is based on the difference between the greatest value from the mean divided by the standard deviation. If T > critical value of the Grubbs-Test, the null hypothesis (the greatest value is not an outlier) is rejected and the alternative hypothesis (the greatest value is an outlier) is accepted.

For the Internet Subscribers for 2002, the Grubbs-Test for an outlier is conducted in the following way: T=(4981.56 – 2117.4921)/1296.94217=2.208. The critical value of the Grubbs-Test T(30;0.05)=2.75 > T=2.208. The null hypothesis is not rejected, which means that the greatest value is not an outlier.

For the Internet subscribers for 2003, the Grubbs-Test for an outlier is conducted in the following way: T=(6904.91 – 2467.1945)/1509.82078=2.9392. The critical value of the Grubbs-Test T(30;0.05)=2.75 < T=2.9392. The null hypothesis is rejected, which means that the greatest value is an outlier. As mentioned above this outlier might be a result from an overestimation of the number of Dial-up Internet subscribers in Portugal included in the total number of Internet Subscribers in the country. For further data analysis, this value should be excluded, corrected, or replaced by an approximation in order to avoid distortions.

### Linear Regression

Firstly, it will be identified which of the following variables: ISDN Subscribers Primary Rate, ISDN Subscribers Basic Rate, Cable Modem Subscribers, Dial-up Internet Subscribers, DSL Lines and Other Broadband Access Technologies to Internet (independent variables) might be correlated to Internet Subscribers (dependent variable). Moreover, the variables will be checked for multicollinearity. Secondly, a linear regression analysis will be applied for each year, in order to be calculated the strength of the relationship between the dependent variable and the selected independent variables. However, the regression models cannot be used here to predict values for the dependent variable as the whole population and not a sample is considered.

#### Dependencies

Before a linear regression to be conducted the dependencies between the variable Internet Subscriber and ISDN Subscribers Primary Rate, ISDN Subscribers Basic Rate, Cable Modem Subscribers, Dial-up Internet Subscribers, DSL Lines and Other Broadband Access Technologies to Internet must be checked. This has the goal to identify the independent variables that will be included later in the linear regression models. For this purpose a scatter plot is used. It displays the correlation between two variables. On the X-axis is the Internet Subscribers dependent variable and on the Y-axis is the respective independent variable. The scatter plots display a clear positive strong linear dependency between Internet Subscribers and Dial-up Internet Subscribers for 2000, 2001, 2002 and 2003. The dependency between the Other Broadband Access Technologies to Internet and Internet subscribers almost equals to zero. The correlation between the Cable Modem Subscribers is weakly linear and positive for 2001 and 2002. The dependencies between Internet Subscribers Primary Rate are also positive but not strongly linear as by Dial-up Internet Subscribers and Internet Subscribers. The scatter plots for correlations between Internet Subscribers and DSL Lines display that for the years 2000 and 2001 the values for DSL Lines almost do not change with the variation of the values for Internet Subscribers. For 2002 and 2003 we can observe a positive dependency, but not strongly linear, between the Internet Subscribers and DSL Lines.

 DSL Lines and ISDN Subscribers Basic Rate for 2000 and 2001 DSL Lines and ISDN Subscribers Basic Rate for 2002 and 2003 ISDN Subscribers Primary Rate and Dial-up Internet Subscribers for 2000 and 2001 ISDN Subscribers Primary Rate and Dial-up Internet Subscribers for 2002 and 2003 Cable Modem Subscribers and Other Broadband Access Technologies for 2000 and 2001 Cable Modem Subscribers and Other Broadband Access Technologies for 2002 and 2003

The correlation analysis is used to test whether an X-variable exerts a significant influence on the Y-variable. Moreover, it shows to what extent multicollinearity, which affects the estimation of regression parameters between the X-variables, exists. Multicollienarity means that these variables do not vary more independently. With increasing multicollinearity the estimation of the regression coefficient becomes unreliable. The multicollinearity can be reduced through the elimination of variables, variable transformations and others. In the figures below, the calculated Person Correlations for each year are shown.

According to the results from the correlations table for 2000 and 2001, ISDN Subscribers Primary Rate and Dial-up Internet Subscribers are significantly correlated to Internet Subscribers at the 0.01 level, and ISDN Subscribers Basic Rate is significantly correlated to Internet Subscribers at the 0.05 level. However, significant multicollinearity exists between ISDN Subscribers Primary Rate and ISDN Subscribers Basic Rate, between ISDN Subscribers Primary Rate and Dial-up Internet Subscribers, and between ISDN Subscribers Basic Rate and Dial-up Internet Subscribers. The scatter plot for Internet Subscribers and Dial-up Internet Subscribers, as well as the correlations table show a high linear dependency between these two variables. Thus, in the linear regressions for 2000 and 2001 only Dial-up Internet Subscribers will be included as an independent variable. Despite the significant correlation of the both variables ISDN Subscribers Primary Rate and ISDN Subscribers Basic Rate with Internet Subscribers, their inclusion in the linear regressions might lead to an unreliable estimation of the regression coefficients.

According to the results from the correlations table for 2002, ISDN Subscribers Primary Rate and Dial-up Subscribers are significantly correlated to Internet Subscribers at the 0.01 level and Cable Modem Subscribers are significantly correlated at the 0.05 level. However, there is also observed a significant multicollinearity between ISDN Subscribers Primary Rate and Dial-up Internet subscribers. Thus, in the linear regression for 2002, only the both variables Dial-up Internet Subscribers and Cable Modem Subscribers will be taken into account, as the correlation between Dial-up Internet Subscribers and Internet Subscribers is much higher (0.930) than the correlation between ISDN Subscribers Primary Rate and Internet Subscribers (0.570).

According to the results from the correlations table for 2003, ISDN Subscribers Primary Rate and DSL Lines are significantly correlated to Internet Subscribers at the 0.05 level and Dial-up Internet Subscribers is significantly correlated to Internet Subscribers at the 0.01 level. Between ISDN Subscribers Primary Rate and Dial-up Internet Subscribers is observed again significant multicollibnearity. As Dial-up Internet Subscribers and Internet subscribers show a much higher correlation (0.923) compared with the correlation between ISDN Subscribers Primary Rate and Internet Subscribers (0.499), ISDN Subscribers Primary Rate will be excluded from the linear regression for 2003, in order an unreliable estimation of the regression coefficients to be avoided. Dial-up Internet Subscribers and DSL Lines will be included as independent variables in the same linear regression.

#### Linear Regression Models

For the linear regressions the method “Enter” is used. For 2000, there is one independent variable included in the regression model – Dial-up Internet Subscribers. In the ANOVA table the significance level of the regression is 0.00 and is < 0.05, which means that the null hypothesis ($H_{0}$: $R^2$=0) is rejected. The coefficient of determination $R^2$ provides information whether the regression model predicts y or Internet Subscribers. It takes values between 0 and 1, as 1 means that the model perfectly predicts y. For 2000 the results show that $R^2$=0.901. The adjusted $R^2$ takes into account the numbers of the independent variables and thus, a better comparison between regression models with different number of independent variables is possible. Here the adjusted $R^2$=0.898. The coefficients table shows that the Dial-up Internet Subscribers beta coefficient has a significantly influence on Internet Subscribers as the Sig.< 0.05. The Dial-up Internet Subscribers beta coefficient is equal to 0.999 and is the change in y (Internet Subscribers) in response per unit change in the predictor. The standardized regression coefficients have a mean of zero and a standard deviation of 1 and are calculated in order to be made comparable to each other. Here the Dial-up Internet Subscribers standardized coefficient is 0.949. However, this regression model consists of only one independent variable, thus a comparison with other independent variables’ regression coefficients will not be conducted. One of the assumptions for a linear regression is the normal distribution of the residuals. A histogram and a P-P plot are used in order to be checked whether the standardized residuals are normally distributed. A histogram is a graphical representation of the frequency distribution and the P-P plot graphically tests whether the empirical distribution of a continuous variable corresponds to the normal distribution. By the P-P plot, the empirical distribution function is plotted on the abscissa against the distribution function of the normal distribution on the ordinate. For a better estimation of the standardized residual's distribution, K-S test and Shapiro-Wilk test are used. According to the results, the standardized residuals are not normally distributed (at the predetermined significance level 0.05). Thus, the assumption of the linear regression is not fulfilled.

For the linear regression model 2001 again only the independent variable Dial-up Internet Subscribers is included. The null hypothesis $H_{0}$: $R^2$=0 and the null hypothesis $H_{0}$: Dial-up Internet Subscribers regression coefficient is equal to 0 are rejected. The tables for 2001 provide the following results: $R^2$=0.888, adjusted $R^2$=0.884, The Dial-up Internet Subscribers beta coefficient is equal to 1.030 and the standardized regression coefficient is equal to 0.994. According to the normality tests, the standardized residuals for 2001 are not normally distributed.

In the regression model for 2002, except Dial-up Internet Subscribers, there is one more independent variable included – Cable Modem Subscribers. The null hypothesis $H_{0}$: $R^2$=0 is rejected. $R^2$=0.959 and adjusted $R^2$=0.956, which means that the model predicts very good y. The Dial-up Internet Subscribers and Cable Modem Subscribers beta coefficients significantly influence Internet Subscribers. The standardized coefficients show that Dial-up Internet Subscribers have a greater effect than Cable Modem Subscribers on the dependent variable. (Dial-up Internet Subscribers standardized regression coefficient is 0.890 and Cable Modem Internet Subscribers standardized regression coefficient is 0.316). According to the normality test results, the regression standardized residuals for 2002 are normally distributed.

In the regression model for 2003 Dial-up Internet Subscribers and DSL Lines are included as independent variables. The null hypothesis $H_{0}$: $R^2$=0 is rejected. $R^2$=0.968 and adjusted $R^2$=0.966, which means that the model predicts y very good. The Dial-up Internet Subscribers and DSL Lines beta coefficients significantly influence Internet Subscribers. The standardized coefficients show that Dial-up Internet Subscribers have a greater effect than DSL Lines on the dependent variable. (Dial-up Internet Subscribers standardized regression coefficient is 0.894 and DSL Lines standardized regression coefficient is 0.341). From the both graphs (histogram and P-P plot), it can not be derived any concrete statements on the distribution of the standardized residuals. The K-S test results show that the standardized residuals are normally distributed (sig.>0.05) and the Shapiro-Wilk test - that they are not(sig.<0.05). As the Shapiro-Wilk test is more appropriate for a small sample than the K-S test, it can be assumed that the standardized residuals for 2003 are not normally distributed.

## Summary

The current work analyzes the Internet subscribers in the OECD countries between 2000 and 2003 as different types of Internet technologies are taken into account. The results show that the total number of Internet Subscribers in the OECD countries strongly increases during this period. The extreme values related to Portugal, one of which is an outlier, identified on the box plots for Internet Subscribers and Dial-up Internet Subscribers might be resulted from an overestimation of the data for Dial-up Internet Subscribers for Portugal. The extremes on the box plots for ISDN Subscribers Primary Rate related to Norway are explained through the fact that the ISDN technology was first introduced in Norway. The extreme values by Cable Modem Subscribers and DSL Lines for Korea can be interpreted by the leading position of Korea in broadband Internet access in the world. The extremes for Canada by Cable Modem Subscribers are due to the leading position of the Cable companies in introducing broadband access services in the country. The extreme values for Sweden and Korea by Other Broadband Access to Internet Technologies are explained by the facts that the leading technology in Sweden during this period was Ethernet LANS, and that the international Internet connectivity in Korea has expanded tremendously because of the use of optical fibre.

By conducting linear regressions with the dependent variable Internet Subscribers, Dial-up Internet Subscribers is included as an independent variable for all investigated years. For 2002, Cable Modem Subscribers is also included in the regression model as an independent variable and for 2003 – DSL Lines. However, the results show that only by the regression model for 2002 the assumption for normal distribution of the residuals is fulfilled. Dial-up Internet Subscribers has a greater effect than Cable Modem Subscribers on Internet Subscribers in the same linear regression model.

Data Sources: