OLYMPIC DATA ANALYSIS
40px  Please do not cite work from this wiki, since these are mainly students theses which may contain errors!

40px 

Inhaltsverzeichnis
Abstract
In this thesis we analysis the data of Women’s field sport results from 1984 Olympic Games, employing skills such as descriptive statistics, outlier test, distribution estimate and test, and trying to find the relationship between the performance of a country in shortdistance and longdistance runnings. Adopting the statistical methods such as regression and implementing in XploRe, the results of analysis verified our assumption. The detailed method and data analysis are presented in the following thesis.
Key Words
running performance outlier test density test correlation regression variable transformation
Background and Data Set
We retreat our data set from1984 Olympic records, collected by Johnson & Wichern ^{[1]}. The data shows result from Women Field competition of athletes from 55 countries, including 100m, 200m, 400m, 800m, 1500m, 3000m, and marathon. Originally it contains 55 observations and 8 variables.
The data in given by the following table
National Track Records for Women OBS COUNTRY 100m 200m 400m 800m 1500m 3000m Marathon 1 argentin 11.61 22.94 54.50 2.15 4.43 9.79 178.52 2 australi 11.20 22.35 51.08 1.98 4.13 9.08 152.37 3 austria 11.43 23.09 50.62 1.99 4.22 9.34 159.37 4 belgium 11.41 23.04 52.00 2.00 4.14 8.88 157.85 5 bermuda 11.46 23.05 53.30 2.16 4.58 9.81 169.98 6 brazil 11.31 23.17 52.80 2.10 4.49 9.77 168.75 7 burma 12.14 24.47 55.00 2.18 4.45 9.51 191.02 8 canada 11.00 22.25 50.06 2.00 4.06 8.81 149.45 9 chile 12.00 24.52 54.90 2.05 4.23 9.37 171.38 10 china 11.95 24.41 54.97 2.08 4.33 9.31 168.48 11 columbia 11.60 24.00 53.26 2.11 4.35 9.46 165.42 12 cookis 12.90 27.10 60.40 2.30 4.84 11.10 233.22 13 costa 11.96 24.60 58.25 2.21 4.68 10.43 171.80 14 czech 11.09 21.97 47.99 1.89 4.14 8.92 158.85 ... ..... ... ... ... ... ... ... ... 36 mauritiu 11.76 25.08 58.10 2.27 4.79 10.90 261.13 ... ..... ... ... ... ... ... ... ... 55 wsamoa 12.74 25.85 58.73 2.33 5.81 13.04 306.00 ... ..... ... ... ... ... ... ... ...
Analysis Target
Our analysis target is to find out the relationship between the performance of a country in shortdistance running and longdistance runnings. The result should be given in the form of regression formula.
Analytical Method
Data modification
For further comparison, we convert the unit of time into time spent for running 100 meters in each sport to put them into the same scale. Calculation formula is given by
Time per 100M = Now we obtain the data in uniform unit, which is ready for next step’s calculation.
Another modification is made when explanatory variable and dependent variables are defined. 100, 200 and 400m are defined as explanatory variables while the most representative long distance running marathon as dependent variables . Because adopting data of 800m can incur ambiguity in the definition for longdistance running. These data are excluded from our analysis.
Descriptive Graphing
Univariate Graphing
In order to check distribution and existence of outliers, we generate box plot^{[2]} of our transformed data, as shown in following. (Note that the x axis of box plot has no meaning.) From the plot we can see that few outliers exist. And they all appear on the upper boundary side, pulling the mean higher away from median in the dataset except in 800M. Though there are no outliers in 800M, the data of this variable is still skewing distributed. Before deciding about kicking outliers, we still need to find out whether there are some special reasons contributing to them. And we also need to see in multidimensionality whether these outliers in one dimension are still outliers in other dimensions.If not, then the data may provide some important information, wrongly excluding them could lead to unsatisfactory result in future analysis.
11 Boxplot of running data
Multivariate Graphing
Star Diagram
Star diagram can show multidimensional data in plane. Each axis of a star represents one variable in dataset. The dataset is standardized to a common interval. Then we can easily identify multidimensional outliers as they tend to become round or a larger heptagon.
In the diagram we can identify several possible outliers: the 12th, 13th, 36th and 55th countries. The 12th country is Cookis, the islands located in South Pacific Ocean, 13th country is Costa Rica in Central America ,while the 36th and 55th countries are Mauritius Western Samoa, which are both South Pacific Oceania countries.
At the same time, we find some countries have perfectly contradicted almost into a point. When referring back to the dataset, we found them are respectively: 19th German Democratic Republic, 53rd USA and 54th USSR. However, the 20th country, Federal Republic of Germany is lagging a little behind in terms of track performance. Up to now, it is still not clear that whether subgroups exist in our data.This also requires future work.
PCP
Plotting dataset in parallel coordinate also yield the similar result. In this diagram, we partition countries according their performance in Marathon. Those countries with a value larger than median in marathon result will be colorized blue, otherwise it will be red. The two obviously flowing away curves represent respectively 12th and 55th countries.
As PCP is plot in such as way that the biggest value is regarded as 1, and the rest coordinate will be the proportion of 1, we can see that in short distance running, the two slowest countries differ not so greatly as they differ from other countries in long distance running.
Strong correlation can also be observed between the 4th and 5th variables. However, we did not see anythings which clearly shows there is subgroup in our dataset.
Andrew’s curve
The Andrew’s curve is also partitioned in the same fashion as PCP. There are also a few curves flowing away from the other curves. The most obvious ones are 12th, 13th, 36th and 55th countries. From this diagram we can also see that blue and red colors are separated quite well, except a few one still entangling. Hence we can conclude that using the 7th variable to partition dataset is desirable.
Face plot
15 Flury faces plot Running data
100m = mouth size, 200m = pupil size, 400m = eye slant, 800m = upper hair line 1500m = lower hairline, 3000m = face line, marathon = darkness of hair,
Flury faces can help to visualize dataset in a more vivid way. For the reason that we assign the most distinct feature: face line and hair darkness to the 3000m running and marathon, a “squared ” face and dark hairs will suggest poor performance in both competition. The 55th country again stands out, however, 12th country is not so unique as there are some similar face as well: 23rd guatemal and 36th mauritiu, 41st Papua New Guinea.
Scatter Plot
In order to illustrate the overall and general relation between each explanatory and dependent variables, a series of scatters are plotted.. From now on, we begin to introduce some selection of variables. As 800 is ambiguous between short distance and long distance, we exclude it from our scatter plot. In the plot, we find there might be quadratic curves in the plot against and . So we add item into our model (xplore result also shows after adding this, out model can explain more about the original data’s variation).
16 scatter plot Running data
Outliers Detection and Test
For more simplicity and better focusing, in the following, we restrict to the 4 variales which are related to our model: as explanatory variables and marathon as dependent variables
Though graphically we identify some potential outliers, they are not mathematically proved to be outliers. Actually, more cautious work is required. In statistics, outlier is defined as a single observation "far away" from the rest of the data. We can detect them using some techniques. Once potential outliers are located, we should not just simply delete them. If the outliers stem from misrecording, then it is reasonable to delete it. Otherwise, it may just simply indicate a different type of response. ^{[3]}
Outlier locating technique can be done though linear model fitting. First of all, we use a general forward selection linear regression model to approximate marathon performance (Y) by our short running performance to analysis residuals, first of all we plot the standardized residuals from fitting value of Y by X as follows.
rundata=read("running") rundatao=rundata[,1]~rundata[,2]./2~rundata[,3]./4~rundata[,4]*60./8~rundata[,5]*60./15 ~ rundata[,6]*60./30 ~ rundata[,7]*60./422 x1=rundatao[,1] x2=rundatao[,2] x3=rundatao[,3] X=x1~x2~x3 Y=rundatao[,7] {xfs,bfs}=linregfs(X,Y, 0.05) {res,out}=linregres(xfs, Y, xfs*bfs) std=res[,3] stdsqr=res[,3]^2 stdsqr d= createdisplay(1,1) show(d, 1, 1, xfs*bfs~std) setgopt(d, 1, 1, "xlabel","Fitted Value","ylabel","Standard Residuals") sqrt(51*0.1545)
17 Data Fitting
At the first glance, we can find there are some influential observations(outliers) in the plots, however, a more detailed data analysis and outlier test should be done before any final decision is down.
We begin our outlier test, which is based on standardized residuals.
First we should assume normally distributed errors, i.i.d. Then concerning the potential outlier, the tth observation, we have two hypothesis:
Hypothesis H :
Alternative is a Tvector of zeros except for the tth component.
We test the Hypothesis by using formula:
And by using this alphatest, we will rejects H if is larger than the quantile of thecorresponding betadistribution. And according to Barnett & Lewis and Cran et. all the critical value is where is the quantile of the  distribution. here we set (in this test, we usually select a larger alpha value than usual). As there is no the corresponding beta distribution critical value generator in XploRe , we use matlab to get the corresponding critical value of the beta distributions is 0.1545. then if our test statistic is higher than our test statistic critical value 2.807 which is equal to , we reject our null hypothesis. The outcomes can be seen the following table.
[ 1,]  0.026947 
[ 2,]  0.011173 
[ 3,]  0.15936 
.....  ........ 
[ 13,]  3.7855 
.....  ........ 
[ 36,]  5.8698 
.....  ........ 
[ 55,]  19.196 
table 2 Outlier Test Outcome
After checking, we can find the # 13, 36 and 55 observations have larger test statistic value than our critical value which drivers us to reject the null hypotheses. we can find that for # 36, and 55’s high test statistic mainly results from their abnormal marathon performance where their performance is almost one times slower than the best country. This can be regarded as very abnormal and we should deleted them for our further analysis. However, for the 13th country Costa Rica, its high value is resulted from they take less time almost in every running which should be desirable in the Olympic games, so we should never delete it. Though # 12 country cookis behave like a potential outlier in the previous graphs, its abnormality comes from other variables than what we are interested here, therefore after outlier test, we just delete 36 and 55th observation, the left ones will be used for further analysis.
Distribution Test
For this part, we use shifted averaged histogram to get a general feeling and estimate our explanatory variable’s distribution which are demanded for further analysis.
First of all, we use default bandwidth to plot the shifted averaged histogram (1st column) for x1 and x2 variables. For the first impression, they seem normal and find there are 3 peaks in each plot. However, as the theoretical research shows that the optimal bandwidth selection will be to minimize the AMISE of our density function estimator. And if the true density is normal (unfortunately in practice, we can’t know this, so the following optimal bandwidth is rule of thumb bandwidth) ^{[4]}is which is 0.9203 in our model. As in the right hand side, we can find our histogram plotting is much more smoothed, and seems like normal. However a theoretical test should be cautiously constructed before any conclusion is made.
rundata=read("running1") rundatao=rundata[,1]~rundata[,2]./2~rundata[,3]./4~rundata[,4]*60./8~rundata[,5]*60./15 ~ rundata[,6]*60./30 ~ rundata[,7]*60./422 t1 =(rundatao[,1]) t2 =(rundatao[,2]) bp11 = grash(t1) bp12 = grash(t1,0.9203) bp21 = grash(t2) bp22 = grash(t2,0.9203) d= createdisplay(2,2) show(d, 1, 1, bp11) show(d, 1, 2, bp12) show(d, 2, 1, bp21) show(d, 2, 2, bp22) setgopt(d, 1, 1, "xlabel","100M'(Default Bandwidth)") setgopt(d, 1, 2, "xlabel","100M'(Optimal Bandwidth)") setgopt(d, 2, 1, "xlabel","100M'(Default Bandwidth)") setgopt(d, 2, 2, "xlabel","200M'(Optimal Bandwidth)")
18 Density Estimation of running data
Similarly, we can plot x3 and y to check their distribution. Please note that, it seems y is more skewed to the left side which shows it’s not normal distributed. In the following test, we can get same conclusion concerning distribution of y.
As our sample size is more than 50, our test is based on the skewness and kurtosis instead of employing the shapirowilk test. (where n less or equal 50).
Let denote the pth central moment of a random variable y. Then and are called skewness and kurtosis, respectively.
For a sample ,……, , the empirical central moments are defined by , (p = 2, 3, ….)
Now (sample skewness)
(sample kurtosis)
For the case of normal distribution, both skewness and excess (i.e. kurtosis  3) are zero. Using JarqueBera test, we try to compares the difference of skewness and kurtosis of the data with those from a normal distribution:
Test statistic formula is given by: (asymptotically) (Here our hypothesis are H : and . and
After running the program we used in xplore, our have test statistics 1.25, 4.29,0.26 which are all smaller than our critical value 5.9915 generated from the corresponding beta distribution at the 5% significance level., while our dataset has 19.427 much higher than 5.9915 which confirm our previous assumption  we can’t reject has normal distribution while should reject the assumption that is normally distributed.
Correlation
Correlation coefficients can be shown as result of XploRE:
[1,] 1 0.95942 0.83548 0.70676 0.71094 0.73618 0.71441 [2,] 0.95942 1 0.83954 0.68462 0.68498 0.69678 0.68757 [3,] 0.83548 0.83954 1 0.88193 0.80358 0.78087 0.68912 [4,] 0.70676 0.68462 0.88193 1 0.93746 0.87528 0.7693 [5,] 0.71094 0.68498 0.80358 0.93746 1 0.94919 0.81305 [6,] 0.73618 0.69678 0.78087 0.87528 0.94919 1 0.83641 [7,] 0.71441 0.68757 0.68912 0.7693 0.81305 0.83641 1
In accordance with the scatter, a large correlation coefficient are observed between and . So if we directly involve both of them in our regressor , it will cause serious multicollinearity problem which makes our parameter estimation very unstable. We are driven to do some transformation at first to eliminate this effect.
Variable Transform
By regressing (200m) on (100m), we obtained the formula:
The transformation of explanatory variables is made possible by utilizing the coefficient 0.9824. And after adding item, the adjusted is increase from 0.51158 to 0.56684. The new explanatory variables are given in the following form.
, is the residuals regressing on
, is original + estimated by regressing
(There is no change of this variable)
XploRE Modelling
When the data is transformed, further analysis can be carried out.
Regression of Y on
Our model can explain about 56% of our original data(Marathon)’s variation. At the 5% significance level , the Pvalue of and ‘S are less then 5% which shows they are significant , while is not significant as their P value are larger than 5% which drivers us to remove from our model which is performed next.
[ 1,] "" [ 2,] "A N O V A SS df MSS Ftest Pvalue" [ 3,] "_________________________________________________________________________" [ 4,] "Regression 288.012 4 72.003 18.012 0.0000" [ 5,] "Residuals 191.882 48 3.998" [ 6,] "Total Variation 479.894 52 9.229" [ 7,] "" [ 8,] "Multiple R = 0.77470" [ 9,] "R^2 = 0.60016" [10,] "Adjusted R^2 = 0.56684" [11,] "Standard Error = 1.99938" [12,] "" [13,] "" [14,] "PARAMETERS Beta SE StandB ttest Pvalue" [15,] "________________________________________________________________________" [16,] "b[ 0,]= 202.8759 88.4055 0.0000 2.295 0.0262" [17,] "b[ 1,]= 2.8905 2.0601 0.1567 1.403 0.1670" [18,] "b[ 2,]= 1.5008 0.5861 0.4687 2.561 0.0136" [19,] "b[ 3,]= 33.8212 13.1570 7.0654 2.571 0.0133" [20,] "b[ 4,]= 1.3322 0.4947 7.4381 2.693 0.0097"
Regression of Y on
Secondly, we remove from our model, and are all significant at the 5% significance level.
[ 1,] "" [ 2,] "A N O V A SS df MSS Ftest Pvalue" [ 3,] "_________________________________________________________________________" [ 4,] "Regression 280.143 3 93.381 22.907 0.0000" [ 5,] "Residuals 199.751 49 4.077" [ 6,] "Total Variation 479.894 52 9.229" [ 7,] "" [ 8,] "Multiple R = 0.76404" [ 9,] "R^2 = 0.58376" [10,] "Adjusted R^2 = 0.55828" [11,] "Standard Error = 2.01905" [12,] "" [13,] "" [14,] "PARAMETERS Beta SE StandB ttest Pvalue" [15,] "________________________________________________________________________" [16,] "b[ 0,]= 197.5929 89.1939 0.0000 2.215 0.0314" [17,] "b[ 1,]= 1.2263 0.5579 0.3830 2.198 0.0327" [18,] "b[ 2,]= 32.0543 13.2254 6.6963 2.424 0.0191" [19,] "b[ 3,]= 1.2654 0.4973 7.0653 2.545 0.0141"
Final Model and prediction
Finally we have our model, which illustrate relation between explanatory variables and dependant variable.
To test this model, we also generate prediction of data from it, and compare those prediction with true data. Graphs are shown below. We can find that our model fits the data well except in some extreme situations.
19 Prediction of running data
Conclusion and Explaination
Based on the exploration and construction of the final linear regression models above, we can now draw our conclusion. A quadratic polynomial relation exist between the performance of 100m, 200m, 400m running and the marathon. If increases one unit, Y will incrase 1.23 and 1.2054 unit respectively. The speed of marathon running will increase at first as speed of 400m goes up, but after reaching the maximum value, marathon speed decreases as 400m speed continue to increase. This may be resulted from the physical conditions of women athletes in different countries, or from the differentiated support from governments to long and short distance running sport.
Reference
1. Johnson, R. A. and Wichern, D. W. (1998). Applied Multivariate Statistical Analysis. PrenticeHall International, USA.
2. W. Härdle, S. Klinke, M. Müller. XploRe Learning Guide
3. Dr. Droge, selected topics in economics, Humboldt Universitaet zu Berlin.
4. Multivariate density estimation: theory, practice, and Visualization, John Wiley & Sons, New York, Chichester