RAND is a repository for a sport dataset analysis made with the language R. The analysis is done in Projet_final_Lemonnier_Alexandre_Simonin_Victor.ipynb
with a dataset description, some statistical analysis, visualisation, a PCA (Principal Component Analysis ) and finally a logistic regression.
The dataset is here in /data/decathlon.csv
. The dataset is made up of several quite different pieces of data. These represent the scores obtained by athletes in the decathlon event at the Olympic Games and at the Decastar, we have the scores of the 10 events as well as their ranking and their points.
We perform a PCA (Principal Component Analysis) in the notebook.
This allows us to analyze and visualize our Decathlon dataset, which contains individuals described by several quantitative variables.
PCA is a method that allows us to explore data with multiple variables. Each variable could be considered as a different dimension. This is useful because it could be very difficult to visualize our data in a multidimensional "hyper-space".
We also visualize our data with a correlation graph :
We made few analysis on it:
- The results of the short race events, therefore
100m
,400m
and110m.H
, are strongly correlated with each other. Quite surprisingly, however, they are negatively correlated with the final ranking, unlike theLength
,Weight
andHeight
events. - The
Length
event and the400m
event are quite strongly negatively correlated.
Finally there is a representation on the factorial axis that allows us to notice the quality of representation of our variables. Here, the axes must be interpreted independently.
Several elements can be analysed:
- Positively correlated variables are grouped together.
- Negatively correlated variables are positioned on opposite sides of the origin of the graph (opposing quadrants).
- The distance between the variables and the origin measures the quality of representation of the variables. Variables that are far from the origin are well represented by PCA. For example, the throwing events, namely
Shot Put
,Discus
andJavelin
, are strongly correlated because they point in the same direction. However,Javelin
is less well represented thanWeights
andDiscs
in the first two principal components.