Variability is inherent and fundamental in sports. Imagine a world with no variability between athletes? Every race would end in a dead-heat, every game in a draw. No two athletes always perform exactly the same and physiological measurements taken on athletes vary. One of the big challenges in sports science is the ability to analyse and model data correctly in an objective manner. Without an objective approach to the analysis of data all that is left is personal opinion.
Statistics, the (data) science of uncertainty, provides such an objective framework. It deals with the collection, analysis, interpretation and presentation of data. It provides the logical framework to understand variability.
The proper application of statistics will provide answers to many open questions in the rapidly evolving field of big data in sports science such as soft tissue injury prediction, training load optimisation, evidence-based recovery strategies and the avoidance of excessive fatigue. Ultimately this can reduce games lost and extend athlete careers.
The first step in all such analyses is to use dynamic visualisation to enable the scientists to get a feel for their data, to highlight any unsuspected views and to provide a suitable method to assess the validity of potential modelling strategies. A picture is worth a thousand numbers.
The next step involves using the data to build a statistical model that successfully captures the signal of interest, but is no more complex than necessary. For example, suppose we are interested in why some athletes experience a soft tissue injury in training while other similar athletes, undergoing the same training regime, do not? The outcome of interest is soft tissue injury, predicted as yes or no. It is known that soft tissue injury occurrence is related to, amongst other things, the age of the athlete, the number of previous injuries, nutritional status and acute training loads. Other sources of potentially useful numeric data include an athlete’s sleep log, muscle soreness rating and mood state, and non-numeric (unstructured) data such as a sport scientist’s training diary and a clinician’s case notes.
Such data sources are the building blocks for developing a useful, fit for purpose, statistical model. They all partially explain the occurrence of soft tissue injury and contribute to the signal the model aims to capture. The fact that similar athletes can manage different training loads, that some athletes appear ageless, that some have acute biomarker responses, and that training conditions often change, are all factors that may explain soft tissue injury. These factors are not easily captured as data sources and as such encapsulate athlete to athlete variability (i.e. noise) that cannot be explained by the model.
The goal of the model is to successfully separate the signal of interest from unexplainable athlete-to-athlete noise. Models that are too simple fail to identify important data signals. Models that are too complex, perhaps using some unnecessary data sources, ‘overfit’ the data, that is they mistakenly identify unexplainable athlete-to-athlete noise as useful signals. The predictive power of an overfitted model will appear excellent when used on the sample of athletes on which the data was collected, but it will be disappointingly poor when applied to new (unseen) data.
The trade-off is being able to identify, amongst all the data sources available, the most parsimonious subset that best captures the signal, without including unnecessary noise variables, while still maintaining enough complexity to achieve the maximum predictive power in future unseen data. This is the game that statisticians play.