degrees of freedom (statistics)

Degrees of freedom

The number of degrees of freedom is the number of values in the final calculation of a statistic (that is, a singular quantity calculated from values in sample) that are free to vary. The statistic can be the mean of a sample, or something as simple as the number of heads in a coin flip.

To demonstrate what does the phrase "free to vary", suppose our friend tosses a coin and tells us that the result is NOT head. Logically, we will conclude that the outcome must be tail and nothing else. Therefore, we have only one independent observation, i.e. it is possible to choose one of two sides of the coin, but once the side is chosen, the other one must equal to only one value. Thus, in this experiment there is one degree of freedom.

Now suppose we made three observations, but due to our horrible handwriting, we only know the value of one of the observations, \[\left\{ x,y,z \right\}\]. However, we also know that the mean is given as \[6\]. There's no way for us to guess the values of each observations, so we study carefully at our (horrible) handwriting. Now, after a closer inspection, we find that \[x=6\]. It's a great start, but we still can't say for sure what the remaining two values are. Again, after another round of inspection, we find that \[z=9\]. Then, we realise, without looking back at our squiggles, we can say with confidence that \[y\] must be equal to \[3\], since we know the mean. In this example, there are then two degrees of freedom. In general, degrees of freedom equals the number of observations minus the number of parameters estimated. In this case, we only know the mean, which means we can only confidently "guess" one observation, thus there are \[3-1=2\] degrees of freedom.

Let's say we take the formula for standard deviation, \[s=\sqrt{\frac{\sum(x-\overline{x})^{2}}{n-1}}\]. You may wonder, why even bother dividing by the degrees of freedom? Suppose we took a sample of seven women living in a town. Their heights are measured to be \[\left\{ 164,173,158,179,168,187,167 \right\}\], and the mean height is given as \[\overline{x}=170.85\]. If we were to say, only divide by the sample size \[n\], i.e. \[s=\sqrt{\frac{\sum(x-\overline{x})^{2}}{n}}\], then \[s\approx9\], while if we were to divide by degrees of freedom \[n-1\], then \[s\approx9.72\]. So the question is, why we do want our estimate of \[s\] to be larger? As we've already calculated \[\overline{x}\], we don't have to use all the data in order to calculate \[s\]. It does not depend on each piece of information, and the last observation does not contribute to \[s\]. So, if we don't delete this redundant data, then we underestimate \[s\].

Additionally, notice that degrees of freedom are only used in the calculation of standard deviation if it is estimated for sample data. The standard deviation of a population does not require degrees of freedom as \[n\] would be so big that \[n-1\] makes no practical difference.

Source: Medium

Referenced by:

t-distribution