Overview
In the previous article, you learned how to calculate the mean or expected value (\(\mu \)) of a discrete random variable. Mean, median, and mode are measures of central tendency – a number that works as a reasonably good substitute for all the numbers in a dataset.
There is another way to look at a dataset – how spread the data are. You must have learned about range before, which is the difference between the largest and the smallest number in a dataset. It measures the ‘spread’. We also use variance and standard deviation to measure spread. These are called measures of dispersion. All these concepts are part of your Year 11 HSC Advanced Maths syllabus. Be sure to have a good grasp of them.
Learning Outcomes
After reading this article, you should be able to
- Understand the difference between range and variance (or standard deviation)
- Calculate the variance and standard deviation of a discrete random variable
The idea of Variance and Standard Deviation
Consider the following datasets:
Set A: 10, 29, 30, 31, 50
Set B: 10, 11, 30, 49, 50
The means of both sets are 30, and their ranges are 40. However, there is a big difference between the two sets of numbers. Three of the five numbers in set A (29, 30, 31) are clustered around the mean. On the other hand, except for the mean itself, all numbers in set B are spread far and away from the mean. We say set B has a larger standard deviation (and variance) than set A. Now, the question is: how do we quantify this? How do we assign a number to this spread?
One way to do it is to take the difference between each number and the mean of the set and add those up. But there’s a problem with that approach. Let’s see what it is.
If we took the approach above, the ‘spread’ of set A would be (10 – 30) + (29 – 30) + (30 – 30) + (31 – 30) + (50 – 30) = -20 -1 +0 +1 +20 = 0
Likewise, the ‘spread’ of set B would be (10 – 30) + (11 – 30) + (30 – 30) + (49 – 30) + (50 – 30) = -20 -19 +0 +19 +20 = 0
So if we calculate this way, we can’t tell that the numbers in set B are more spread than those in set A. The reason is, the high negative values cancel out the high positive values making the overall sum of all the deviations minimal, thereby giving a false impression that the spread must have been small!
To counter this effect, instead of taking the differences between the values and the mean, we take the squares of the differences and add them up. As the square of a negative number is positive, there is no chance of the positives and negatives cancelling out! As a final step, we can divide the sum by the number of data points in the set to find the average.
Using the approach mentioned above, the average ‘spread’ for set A is:
\([(10-30)^{2}+(29-30)^{2}+(30-30)^{2}+(31-30)^{2}+(50-30)^{2}]/5 = 802/5 = 160.4\).
And the average ‘spread’ for set B is:
\([(10-30)^{2}+(11-30)^{2}+(30-30)^{2}+(49-30)^{2}+(50-30)^{2}]/5 = 1522/5 = 304.4\)
This makes a lot more sense, right?
The ‘spread’ calculated above is called variance (\(\sigma ^{2}\)), and its square root is called the standard deviation (\(\sigma\)). Why square root? To nullify the effect of having to square the differences (which in turn was done to nullify the effect of positives and negatives cancelling out).
Variance and Standard Deviation of Discrete Random Variables
We have already learned about discrete random variables and their probability distributions. Now that you know variance and standard deviation, let’s see how to find them for discrete random variables.
Consider a discrete random variable X with the following values and corresponding probabilities:
X | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
P(X) | 0.1 | 0.2 | 0.3 | 0.3 | 0.1 |
\(\mu \) = E(X) = \(1\times 0.1 + 2\times 0.2 + 3\times 0.3 + 4\times 0.3 + 5\times 0.1 = 3.1\)
\(\sigma ^{2} = E((X-\mu )^{2})=E(X^{2})-\mu^{2}\)
Therefore, variance (\(\sigma ^{2}\)) = \(\sum_{}^{}x^{2}\cdot P(x)-\mu ^{2}\)
\(\sum_{}^{}x^{2}\cdot P(x) = 1^{2}\times 0.1+2^{2}\times 0.2+3^{2}\times 0.3+4^{2}\times 0.3+5^{2}\times 0.1\)
= 10.9
\(\mu ^{2}= 3.1^{2}\)=9.61
Variance (\(\sigma ^{2}\)) = 10.9 – 9.61 = 1.29
Standard deviation (\(\sigma\)) = \(\sqrt{1.29}\)=1.14
Wrap Up
Standard deviation and variance tell us how spread out the data is. In general, you would want low values for these measures because it means that your data is not very scattered and is, therefore, easier to predict. However, there are some contexts where high values for these measures may be desirable (for example, if you’re trying to show that a lot has changed over time). Contact us today if you need more help understanding these concepts or with any other year 11 HSC Maths concepts. Our friendly tutors are here to help!