Z-score on a sample set

Test your knowledge

In the last blog about Z-score, we talked about the odd of having a sample data point x from a given Normal distribution.

Today, we are not taking only 1 data point, but a set of n data points. We call the mean value of this set $\overline{x}$ , the question is: How likely do we have a sample set with size n whose mean is at least as extreme as $\overline{x}$ ?

Let’s return to the example of IQ-tests. Humans’ IQ is normally distributed with mean $\mu = 100$ and std $\sigma = 15$ (which means variance $\sigma^2 = 225$ ). This time, says you are not taking the test alone, but with all your classmates. Your class has 20 students, and their average result on IQ-test is 106. The question is: how is your class’s IQ compared to other groups of 20 people?

Remember that in the original version of this example, you are the only one to take the IQ-test, and we compare your result to the full population on Earth. In this modified version, we change the size, from only you (size = 1) to all your class (size = 20).

In fact, we can map the current problem to the same state as the original problem, because the average value of a group of size n (n > 1) also follows Normal distribution, called Normal distribution of the mean. What we need is the mean and std of this distribution, let’s call them $\overline{\mu}$ and $\overline{\sigma}$ , respectively.

Compute $\overline{\mu}$ :

$\begin{aligned}\overline{\mu} &= E[\overline{x}] \\ &= E[\frac{1}{n}\sum_{1 \leq i \leq n}{x_i}] \\ &= \frac{1}{n} E[\sum_{1 \leq i \leq n}{x_i}] \\ &= \frac{1}{n}n \mu \\ &= \mu\end{aligned}$

Before computing $\overline{\sigma}$ , let’s revise some properties of Variance:

$Var[aX] = a^2 Var[X]$ ,
$Var[X + Y] = Var[X] + Var[Y]$ if X and Y are independent.

Ok, let’s continue:

$\begin{aligned}\overline{\sigma}^2 &= Var[\overline{x}] \\ &= Var[\frac{1}{n} \sum_{1 \leq i \leq n}{x_i}] \\ &= \frac{1}{n^2} Var[\sum_{1 \leq i \leq n}{x_i}] \\ &= \frac{1}{n^2} n Var[x_i] \\ &= \frac{1}{n^2} n \sigma^2 \\ &= \frac{\sigma^2}{n}\end{aligned}$

So:

$\overline{\sigma} = \frac{\sigma}{\sqrt{n}}$

(It worths mentioning that $\overline{\sigma}$ has its own name, called the standard error of the sample.)

Well done! Now we know that:

The distribution of sample mean is a normal distribution with $\overline{\mu} = \mu$ and $\overline{\sigma} = \frac{\sigma}{\sqrt{n}}$ .

To get the z-score of your class’s IQ, we should use this distribution. Do you remember the formula of Z-score? Apply it here:

Z-score = $\frac{\overline{x} - \overline{\mu}}{\overline{\sigma}}$

or we can say in general:

Z-score = $\frac{\overline{x} - \mu}{\sigma / \sqrt{n}}$

For our case, $\overline{x} = 106, \mu = 100$ and $\sigma = 15$ . Thus,

Z-score[class IQ] = $\frac{106 - 100}{15 / \sqrt{20}} \approx 1.79$

Look it up on the z-table gets us a percentile of 96.3%, which means your class is in top 3.7% of the world (compared to any other groups of 20 people), that’s impressive!

Test your understanding

Conclusion

In this blog, we introduced the distribution of sample mean, which is also a normal distribution with the same mean ( $\overline{\mu} = \mu$ ) but different standard deviation ( $\overline{\sigma} = \frac{\sigma}{\sqrt{n}})$ .

4 thoughts on “Z-score on a sample set”

tdang says:

April 5, 2020 at 12:55 pm

Very clear explanation. Any recommend about size of sample to get its mean close to normal distribution?

Reply
1. Tung.M.Phung says:
  
  April 5, 2020 at 5:34 pm
  Thanks for your question. Let me try solving it!
  
  First, we reformulate the problem:
  You have:
  - a normal distribution ( $\mu$ and $\sigma$ are known).
  - a generator that can generate data samples (you believe that this generator generates data following the above normal distribution).
  You define: a confidence level and a confidence interval.
  You ask for: the smallest sample size such that with the confidence level, the sample mean will fall into the confidence interval.
  
  Seems complicated. Let’s make it concrete by putting real numbers on:
  You have a normal distribution N(10, 2).
  Suppose your generator also output values following N(10, 2).
  Suppose “close” means the actual mean and the sample mean is not different for more than 0.1 $\sigma$ (i.e. 0.2).
  You ask for the smallest sample size n, so that 95% of the random samples of size n have their means fall into the range [9.8, 10.2].
  
  Solve:
  For a 2-tailed z-test with 95% confidence level, the |z-score| needed is 1.96.
  Substitute all those values to the equation Z-score = $\frac{\overline{x} - \mu}{\sigma / \sqrt{n}}$
  We have n to be 384.16. Thus, this is the smallest value of n needed to satisfy the above requirements.
  
  Bests,
  Reply
  1. tdang says:
    
    April 5, 2020 at 11:14 pm
    
    Thank for your detailed explanation.
    Well, I should have asked in a more specific way.
    My question is on cases when we do not know exact information about the population (because if we already know about mean/std of the population, it seems getting stats on its sample is an unnecessarily).
    In these cases, we can conduct a survey/measurement of a sample then can infer its stats to population. Then, a question raised: What size of sample should be?
    
    Regards,
    Truong Dang.
    
    Reply
    1. Tung.M.Phung says:
      
      April 6, 2020 at 6:40 am
      
      Thanks for the details. Actually, I was not sure if my answer matched your intended question.
      However, the example above seems still come in handy.
      
      A normal distribution’s signature is its $\mu$ and $\sigma$ . Thus, if we want our sample to match the population distribution, it is sufficient to have $\mu$ and $\sigma$ of the sample close to that of the population.
      
      As a rule-of-thumb to determine whether a Z-test or a T-test is suitable for testing a hypothesis, the sample’s standard deviation (often denoted as s) is considered to be a good representative of the population’s $\sigma$ if the sample size > 30.
      
      The only problem is how to have the sample mean close to the population mean. On the example above, to compute the value of n, we actually did not assume anything about the mean of the population, we only assumed that the population’s $\sigma$ is 2. The formula is: Z-score = $\frac{0.1*\sigma}{\sigma / \sqrt{n}}$ , which gave us the needed value of the sample size is $\approx$ 384.
      
      In conclusion, with a sample size of $\approx$ 384, we have both the mean and standard deviation of the sample “close” to the population. Thus, a sample of this size can well represent the whole population.
      
      Reply

Tung M Phung's Blog

Z-score on a sample set

4 thoughts on “Z-score on a sample set”

Leave a ReplyCancel reply