Introduction
The goal of statistical inference is to make generalizations about the population when only a sample is available.
Sampling
We want to study the probability distribution of a population variable: how likely its possible values are to happen. The data we collect must be randomly generated from this probability distribution in order to make inference about the whole population.
Methods
There are several ways to sample a population:
- Simple random sample – each subject in the population has an equal chance of being selected. Some demographics might be missed.
- Stratified random sample – the population is divided into groups based on some characteristic (e.g. sex, geographic region). Then simple random sampling is done for each group based on its size in the actual population.
- Cluster sample – a random cluster of subjects is selected from the population (e.g. certain neighborhoods instead of the entire city).
Bias
There are several forms of sampling bias that can lead to incorrect inference:
- selection bias: not fully representative of the entire population.
- people who answer surveys.
- people from specific segments of the population (polling about health at fruit stand).
- survivorship bias: population improving over time by having lesser members leave due to death.
- head injuries with metal helmets increasing vs cloth caps because less lethal.
- damage in WWII planes: not uniformally distributed in planes that came back, but only in non-critical areas.
Note: other criteria can also impact the representativity of our sample.
Representativity
Due to the random nature of sampling, some samples are not representative of the population and will produce incorrect inference. This uncertainty is reflected in the confidence level of statistical conclusions:
- a small proportion of samples, typically noted $\alpha$, will produce incorrect inferences.
- for 1 - $\alpha$ percents of all samples, the conclusions will be correct.
- the confidence level is therefore expressed as 1 - $\alpha$.
Note: 0.01 and 0.05 are the most common values of $\alpha$. This translates to 99% and 95% confidence intervals.
Probability Distribution
Point Estimate
It is often interesting to summarize the probability distribution with a single numerical feature of interest: the population parameter. We draw our conclusions about the parameter from the sample statistic.
A few important limitations:
- a sample is only part of the population; the numerical value of its statistic will not be the exact value of the parameter.
- the observed value of the statistic depends on the selected sample.
- some variability in the values of a statistic, over different samples, is unavoidable.
The Maximum Likelihood Estimator is the value of the parameter space (i.e. the set of all values the parameter can take) that is the most likely to have generated our sample. As the sample size increases, the MLE converges towards the true value of the population parameter.
- for discrete distributions, the MLE of the probability of success is equal to successes / total trials.
- for continuous distributions:
- the MLE of the population mean is the sample mean.
- the MLE of the population variance is the sample variance.
Note1: the sample variance needs to be slightly adjusted to become unbiased.
Note2: in more complex problems, the MLE can only be found via numerical optimization.
Hypothesis Testing
Experimental Design
Hypothesis testing is used to make decisions about a population using sample data.
- We start with a null hypothesis $H_0$ that we we asssume to be true:
- the sample parameter is equal to a given value.
- samples with different characteristics are drawn from the same population.
- We run an experiment to test this hypothesis:
- collect data from a sample of predetermined size (see Statistical Power below).
- perform the appropriate statistical test.
- Based on the experimental results, we can either reject or fail to reject this null hypothesis.
- If we reject it, we say that the data supports another, mutually exclusive, alternate hypothesis.
P-Value
We reject the null hypothesis if the probability of observing the experimental results, called the p-value, is very small under its assumption. The cutoff probability is called the level of significance $\alpha$ and is typically 5%.
More specifically, we measure the probability that our sample(s) produce such a test statistic or one more extreme under the $H_0$ probability distribution. A low p-value means that $H_0$ is unlikely to actually describe the population: we reject the null hypothesis.
- $P\leq\alpha$: we reject the null hypothesis. The observed effect is statistically significant.
- $P\gt\alpha$: we fail to reject the null hypothesis. The observed effect is not statistically significant.
Types of Errors
There are four possible outcomes for our hypothesis testing, with two types of errors:
Decision | \(H_0\) is True | \(H_0\) is False |
---|---|---|
Reject H0 | Type I error: False Positive | Correct inference: True Positive |
Fail to reject H0 | Correct inference: True Negative | Type II error: False Negative |
The Type I error is the probability of incorrecly rejecting the null hypothesis when the sample belongs to the population but with extreme values; this probability is equal to the level of significance $\alpha$. It is also called False Positive: falsely stating that the alternate hypothesis is true.
The Type II error $\beta$ is the probability of incorrectly failing to reject a null hypothesis; it is also called False Negative.
Note: The probabilities of making these two kinds of errors are related. Decreasing the Type I error increases the probability of the Type II error.
Statistical Power
Power, also called the sensitivity, is the probability of correctly rejecting a false $H_0$; It is equal to $1 - \beta$.
Two key things impact statistical power:
- the effect size: a large difference between groups is easier to detect.
- the sample size: it directly impacts the test statistic and the p-value.
Given the variance of data $\sigma$ and the minimum difference to detect $\delta$, a typical formula to assess sample size is:
\[N = (z_\alpha + z_\beta)^2 \times \frac{\sigma^2}{\delta^2}\]Where $z_\alpha$ and $z_\beta$ are the z-score of $\alpha$ and $\beta$, respectively.
Choosing a Test
Choosing a statistical test depends on:
- what hypothesis is tested.
- the type of the variable of interest & its probability distribution.
Note: relationship modelling will be covered in another article.
Population Inference
We can infer the value of the population parameter based on the sample statistics. Which parameter represents the population the best depends on the probability distribution.
Difference Between Samples
Comparing samples aims to determine if some characteristics of the population have an impact on the variable of interest. More specifically, we check if different values of some categorical variable(s) lead to different probability distributions for the variable of interest.
Correlation Coefficients
A correlation coefficient quantifies the goodness of fit between two continuous or ordinal variables.
Assumptions of Parametric Tests
Both t-tests and ANOVA compare means between samples. They require specific assumptions for their conclusions to be statistially sound.
T-Tests
In its most common form, a t-test compare means.
- one-sample null hypothesis: the mean of a population has a specific value.
- two-sample null hypothesis: the means of two populations are equal.
T-tests make the following assumptions:
- the sample mean(s) follow a normal distribution (this is always the case for large samples under the CLT).
- the sample variance(s) follow a $\chi^2$ distribution (this is always the case for normally distributed data).
In practice, t-tests can be used when:
- the sample size is large (30+ observations), OR
- the population is roughly normal (very small samples - use normal probability plots to assess normality).
ANOVA
In its most common form, an ANalysis Of VAriance (ANOVA) compare means.
- one-way ANOVA null hypothesis: the means of three or more populations are equal (see example here).
- repeated measures ANOVA null hypothesis: the average difference between in-sample values is null.
ANOVA is mathematically a generalized linear model (GLM), where the factors of all the categorical variables have been one-encoded. In particular, factorial ANOVA include interaction terms between categorical factors and should therefore be interpreted like traditional linear models.
ANOVA being a GLM, assumptions are the same as for linear regression:
- Normality
- Homogeneity of variance
- Independent observations
Note: If group sizes are equal, the F-statistic is robust to violations of normality and homogeneity of variance.
Non-Parametric Alternatives
Non-parametric tests should be used when:
- the assumptions are not met.
- the mean is not the most appropriate parameter to describe the population.
Appendix - Central Limit Theorem
Definition
A group of samples having the same size $N$ will have mean values normally distributed around the population mean $\mu$, regardless of the original distribution. This normal distribution has:
- the same mean $\mu$ as the population
- a standard deviation called standard error equal to $\sigma / \sqrt(n)$, where $\sigma$ is the SD of the population
Confidence Intervals
Because the sampling distribution of sample statistic is normally distributed, 95% of all sample means fall within two standard errors of the actual population mean. In other words: we can say with a 95% confidence level that the population parameter lies within a confidence interval of plus-or-minus two standard errors of the sample statistic.
Given some sample statistic $\mu$ and the population parameter $\mu_0$, there are three possible alternate hypotheses:
Left-tailed | Two-sided | Right-tailed |
---|---|---|
$\mu \lt \mu_0$ | $\mu \neq \mu_0$ | $\mu \gt \mu_0$ |
The p-value being smaller than $\alpha$ would mean that the sample statistic under $H_0$ is in the blue areas of the sampling distribution of sample statistic, depending on the alternate hypothesis.
Note: for two-tailed tests, we use $\alpha/2$ for each tail. This ensures the total probability of extreme values is $\alpha$.
Z-Scores
We can use two factors to assess the probability of observing the experimental results under the null hypothesis:
- The Z-score represents the number of standard deviations an observation is from the mean.
- The sampling distribution of sample statistic is centered around the population parameter and has a standard error linked to the population variance.
It means that we can calculate the z-score of our sample statistic to calculate its p-value.
Appendix - Further Reads
A few interesting Wikipedia articles:
Generalities
- https://en.wikipedia.org/wiki/Sampling_distribution
- https://en.wikipedia.org/wiki/Statistical_hypothesis_testing
Probabilities
- https://en.wikipedia.org/wiki/Probability_interpretations
- https://en.wikipedia.org/wiki/Frequentist_probability
- https://en.wikipedia.org/wiki/Bayesian_probability
Inference paradigms:
- https://en.wikipedia.org/wiki/Frequentist_inference
- https://en.wikipedia.org/wiki/Bayesian_inference
- https://en.wikipedia.org/wiki/Lindley%27s_paradox
- https://www.stat.berkeley.edu/~stark/Preprints/611.pdf