Point Estimation and Sampling Methods
The purpose of point estimation is to estimate the parameters of a population using some sample data. Parameters of a population could include:
Mean
Median
Mode
Variance
Standard Deviation
Proportion
Range
A population refers to the entire existence of a particular topic. For example, the entire population of a country, or the entire population of students in a school. Population parameters are ‘fixed’ - meaning, they do not change unless the population itself changes (e.g., a new person is born, or a new student enters the school).
Sampling Techniques:
Typically, we cannot collect all of the data about all of the observations that belong in a population. This is because data collection is expensive, and populations are ever-changing. As such, we will typically collect a sample of data - in other words, a smaller group of observations which we think represents the population appropriately.
There are many ways in which we could sample data, which could help us derive estimates of a population:
Simple Random Sampling: this method involves randomly selecting observations from a population; each observation has an equal chance of being selected. This method is useful for large populations, where sampling bias risk is low.
Stratified Sampling: this method divides a population into homogeneous groups (called strata) based on some characteristic (e.g., sex, age), and then random samples from each group are taken. This ensures that all groups are represented in the sample.
Cluster Sampling: this method is used when sampling is expensive, or stratification is difficult. The population is divided into groups based on a non-personal parameter (e.g., location, time) and random samples are selected from each cluster.
Systemic Sampling: this method is used when we want every nth element. For example, completing quality assurance tests on every 10th widget in a manufacturing line. It requires a homogeneous population (i.e., a population with the same characteristics).
Quota Sampling: in this method, the population is stratified and the number of samples selected from each group is based on the proportion of the entire population. Observations do not have an equal likelihood of being selected.
The problem we are trying to solve, or question we are trying to answer with data, will typically dictate the sampling methodology. For example, if we want a general understanding of the population, a random sample should suffice. However, if we want to understand differences in behaviour between distinct groups, perhaps we need stratified or cluster sampling. Remember to clearly think through the problem you are solving before determining the sampling technique.
Point Estimation Techniques:
Once we have decided on an appropriate sampling methodology, we need to determine an appropriate technique for point estimation. Remember: the purpose of point estimation is to find the parameters of a population. The population parameters are be fixed, and would only change if the population itself changes.
Method of Moments:
A simple technique for estimating population parameters is the Method of Moments. In this example, we calculate ‘moments’ from a sample distribution and we set them equal to the theoretical moments of the population. For example:
μ = (1/n) * Σ x_i and σ^2 = (1/n) * Σ (x_i - μ)^2
Where:
n = number of observations
μ = mean
σ^2 = standard deviation
For example, suppose you’re estimating the mean and standard deviation of class grades, where the distribution is normal. Consider the data set below. First, we would calculate the First Moment, which is the sample mean:
[50,60,70,75,78]
μ = sum([50,60,70,75,78]) / 5
μ = 66.6
Then we would calculate the Second Moment (standard deviation):
σ^2 = sqrt([(50-66.6)^2+ (60-66.6)^2+ (70-66.6)^2+ (75-66.6)^2+ (78-66.6)^2] / (5-1))
σ^2 = 11.52
Finally we would estimate that the population parameters are equal to the sample parameters. This is a complicated name for a very simple estimate.
If we have a large sample size, this method is probably fairly accurate. This is because the Law of Large Numbers suggests that the mean and variance of the samples converge to the population mean and variance.
Maximum Likelihood Estimation:
In a more complex scenario, for example predicting the potential annual value of a brand new customer, we could build a predictive model. We may use sample data to train our model, and based on the sample data, we would like to estimate the parameters of the model such that we maximize the likelihood that the sample data itself came from that model. In other words, we are working backwards from the sample data to find the parameters of the model which would have produced the sample data.
For example, suppose you flipped a coin 100 times and you observed 60 heads and 40 tails. We may want to model the likelihood of observing exactly 60 heads. We know that this example is a binomial distribution, and the probability (p) of observing exactly ‘y’ successes in n trials is given by:
L(p | y, n) = (n choose y) * p^y * (1 - p)^(n - y)
Where:
n = trials
y = successes
p = the probability of success in a given trial
In our example, this can be written as:
L(p: data) = C(100,60) * p^60 * (1-p)^40
Where:
L(p: data) - the likelihood of observing exactly the data we saw
C(100, 60) - the binomial coefficient
p^60 * (1-p)^40 - the probability of observing exactly 60 heads in 100 flips
The goal of Maximum Likelihood Estimation is to find the probability value (p) that maximizes the log likelihood, since we are working backwards and trying to find the parameters of the function which led to the observed data. To calculate the log likelihood we alter our function slightly:
log(L(p | data)) = log(p^60 * (1 - p)^(100-60))
Which simplifies to:
log(L(p | data)) = 60*log(p) + (100-60)*log(1 - p)
Now we need to solve for ‘p’. We can do this by setting the derivative of the function, with respect to ‘p’, to 0:
(d / dp) log(L(p | data)) = 60 / p - (100-60)/ (1 - p)
Setting equal to 0, and solving for p:
60 / p = (100-60)/ (1 - p)
60 * (1-p) = p * (100-60)
60 - 60p = 100p - 60p
60 = 100p
60 / 100 = p
p = 0.6
So now we can see that the probability of getting 60 heads out of 100 flips would have to be 0.6 in order for the observed data to have most likely come out of the distribution of the population.
This is a very simple example, but the intent is to generally demonstrate how MLE works. Once we discuss more complex models, we will revisit this concept.
The Least Squares Method
This is another method of estimating parameters in a statistical model. We will review this method once we review Regression Modelling, but generally this method tries to estimate the parameters of a function which minimize the distance between an observed data point and an estimated data point. For example, in a linear regression model, we would use a ‘line of best fit’ to represent a set of data points. The line of best fit is generally represented by:
y_i = β_0 + β_1x_i + ε_i
Where:
y_i = predicted value
β_0 = y-intercept
β_1 = coefficient
x_i = independent variable
ε_i = error term
The y-intercept can be interpreted as the minimum predicted value if the coefficient of the independent variable was 0. The coefficient can be interpreted as the amount of change in the predicted value, driven by a single unit of change in the independent variable. The objective of least squares estimation is to find the y-intercept and coefficient values which result in predicted values that are as close as possible to the observed value.
Bayesian Estimation
We can also estimate parameters using Bayesian estimation. Bayesian statistics takes a different approach than frequentist statistics, in that we typically update our knowledge based on new information. The main concept in estimating parameters is that we would first estimate baseline parameters (e.g., mean, standard deviation) based on an initial sample (this is called the prior distribution), then once we collect new data, we would combine new parameters (e.g. mean, standard deviation) from the new data with the estimates we had from the prior distribution. This is called the posterior distribution.
In later modules we will discuss how these methods are applied in various modelling techniques, but for now, this should provide a general overview of some methods used to estimate population parameters.

