An explainer by FactCheck.lk
In Sri Lanka, it is common for the validity of sample surveys to be questioned, on the basis that the samples are “small” and cannot reasonably reflect the characteristics of the entire national population.
Examples of skeptical perceptions abound. A prominent Sinhala media personality on a popular TV channel questioned the validity of the findings of the Centre for Policy Alternatives’ (CPA) Economic Reform Index, stating “The sample size for this survey is 1,000…… it is wrong if the mindset of 22 million people is gauged by using 1,000 people”. Similarly, when Verité Research shares the results of its quarterly Mood of the Nation poll, some question how a sample size of approximately 1,000 people can truly reflect the ‘mood of the nation’.
Sri Lanka has an adult population of approximately 14 million. It is generally not feasible to speak to every single member of the population to learn about important characteristics of the population, such as the unemployment rate or sentiments on the government. A census of the population is carried out not more than once in 10 years due to the vast amount of time and human resources required.
This makes sample surveys a useful tool. The statistical science of a sample survey is that it is possible to have a reasonable estimate of the characteristics of an entire population, by looking at the characteristics of a much smaller, randomly selected, sample of the population.
But how large should a randomized sample survey be, to have a reasonable estimate of the characteristics of the entire population? That is a question best answered by statistical mathematics, not subjective perception. This FactCheck.lk Explainer sets out the objectively defined statistical approach to provide a scientific answer to the question.
How do statisticians define a reasonable estimation from a sample survey?
Statisticians use two criteria to mathematically determine the level of reasonableness of an estimation derived from a sample survey: (1) margin of error and (2) confidence level. These two criteria depend on the sample size and how the sample is chosen (sample selection). Assessing whether these criteria are met allows for an objective evaluation of the survey’s reliability.
(1) The margin of error
This is a measurement of how much the average answer to a particular question in the survey sample may differ from the average answer of the entire population. The smaller the margin of error, the closer the sample result is to that of the entire population.
For example, if a survey shows 50% of the sample supports a certain political candidate, with a margin of error of (+/-) 5 percentage points. This means the actual percentage of people in the population who support that candidate is expected to be anywhere between 45% and 55%. If the margin of error was only (+/-) 3 percentage points, then support for the candidate in the population is expected to be narrower, between 47% and 53%.
(2) The confidence level
This is a measurement of the confidence we can have that the results of asking the entire population would be within the “results range” we get from the population that is sampled. The “result range” is the range from the sample average minus the margin of error to the average plus the margin of error.
The confidence level can be described as the probability that the result one would get from surveying the entire population, is within the “result range” of the limited sample that is selected.
For example, consider a survey that has 50% of the sample supporting a certain political candidate, with a margin of error of (+/-) 3 percentage points and a confidence level of 95%. Here, a confidence level of 95% means that if the survey is conducted 100 times, using the same sample size and selection method, 95 out of the 100 surveys would produce results that are the same as they would be for the entire population, within a margin of error (+/-) 3 percentage points. This means that there is a 95% probability that the actual support for the candidate in the above example is between 47% and 53%.
What is the statistical standard for margin of error and confidence level?
Both the margin of error and the confidence level are ways of specifying how close the survey result is likely to be to the actual results of the entire population. Low margins of error and high confidence levels make survey results more likely to be close to those of the entire population. If Sri Lanka’s entire population was surveyed, the margin of error would be zero and the confidence level would be 100. Therefore, the larger the randomly selected size of a sample survey, the lower the margin of error and higher the confidence level.
In this context, what is an acceptable survey sample in terms of these two metrics? Sample surveys around the world, especially in relation to opinion polling and understanding population characteristics, have tended to carry a margin of error of (+/-) 3 percentage points and a confidence level of 95%.
Achieving a particular statistical standard in terms of the margin of error and the confidence level is determined both by the size of the sample, and the method by which the sample is chosen.
Method of Random Selection
The margin of error and confidence level calculations are based on the sample being randomly selected from the population. Random selection techniques provide everyone in the population an approximately equal chance of being selected, and form samples that are likely to be a representative microcosm of the entire population.
What happens if this expectation of random selection is violated and there is a bias towards some group in the population in terms of sample selection? The results of the sample survey will also be biased towards the results of that group, which could be different from those of the entire population. For instance, if the survey was conducted only online, the sample would be biased towards those who have the capacity (devices and data) to access the internet. If only those able to speak English were selected, the sample results may be less representative of Sri Lanka’s entire population, which mostly speaks Sinhala or Tamil.
There are many methods to implement random selection. One method is placing the names of the entire population into a bin, mixing them up, and drawing out names as done to pick the winners of a lottery. A more advanced method is to begin by stratifying the population. Stratification involves dividing the entire population into equal non-overlapping groups (imagine having a different bin for each group) and then randomly selecting a certain number from each bin).
If there are a very large number of groups, imagine 5,000 groups/bins, one might also conduct a multi-stage stratified sample selection. This involves, first, grouping those numerous bins into a smaller number of larger groups/bins, say 50 bins. Each of those 50 larger bins would then have 100 smaller bins. In the first stage, one would randomly select, say, 10 of the smaller bins (out of the 100) from each of the 50 larger bins; and in the second stage, randomly select a certain number of people each from the resulting 500 bins rather than from all 5,000.
In national sample surveys, multi-stage stratified random sampling is used to enhance the effectiveness of random selection and get as good a representative sample as practically possible.
For example, Sri Lanka has 24 districts with different population characteristics. The first stage of a multi-stage stratified random sampling could be stratifying the population into these 24 districts. Each district in turn has separate administrative areas called Grama Niladari (GN) divisions — smaller pockets of more homogenous groups. There are 14,022 such GN divisions. The second stage could be selected from within each district a certain number of GN divisions proportionately to the population of the district (more GN divisions from districts with larger populations). Finally, a certain number of people from each GN division could be selected for the survey sample proportionate to the population in the GN division (more people from GN divisions with larger populations).
Minimum acceptable size of a randomly selected sample
The above is a basic explanation of sample surveys, to help readers objectively calculate the sample size that would be adequate for statistically valid results. It is possible to calculate the required sample size using the relevant mathematical equations; free online calculators can generate the result as well.
Sri Lanka has an adult population of c.14 million persons and c.5.7 million households. Both the equation and calculators when applied provide the following result for Sri Lanka: a 95% confidence level and an error margin of 3 percentage points on a nationwide survey, is achieved with a randomly selected sample of c. 1,000.
Thus, the claim that very large sample sizes are necessary to achieve reasonably accurate survey results is a popular misconception. Statistical science and math lead us to an objective conclusion, that c. 1,000 is enough. Larger samples can be used, but the cost-to-benefit ratio can be small. For instance, more than doubling the sample to c. 2,400 will only reduce the maximum error margin from 3 to 2 percentage points. You can use this online calculator to test it for yourself: https://www.surveymonkey.com/mp/sample-size-calculator/.
There are further complex nuances and assumptions, such as the shape of the mathematical distribution of responses, for example, and tests that can be conducted to ascertain criteria, which are not discussed in this article.
References
Cochran, W. G. (1977). Sampling Techniques (Third Edition). John Wiley and Sons.
Lohr, S. L. (2019). Sampling: Design and Analysis (Second Edition). Taylor and Francis.
Asher, H. (2007). Polling and the Public: What Every Citizen Should Know. CQ Press.
Moore, D.S., McCabe, G.P., & Craig, B.A. (2015). Introduction to the Practice of Statistics. W.H. Freeman.
This post was last updated with minor clarifying edits on 7 November 2023.