Today, most data collection systems are designed to provide snapshots or estimates of the “big picture”. For example, survey samples sizes are computed such that parameter estimates are generalizable to the larger population from which they were selected. Although survey samples provide rich, granular data, a common challenge in survey sampling is the “small sample size” problem which leads to a limited range of information. The solution ideally would be to increase the sample size, but for purposes of practicality and resource availability, this is often not feasible. It is therefore important to reconcile the practicality of granular data with the need to ensure that the cost of data collection is bearable. Enter small area estimation (SAE)!
In this blog, I will discuss the concept of SAE, I will also provide a gentle introduction to basic SAE techniques and describe the implementation of these techniques using the R programming software.
Small area estimation (SAE) is an umbrella term that refers to a set of statistical techniques used to produce reliable estimates for small geographic or demographic subpopulations even when only small samples are available for these areas. These subpopulations, often called “small areas,” may be defined by geographic boundaries (e.g., counties, districts) or demographic characteristics (e.g., age, gender and ethnic groups). Typically, they have insufficient sample sizes for traditional survey methods to provide accurate direct estimates.
SAE methods improve the precision of estimates by “borrowing strength” from related areas or populations through statistical modeling. It is based on the simple principle that “if a sample survey cannot provide an adequately precise and reliable granular estimate, SAE combines the survey data with other types of auxiliary data that have wider coverage to enhance the survey estimator. Typically, the sample of a small area is much smaller than the sample size at the survey domain level. For example, assuming that a survey was designed to provide reliable estimates at the regional level, the region serves as the survey’s domain and sample size is calculated for each region in the target population. However, if a region contains multiple provinces as is usually the case, then a province can be considered a small area because the survey may not have enough sample size to yield reliable estimates at the level of the province. Many SAE techniques borrow strength from multiple data sources, and the choice of data depends on the parameter being estimated and available data sources.
With this background, let’s look at some of the methodology and data requirements of some SAE techniques:
Applications – Small Area Estimation techniques have wide applicability in various fields of human endeavor. They include:
Steps in Conducting Small Area Estimation – there are at least six steps involved in SAE. They include:
Challenges in Small Area Estimation:
SAE in R: In this next section, I provide an example of direct estimation SAE for one indicator in R. For this exercise, I will use the laeken package in R. I assume some level of familiarity with R programming. Some content were adapted from the “Computer Workshop Introduction to Small Area Estimation” University of Bristol.
First, I load the necessary packages.
library(pacman) # It may be necessary for you to first install the pacman
## Warning: package 'pacman' was built under R version 4.0.5
pacman::p_load(laeken, simFrame)
Next, I load the data set for the estimation. The data set is synthetically generated from real Austrian EU-SILC (European Union Statistics on Income and Living Conditions) data
data("eusilc") SAE – Direct estimation. I start by computing the at-risk-of-poverty indicator at national level. I use the command arpr from the laeken package. The arpr command requires that I specify the income variable, eqIncome, the survey weights, rb050, and the data I am using.
hcr_national <- arpr("eqIncome", weights = "rb050", data = eusilc)
Print the results
hcr_national
## Value:
## [1] 14.44422
##
## Threshold:
## [1] 10859.24
As shown, the percentage of households below the poverty line (value) = 14.4% and the poverty line (threshold) = 10859.24.
Next, let’s learn how to produce estimates of the at-risk-of-poverty indicator at the subnational levels. To achieve this, I still use the previous command but this time I have added the breakdown argument, which specifies the geographical level that I am interested in.
hcr_subnational <- arpr("eqIncome", weights = "rb050", breakdown = "db040", data = eusilc)
Print the results
hcr_subnational
## Value:
## [1] 14.44422
##
## Value by domain:
## stratum value
## 1 Burgenland 19.53984
## 2 Carinthia 13.08627
## 3 Lower Austria 13.84362
## 4 Salzburg 13.78734
## 5 Styria 14.37464
## 6 Tyrol 15.30819
## 7 Upper Austria 10.88977
## 8 Vienna 17.23468
## 9 Vorarlberg 16.53731
##
## Threshold:
## [1] 10859.24
Using the variance command, I can estimate the variance of the at-risk-of-poverty indicator for each geographical unit.This command requires that I specify the variable of interest for my analysis (eqIincome), the survey weights (rb050), the variable that specifies the geographical breakdown (db040) and the variable(s) that define the survey design.For the current example, the design is assumed to be stratified by geography and hence I am using the variable db040. Finally, I need to specify the indicator for which I want to estimate the variance. Previously, I had estimated and saved this as an object (hcr_subnational), the type of bootstrap variance estimation (here I am using naive bootstrap, although other types exist), and the number of bootstrap replications.
hcr_variance <- variance("eqIncome", weights = "rb050", breakdown = "db040", design = "db040", data = eusilc, indicator = hcr_subnational, bootType = "naive", seed = 123, R = 500)
Print the results
hcr_variance
## Value:
## [1] 14.44422
##
## Variance:
## [1] 0.08168033
##
## Confidence interval:
## lower upper
## 13.92968 14.98392
##
## Value by domain:
## stratum value
## 1 Burgenland 19.53984
## 2 Carinthia 13.08627
## 3 Lower Austria 13.84362
## 4 Salzburg 13.78734
## 5 Styria 14.37464
## 6 Tyrol 15.30819
## 7 Upper Austria 10.88977
## 8 Vienna 17.23468
## 9 Vorarlberg 16.53731
##
## Variance by domain:
## stratum var
## 1 Burgenland 2.8599386
## 2 Carinthia 1.3806124
## 3 Lower Austria 0.4508332
## 4 Salzburg 1.4093275
## 5 Styria 0.5591159
## 6 Tyrol 1.0308416
## 7 Upper Austria 0.3306110
## 8 Vienna 0.5629735
## 9 Vorarlberg 1.9744992
##
## Confidence interval by domain:
## stratum lower upper
## 1 Burgenland 16.203852 22.82716
## 2 Carinthia 10.720872 15.37713
## 3 Lower Austria 12.501693 15.14681
## 4 Salzburg 11.589741 16.28962
## 5 Styria 12.982625 16.00579
## 6 Tyrol 13.487444 17.46962
## 7 Upper Austria 9.777863 12.01431
## 8 Vienna 15.728719 18.75112
## 9 Vorarlberg 13.854050 19.37823
##
## Threshold:
## [1] 10859.24 As seen above the output gives estimates of the variance and confidence intervals.
In some cases it may be necessary to specify domains by cross-classification of a geographic and a demographic variable respectively.
For this, I first create a domain variable (gender_region).
eusilc$gender_region <- interaction(eusilc$db040, eusilc$rb090)
Finally, I compute the at-risk-of-poverty indicator for the newly created variable.
hcr_geography_gender <- arpr("eqIncome", weights = "rb050", data = eusilc, breakdown = "gender_region")
Print results
hcr_geography_gender
## Value:
## [1] 14.44422
##
## Value by domain:
## stratum value
## 1 Burgenland.male 17.414524
## 2 Carinthia.male 10.552149
## 3 Lower Austria.male 11.348283
## 4 Salzburg.male 9.156964
## 5 Styria.male 11.671247
## 6 Tyrol.male 12.857542
## 7 Upper Austria.male 9.074690
## 8 Vienna.male 15.590616
## 9 Vorarlberg.male 12.973259
## 10 Burgenland.female 21.432598
## 11 Carinthia.female 15.392924
## 12 Lower Austria.female 16.372949
## 13 Salzburg.female 17.939382
## 14 Styria.female 16.964539
## 15 Tyrol.female 17.604861
## 16 Upper Austria.female 12.574206
## 17 Vienna.female 18.778813
## 18 Vorarlberg.female 19.883637
##
## Threshold:
## [1] 10859.24
Please note that the laeken package can be used to compute other indicators. For example, using the qsr command, I can compute the quintile ratio which can be used to quantify income inequality. In the current example, I estimated the at-risk-of-poverty indicator at both national and subnational levels. I also applied the estimator to the geographic region classified by gender (auxilliary information). Other area estimates that might provide meaningful insights to researchers and policy makers include estimating the number of poor people in a municipality, under-five mortality rates, etc. It is worth mentioning too, that besides the laeken package, several other R packages are available for small area estimation and related tasks. They include the sae package, survey package, SUMMER package, etc.
Conclusion
Small area estimation is a powerful tool that enables detailed and accurate analysis at granular levels, providing valuable insights for decision-making and resource allocation in various fields. There is a wide array of SAE techniques, some are easy to apply, while others are more challenging. Other considerations include the purpose for conducting the SAE, type of indicator variables of interest, and availability of data prior to actual estimation.
Karo has been writing scientific articles for over a decade. He has been a college professor, data scientist, journal editor, reviewer and research consultant. At oores Analytics, he leads biostatistical and research consulting activities and ensures that services and information published are tailored to meet client needs as much as possible.