Small Area Estimation (SAE) – what it is and how you can apply it in R programming

By Karo Omodior, Ph.D., MPH

Today, most data collection systems are designed to provide snapshots or estimates of the “big picture”. For example, survey samples sizes are computed such that parameter estimates are generalizable to the larger population from which they were selected. Although survey samples provide rich, granular data, a common challenge in survey sampling is the “small sample size” problem which leads to a limited range of information. The solution ideally would be to increase the sample size, but for purposes of practicality and resource availability, this is often not feasible. It is therefore important to reconcile the practicality of granular data with the need to ensure that the cost of data collection is bearable. Enter small area estimation (SAE)!

In this blog, I will discuss the concept of SAE, I will also provide a gentle introduction to basic SAE techniques and describe the implementation of these techniques using the R programming software.

Small area estimation (SAE) is an umbrella term that refers to a set of statistical techniques used to produce reliable estimates for small geographic or demographic subpopulations even when only small samples are available for these areas. These subpopulations, often called “small areas,” may be defined by geographic boundaries (e.g., counties, districts) or demographic characteristics (e.g., age, gender and ethnic groups). Typically, they have insufficient sample sizes for traditional survey methods to provide accurate direct estimates.

SAE methods improve the precision of estimates by “borrowing strength” from related areas or populations through statistical modeling. It is based on the simple principle that “if a sample survey cannot provide an adequately precise and reliable granular estimate, SAE combines the survey data with other types of auxiliary data that have wider coverage to enhance the survey estimator. Typically, the sample of a small area is much smaller than the sample size at the survey domain level. For example, assuming that a survey was designed to provide reliable estimates at the regional level, the region serves as the survey’s domain and sample size is calculated for each region in the target population. However, if a region contains multiple provinces as is usually the case, then a province can be considered a small area because the survey may not have enough sample size to yield reliable estimates at the level of the province. Many SAE techniques borrow strength from multiple data sources, and the choice of data depends on the parameter being estimated and available data sources.

With this background, let’s look at some of the methodology and data requirements of some SAE techniques:

Direct vs. Indirect Estimation:
- Direct Estimation: Uses data solely from the small area of interest. It is straightforward but can be unreliable if the sample size is too small.
- Indirect Estimation: Combines data from the small area with data from other areas, often using models to improve precision.
Types of Models:
- Area-Level Models: Use aggregate data (e.g., means, totals) for small areas. The Fay-Herriot model is a popular area-level model.
- Unit-Level Models: Use individual-level data and can include covariates at the individual level. These models often employ mixed-effects models to account for both fixed and random effects.
Empirical Best Linear Unbiased Prediction (EBLUP):
- A common method for SAE that uses mixed-effects models to provide estimates that minimize mean squared error.
Bayesian Methods:
- Bayesian hierarchical models can incorporate prior information and provide a coherent probabilistic framework for estimation and uncertainty quantification.
Synthetic Estimation:
- Combines data from similar areas or populations to produce estimates for the target small area. It assumes that the small areas share similar characteristics.
Calibration and Benchmarking:
- Adjusts SAE estimates to ensure they are consistent with known totals or other reliable estimates at higher aggregation levels.

Applications – Small Area Estimation techniques have wide applicability in various fields of human endeavor. They include:

Public Health: Estimating disease prevalence in small geographic areas, which can serve as the bases for effective allocation of healthcare resources.
Economics: Estimating local unemployment rates or income levels, which can be used for regional economic planning.
Market Research: Understanding consumer behavior in specific locales for targeted marketing.
Environmental Studies: Estimating pollution levels in specific regions to inform policy decisions.

Steps in Conducting Small Area Estimation – there are at least six steps involved in SAE. They include:

Define the Small Areas: Determine the geographic or demographic boundaries of the small areas.
Collect Data: Gather data from surveys, censuses, or administrative records.
Choose a Model: Select an appropriate statistical model based on the data and the estimation goals.
Fit the Model: Use statistical software to estimate the model parameters.
Generate Estimates: Produce the small area estimates using the fitted model.
Validate and Benchmark: Compare the estimates with known values or higher-level aggregates to ensure their validity.

Challenges in Small Area Estimation:

Data Quality: Ensuring the data used is accurate and representative.
Model Selection: Choosing the right model that adequately captures the variability within and between small areas.
Uncertainty Quantification: Accurately measuring the uncertainty associated with the estimates.
Computational Complexity: Handling the computational demands of sophisticated statistical models, especially for large datasets.

SAE in R: In this next section, I provide an example of direct estimation SAE for one indicator in R. For this exercise, I will use the laeken package in R. I assume some level of familiarity with R programming. Some content were adapted from the “Computer Workshop Introduction to Small Area Estimation” University of Bristol.

First, I load the necessary packages.

library(pacman) # It may be necessary for you to first install the pacman
## Warning: package 'pacman' was built under R version 4.0.5
pacman::p_load(laeken, simFrame)

Next, I load the data set for the estimation. The data set is synthetically generated from real Austrian EU-SILC (European Union Statistics on Income and Living Conditions) data

data("eusilc")

SAE – Direct estimation. I start by computing the at-risk-of-poverty indicator at national level. I use the command arpr from the laeken package. The arpr command requires that I specify the income variable, eqIncome, the survey weights, rb050, and the data I am using.

hcr_national <- arpr("eqIncome", weights = "rb050", data = eusilc)
Print the results
hcr_national
## Value:
## [1] 14.44422
## 
## Threshold:
## [1] 10859.24

As shown, the percentage of households below the poverty line (value) = 14.4% and the poverty line (threshold) = 10859.24.

Next, let’s learn how to produce estimates of the at-risk-of-poverty indicator at the subnational levels. To achieve this, I still use the previous command but this time I have added the breakdown argument, which specifies the geographical level that I am interested in.

hcr_subnational <- arpr("eqIncome", weights = "rb050", breakdown = "db040", data = eusilc)
Print the results
hcr_subnational
## Value:
## [1] 14.44422
## 
## Value by domain:
##         stratum    value
## 1    Burgenland 19.53984
## 2     Carinthia 13.08627
## 3 Lower Austria 13.84362
## 4      Salzburg 13.78734
## 5        Styria 14.37464
## 6         Tyrol 15.30819
## 7 Upper Austria 10.88977
## 8        Vienna 17.23468
## 9    Vorarlberg 16.53731
## 
## Threshold:
## [1] 10859.24

Using the variance command, I can estimate the variance of the at-risk-of-poverty indicator for each geographical unit.This command requires that I specify the variable of interest for my analysis (eqIincome), the survey weights (rb050), the variable that specifies the geographical breakdown (db040) and the variable(s) that define the survey design.For the current example, the design is assumed to be stratified by geography and hence I am using the variable db040. Finally, I need to specify the indicator for which I want to estimate the variance. Previously, I had estimated and saved this as an object (hcr_subnational), the type of bootstrap variance estimation (here I am using naive bootstrap, although other types exist), and the number of bootstrap replications.

hcr_variance <- variance("eqIncome", weights = "rb050", breakdown = "db040", design = "db040", data = eusilc, indicator = hcr_subnational, bootType = "naive", seed = 123, R = 500)
Print the results
hcr_variance
## Value:
## [1] 14.44422
## 
## Variance:
## [1] 0.08168033
## 
## Confidence interval:
##    lower    upper 
## 13.92968 14.98392 
## 
## Value by domain:
##         stratum    value
## 1    Burgenland 19.53984
## 2     Carinthia 13.08627
## 3 Lower Austria 13.84362
## 4      Salzburg 13.78734
## 5        Styria 14.37464
## 6         Tyrol 15.30819
## 7 Upper Austria 10.88977
## 8        Vienna 17.23468
## 9    Vorarlberg 16.53731
## 
## Variance by domain:
##         stratum       var
## 1    Burgenland 2.8599386
## 2     Carinthia 1.3806124
## 3 Lower Austria 0.4508332
## 4      Salzburg 1.4093275
## 5        Styria 0.5591159
## 6         Tyrol 1.0308416
## 7 Upper Austria 0.3306110
## 8        Vienna 0.5629735
## 9    Vorarlberg 1.9744992
## 
## Confidence interval by domain:
##         stratum     lower    upper
## 1    Burgenland 16.203852 22.82716
## 2     Carinthia 10.720872 15.37713
## 3 Lower Austria 12.501693 15.14681
## 4      Salzburg 11.589741 16.28962
## 5        Styria 12.982625 16.00579
## 6         Tyrol 13.487444 17.46962
## 7 Upper Austria  9.777863 12.01431
## 8        Vienna 15.728719 18.75112
## 9    Vorarlberg 13.854050 19.37823
## 
## Threshold:
## [1] 10859.24

As seen above the output gives estimates of the variance and confidence intervals.

In some cases it may be necessary to specify domains by cross-classification of a geographic and a demographic variable respectively.

For this, I first create a domain variable (gender_region).

eusilc$gender_region <- interaction(eusilc$db040, eusilc$rb090)
Finally, I compute the at-risk-of-poverty indicator for the newly created variable.
hcr_geography_gender <- arpr("eqIncome", weights = "rb050", data = eusilc, breakdown = "gender_region")
Print results
hcr_geography_gender
## Value:
## [1] 14.44422
## 
## Value by domain:
##                 stratum     value
## 1       Burgenland.male 17.414524
## 2        Carinthia.male 10.552149
## 3    Lower Austria.male 11.348283
## 4         Salzburg.male  9.156964
## 5           Styria.male 11.671247
## 6            Tyrol.male 12.857542
## 7    Upper Austria.male  9.074690
## 8           Vienna.male 15.590616
## 9       Vorarlberg.male 12.973259
## 10    Burgenland.female 21.432598
## 11     Carinthia.female 15.392924
## 12 Lower Austria.female 16.372949
## 13      Salzburg.female 17.939382
## 14        Styria.female 16.964539
## 15         Tyrol.female 17.604861
## 16 Upper Austria.female 12.574206
## 17        Vienna.female 18.778813
## 18    Vorarlberg.female 19.883637
## 
## Threshold:
## [1] 10859.24

Please note that the laeken package can be used to compute other indicators. For example, using the qsr command, I can compute the quintile ratio which can be used to quantify income inequality. In the current example, I estimated the at-risk-of-poverty indicator at both national and subnational levels. I also applied the estimator to the geographic region classified by gender (auxilliary information). Other area estimates that might provide meaningful insights to researchers and policy makers include estimating the number of poor people in a municipality, under-five mortality rates, etc. It is worth mentioning too, that besides the laeken package, several other R packages are available for small area estimation and related tasks. They include the sae package, survey package, SUMMER package, etc.

Conclusion

Small area estimation is a powerful tool that enables detailed and accurate analysis at granular levels, providing valuable insights for decision-making and resource allocation in various fields. There is a wide array of SAE techniques, some are easy to apply, while others are more challenging. Other considerations include the purpose for conducting the SAE, type of indicator variables of interest, and availability of data prior to actual estimation.

About the author:

Karo has been writing scientific articles for over a decade. He has been a college professor, data scientist, journal editor, reviewer and research consultant. At oores Analytics, he leads biostatistical and research consulting activities and ensures that services and information published are tailored to meet client needs as much as possible.