A dataset of harmonized DHS districts

Create longitudinal data from Demographic and Health Surveys

In this post, I present a dataset that I created, containing the harmonized district names and regions of 254 demographic and health surveys (DHS). I explain why I created it, how I did it and how to use the dataset.

1. The need for an harmonization of DHS districts

The DHS data are micro-datasets monitored by the the DHS Program (ICF 2017) (mainly USAID funded) since 1984. They cover more than 90 countries and 400 surveys. The DHS dataset is a rich source of data for a large set of research topics such as health, epidemiology, economy, demography or sociology. Local authorities are responsible for implementing the survey in their own countries but the methodology of DHS surveys is standardized, which makes them even more useful when it comes to conducting cross-country analyses. However, differences in the encoding of variables can still occur. In particular, the reporting of region/district names is far from being standardized. This can be due to differences in reporting languages or simply typos. For example, a given region of a country can be reported “Center” (US English) with the index 5 in year 1990, and “Centre” (Brit., French) with the index 3 in 1995. Another type of change in region index and names reporting can be due to a political change of administrative sub-divisions (merging/splitting of regions or complete re-definition of administrative sub-divisions). These discontinuities can be a problem when one wants to create panel data set for one or several countries, the unit of the panel being the region/district. For my PhD project, I needed to build a panel data set for the district level for as many DHS surveys (and therefore countries) as I could. I therefore spent some time harmonizing the region names and indices. I decided to share this dataset that I created and to explain in this document how I built it and how to use it.

2. How did I create the dataset?

In order to create this dataset, I first loaded all the DHS datasets and kept only the information on DHS survey ID, region name and region index and I put it together in one single file. I then used four different types of methods of harmonization, and added a column “method_harmonization” to tell which method was used for each DHS survey. The four methods are the following:

1- The most simple (but laborious) thing to do was to correct typos and wrongly encoded region indices. For example, a region could have been encoded “ATLANTique” in 2000 and “Atlantique” in 2005, or a difference in the language could occur (for example North (English) and Nord (French)).

3- When no information was available from the IPUMS-DHS harmonization, I used the DHS spatial repository, which mapped the boundaries of sub-national regions in DHS surveys, over time. It allows to visualize change, merge or division of regions over time. Some specific notes on this harmonization can be found on the document “notes_harmonization.md”.

4- For a few surveys, I did the harmonization myself based on my own research. For example, Uganda 2016, the regions of North Buganda and Souhth Buganda were put in the region Central 1 + Central 2 + Kampala because the Central region is coterminous with the Kingdom of Buganda, one of the ancient African monarchies that are constitutionally recognised in Uganda Source. Further notes are available on the document “notes_harmonization.md”.

Some surveys where removed in the process:

  • The countries that had only one wave of DHS survey because there was no cause for harmonization.

  • Some surveys for which there was no harmonization possible. For example when a country has two waves with very different administrative sub-divisions, I didn’t know how to harmonize.

  • Some specific countries for which the number of region varies too much across the waves (for example Dominican Republic)

For the region name harmonization, I created two columns:

  • region_name_clean: here, I only corrected the typos, the change of language of the region (i.e. Nord to North). I did not merge any regions in this column.

  • region_name_harmonized: this column has cleaned column names and also harmonized regions by sometimes changing the administrative borders (following what the IPUMS-DHS).

3. The dataset

The dataset can be found on my git repository here. Here is an extract of the dataset, for the country Benin, the year 2012 and 2017:

country_name

survey_year

survey_id

region_num_raw

region_num_harmonized

region_name_raw

region_name_clean

region_name_harmonized

Benin

2017

BJ2017DHS

1

3

ALIBORI

Alibori

Borgou + Alibori

Benin

2017

BJ2017DHS

2

1

ATACORA

Atacora

Atacora + Donga

Benin

2017

BJ2017DHS

3

2

ATLANTic

Atlantique

Atlantique + Littoral

Benin

2017

BJ2017DHS

4

3

BORGOU

Borgou

Borgou + Alibori

Benin

2017

BJ2017DHS

5

6

COLLINES

Collines

Zou + Collines

Benin

2017

BJ2017DHS

6

4

COUFFO

Couffo

Mono + Couffo

Benin

2017

BJ2017DHS

7

1

DONGA

Donga

Atacora + Donga

Benin

2017

BJ2017DHS

8

2

LITTORAL

Littoral

Atlantique + Littoral

Benin

2017

BJ2017DHS

9

4

MONO

Mono

Mono + Couffo

Benin

2017

BJ2017DHS

10

5

OUÉMÉ

Oueme

Oueme + Plateau

Benin

2017

BJ2017DHS

11

5

PLATEAU

Plateau

Oueme + Plateau

Benin

2017

BJ2017DHS

12

6

ZOU

Zou

Zou + Collines

Here is the description of the columns of the dataset:

Column

Description

Note

country_name

Name of the country of the DHS survey

country_code

Country code of the country

alpha-3 code

survey_year

Year of the DHS survey

file_name

File name of the survey I used for the harmonization

One survey has different file depending on the recoding. I used the Individual recode with SPSS file format (.sav) to create this dataset

survey_id

Unique ID of the DHS survey

region_num_raw

Index of the DHS region as stated originally in the dataset

region_num_harmonized

Harmonized index of the DHS region

region_name_raw

Name of the DHS region as stated originally in the dataset

region_name_clean

Name of the DHS region as stated originally in the dataset, but without typos, language differences, and special characters

region_name_harmonized

Harmonized name of the DHS region. Typos were removed, language of region harmonized, and special characters removed. In addition, the administrative borders have been harmonized (i.e merged when they were splited) from a year to another, thanks to the IPUMS-DHS region harmonization.

4. How to use it?

When you are working on a study involving several waves of DHS of a country in which you need to aggregate the data at the sub-national level, here is how to do to harmonize the region names and index using my concordance dataset. The code below works with the package rdhs, which is a package for the management and analysis of DHS data. To make this code work, you’ll have to download the DHS data through the rdhs package. All the steps to do so are described here (in “3. Download survey datasets”). (To ask for access to DHS data –> go to the DHS Program website)

Here are the steps to use my concordance dataset to harmonize the regions names of the surveys in which you are interested:

1- First of all, download the git repository, and check in the dhs_harmonization_dataset.csv if the DHS surveys in which you are interested have been harmonized.

2- Load the meta-information on the DHS surveys for which you have access granted: (For information, the correspondance between DHS country code and country names can be found here)


library(dplyr)
library(reactable)

# Set the path of the git repository and of data folder
path_git_repo = getwd()
path_toy_dhs_data <- paste0(path_git_repo, "/DHS_region_harmonization_files/toy_DHS_data")

# Select the country code of the country on which you which to work
country_code <- "BJ" # In this example we will mimic to use Benin data (although I am only using toy data)

# If you have access to the DHS data, Do this step: 
  ## Load the dataset with information on all the dhs surveys for which you have access granted
  # library(rdhs)
  # info_dhs_datasets <- 
  # rdhs::dhs_datasets(fileFormat = "spss", fileType = "IR", surveyType = "DHS") %>% 
    #dplyr::select(SurveyId, SurveyYear, DHS_CountryCode, CountryName, FileName) %>% 
    #rename(zip_file_name = FileName) %>%
    #   mutate(rds_file_name = # create a column corresponding to the RDS file name instead of zip file name
      #         paste0(substr(zip_file_name, start=1, stop = 8 ),".rds"))

# If you don't have access to the dhs data, run this instead: 
info_dhs_datasets <- read_csv(file = paste0(path_git_repo, "/DHS_region_harmonization_files/info_dhs_dataset.csv"))
## Parsed with column specification:
## cols(
##   SurveyId = col_character(),
##   SurveyYear = col_double(),
##   DHS_CountryCode = col_character(),
##   CountryName = col_character(),
##   zip_file_name = col_character(),
##   rds_file_name = col_character()
## )

# Dataframe the infos on only the surveys you are interested in 
file_names_one_country <- info_dhs_datasets %>% 
  dplyr::filter(DHS_CountryCode == country_code) %>% 
  dplyr::select(rds_file_name, SurveyYear, SurveyId)

3 - Load all of the survey datasets you need and select your variables of interest. Make sure your selection includes the variable region (v024 if the survey is after 1989, v101 otherwise). Again, I recommend to use the package rdhs to be able to load your DHS data in the RDS format. Please note that in this document I am using toy data and not the real DHS dataset of Benin.


# Load the dhs survey with RDS
country_dhs_survey_data <- data.frame()

for(i in 1:nrow(file_names_one_country)) { # this loops over all the DHS surveys available for ONE country. 
  
  current_dhs_survey_data <- 
    readRDS(file =paste0(path_toy_dhs_data, "/", file_names_one_country$rds_file_name[i])) %>%
    dplyr::select(
      v024, # variable name of the DHS district /!\ if the survey is from 1989 or before the region name is v101, not v024
      v119, v106, v012) %>% # select your variable of interest. Here: electricity, highest education level and age
  dplyr::rename(region_num_raw = v024) %>% 
  mutate(survey_id = file_names_one_country$SurveyId[i]) # add the corresponding survey id
  
  region_labels <- stack(attr(current_dhs_survey_data$region_num_raw, 'labels')) %>% # get labels and index correspondance
    as.data.frame() %>% 
    rename(region_num_raw = values, # rename them to match it the DHS region harmonization dataset
         region_name_raw = ind)
  
  # Join the region_labels dataframe with our dhs surveys, to get the raw region name 
  current_dhs_survey_data <- current_dhs_survey_data %>% left_join(region_labels, by = "region_num_raw")
  current_dhs_survey_data$region_num_raw <- as.numeric(as.character(current_dhs_survey_data$region_num_raw))
  
  country_dhs_survey_data <- rbind(country_dhs_survey_data, # bind the current dhs data to previous ones
                                   current_dhs_survey_data)
  
}

Here is what the DHS data look like BEFORE the region harmonization, for only one region of Benin 2017:

country_name

survey_year

survey_id

region_num_raw

region_name_raw

v119

v106

v012

Benin

2017

BJ2017DHS

3

ATLANTic

1

2

23

Benin

2017

BJ2017DHS

3

ATLANTic

0

3

37

Benin

2017

BJ2017DHS

3

ATLANTic

0

0

38

Benin

2017

BJ2017DHS

3

ATLANTic

0

0

31

Benin

2017

BJ2017DHS

3

ATLANTic

0

0

40

Benin

2017

BJ2017DHS

3

ATLANTic

0

1

32

4- Join the DHS surveys you are interested in with the dhs_harmonized_region_name dataset.


# Then join your dhs surveys with the dhs_harmonized_region_name dataset
country_dhs_survey_data <-
  country_dhs_survey_data %>% 
  select(-region_name_raw, -country_name, -survey_year) %>% 
  left_join(dhs_region_harmo, by = c("survey_id", "region_num_raw"))

# Re-order the columns 
country_dhs_survey_data <-
  cbind(country_dhs_survey_data %>% 
          select(country_name, country_code, survey_year, survey_id, file_name,
                 region_num_raw,region_num_harmonized, region_name_raw,
                 region_name_clean, region_name_harmonized), 
        country_dhs_survey_data %>% 
          select(-country_name, -country_code, -survey_year, -survey_id, -file_name,
                 -region_num_raw, - region_num_harmonized, -region_name_raw,
                 -region_name_clean, -region_name_harmonized))

Here is what the DHS data look like AFTER the region harmonization, for only one region of Benin 2017:

country_name

survey_year

survey_id

region_num_raw

region_num_harmonized

region_name_raw

region_name_clean

region_name_harmonized

v119

v106

v012

Benin

2017

BJ2017DHS

3

2

ATLANTic

Atlantique

Atlantique + Littoral

1

2

23

Benin

2017

BJ2017DHS

3

2

ATLANTic

Atlantique

Atlantique + Littoral

0

3

37

Benin

2017

BJ2017DHS

3

2

ATLANTic

Atlantique

Atlantique + Littoral

0

0

38

Benin

2017

BJ2017DHS

3

2

ATLANTic

Atlantique

Atlantique + Littoral

0

0

31

Benin

2017

BJ2017DHS

3

2

ATLANTic

Atlantique

Atlantique + Littoral

0

0

40

Benin

2017

BJ2017DHS

3

2

ATLANTic

Atlantique

Atlantique + Littoral

0

1

32

There you go! You now have all the DHS datasets in which you are interested with their region name harmonized from one year to another. This allows you for example to create a panel dataset easily: you’ll just have to aggregate the data at the regional (i.e. district) level, and you will have repeated measures over time!

Disclaimer: This dataset is “home made” and I do not guarantee it is flawless. If you have any question, report of errors, please contact me or make a pull request!

Licensing for the dataset

License: CC BY-NC-ND 4.0

References

Elizabeth Heger Boyle, Miriam King, and Matthew Sobek. 2019. “IPUMS-Demographic and Health Surveys: Version 7 [Dataset].” The Journal of Development Studies. Minnesota Population Center; ICF International, 2019. https://doi.org/https://doi.org/10.18128/D080.V7.

ICF, Funded by USAID. 2017. “Demographic and Health Surveys (Various) [Datasets].” https://doi.org/https://doi.org/10.18128/D080.V7.

Camille Belmin
Camille Belmin
PhD Candidate

I am a PhD candidate at PIK. My research interests include demography, energy, sustainability and gender.

Next