source(here::here("dataset", "load_and_clean_data.R"))
Data Used
Most of the data used in our analysis was sourced from the United Nations Habitat Urban Indicators Database (https://data.unhabitat.org/pages/datasets), provided by the UN-Habitat Data and Analytics Section. We used datasets related to housing, urban population trends, access to basic services, and quality of living in urban areas. Population data was sourced from the UN Population division (https://population.un.org/wpp/Download/Standard/Population/), and country codes for regional and subregional grouping is found at (https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv).
The Data and Analytics Section coordinates monitoring of urban indicators. The group aims to collect consistent data to improve information access and guide data-driven policy making. The data is additionally used to monitor progress toward the UN sustainable development goals (SDGs) and new urban agenda (NUA, a plan for a sustainable urban future).
Datasets
services.RData:
This dataset contains data collected between 1990-2019, broken down by country, and provides the percentage of the urban population in each country with access to various basic services. Cleaning this dataset required consolidating multiple header rows, and selecting only the services we were interested in: Total Improved water (d), Total Improved Sanitation (f), Basic hand-washing facility (h), Access to basic drinking water services (e), Access to basic sanitation services (g), and Sufficient living area (j). Columns were renamed for simplicity, and the values in these columns were converted to numeric values from character strings.
total_pops.RData
This dataset contains the total population for country and regional groups from 1950-2020. In this file, the data was organized using a column for each year containing population values for that year. To clean the data, we reorganized the data into a Year column and a TotalPop column, so the dataset could easily be joined with other datasets using the Year column. The TotalPop column was multiplied by 1000 since the data was given in units of thousands.
disbursements.RData
This dataset contains the amount of disbursement of development assistance to country and regional groups from 1985 - 2019. The disbursement is given in terms of several metrics, and we chose to use disbursement in units of millions of US dollars and as a percentage of GNI of the recipient. Aid disbursement data is available at regional and country levels. The values for total aid disbursement in US $ were given as character strings containing commas and numbers and had to be converted to numeric data and then multiplied by 1000000 to convert to the raw value of aid disbursed.
population.RData
This dataset contains the urban population of country and regional groups, with values available since 2000 and projected values until 2050. The population values were read in as character columns because they contained spaces, and were converted to numeric values. The data originally had an individual column for each year, and was pivoted to have a Year column and an UrbanPop column, so the dataset could easily be joined with other datasets using the Year column.
country_info.RData
This dataset maps country names to region and subregional groupings. We used this dataset to join regional information to country level datsets for data visualization and analysis of regional trends.
slum_pop_clean.RData
This dataset contains the slum population for regional and country groups for selected years between 1990-2018. Like other datasets, the population data had to be converted to numeric type from character strings containing commas. The columns for each year were pivoted into a Year column and a SlumPop column, so the dataset could easily be joined with other datasets using the Year column. Slum population values were multiplied by 1000 because the data was given in units of thousands.
air_quality_country.RData
This dataset gives the air quality index for countries for the years 2015-2019. The data is the mean population exposure to fine particulate matter (PM2.5) in units \(\mu g / m^3\). The columns for each year were pivoted into a Year column and an AirQuality column, so the dataset could easily be joined with other datasets using the Year column. Air quality was given for cities within each country, so the mean value for all the cities in a country was taken for each year to create a country level dataset.
load_and_clean_data.R
This data cleaning script relies heavily on tidyverse functions, as well as the read_xlsx() function from the readxl library (used to read in excel file formats as dataframes), and the year() function from the lubridate library, to convert the year values to date format.
Combining Datasets
joined.RData
This dataset is a combination of the datasets described above, joined by country and year. Two additional columns are created, PercUrban takes the percentage of a country or regional group’s population which is urban (UrbanPop / TotalPop), and PercSlum takes the percentage slum population of a country or regional group’s total population (SlumPop / TotalPop). The columns containing basic services data are pivoted into a Services column and a PercAccess column, for ease in data visualization.
joined <- disbursements %>%
left_join(total_pops %>% select(c("Region", "Year", "TotalPop")),
by = c("Year", "Area" = "Region")) %>%
left_join(urban_pop, by = c("Year", "Area" = "Region")) %>%
left_join(slum_population %>% select(c("SlumPop", "Country or area", "Year")),
by = c("Year", "Area" = "Country or area")) %>%
left_join(air_quality, by = c("Year", "Area" = "Country")) %>%
left_join(services %>% select(-c("Region", "M49 classification", "Cases", "Source")),
by = c("Year", "Area" = "Country")) %>%
left_join(country_info, by = c("Area" = "name")) %>%
left_join(gini %>%
select(-`City/region`),
by = c("Year", "Area" = "Country", "Region", "SubRegion")) %>%
mutate(Year = year(as.Date(Year, "%Y")), Area = as.factor(Area),
PercUrban = UrbanPop / TotalPop, PercSlum = SlumPop/TotalPop)
joined_reg.RData
This dataset is similar to the joined.RData dataset as described above, but selects regional level information for datasets where it is available and calculates regional estimates from datasets which do not provide regional level data. The regions used are: “Africa”, “Americas”, “Asia”, “Europe”, and “Oceania”, and the subregions are “Australia and New Zealand”, “Central Asia”, “Eastern Europe”, “Latin America and the Caribbean”, “Melanesia”, “Micronesia”, “Northern Africa”, “Northern Europe”, “Polynesia”, “South-eastern Asia”, “Southern Asia”, “Southern Europe”, “Sub-Saharan Africa”, “Western Asia”, and “Western Europe”. The data for the Americas does not include North America, because aid disbursement data is not available for North America, and aid disbursement is the paramater we aimed to predict based on other values. When calculating estimates for the Americas, all American regions aside from North America were used. First, disbursement data was calculated for regional and subregional groups. This dataset was then joined to each other dataset, with regional and subregional information selected only or calculated when unavailable. For data such as population, urban population, and slum population, a sum was taken within regional and subregional groups where this data was unavailable. For air quality and % access to basic services, the mean was taken within regional and subregional groups. The columns containing basic services data are pivoted into a Services column and a PercAccess column, for ease in data visualization.
joined_reg <- merge(disb_reg, disb_subreg, all = TRUE) %>%
left_join(total_pops %>%
mutate(Region = str_replace_all(Region, "And The", "and the")),
by = c("Year", "Area" = "Region")) %>%
left_join(urban_pop %>%
mutate(Region = str_replace_all(Region, "And The", "and the")),
by = c("Year", "Area" = "Region")) %>%
left_join(slum_pop_grouped, by = c("Year", "Area", "Grouping")) %>%
left_join(serv_grouped, by = c("Year", "Area", "Grouping")) %>%
left_join(air_grouped, by = c("Year", "Area", "Grouping")) %>%
mutate(Year = year(as.Date(Year, "%Y")), Area = as.factor(Area),
PercUrban = UrbanPop / TotalPop, PercSlum = SlumPop/TotalPop)
joined_reg_wide.RData
This dataset is exactly like the joined_reg.RData dataset, but instead of the Services column, each basic service has its own column. This dataset was created so the basic services could individually be used as model parameters.
joined_model <- merge(disb_reg, disb_subreg, all = TRUE) %>%
left_join(total_pops %>%
mutate(Region = str_replace_all(Region, "And The", "and the")),
by = c("Year", "Area" = "Region")) %>%
left_join(urban_pop %>%
mutate(Region = str_replace_all(Region, "And The", "and the")),
by = c("Year", "Area" = "Region")) %>%
left_join(slum_pop_grouped, by = c("Year", "Area", "Grouping")) %>%
left_join(air_grouped, by = c("Year", "Area", "Grouping")) %>%
left_join(serv_wide, by = c("Year", "Area", "Grouping")) %>%
mutate(Year = year(as.Date(Year, "%Y")), Area = as.factor(Area),
PercUrban = UrbanPop / TotalPop, PercSlum = SlumPop/TotalPop)