Dataset Descriptions

2022-03-02

Dataset 1

Pew Research American Trends Panel Datasets

https://www.pewresearch.org/american-trends-panel-datasets/

Data Collection, Organization, and Usability

This dataset includes 85 waves, with data collected from an online survey panel of over 10,000 adults in the US. Each wave includes responses from panel members to questions focused on particular topics (for example, the topic of Wave 1 is media consumption, Wave 27 is automation and driverless vehicles). The data for a wave is saved in a .sav file, and can be read in using read.spss() function from the library foreign with the path to the file specified. The Wave 8.5 dataset (US smartphone use in 2015), after loading in as a data.frame using read.spss() and converting to a tibble with as_tibble(), has 562 columns and 1,635 rows. The first column is called QKEY, which is a unique identifier for the panel participent. This key can be used to join data from one wave with another and compare the responses to multiple topics from the same participant. Participant keys can also be used to link participants with their demographic data, including age, race, education, religion, and other demographic metrics. The second through fifth column contain session data about the current survey response. The remaining columns are titled with the question asked, and the rows contain response values in the form of string factors.

What Are the Main Questions We Hope to Address?

This data could be used to answer interesting question about the correlation between age, education, urban/suburban groups, or other subgroups in the US and belief trends in politics, religion, technology, and science. Possible difficulties with this data are missing values in question responses. These topics are also extremely complicated and there is risk of over-simplifying the data based on what is available and coming up with incorrect conclusions. The sample size of 10,000 is also not very big compared to the US population, especially when not all participants respond in each wave.

<<<<<<< HEAD

#Dataset 1

Pew Research American Trends Panel Datasets

https://www.pewresearch.org/american-trends-panel-datasets/

This dataset includes 85 waves, with data collected from an online survey panel of over 10,000 adults in the US. Each wave includes responses from panel members to questions focused on particular topics (for example, the topic of Wave 1 is media consumption, Wave 27 is automation and driverless vehicles). The data for a wave is saved in a .sav file, and can be read in using read.spss() function from the library foreign with the path to the file specified. The Wave 8.5 dataset (US smartphone use in 2015), after loading in as a data.frame using read.spss() and converting to a tibble with as_tibble(), has 562 columns and 1,635 rows. The first column is called QKEY, which is a unique identifier for the panel participent. This key can be used to join data from one wave with another and compare the responses to multiple topics from the same participant. Participant keys can also be used to link participants with their demographic data, including age, race, education, religion, and other demographic metrics. The second through fifth column contain session data about the current survey response. The remaining columns are titled with the question asked, and the rows contain response values in the form of string factors.

This data could be used to answer interesting question about the correlation between age, education, urban/suburban groups, or other subgroups in the US and belief trends in politics, religion, technology, and science. Possible difficulties with this data are missing values in question responses. These topics are also extremely complicated and there is risk of over-simplifying the data based on what is available and coming up with incorrect conclusions. The sample size of 10,000 is also not very big compared to the US population, especially when not all participants respond in each wave.

======= >>>>>>> 7522ed6f3421c0572742a576058bdbac6a9f0af9

Dataset 2

Link to dataset

https://data.unhabitat.org/pages/datasets

Data Collection, Organization, and Usability

Depending on the research done by the sources, there are slight differenes in the total number of cities in each dataset. However, we plan to join these datasets into one compiled dataset, where we will use only the cities that are existent in all of the datasets. Regarding the number of columns, this also varies by each of the sub-datasets that this source gives, where the columns can be measurements of the population numbers by year or the city prosperity index at a specific year. Therefore, we can all compile this data into one dataset so we will be able to clean it.

The data was originally collected by the UN Habitat organization that collects data in over 90 countries to compare quality of life indicators in all parts of the world for promoting a safe, resilient, and sustainable cities and communities in an urbanizing world. By collecting this data from governments, other UN agencies, civil society organizations, academic institutions, and private sectors data can be collected and compiled to easily visualize the differences in cities and their growth in certain indicators. Due to the different sources UN Habitat can receive official data from, there may be some missing cities in different indicators, which we will plan to clean for a more thorough analysis.

We are able to load the data as simple .CSV files, and by using some code we hope to clean the data by combining the city locations as one.

What Are the Main Questions We Hope to Address?

With 10 different indicators ranging from open spaces to social inclusion indicators, we hope that these various different city indicators that look at all aspects of a city can try to answer what factors of a city play the biggest role in making a city a better place to live than others. We also hope that we can use the City Prosperity Index (CPI) as one of our indicators of whether a city is successful or not, and we hope we can try to model cities that will outperform others in the future by looking at the trends of the data and modelling it.

Dataset 3

https://catalog.data.gov/dataset/u-s-chronic-disease-indicators-cdi

Data Collection, Organization, and Usability

This dataset is the U.S. Chronic Disease Indicators, containing 124 indicators of chronic disease, which “allows states and territories and large metropolitan areas to uniformly define, collect, and report chronic disease data that are important to public health practice and available for states, territories and large metropolitan areas.” The dataset is a 956,639 x 34 csv file, and includes the following indicator groups: alcohol, arthritis, asthma, cancer, cardiovascular disease, chronic kidney disease, chronic obstructive, pulmonary disease, diabetes, disability, immunization, mental health, nutrition; physical activity; and weight status, age, oral health, overarching conditions, reproductive health, school health, and tobacco; for a total of 124 indicators. Columns include year, location, source of data, specific chronic disease, race, and gender. The data is taken from a wide variety of sources, including the Behavioral Risk Factor Surveillance System, American Community Survey, and National Vital Statistics System. The data was loaded into R without issue, though it contains multiple empty or mostly empty columns, which can be removed to leave us with more usable data.

What Are the Main Questions We Hope to Address?

Working with this dataset would allow us to work on and potentially develop answers to important questions in public health, particularly as they relate to disease burden and risk factors for medical complications.

Previous Data Loading and Cleaning