Blog Post 2: Updates From My Research on Correlations Between Variables in Census Data

In the last two weeks, I’ve made progress with my project. I have created a data set of historical federal census data and have worked to interpret, investigate, and analyze it in Stata, a statistical analysis software program. After I generated a data set in IPUMS (an online database of census microdata used by social and economic researchers) using the variables and samples I had selected, I downloaded this data and tried to import it into Stata, at which point I encountered a minor obstacle: the file was much larger than I had originally expected. I discovered that I needed to increase the maximum memory capacity in Stata from its default setting before I could import the data and begin the real work.

Once my data set was successfully imported into Stata, I began by browsing through the data to see and understand the observations I had collected. It became clear that I needed to generate new variables based on some of the original ones I had chosen from IPUMS. I needed to change my variables corresponding to race, urban vs. rural geographic location, domestic vs. foreign birthplace, and literacy to make them more mathematically meaningful. I did this by coding them as binary or “dummy” variables, where they take values of 0 and 1. It took me a bit of trial and error to generate the new dummy variables by typing into Stata’s command line, since my commands needed to be specific and I needed to become familiar with the codes and symbols that Stata recognizes and accepts. After several attempts, as well as consultations with Stata manual pages, I was successful in generating new variables that coded for each of the original ones in a binary fashion. For example, my dummy variable “racenum” codes 0 for white and 1 for non-white; my dummy variable “urbannum” codes 0 for urban and 1 for rural.

I also researched the coding schemes for each of the other variables in my data set, paying particular attention to OCCSCORE, a key variable in my analysis. This variable is a 2-digit numeric variable representing the median total income in hundreds of 1950 dollars of all people in a specific occupation. In my data set, the OCCSCORE values run from 0 to 80, and I noticed that many of them were 0, which codes for “N/A.” I realized that having so many scores of 0, which provide no meaningful information, would negatively affect any statistical analysis I would do that involved the OCCSCORE variable. To work around this problem, I generated a new variable, “occscorenum,” which retained all of the relevant OCCSCORE values and simply excluded all values of 0.

After studying the data and making new variables to aid my analysis, it was time to focus on specific questions about the data, and use these questions as a basis for statistical tests and graphs of relationships between the variables. In the early stages of my project, I had generated a list of research questions about the relationships between variables and trends over time. I now used these inquiries to inform my analysis and graphics. Since my data set was so large—containing more than 5 million observations in total across the five samples—making graphs with so many data points yielded little useful information. I “collapsed” my full data set by generating the mean, or average, value for each variable in each year, and I used these means to plot two variables against each other on two-way scatterplot graphs. In these sparser scatterplot graphs, I was able to see clear relationships between variables, and the graphs provided good preliminary information that allowed me to delve deeper into analysis. For example, in one of my graphs, I could see that higher mean values of occupational income corresponded to higher mean literacy; in another, I could see that the mean number of children decreased steadily from 1900 to 1940; and in a third, I could that the mean occupational income value increased steadily during that period. None of these or other findings from my graphs was particularly surprising or groundbreaking, but the graphs helped to confirm my own predictions about the data and presented the data points and relationships between variables well.

Furthermore, it is important to note that these graphs simply showed associations between two variables, not causal relationships. Many confounding variables affect the outcomes of each of the variables in my data set, and my work has only begun to explore simple correlations between them.

My next step in this project is to analyze the data using linear regressions between pairs of variables to understand and interpret quantitative correlations. I have begun to run regressions between variables and to generate new scatterplots fitted with linear predictions. In the upcoming weeks, I will continue to work with regressions, and will expand my analysis to other relevant statistical tests. In my next post, I will summarize my findings based on these inferential statistics and describe any conclusions that I may be able to draw from my results.