Blog Post 3: Summary of My Research on Correlations Between Variables in Census Data

I have completed my Freshman Monroe research project in Economics. Throughout the process, I gained a deeper understanding of data, statistics, the software Stata, and I learned a lot about the research process in general. As I discovered that I needed to adapt my initial plans to fit the data available, my focused area of study shifted significantly from the plans I discussed in my abstract blog post. The new variables I chose for study piqued my interest in an evolved set of questions, which led me, through use of statistical and graphical techniques, towards a meaningful set of findings.

After working on my project each weekday during the month of June, I continued to work periodically throughout July and early August. As I became more familiar with the software program Stata and the statistical tools and tests I needed to use to analyze my data, my advisor encouraged me to move beyond my basic findings to research certain variables that I found particularly interesting. As I ran basic statistical tests and made a range of graphs from my large assortment of variables, I became interested in the associations between race, number of children, education levels, wealth, and region of the country, and how these variables and associations changed over time. I decided to explore the relationships between these variables, with a particular focus on the American South. We also discussed possible contributing factors of occupation, particularly farmers vs. non-farmers, as well as ownership vs. tenancy and mortgage holding vs. non-mortgage-holding ownership characteristics.

I worked to create research questions that would direct my analysis of these variables, narrowing my focus but still allowing me to explore relationships through a variety of tests and graphical techniques. After correspondence with my advisor, I settled on the following two inquiries as my guiding research questions:

  • As a result of rural-to-urban migration and macroeconomic shocks, including the stock market crash of 1929 and Great Depression in the 1930s, did the number of farm-owners decrease as the number of tenants increased from 1920 to 1940? I predicted that the proportion of farmers in the population overall decreased during this period, and as migration to urban areas increased, average family sizes declined over time across both white and non-white racial groups.
  • Did white farm owners have more children and more years of education than white farm tenants, and did both groups have higher education than non-white farmers, between 1900 and 1940, as a result of disparities in wealth and access to medical care between whites and non-whites?

To answer these questions, I returned to where I began with this research project: the website and database IPUMS, where historical federal census data is stored. I generated a new data set in IPUMS that contained only the variables in which I was interested: REGION, STATE, COUNT, URBAN, FARM, OWNERSHP, MORTGAGE, NCHILD, RACE, BPL (birthplace), HIGRADE, LIT, and OCCSCORE. I wanted to use the same time frame that I had used for my preliminary analysis, so I chose the same five samples as I had used in my first data set: the 1% samples from the 1900, 1910, 1920, 1930, and 1940 federal censuses.

After creating and downloading this data set, I imported it into Stata and began my investigations. The data was not particularly useful in the form in which it was given from IPUMS; I needed to create dummy variables to more clearly distinguish between groups. I created dummy variables for farmers (where “farmer” = 1 if the variable FARM matched the IPUMS code for farmers, and “farmer” = 0 otherwise), for blacks, whites, owners, renters, and mortgage-holders. I also created interaction dummy variables, such as “farmowner”, equal to farmer*owner. This was useful and necessary for my analysis because farmowner = 1 only if both farmer = 1 and owner = 1, and farmowner = 0 in all other cases.

I made use of these dummy variables in scatterplots and histograms, and in descriptive statistics and t-tests. Through these various methods, I succeeded in finding answers to my research questions. To extrapolate my findings, my overall results showed that there were clear differences in family sizes and education levels between white and non-white farmers, and between farm owners and farm renters, in the American South in the early twentieth century. While these results might sound unsurprising or self-evident, I still found them to be exciting, especially as I uncovered numerous details that contributed to the results as a whole. My research, like much modern research, serves to reinforce and support (or sometimes provide evidence against) past findings and thus to contribute to an ever-increasing body of knowledge on a topic. I found that the most interesting aspects of my results became evident in all of the minute details:

  • The proportion of farmers in the population of the South decreased across each census from 1900 to 1940.
  • The average proportion of farm owners in the population decreased between each census year across the overall population. This is also true when focusing only on the South: from 1900 to 1940, mean proportions of farm owners fell in the South each census year.
  • The proportions of both white and black farm owners decreased between each census. Proportions of farm renters did not increase steadily between censuses; the proportion of white farm renters decreased between 1900 and 1910, increased between 1910 and 1920 and between 1920 and 1930, then decreased dramatically between 1930 and 1940. The proportion of black farm renters steadily decreased between each census year.
  • Average number of children (nchild) decreased between each census year across the overall population. The proportion of blacks in the population of the South also decreased between each census.
  • White farm owners had, on average, fewer children than white farm renters: mean nchild for white farm renters was 1.0333, compared to 0.9934 for owners.
  • Black farm owners had, on average, fewer children than black farm renters: mean nchild for black renters was 0.9940, compared to 0.9825 for black owners.
  • White farm owners had, on average, more children than black farm owners.
  • White farm renters had, on average, more children than black farm renters.
  • White farm owners had, on average, more years of education than white renters: mean higrade for owners was 9.5595, and for renters was 8.1086.
  • Black farm owners had, on average, more years of education than black renters: mean higrade for owners was 6.7847, and for renters was 5.4342.
  • There was a significant disparity in the education levels between black and white farmers; white farm renters, who had fewer years of education on average than white owners, had more years of education than black farm owners (using the above numbers, white renters had an average higrade of 8.1086, while for black owners, it was 6.7847.)
  • To substantiate/support the above results, I ran t-tests to investigate whether the differences in nchild and higrade between the groups (white owners, white renters, black owners, black renters) were statistically significant, and I obtained very strong evidence that the differences were statistically significant.

I understand that my research remains relatively basic, and I could continue to investigate further once I learn more advanced techniques for analyzing data. Yet my research does provide some illumination on specific racial and time-series trends in relation to occupation, family size, and education levels for Southern black and white farm owners and renters in the early twentieth century. I am happy to have contributed in some small way to the expansion of knowledge and understanding of historical socio-economic conditions in our nation.


  1. iechevarria says:

    I think your project is very interesting, and it seems that you accomplished so much in your time researching this summer. I really appreciate the fact that you give some technical insight about how you actually accomplished your analysis – for instance, there’s the bit about making dummy variables. I was wondering if you had considered the possibility of using some machine learning algorithms to try to find some more interesting details. In particular, have you thought about using some kind of unsupervised learning technique that’s able to find patterns in your data without actually needing to direct it?

    Again, I want to say how interesting your project is, and I’m really in awe of how much substantive analysis you did this summer.