Post #1: Data Collection and Beginnings

The goal of my project is to analyze the correlations between scoring and success in college basketball, and then to narrow my focus to teams that have had coaching changes in the past two years in order to determine whether those trends hold up and can be used to predict team success for new coaches.

For the first part of this project, I have to compile the data-set with which I am going to draw my correlations. I reached out to because their “College-Basketball Reference” is the most comprehensive database I could find for the data I need. Unfortunately, while it could be as simple as collecting the points per game for and points per game against averages listed next to each team for each year, doing so has two fundamental issues. The first issue was that these averages are only to one decimal place. While this is perfectly adequate, I believe that going further, to two or more decimal places, will allow me to distinguish further between teams, since most teams will likely fall in a very small range. The second issue is that the given averages generally do not account for playoff and tournament games. In my opinion, these games should be accounted for in any discussion of team success, and so they must be accounted for in my data-set.

In order to compile my data, I am in the process of downloading the CSV files for each team in my research and for each season I am researching. In total, this will produce over 1000 individual CSV files, each of which have 30-40 lines. I had originally intended to use a web scraper in order to do this more efficiently, but I reached out to Sports-Reference to ask them about the feasibility of this and was asked not to. Then, I wrote a short program in Python. This program opens each of the CSV files, transforms the data-set into information that is usable in my project, and adds the new information into a data-frame that will be saved as its own CSV file. I will then be able to use the new CSV file to construct graphs and draw conclusions for the first part of my project.

Speak Your Mind