Update 1: Learning machine learning

As originally stated in my abstract proposal https://freshmanmonroe.blogs.wm.edu/2018/04/10/assessing-antibiotic-sentiments-online-social-media/, I will be using artificial intelligence/machine learning to help detect depression in social media. I’m very excited to work on this project as it is able to combine my two passions., helping people and computer science, into one.

Currently my project is divided into three phases . Phase one is reading through necessary literature and courses. Phase two is collecting the data and programming the model. And phase three is testing the model and debugging it.

Phase one: For the last two months I have been researching and learning through a different array of courses and literature. Since machine learning and natural language processing are such theory heavy fields, I utilized Google’s Machine Learning crash course and NLP Python Book. The difficulty with each source was extrapolating what was important and making sense of abstract and ambiguous concepts. For instance I had to understand the role that cross-validation plays in testing a machine learning model.

Afterwards I began looking through the academic resources on my topic, starting with sentiment analysis on depression and public health issues. Although I was able to digest the concepts and data, there was no code present for me to really learn the fine details. From there onwards I started looking at video and text tutorials on how to program machine learning and sentiment analysis models. Kaggle’s titanic dataset provided to be most insightful, making clear of all the concepts I learned the past month.

Phase two: Now that I gone through the necessary prerequisite material, I am finally able to start the nitty gritty of the project. The first issue I tackled was figuring out what dataset to use. All the academic literature on this subject used Twitter and other micro-blogging sites. As vast as this information is, I quickly realized the problem of filtering out individuals who are depressed from those are simply talking about depression and maintaining confidentiality, something quite dangerous for a sensitive and stigmatized issue such as depression.  So instead I am using Reddit data. Not only are the comments confidential but subreddits provide an easy, accessible way to collect topic based comments. As of currently I’ve scrapped about 1200 comments using the Praw python API and am cleaning them of unnecessary words and dividing them into individual words.

Some things I learned: Machine learning is a widely used and broad field of artificial intelligence. It is built on the idea of using massive amounts of data to tackle real-world problems like voice recognition, self-driving cars, and music recommendations. For my project, I am dealing with the problem of natural language processing, or using machine learning on text. Although this is niche field, the applications of NLP are tremendous.

Machine learning is built on two components, data pre-processing and classification/prediction. Most data that is initially collected is error-riddled and incompatible with algorithms, that is why they need to converted into numerical values to perform calculations on. This phase, called feature engineering, is turning data features, like the characteristics of a flower or words in a sentence, into something more applicable. For my project, for instance, one feature I can create is collect the frequencies of words in the dataset and use the most common words. This procedure is called ‘bag of words’ and is used often in NLP.

Speak Your Mind