Final Update and Summary for Detecting Depression in Social Media

This will be the final update and summary of my summer research project.

My project was detecting depression in social media using machine learning/ai techniques. The project consisted of collecting 50,000 comments collected from Reddit. Reddit is a social media site consisting of subreddits or subforums consisting of a theme. The subreddits I collected the comments from were /r/depression /r/depressed /r/suicidewatch /r/lonely for my experimental group and /r/frugal, /r/casualconversation, /r/relationships, /r/legaladvice, /r/AcomplishedToday, /r/wholesomememes. The data collection was done by a Python API Library called Praw and took about 2 hours to fully scrape.

After the comments were collected I cleaned up the data of unnecessary words, called stop-words, lower-cased all words, and removed all punctuation. This was done to ensure consistency and overall word focus when the model is built. Then I applied a step called lemmatizing which removes the stem of all the words. For instance, “running” and “ran” become the basic form “run”. Finally I preprocessed the words to show either one word combinations(uni-grams) or two-word combinations (bi-grams). After the comments are finished preprocessed I then built a frequency table to show the most common words.


downloaddownload (4)


The most common words (besides the apostrophe) were personal heavy nouns and negatively attached words for depressed comments. I also built a word-cloud which showed the frequency in a more visual heavy format.

download (1)download (5)

After the cleaning and graphing were done, I ran the words against a Naive Bayes and Logistical Regression model to classify depression comments from non-depressed comments. These models use statistical techniques to help “sort” and “divide” comments whether they depressed or not depressed.

I first split the data into training data and test data. Machine learning algorithms need training data to help “teach” the models the difference between the different categories. I then apply the test dataset to measure the accuracy. The whole premise of using the training/test data is having a set of data with the correct answers given and comparing the model’s answers with the given answers.

Using this methodology I was able to achieve 79% accuracy for unigram words using Naive Bayes and 78% for Logistical Regression, and for bigrams I was able to achieve 73% accuracy for both models.

The graph below is called a ROC curve which measures the model’s false positive results in relation to the true positive results. The ROC area under the curve is the measure of accuracy, which in this case corresponds to the each model’s accuracy percentage.

download (2) download (3)

Using tried and tested data collection and preprocessing, I was able to achieve modest accuracy for each respective model. An issue common in machine learning the is the idea of overfitting and underfitting. Overfitting is when the model over-generalizes the data and creates an inflexible model. Underfitting is when the model under-generalizes and creates an overly flexible model. For the purpose of my research, I decided the models should be underfitted rather than overfitted. Due to the dangers of false-diagnosis in depression and mental disorders it is best to have a clear margin of error and avoid over-analysis.  With this in mind the unigram models, with the highest relative accuracy, are shown to have the best results.