Update 2: Building the model

As of this update, I have finished most of the model building and metric analysis. Using the initial 2000 comments(1000 un-depressed, 1000 depressed), I cleaned out all unnecessary words(stop-words), punctuation, and removed all suffixes. Then I built a frequency table showing the most common two-word n-grams within the depressed class. N-grams are a combination of words next to each other. For example, “I like bees” the two-word ngrams, or bigrams, would be “I like”, “like bees”.


,download (1)

What’s interesting is the word depression or depressed is not used frequently, but very personal pronoun heavy phrases like “I feel” or “I want” is used in great occurrences.

Using this data I compiled it into a word cloud visualization. A word cloud visualization is very similar to a frequency graph, but instead of a numerical measure, it uses the relative size of the words as a measure of frequency.



After compiling and processing the data, I trained three separate models to identify depressed vs non-depressed comments and outputted the performance metrics. The best performing model scored 73% while the worst at around 66%. Interestingly, if I used the most common one-word combinations instead of two-words, the performance of all three jumped by almost 10%.

While looking at the relative frequencies of the words, it appears that pronoun heavy phrases like “I feel” and negatively connotated words like “not” or “sorry” were used most frequently. Although this makes sense logically, when the model was tested using singular phrases like “I am not depressed”, results wavered between depressed and non-depressed.

For the next step I will collect more data (around 200,000-300,000 comments), use different features like the average comment length, and test it with other data.



  1. ajfantine says:

    Hey Alex,

    This is such a fascinating project and with some fine-tuning I could see it having some seriously beneficial effects in many online communities. I love the idea of using n-gram analysis to see what words are most likely to appear together in the ‘depressed’ class, and I’m interested as to why “happy birthday” occurs so frequently. Is it because those two words are very common in online forums and not particularly indicative of depression? I think they might be skewing your data some. In my own research of machine learning, I found a source that talked about splitting data into three groups: the training data, a development set data sample, and the test data. With the dev-set data, you can actually have the NLTK run some error analysis so that you can see where the classifier goes wrong. I’m super excited to see where you go with this and how your analysis of a larger data set changes the accuracy of the classifier. It seems like adding a few more features like average comment length might be helpful, seeing as n-gram analysis isn’t entirely foolproof. Great work so far!
    Alex F