Update 2: Building the model

As of this update, I have finished most of the model building and metric analysis. Using the initial 2000 comments(1000 un-depressed, 1000 depressed), I cleaned out all unnecessary words(stop-words), punctuation, and removed all suffixes. Then I built a frequency table showing the most common two-word n-grams within the depressed class. N-grams are a combination of words next to each other. For example, “I like bees” the two-word ngrams, or bigrams, would be “I like”, “like bees”.

 

,download (1)

What’s interesting is the word depression or depressed is not used frequently, but very personal pronoun heavy phrases like “I feel” or “I want” is used in great occurrences.

Using this data I compiled it into a word cloud visualization. A word cloud visualization is very similar to a frequency graph, but instead of a numerical measure, it uses the relative size of the words as a measure of frequency.

 

download

After compiling and processing the data, I trained three separate models to identify depressed vs non-depressed comments and outputted the performance metrics. The best performing model scored 73% while the worst at around 66%. Interestingly, if I used the most common one-word combinations instead of two-words, the performance of all three jumped by almost 10%.

While looking at the relative frequencies of the words, it appears that pronoun heavy phrases like “I feel” and negatively connotated words like “not” or “sorry” were used most frequently. Although this makes sense logically, when the model was tested using singular phrases like “I am not depressed”, results wavered between depressed and non-depressed.

For the next step I will collect more data (around 200,000-300,000 comments), use different features like the average comment length, and test it with other data.

 

Speak Your Mind