Update 2: Model building

As of this update, I have finished model building and result analysis. Using the intial 2000 comments (1000 depressed 1000 undepressed) collected in the last update, I preprocessed the text by cleaning out all unnecessary words, characters (#), and punctuations. Then I built a frequency table that showed the most common two-word combinations in the text.

Using this data I compiled it into a word -cloud. A word cloud is an infographic visualization of the frequencies of a word THe bigger the number the more occurrences it has in the text.

Certain two word combinations showed to be insightful. Pronoun heavy phrases like “I feel like”¬† or displays of emotion like “happy” and “Sad” seemed to be the most frequent in the depressed text.

After the I cleaned and preprocessed the, I converted the text into a numerical matrix for the machine learning algorithm to use. Since algorithms rely on numerical features it is necessary to convert the text into frequencies. Then using the feature vectors, I trained three separate models and compared their relative performance. The worst scored 67% accuracy, 80% precision, and 44% recall, the best scored 72% accuracy, 83% precision, and 58% recall.

What is interesting is if I used the most frequent one-word combinations rather than two-words, the overall accuracy of models increases by a large magnitude – the highest scoring around 85% accuracy.

Although I have been able to extrapolate interesting data, there is still a lot to be left desired. My next step is to collect more data, around 200,000 comments, include other indicators like time and comment length, and test the model with other datasets.