Twitter Sentiment Analysis

Rahul Gupta
5 min readDec 20, 2020

Introduction

In the recent past there has been hike in the usage of twitter as a social media platform where people share opinions about various walks of life. As of May 2020, the average number of tweets sent per minute are around 350,000. In this blog we propose a technique for text sentiment classification using term frequency- in-verse document frequency (TF-IDF) and compare different classification models when trained on different word embedding Word2Vec and pretrained Glove.

Dataset

The dataset used is Sentiment140 dataset with 1.6 million tweets from Sentiment140 dataset with 1.6 million tweets | Kaggle

It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 1= positive) and they can be used to detect sentiment .

1.6 Million Tweets and corresponding sentiment
Balanced Dataset
Most Positive words found in tweets

Approach 1- Feature extraction by Word2Vec:

Word2Vec creates distributed numerical representations of word features, such as the context of individual words.

Machine Learning Models:

Different machine Learning model is trained on features extracted by the Word2vec. Accuracy of the XGBClassifier was found better than other machine learning model as shown in table below.

Deep Learning Models:

Different Neural Network models trained on the feature extracted by the Word2vec. Accuracy of CNN+ bidirectional LSTM was found to be 0.76, performed better than other classic machine Learning model.

Performance of Model on Wor2Vec Features

Approach 2- Feature extraction by Tf-idf :

Using unigram : The basic feature that was considered was of unigrams that takes one word at a time into account to create feature vector.

Using bigram : A bigram is a sequence of two words. Thus the tf-idf vector is constructed taking two words at a time to create feature vector.

Using unigram and bigram : In this approach both unigrams and bigram are used to construct the tf-idf vector and then the model is trained on this vector.

Comparison between the above variations: Training models on unigram and bigram features together performed better than training on only unigram or bigram feature. Accuracy of SVM and Logistic Regression was 0.79 on test set.

Performance of model using Tf-Idf Features

Approach 3- Feature extraction by pretrained GloVe :

GloVe stands for global vectors for word representation. It is an alternate method to create word embeddings.

Comparison of Model trained on Word2Vec and GloVe word embeddings: Word2Vec performs better than the pretrained GloVe word embedding in the neural network model. CNN + Bi-directional LSTM achieved 0.76 accuracy when Word2Vec feature were extracted.

Performance of model using pretrained GloVe Feature

Comparing accuracies of different models

Model and their Accuracy
Confusion Matrix of SVM model trained on tf-idf unigram and bigram

Baseline model Logistic Regression trained on tf-idf extracted features using ngram as unigram is colored as Brown and bars colored orange are the best model with 0.79 accuracy that was achieved on the tf-idf extracted features when ngram was set to unigram and bigram together.

Conclusion

As this is a classification task so we implemented Linear SVC and Logistic regression, Naïve Bayes, MLP, XGB, and neural network models. The different n-gram implementation of tf-idf vector helped us analyze the differences of unigram, bigram and unigram + bigram approach. Unigram and bigram alone give reasonable accuracies but the unigram + bigram model achieved best accuracy of 79%. The Word2Vec implementation gives better result for neural net-works than the traditional machine learning classifiers like Naïve Bayes and Logistic Regression. Model achieved 76% accuracy in the CNN + bi-LSTM model. Model trained on GloVe word embedding achieved 73% as the maximum accuracy when fed to an LSTM recurrent neural network.

Thus we reach to a conclusion that we see that the unigram + bigram tf-idf implementation on Logistic Regression and SVM performed best among other models we trained. The accuracy attained by the unigram + bigram tf-idf implementation on Logistic Regression and SVM is 79%.

Blog Authors and Their Contributions

Rahul Gupta(linkedin.com/in/rahul-gupta-68700a114): Extracted features using Word2Vec and trained Traditional machine learning model that includes Naïve Bayes, SVM, Logistic Regression, XGBClassifier and MLP. Extracted features using pretrained GloVe and Recurrent Neural Network model trained on these features includes SimpleRNN, LSTM and CNN bi-directional LSTM.

Anjali(linkedin.com/in/anjali-b3169a1ab): Text preprocessing which includes removing user references, converting emoji to respective sentiment, removing any hyperlinks if any. Extracted features using Tf-idf vectorizer and training of Traditional Machine learning model that includes Naïve Bayes, SVM, Logistic Regression, XGBClassifier and MLP.

Under the guidance of

  1. Professor: Tanmoy Chakraborty (linkedin.com/in/tanmoy-chakraborty-89553324)
  2. Prof. Website: faculty.iiitd.ac.in/~tanmoy/
  3. Teaching Fellow: Ishita Bajaj
  4. Teaching Assistants: Shiv Kumar Gehlot, Vivek Reddy, Pragya Srivastava, Chhavi Jain, Shikha Singh and Nirav Diwan.

References:

[1] Bijoyan Das ,Sarit Chakraborty. An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation. Student Member, IEEE.

[2] Md. Rakibul Hasan ,Maisha Maliha, M. Arifuzzaman. Sentiment Analysis with NLP on Twitter Data Computer Communication Chemical Materials and Electronic Engineering (IC4ME2) 2019 International Conference on, pp. 1–4, 2019.

[3] Edilson A. Corrˆea Jr., Vanessa Queiroz Marinho, Leandro Borges dos Santos. NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment Analysis Institute of Mathematics and Computer Science University of S ̃ao Paulo (USP)S ̃ao Carlos, S ̃ao Paulo, Brazil.

[4] Juan Ramos. “Using TF-IDF to Determine Word Relevance in Document Queries” pp. 1–4.

[5] Andrew L. Maas, Raymond E, Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. “Learning word vectors for sentiment”(2011).

[6] Michael Weigand, Alexandra Balahur, BenjaminRoth, Dietrich Klakow, Andres Montoyo.“A Surveyon Role of Negation in Sentiment Analysis” pp 1–9

--

--