Sentiment Analysis, building word clouds and more….
Ever heard of Twint?
Twint is an advanced web scraping tool built in Python which scrapes the web instead of collecting data through the twitter API like tweepy. It is short for Twitter Intelligence Tool. You can download it using:
pip3 install twint
The twint documentation can be found here.
In this article, we will use Donald Trump’s tweets since the start of the year 2019. We can download tweets of a given user with this simple command in the command line:
twint -u realDonaldTrump – since 2019-01-01 -o trump.csv – csv
This will download all the tweets of
@realDonaldTrump since 2019 into a single csv file
Here I have converted the csv file to xls format for convenience. Lets dive in!
df=pd.read_excel('trump.xls')***added columns mentions, hashtags and length***
***added month, year and hour columns***
- Let’s look at the average tweet length by hour.
Looks like the president’s tweets are lengthy early in the morning (3am to 10am).
- Average number of mentions by hour.
How about when coupled with the sentiment of those tweets? (calculation of sentiment shown later.)
First, let’s clean the tweets. For this, we will create two functions, one for removing urls, mentions and hashtags (store them in a separate column) and the other for cleaning the remaining text (removing stop words, punctuations).
I will use the
cleaned_tweets column for the tweets with stripped content, stopwords and punctuation and
tweet column with just the content removed, to calculate sentiment and subjectivity.
df['cleaned_tweets']=df['tweet'].apply(lambda x: process_text(x))
df['tweet']=df['tweet'].apply(lambda x: remove_content(x))
Now let’s build a word cloud to get an idea of frequent phrases.
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plttemp=' '.join(df['cleaned_tweets'].tolist())
wordcloud = WordCloud(width = 800, height = 500,
min_font_size = 10).generate(temp)plt.figure(figsize = (8, 8), facecolor = None)
plt.tight_layout(pad = 0)
More frequent words/phrases appear in larger font.
Now let’s define a function to plot the top n occurrences of phrases in the given ngram range. For this, we will use the
Most of the work is done, now let’s plot the frequent phrases.
Sleepy Joe Biden? Seriously?
We use the tweet column to analyze the sentiment and subjectivity of the tweets. For this, we will use
Given an input sentence, TextBlob outputs a tuple of two elements:
from textblob import TextBlob
df['subject']=df['tweet'].apply(lambda x: TextBlob(x).sentiment)
df['polarity']=df['sentiment'].apply(lambda x: 'pos' if x>=0 else 'neg')
Let’s look at the sentiment distribution of tweets
Most of the tweets might not be subjective. The tweets might be a fact, like bad news. Let’s find out the sentiment distribution of tweets which were subjective. For this, let’s filter out the tweets with
subjectivity greater than 0.5 and plot the distribution.
fig=px.histogram(df[df['subject']>0.5], x='polarity', color='polarity')
Looks like the proportion of negative sentiment increased when only subjective tweets were analyzed.
Now let’s look at the polarity of subjective tweets of the 20 most mentioned users.
Topic modeling is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents. This is known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.
We will use the gensim LDA model for topic modelling.
#pre-process tweets to BOW
from gensim import corpora
r = [process_text(x,stem=False).split() for x in df['tweet'].tolist()]
dictionary = corpora.Dictionary(r)
corpus = [dictionary.doc2bow(rev) for rev in r]#initialize model and print topics
from gensim import models
model = models.ldamodel.LdaModel(corpus, num_topics=10, id2word=dictionary, passes=15)
topics = model.print_topics(num_words=5)
for topic in topics:
There are some clear topics like topic 5 during the early stages of impeachment trial, topic 8 containing phrases related to the China trade deal and topic 6 regarding his plans to build the wall.
for x in model[corpus]:
labels.append(sorted(x,key=lambda x: x,reverse=True))df['topic']=pd.Series(labels)
Let’s look at the topic distribution.
Let’s look at the distribution of topic 5 and 6.