Skip to content
Search
Generic filters
Exact matches only

Analyzing #WhenTrumpIsOutOfOffice tweets

#Unigram bar chart
tidy_tweets %>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
filter(n > 150) %>%
ggplot(aes(word, n)) +
geom_col(fill = "red") +
xlab(NULL) +
coord_flip() +
ggtitle("#WhenTrumpIsOutOfOffice - 1-Word Frequency") +
geom_text(aes(x = word, label = n), vjust = 0, hjust = -0.3, size = 4)
#Bigram bar chart (showing codes for the bar chart only)
...
bigrams_united %>%
count(bigram, sort = TRUE) %>%
mutate(bigram = reorder(bigram, n)) %>%
filter(n > 20) %>%
ggplot(aes(bigram, n)) +
geom_col(fill = "blue") +
xlab(NULL) +
coord_flip() +
ggtitle("#WhenTrumpIsOutOfOffice - 2-Word Frequency") +
geom_text(aes(x = bigram, label = n), vjust = 0, hjust = -0.3, size = 4)

Other than finding out word frequencies, we can also quickly extract keywords using pre-built algorithms/packages. In our case, I extracted keywords using the RAKE method (Rapid Automatic Keyword Extraction), available in the udpipe R package.

Based on the RAKE keywords bar chart below, we can assume the following insights:

  • The two keywords with the highest scores are ‘white house’ and ‘new president.’ People are concern if Trump will continue his second term or be replaced by a new president.
  • Similar to the results of word frequencies, people plan to throw the biggest party when Trump is out of office
  • Interestingly, ‘better place’ is also on top of the RAKE keyword list
Keywords identified by RAKE method bar chart
#Keywords identified by RAKE method bar chart
stats <- keywords_rake(x = tidy_text, term = "lemma", group = "doc_id",
relevant = tidy_text$upos %in% c("NOUN", "ADJ"))
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
stats %>%
#filter data
filter(freq > 10 & ngram > 1) %>%
# set value for x, y, and fill
ggplot(aes(x = reorder(keyword, rake), y = rake )) +
# show in bars
geom_col(fill = "red") +
# flip the bars to be horizontal
coord_flip() +
# show value label
geom_text(aes(label = round(rake, digits = 2), vjust = 0, hjust = -0.3 )) +
# change y-axis name
xlab("keywords")+
# add title
ggtitle("Keywords identified by RAKE method") +
# hide legend
theme(legend.position = "none")

Another interesting text mining technique is to give each word a grammatical tagging, known as Part-of-speech tagging (POS). In this case, we specify to extract noun phrases from the tweets.

Some takeaways from the POS tagging bar chart:

  • Mike pence/President Pence are phrases that are heavily mentioned, compared to Joe Biden, when Trump is out of office
  • When Trump is out of the office, people heave a sigh of relief
Keywords identified by POS tags — Simple noun phrases bar chart
#Keywords identified by POS tags - Simple noun phrases bar chart
tidy_text$phrase_tag <- as_phrasemachine(tidy_text$upos, type = "upos")
stats <- keywords_phrases(x = tidy_text$phrase_tag, term = tolower(tidy_text$token),
pattern = "(A|N)*N(P+D*(A|N)*N)*",
is_regex = TRUE, detailed = FALSE)
stats <- subset(stats, ngram > 1 & freq > 100)
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
stats %>%
#data
#filter(freq > 100 & ngram > 1 ) %>%
# set value for x, y, and fill
ggplot(aes(x = reorder(keyword, freq), y = freq )) +
# show in bars
geom_col(fill = "red") +
# flip the bars to be horizontal
coord_flip() +
# show value label
geom_text(aes(label = freq, vjust = 0, hjust = -0.3 )) +
# change y-axis name
xlab("keywords")+
# add title
ggtitle("Keywords identified by POS tags - Simple noun phrases") +
# hide legend
theme(legend.position = "none")

Sentiment analysis is a field in Natural Language Processing (NLP) that tries to recognize emotions within text data. Sentiment analysis allows us to quickly understand the public sentiment on a specific topic or individual. Hence, Twitter is an excellent data source for text mining, where you can find tons of publicly expressed opinions.

Using the ‘NRC’ lexicon, we can tag words to 8 basic emotions (trust, anticipation, fear, etc.) and two sentiments (positive and negative).

In our case, we see an almost equal distribution of positive and negative words in the tweets. Note how ‘trust’ has the highest percentage among the other emotions while ‘surprise’ has the least words. In the next section, we will look in detail what are the words tagged to the emotions and sentiments.

Sentiment ranking List
#Sentiment ranking list
nrc_words <- tidy_tweets %>%
inner_join(get_sentiments("nrc"), by = "word")
#Rank of sentiments
sentiments_rank <- nrc_words %>%
group_by(sentiment) %>%
tally %>%
arrange(desc(n))
#Find percentage
sentiments_rank %>%
mutate(percent = (n/8169)*100)

We could also visualize the sentiment rankings in pie chart format:

Word frequency pie chart categorized by sentiments
#Word frequency pie chart categorized by sentiments
sentiments_rank_clean <- sentiments_rank %>%
filter(sentiment != "positive") %>%
filter(sentiment != "negative")
# Create bar chart first
bp<- ggplot(sentiments_rank_clean, aes(x=reorder(sentiment, -n), y=n, fill=sentiment))+
geom_bar(width = 1, stat = "identity")
#Turn bar chart into a pie chart
pie <- bp + coord_polar("x", start=0) +
ggtitle("#WhenTrumpIsOutOfOffice - Sentiment pie chart") +
xlab("Word frequency - Sentiment")

Let’s look at the words that appear in the ‘Trust’ category, which has the highest word frequency. Notice that the word ‘President’ appeared for more than 400 times.

Other interesting insights:

  • The word ‘hell’ is on top of the list for anger, fear, and sadness emotion categories.
  • The word ‘finally’ is another word that is on top of the list for disgust, joy, and surprise.
  • If you take a closer look at the words under the negative sentiment, you will find that most words are related to hatred, jail, etc. rather than words associated with sadness or disappointment.