Skip to content
Search
Generic filters
Exact matches only

Automate Entity Extraction of Reddit Subgroup using BERT Model | by Manmohan Singh | Aug, 2020

Named-Entity recognition (NER) is a process to extract information from an Unstructured Text. Its also known as Entity Extraction. This method extracts information such as time, place, currency, organizations, medical codes, person names, etc. We can mark these extracted entities as tags to articles/documents.

But, what do we achieve by extracting the entity from the text? Do these tags help us to reduce time in the article’s searching process?

Tags on the articles or documents can save a lot of time by improving the search process. Tags help us to categorize text documents. It is one of the use-cases of NER.

Some other use-cases of NER are listed below.

1. Categorize articles of NEWS agencies into the world, sports, fashion, entertainment, etc. category.

2. It helps with product searches on different online shopping websites.

3. Online courses can be categorized based on different relevant tags.

We will use the BERT pre-trained model. Learn more about the BERT model here. BERT model will extract person name, organization, and location name from the Reddit subgroup.

This article has divided into three parts.

Part 1. Data collection and Data preparation

Python program connects with Reddit API and fetches information from subreddit. Then we format data according to BERT model input.

Part 2. Information Extraction

We will extract entity information from the data prepared from the first part.

Part 3. Data Analysis and Data Visualization

In this part, we will analyze the information extracted from the second part via graphs and charts.

Now, Let’s get started.

Part 1. Data collection and Data preparation

We will be using Reddit subgroup r/Worldnews data. Reddit provides API access to fetch titles, comments, and other data related to posts. PRAW is a python library that helps us to connect with an API. Learn more about the PRAW library here. (https://praw.readthedocs.io/en/latest/). You need to create a Reddit account to access the information required from the API.

These are the required API information.

reddit = praw.Reddit(client_id=’my_client_id’,
client_secret=’my_client_secret’,
user_agent=’my user agent name’)

Follow these steps mention in the article to get required API access information.

Once you get access, we will fetch the title and comments from the r/Worldnews post. We will use the top weekly post of r/Worldnews. You can receive the data from a subgroup based on a different timeline and their popularity.

def replies_of(top_level_comment, comment_list):
if len(top_level_comment.replies) == 0:
return
else:
for num, comment in enumerate(top_level_comment.replies):
try:
comment_list.append(str(comment.body))
except:
continue
replies_of(comment, comment_list)
list_of_subreddit = [‘worldnews’]
for j in list_of_subreddit:
# get 10 hot posts from the MachineLearning subreddit
top_posts = reddit.subreddit(j).top(‘week’, limit=1)
comment_list = []
# save subreddit comments in dataframe
for submission in top_posts:
print(‘nn’)
print(“Title :” , submission.title)
submission_comm = reddit.submission(id=submission.id)
comment_list.append(str(submission.title))
for count, top_level_comment in enumerate(submission_comm.comments):
try:
replies_of(top_level_comment, comment_list)
except:
continue
print(comment_list)

This code will fetch the entire comment section of the subreddit post using recursion function. The data will be store into the comment_list variable.

Part 2. Information Extraction

Data prepared in the first part is in the input format of the BERT Model. The output generated by the model has saved in different variables.

The transformer python library from Hugging face will help us to access the BERT model trained by DBMDZ. BERT token consists of around 30k words in its library. If input text consists of words that do not present in its library, then the BERT token break that word into near know words.

For example, the Hugging word will split into hu and ##gging. If an unrecognized word has considered as an entity, then each splitted word will be assigned to the same tags.

For example, (‘Hu’, ‘I-ORG’), (‘##gging’, ‘I-ORG’).

for sequence in comment_list:
if len(sequence) > 512:
continue
tokens = tokenizer.tokenize(tokenizer.decode
(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors=”tf”)
outputs = model(inputs)[0]
predictions = tf.argmax(outputs, axis=2)
list_bert = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())]

I have limited the input sentence length to 512 due to BERT token limitation.

I have combined these words and assigned their respective entity. This model is not 100% accurate. Because of it, wrong tags might get designated to some words. We will try to avoid these non-related words in our analysis.

Part 3. Data Analysis and Data Visualization.

We have got three categories for analysis. These categories are location, person name, and organization.

Title of the topic and entity extracted from the data.

Title: Research finds that New Zealand beat Covid-19 by trusting leaders and following advice. Citizens had a high level of knowledge about coronavirus and how it spread, and compliance with basic hygiene practices and trust in authorities was at nearly 100%.{‘Location’: [‘UNITED STATES’, ‘ILLINOIS’, ‘GREECE’, ‘TAIWAN’, ‘NEW Z’, ‘ISLAND’, ‘PORTLAND’, ‘NSW’, ‘CANADA’, ‘QUEENSLAND’, ‘VIETNAM’, ‘CHRISTCHURCH’, ‘HAWAII’,’VICTORIA’, ‘UK’, ‘RUSSIA’, ‘WELLINGTON’, ‘INDIANA’, ‘CHUR’, ‘NZ CHINA’, ‘STATES’, ‘ARGENTINA’, ‘CALIFORNIA’, ‘IETNAM’, ‘TRUMPTOWN’, ‘TEXAS’, ‘FRANCE’, ‘AUS’, ‘NZ’, ‘NEW YORK’, ‘JAPAN’, ‘FLORIDA’, ‘QLD’, ‘AUCKLAND’, ‘KE’, ‘USA’, ‘THE’, ‘CHINA’, ‘ITALY’, ‘SWEDEN’, ‘JONESTOWN’, ‘MELBOURNE’, ‘AMERICA’, ‘NEW ZEALAND’, ‘IRAQ’,’US’, ‘AFGHANISTAN’, ‘AUSTRALIA’], ‘Organisation’: [‘YOUTUBE’, ‘FED’, ‘FACEBOOK’, ‘ALLPRESS’, ‘GNELL’, ‘VODAFONE’, ‘IRON’, ‘LIB’, ‘RESERVE BANK’, ‘LANEWAY’, ‘DEMS’, ‘ALJAZEERA’, ‘RVA’, ‘JACINDAS’, ‘CIA’, ‘LABOR’, ‘TREASURY’, ‘SMD’, ‘WHO’, ‘SENATE’, ‘LIBERALS’, ‘LIBERAL’, ‘IIRC’, ‘COVID’, ‘HS’, ‘PRC’, ‘NATIONAL’, ‘TIL’, ‘SHITREDDITSAYS’, ‘COM’, ‘FOX’, ‘EZZANZ’, ‘QLD’, ‘FAMILY FIRST’, ‘NATIONALS’, ‘NIN’, ‘DEFENCE FORCE’, ‘ZZAN’, ‘ACINDA’, ‘FOX NEWS’, ‘LABOUR’, ‘FEDERAL’, ‘HOUSE OF REPS’, ‘WORLDNEWS’, ‘MURDOCH’, ‘GREENS’], ‘Person Name’: [‘KEVIN’, ‘FATHE’, ‘KAREN’, ‘MACRON’, ‘WINSTON’, ‘LES’, ‘BUCKLEY’, ‘CHLÖE SWARBRICK’, ‘COLLINS’, ‘CLINTON’,’JUDITH COLLINS’, ‘TO’, ‘KYLER’, ‘ASHLEY’, ‘BILL GATES’, ‘THE P’, ‘SCOTTY’, ‘HITLER’, ‘TRUMP’, ‘RUPERT MURDOCH’, ‘GATES’, ‘HGO’, ‘WILLIAM CASEY’, ‘OAK’, ‘TOVA’,’JIM JONES’, ‘KEZZA’, ‘ENN’, ‘MERICA’, ‘ROF’, ‘BLOOMFIELD’, ‘GOD’, ‘KIF’, ‘CLIVE PALMER’, ‘DAVE GROHL’, ‘SHER’, ‘BLAIR’, ‘JACINDA ARDERN’, ‘DAD’, ‘JACINDA’, ‘WINS TON PETERS’, ‘LERON’, ‘BLOOMFIELDS’, ‘MURDOCH’]}

Here are my observations.

1. New Zealand’s location got many mentions in comments. This location name is also mention in the title. Reddit users may prefer a short form of the country name than full form.

For example, a short form of countries such as the United States, New Zealand, and the United Kingdom are the US, NZ, and the UK.

2. Users mention the country name when they know about that country. Or they belong to that country. So, we can say that maximum users who commented on this post belong to NZ, the US, Australia, or the UK.

3. Jacinda Ardern is the prime minister of NZ, which explains the mention of that name in most of the comments. As topic sentiment is positive, I can say that comment mentions the name of Jacinda Ardern is also positive.

4. We can also see the name of Trump ( President of the US) and Bill Gates (founder of Microsoft). But sentiments of comments that mention these names are not conclusive. You can analyze those comments separately.

5. Jacinda Ardern belongs to the ruling Labour party. And the opposition is the National party. Both Labour and National organization names are present in the comments.

6. You can also see the mention of COVID and WHO. Mention of Facebook in the organization tag is inconclusive unless you have the comments which mentioned it.

With these entities, you can expect what type of data it is. You can categorize this data under the world news. These tags will help you to filter reading materials.

The same topic will not appear if you run the same python program. So observation and tags may vary.

Here is a Bar graph of Location extracted from the post.