Ever wondered which of your friends swore in texts the most?
While thinking of what to do for my next project, an idea suddenly popped up in my head. Why not do data analysis on a WhatsApp group chat of college students and find out interesting insights like the most used emoji, the sentiment score of each person, who swears the most, the most actives times of the day, or does the group use phones during college teaching hours? These would be some interesting insights for sure, more for me than for you since the people in this chat are people I know personally.
Note: I represent each of my friends with two letters, an abbreviation of their name as a way to maintain anonymity.
The first step was to gather the data. WhatsApp allows you to export your chats through a .txt format. Opening this file up, you get messages in a format that looks like this:
Since WhatsApp texts are multi-line, you cannot just read the file line by line and get each message that you want. Instead, you need a way to identify if a line is a new message or part of an old message. You could do this use regular expressions, but I went forward with a more simple method. I created a function called vali_date(), which returns True if the argument passed is a valid date.
While reading each line, I split it based on a comma and take the first item returned from the split() function. If the line is a new message, the first item would be a valid date, and it will be appended as a new message to the list of messages. If it’s not, the message is part of the previous message, and hence, will be appended to the end of the previous message as one continuous message.
From here it’s just a matter of extracting the necessary details from each line: the date sent, the time sent, the sender, and the message itself. This is just a matter of using some simple string processing functions, mainly split(), and making a DataFrame out of it. If you’re using the same method then please make sure to specify the date and time formats correctly.