Before we start looking at sentence embedding, let’s fire up some simple plots using Plotly.
I generally prefer Plotly visualizations over anything else. They look nicer, are interactive and you can easily curate and share them using Plotly Chart Studio. The only downside is that it can take quite a lot of code (and a lot of nested dictionaries!) to get something useful out of Plotly which is annoying when all you want to do is to quickly explore the data.
One thing I would recommend is updating Plotly to at least 4.8 as Pandas can now use Plotly as the library for backend plotting as opposed to the ‘aesthetically impaired’ Matplotlib.
In the code block below — we can fire up a simple bar chart from Pandas DataFrame just by adding
.plot.bar() to a DataFrame object.
Simple as 123,abc!
You can tweak the look and feel of your figure if you want by using the same parameters you’d normally use with a Plotly Graph Objects, so there is that added familiar flexibility.
Ok back to the project…
I decided to get lyrics from a range of genres, some more extreme than others in order to see if we can see differences but I also chose very overlapping genres to see if there are commonalities.
Let’s look at how many songs in each genre we have.
fig_genre = pd.DataFrame(df.genre.value_counts()).plot.bar(template='ggplot2')#title parameters
title_param = dict(text='<b>Count of Genre</b><br></b>',
xaxis = dict(title='Genre'),
yaxis = dict(title='Count'))#change the colour
fig_genre.update_traces(marker_color='rgb(148, 103, 189)')#show plot
Each genre is reasonably well represented — enough to make meaningful comparisons at least.
What about artists by genre?
Let’s go to town with some visualization and create a sunburst map of genre and artist. If you want to interact with the plot, the link to the original plot is in the footnote.
I’ve got a range of bands and artists who I feel represent those genres. You might disagree with me. That’s fine, genre definition can be contentious.
Let’s explore lyric length by genre.
#get lyric frequencies for each song
df['lyric_count'] = df['lyrics'].map(lambda x: len(x.split()))
Hip-hop is a very lyric heavy genre — many of the key descriptive metrics that describe lyric count for hip-hop are different from all other genres.
It will be interesting to see if the content of those lyrics are also different when we analyze the content of those lyrics. Which conveniently leads me to the next part of the EDA…Part of Speech (PoS) analysis.