What can we find in lyrics? Can we see history? With this doubt, we carried out a study on the trend of words used in lyrics, and try to get some relations between our resulting words, and the history.
To start the research, we need to get the proper data. In this study, we collected the full billboard THE HOT 100 charts as basis, and got all lyrics available using Musixmatch API.
The HOT 100 chart spans a range from Aug 09, 1958 to July 25, 2015, and released every week.
Here are some statistics:
In this study, we analyze trending words in 3 levels -- week, year and decade. So we could get the most important words for every week, every year and every decade.
If you're interested about how data is processed, please see section Technique Details below.
These are the top words in different time periods. Click on the circle, and get the topwords on the right.
To have a better idea, you can fill in specific words, and get the total number they appear in each year.
Some top-words in the 21th century are not that popluar in the good old days.
What other findings do you get? Please leave us a message!
As we mentioned, we first got a list of songs that have been on billboard THE HOT 100 chart, and collected the lyrics using Musixmatch.
The following chart is the statistics of returned value using Musixmatch API.
We can only analyze songs with English lyrics, which is shown in the top chart.
Unfortunately we couldn't get all lyrics of all the songs, because of authorization problem, foreign language problem, and not-covered-by-database problem(We realized that cover songs like sung by Glee or Pitch Perfect are not likely to be found). But most of the not-covered-by-database problem occured in before 2000. So at least we could get better result in years after 2000s.
It becomes natural to think about weighting on the rank of songs - help higher ranked songs have more weights so their lyrics become more important. That's what we do - we collect all songs in one time period, and weight the words in their lyrics by the highest rank they ever reached.
We used the numerical statistic called tf-idf. It shows how important a word is in its document, compared to the full collection.
In our example, to get the top words in a week(like in week starting with 1958-Aug-09), we take the lyrics in songs of that week as one document, and consider all lyrics in this year's chart as the collection.
To get the top words in a year, lyrics in this year are the document, and lyrics in the current decade are the collection.
Further, to get top words in the decade, lyrics in the current decade are seen as one document, and all 58 years' lyrics are the corpus.
To our surprise, some words that are closely related to history, like 'war' and 'peace', are not shown in the final top words. That's because these words are so commonly used in lyrics, so our algorithm doesn't see them as unique and important.
One possible way is to reduce the appearance of a certain word in one document to 0 when this number is less than tenth of the maximum occurance in all documents. Thus the idf of this word will decrease and the tf-idf will increase. Your opinions are welcomed to help with better selection of top words.