Web crawler of lyrics and corresponding music genre. Multiple baseline classifiers, including KNN, Naive Bayes, LSTM are applied to identify genre of a song by analyzing its lyrics.
We start from a list of artists. We utilize the iTunes Search API to request meta data for each artist, including song name, genre, album/collection name, release year, etc. Then a web crawler is applied to fetch lyrics on Genius, using artist names and song names. Finally the collected lyrics are cleaned up by some basic processing: removing too-short lyrics, deduplication, etc.
We applied the following classifiers to compare their performances:
For more details, take a look at our report.
Lyrics Collection: We manage to obtain a lyrics-genre dataset, with 30649 lyrics from 8 genres.
Classifiers: We train the classifiers on a balanced set - 390 lyrics for each genre, from which we hold out a balanced set as test set. Accuracy on the held-out test set is presented below.
Interestingly, the Baive Bayes Classifier works pretty well, outperforming all others. That said, I have to recognize, the LSTM model has a few hyperparameters(e.g. hidden layer dimension, learning rate, etc) to be tuned. But due to the time and computation resource limitation, these tuning are not adequately performed. It’ll be interesting to see in the future that whether there will be more performance boost with a thorough parameter tuning.