Collection and Classification of Lyrics


Web crawler of lyrics and corresponding music genre. Multiple baseline classifiers, including KNN, Naive Bayes, LSTM are applied to identify genre of a song by analyzing its lyrics.

USC course CSCI544: Applied Natural Language Processing


Lyrics Collection

We start from a list of artists. We utilize the iTunes Search API to request meta data for each artist, including song name, genre, album/collection name, release year, etc. Then a web crawler is applied to fetch lyrics on Genius, using artist names and song names. Finally the collected lyrics are cleaned up by some basic processing: removing too-short lyrics, deduplication, etc.


We applied the following classifiers to compare their performances:

  • KNN, with TF-IDF as features
  • Naive Bayes, with TF-IDF as features
  • SVM, with TF-IDF as features
  • LSTM, with Glove word embedding as features

For more details, take a look at our report.


Lyrics Collection: We manage to obtain a lyrics-genre dataset, with 30649 lyrics from 8 genres.

Classifiers: We train the classifiers on a balanced set - 390 lyrics for each genre, from which we hold out a balanced set as test set. Accuracy on the held-out test set is presented below.

Model Accuracy
KNN 0.426
Naive Bayes 0.597
SVM 0.588
LSTM 0.563

Interestingly, the Baive Bayes Classifier works pretty well, outperforming all others. That said, I have to recognize, the LSTM model has a few hyperparameters(e.g. hidden layer dimension, learning rate, etc) to be tuned. But due to the time and computation resource limitation, these tuning are not adequately performed. It’ll be interesting to see in the future that whether there will be more performance boost with a thorough parameter tuning.