Friday, 9 September 2016

Analyzing Pitchfork Reviews with Data Science


Pitchfork.com has been the web’s premier site for indie music news and analysis. Their album reviews for famous for the overt detail, astute prose, and cutting wit, not to mention their 0-10.0 rating scale.

They are often credited for the popularity of indie music in the 00s and 10s and for “breaking” bands such as Animal Collective, Bon Iver, and Grizzly Bear. Whenever they assign an album a perfect 10.0 rating, it send shockwaves throughout the online music community. They are the golden standard of music criticism in the internet age and writing for them is the pinnacle of a career in music writing.

I currently possess a clean dataset of around 17,000 album reviews from June 2016 to November 1999, that includes the following features: Date of review, author of review, genre, artist, label, and the review text. I gathered the data through www.import.io, an online tool that allows users to easily scrape data from websites.

This project analyzing Pitchfork album reviews will be split in two parts:

1. Data Analysis: This first part will focus on charts and tables showing things like a histogram of album review scores, the average score by genre, label, and artist, and which writers give out the highest and lowest scores.

In addition I will be using the Python library TextStat to “grade” the reviews. TextStat uses a formula to measure the complexity of a piece of text. I’ll use this library to see which writers are the hardest and easiest to read and chart the changes in Pitchfork’s writing over the years.

2. Machine Learning / NLP:

Using the Natural Language Processing Toolkit, I’ll try to predict album review scores by a variety of NLP tools such as a countvectorizer, TFIDF matrix, PCA, and more.

Discover the most commonly used words, bigrams, and trigrams by year, genre, artist, and writer.

Use unsupervised learning to cluster the albums into groups.

The first part of the series will be turned in on Friday September 9 and the second part will be due Monday September 26. There’s a chance I may split up the second part due to its extensiveness. For example, I could make the bigrams/trigrams and clustering their own article.

Overall, I think this will be an exciting NLP project and has a good chance attracting a lot of attention outside of the data community.

Keep an eye on opendatascience.com for these coming articles.

No comments:

Post a Comment