Exago BI vs Public Data: Video Game Remixes
Oct 19, 2017
As someone who has spent the last five years helping to build a broadly-scoped and highly-customizable Business Intelligence product, it’s amazing how easy it is to forget to be a user. As software developers, too often do we focus on narrow silos of the product in order to fix a bug or improve a part of the product without slowing down enough to actually put the product through its paces as a real user would. I’m not the best report builder at Exago, but I love when I get an exciting idea for a “real-world” use-case of our product as a change of pace from the myriad contrived examples we use every day for development and testing. So it was just before our annual Client Advisory Board conference, and I needed to come up with an example to use when demoing the new “code completion” feature in one of our report designers.
For some reason—maybe I was browsing the site; I don’t quite remember—I thought of Ocremix.org, an online website dedicated to video game music arrangements produced by game music fans around the world. This site has its archive of music and artists, called “remixers,” available on public-facing web pages. Surely there was some data here ripe for visualizing!
The Data Model
After spending some time on the website, I started to see a simple relational data model. Remixers create one or more Remixes. This implied that I would want a Remixers table, a Remixes table, and a join table to associate remixes their respective remixers—not rocket science, but enough to get my feet wet.
So I had an idea of what data I wanted, and next was to write a script to gather it from Ocremix.org. The website presents basically all its public data in a convenient tabular format, so writing a scraper in Python was straightforward. The three models I wanted were represented by three different pages:
Using Beautiful Soup, I wrote several Python scripts to scrape the relevant info from each page and dump to CSV files that I could import into ExagoBI. Some of the information was gathered from paginated tables on the website. One such example was the information about each remixer. In this case the scraping algorithm was as such:
1. Load the page that contains the paginated alphabetical list of artists
2. For each table row
a. Get the link to the artist’s profile page
b. Download the profile page and, using Beautiful Soup, examine the DOM to get the desired info about the artist.
3. Request the next page using a programmatically-generated URL
4. If more results are returned, repeat from step 2
One of the downsides to this method is that it is slow. While there are only about 2,300 remixers on the site at the time of this writing, there is on the order of 14,000 remixes. Since each page in the list of remixes only displays 50 results, that means my scraper had to do almost 300 page loads just to get the list of remixes. Since I also wanted bio information on each remixer, that also meant doing one page load for each of the 2,300 remixers in the system. This is just one of the downsides to using a web scraper, but sometimes it is necessary to resort to these somewhat “uglier” means when we aren’t provided with nice, clean APIs.
After pulling down all the data, I popped into ExagoBI and made a report, starting with the ExpressView designer then moving into the Advanced Report Designer when I wanted more specific control. Using Top/Bottom filtering, I was able to find the top ten games with the most remixed songs on the site. I also did some simple demographic reporting using available information about remixer age and gender. The youngest published remixer was fourteen!
To learn more about using Exago BI, check out our Training and Support Lab videos. Have you done any interesting data scrapes? Tell us what you learned in the comments!