AppStore scraping – the front door method
As I found out last week, the browse method of scraping the iTunes AppStore is limited by the fact that a maximum of 2500 apps are listed per category, or per sub-category in the case of games. And given that may apps are listed under multiple categories, it shoud be apparent that that method will not expose all the apps in the store. Therefore a different approach is required. The method I chose mimics selecting a category on the main AppStore front page in iTunes and the clicking ‘Next page’ until the last page is reached. This gives you twenty apps per page, so there are a lot of pages to click. For example, there are currently 301 pages of Books apps. In practice this doesn’t seem to be too slow, although it takes a lot more pages that the browse method.
So to get to the top page per category, the URL is:
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/viewGenre?id=id
where id is the category id (eg 6018 for books). This returns the first page with the first 20 apps. Unfortunately it also includes a lot of extra bumph – mainly the top paid apps and top free apps sidebars. The first thing we need to find is the URL of the next page – it is hidden in a bunch of horrible looking XML something like this:
<HBoxView topInset="5" bottomInset="10" leftInset="0">
<TextView topInset="0" truncation="right" leftInset="0" stretchiness="0" styleSet="normal11Align" textJust="left" maxLines="1">
<SetFontStyle normalStyle="matrixTextFontStyle">
<B>Page 1 of 301</B>
</SetFontStyle>
</TextView>
<VBoxView alt="">
<View alt="" stretchiness="1"/>
<GotoURL target="main" url="http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/viewGenre?sortMode=2&id=6018&batchNumber=1">
<PictureButtonView leftInset="3" width="12" topInset="1" picts="plain,pressed,rollover" transparentClicks="1" alt="next page" url="/images/arrowoutline/arrow_000000_r.png" height="12"/>
</GotoURL>
<View alt="" stretchiness="1"/>
</VBoxView>
</HBoxView>
Got that? The next pages are the same as the root URL with a batchNumber attribute (0 indexed apparently). So you could just increment the index until you hit an error, or you could use an xpath query to read the next page URL from each page. I though the latter approach was less messy, so I did it that way. The xpath I used is:
/*[name()='Document']/*[name()='View']/*[name()='ScrollView']/*[name()='VBoxView']/*[name()='View']/*[name()='MatrixView']/*[name()='VBoxView']/*[name()='MatrixView']/*[name()='VBoxView']/*[name()='VBoxView']/*[name()='View']/*[name()='VBoxView']/*[name()='VBoxView']/*[name()='VBoxView']/*[name()='VBoxView']/*[name()='HBoxView']/*[name()='VBoxView']/*[name()='HBoxView']/*[name()='VBoxView']/*[name()='GotoURL']
which returns the node with the URL in the attribute “url” and the text “next page” in the alt attribute. You have to check that alt because on all pages but the first there is a similar node containing the link to the previous page before the link to the next one.
To read the apps on the page, the xpath
/*[name()='Document']/*[name()='View']/*[name()='ScrollView']/*[name()='VBoxView']/*[name()='View']/*[name()='MatrixView']/*[name()='VBoxView']/*[name()='MatrixView']/*[name()='VBoxView']/*[name()='VBoxView']/*[name()='View']/*[name()='VBoxView']/*[name()='VBoxView']/*[name()='VBoxView']/*[name()='MatrixView']/*[name()='HBoxView']/*[name()='VBoxView']/*[name()='MatrixView']
returns 20 nodes which contain links to the product detail pages, which can be read as before.
And that’s how to read all the apps on the AppStore. The big drawback of this method is that you don’t get category information – if you need that you’re going to have to traverse the browse links as well and create separate links to the apps you found on the front page browse.
Enjoy!
[ad]