iTunes AppStore scraping – decoding the browse URL

Further to my recent posts covering scraping the itunes appstore – I have made some progress towards decoding the browse URL that returns the list of apps by category. There is a slight wrinkle with categories that have sub-categories (currently only games) and a potential work-around to the 3500-per-page limit.

The browse URL breaks down to this:

http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/category/subcategory/page

The top level browse URL, ie

http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse

on its own gives a list of top level categories and their associated ids- eg TV shows is 32, Music videos is 31, Music is 34 and AppStore is 36.

So to browse a category from the root, you append the URL with the query string path=/id. Ie the AppStore URL is

http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/36

which returns a list of AppStore categories and their ids – Weather = 6001, Travel = 6003, Games = 6014, etc.

Then, to browse all weather apps the URL is

http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/36/6001/1

where the final 1 seems to be a paging control – so where there are > 3500 apps you can increment the last number to retrieve the next set of app details.

Where there are subcategories, they can be accessed by replacing the top level id with that of the category – so to browse all games subcategories the URL is

http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/6014

which returns the names and ids of the games subcategories (Action = 7001, Adventure = 7002, and so on). Then to browse the action games the URL becomes

http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/6014/7001/1

It looks to me that currently if the tree is traversed from the root until the list of subcategories returns an empty list, and then the leaf node is used to retrieve the apps, there are no need for paging with a value of greater than 1. This is also the only method I can see for determining which subcategory an app is listed under – the apps themselves link to the category and a genre but not a subcategory. I also don’t know right now if this will produce multiple instances of the same app – ie if an app can appear under multiple subcategories.

[ad#co-1]

Tags: , ,

6 Responses to “iTunes AppStore scraping – decoding the browse URL”

  1. Peter B says:

    Paul,
    I’m finding your iTunes appstore articles incredibly useful for a project I’m currently working on. Many thanks for sharing your findings in this area.
    Cheers, Peter.

  2. paul says:

    Hi Peter – glad they’re useful. I know there’s not much information out there so I wanted to record everything I found. I’ll continue to update as I find out more.

    Cheers.

    –Paul

  3. Tim says:

    I suppose these could produce duplicates. I haven’t gone through the XML files to check that but if you go to the browse function in iTunes (which is was we’re doing with curl I suppose), go to Books and find “ABC Book”. But the genre for this one is actually “Education”.
    If you now go to the “Education” category, you’ll find it there too.
    There is no reason for the ID to be different though, so it should be easy to check.

  4. paul says:

    Hi Tim – Yes I have confirmed that you do get duplicates, for just the reason you say – there are several apps that appear in multiple categories. They do have the same ID though so it’s no big deal. And if it’s important to track category as well as genre it’s the only way to find it because the apps don’t link back to all sub-categories (at least as far as I can see).

  5. Cyrillus says:

    Hi Paul,
    It seems that the page browsing does not work — here’s the output of my in-project AppStore crawler :
    Got category Business (#6000)
    Downloading page 1… done, got 2500 items
    Downloading page 2… done, got 2500 items (2500 duplicates)
    Downloading page 3… done, got 2500 items (2500 duplicates)
    Downloading page 4… done, got 2500 items (2500 duplicates)
    Downloading page 5… done, got 2500 items (2500 duplicates)

    And it goes on and on… Could it be that Apple disabled the page browsing ? (Also, the limit is set to 2500 and not 3500 as you say in the article)

  6. paul says:

    Hi Cyrille – yes both your points are correct. The 3500 was a finger slip (sorry) and the paging either stopped working or never worked. It just looked so logical – why else is that ‘/1′ at the end there? There is a further posting about an alternative approach – see here. You’ll still need to use this method if you want to associate categories with an app as well as a genre though so it’s not a total loss. Good luck!.
    –Paul

Leave a Reply