Scraping the iTunes AppStore part i
There have been quite a few sites set up recently that provide lists of apps on the Apple iPhone AppStore – apptism.com is probably the best known. Obviously they are accessing the AppStore and pulling the data down by masquerading as iTunes. I decided to find out how they are doing it.
In order to monitor the network connection between iTunes and the AppStore I needed a network packet sniffer. I googled the best solution and found that OSX ships with tcpdump, which does a good job of tracking network traffic – the command I used was
sudo tcpdump -s 0 -A -i en0 port 80
which gave me enough details to see what was happening.
I then started up iTunes and went to the AppStore page, and I was able to see all the messages passing between the two – it looks like iTunes AppStore feature functions very similarly to a web browser, although it doesn’t use HTML – it is based on a proprietary,and quite complex, xml format. There are different ways of finding the apps in iTunes – by clicking a category, and then paging through, which is painfully slow although it shows details of each app, including the price, rating and a description. The alternative is to use the browse by category list, which just lists the app name, price and genre. Since I plan on pulling down all the app details, I thought I’d start with the browse list and then drill down through the categories to the individual apps.
The url iTunes uses for browsing is
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/36
This returns a list of categories and associated IDs in pairs – the meat of the xml response looks like this:
<key>infoURL</key><string>http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/viewGenre?id=</string>
<key>items</key>
<array>
<dict>
<key>itemName</key><string>Books</string>
<key>itemId</key><integer>6018</integer>
</dict>
<dict>
<key>itemName</key><string>Business</string>
<key>itemId</key><integer>6000</integer>
</dict>
<dict>
<key>itemName</key><string>Education</string>
<key>itemId</key><integer>6017</integer>
</dict>
...
The first item is a URL which can be used together with the category id to generate a list of apps in that category. The categories and their ids follow in <dict> pairs.
These URLs can be tested with curl or in a web browser. I’ll cover getting category entries and individual app details in a later posting.
[ad#co-1]