Posts Tagged ‘itunes’

AppStore scraping – back to the drawing board

Tuesday, June 9th, 2009

I had assumed that in a browse URL such as http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/6014/7001/1, the last number was some kind of paging option, with each page returning up to 2500 apps. iTunes only seems to display that umber or items per category, and that’s the format of the URL it uses, so it made sense. But having actually tried it I find out that the xml returned is the same regardless of the number at the end. It returns an error if you don’t put a number, but put anything from 0 to 99 and you get the same list of apps. Which is kind of a pain, because that leaves a lot of apps unreachable. I can get to around 35000 using the browse method, but according to apptism there are currently around 49000 apps. The only way round this that I can see is to abandon the browse approach and scrape from the front page link for each category and page through 20 at a time. It’s probably going to be slow but I don’t see any choice at the moment. Of course I’ll report my findings here.

Apple Blocking Access to the AppStore?

Wednesday, May 27th, 2009

I got a chance to investigate the strange access problems I found last week trying to scrape the iPhone Appstore. It looks to me like something has definitely changed on the server, but it’s hard to see what. My original script used curl with its default user string, and that seemed to suffer timeout problems on every page. So I by changed the agent string to the firefox one, which seemed to result in an immediate improvement, but it too started to suffer slowdowns as it progressed through the store. Finally I changed all timeouts to 5 minutes and it looks like every call returned successfully. So all I can think of is that requests that aren’t from the iPhone or iTunes are being served, but they’re being sent to the back of the queue. I don’t see much logic in all this, but the Apple do move in mysterious ways sometimes, and it’s always possible it’s a quirk rather than a policy decision. But the good news is that access is still being permitted at some level and we can go on cutting and dicing appstore content into something useful.

Apple blocking curl from the Appstore?

Tuesday, May 19th, 2009

Not quite sure what’s going on with the AppStore. I just resumed my experiments and it appears that a couple of things have changed. Firstly calls from curl seem to be blocked – although changing the user agent seems to get round that. Why they would impose such a trivially bypassed hurdle is a bit of a mystery – surely if there is a target of a block there are better ways to keep them out, like ip address blocking. It is interesting that they aren’t moving to impose a total block from non-iTunes clients though, clearly that is a tacit admission that they are allowing store scraping at some level. More seriously, some of the browse URLs I was using previously don’t appear to work any more. I’m sure I can figure out what’s going on but I’m going to need more time than I have now to investigate. I’ll post back as soon as I figure it out.

[ad#co-1]

iTunes AppStore scraping – decoding the browse URL

Monday, May 11th, 2009

Further to my recent posts covering scraping the itunes appstore – I have made some progress towards decoding the browse URL that returns the list of apps by category. There is a slight wrinkle with categories that have sub-categories (currently only games) and a potential work-around to the 3500-per-page limit.

The browse URL breaks down to this:

http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/category/subcategory/page

The top level browse URL, ie

http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse

on its own gives a list of top level categories and their associated ids- eg TV shows is 32, Music videos is 31, Music is 34 and AppStore is 36.

So to browse a category from the root, you append the URL with the query string path=/id. Ie the AppStore URL is

http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/36

which returns a list of AppStore categories and their ids – Weather = 6001, Travel = 6003, Games = 6014, etc.

Then, to browse all weather apps the URL is

http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/36/6001/1

where the final 1 seems to be a paging control – so where there are > 3500 apps you can increment the last number to retrieve the next set of app details.

Where there are subcategories, they can be accessed by replacing the top level id with that of the category – so to browse all games subcategories the URL is

http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/6014

which returns the names and ids of the games subcategories (Action = 7001, Adventure = 7002, and so on). Then to browse the action games the URL becomes

http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/6014/7001/1

It looks to me that currently if the tree is traversed from the root until the list of subcategories returns an empty list, and then the leaf node is used to retrieve the apps, there are no need for paging with a value of greater than 1. This is also the only method I can see for determining which subcategory an app is listed under – the apps themselves link to the category and a genre but not a subcategory. I also don’t know right now if this will produce multiple instances of the same app – ie if an app can appear under multiple subcategories.

[ad#co-1]

Scraping iTunes App Store part iv – reading application details

Tuesday, April 28th, 2009

So we now have the page containing all (or the first 3500) applications for each category. To read details of individual apps I used the following XPath query -


/*[name()='Document']/*[name()='View']/*[name()='ScrollView']/*[name()='VBoxView']/*[name()='View']/*[name()='MatrixView']/*[name()='MatrixView']/*[name()='VBoxView']/*[name()='VBoxView']/*[name()='TextView']"

Each node of this contains a long list of name/value pairs as shown in my previous post. Some of the fields are:

  • artistId – The unique id of the app developer
  • artistName – A string containing the name of the developer.
  • genre – the name of the genre to which the app is assigned
  • genreId – the numeric id of the genre
  • itemId – the unique id used to identify the app throughout iTunes
  • itemName – the name of the app
  • kind – always “software”as far as I can see
  • popularity – a ranking indicator. Not quite sure how this is calculated right now
  • price – the price in tenths of a cent
  • priceDisplay – the price as a formatted string
  • releaseDate
  • softwareIcon57×57URL – the URL of the app’s icon
  • url – the URL to view the app in iTunes

[ad#co-1]

Scraping iTunes part iii – reading the categories list

Thursday, April 23rd, 2009

My last post on scraping the iTunes store showed how to read the categories page. We had a URL something like this:
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/36/6081/1/
(for books). Sending that in produces a large xml response – the interesting parts look like this:

<dict>
<key>artistId</key>
<integer>293260414</integer>
<key>artistName</key>
<string>Saxorama.net</string>
<key>buy-only</key>
<true/>
<key>buyParams</key>
<string>
productType=C&salableAdamId=294770918&pricingParameters=STDQ&price=0&ct-id=14
</string>
<key>genre</key>
<string>Productivity</string>
<key>genreId</key>
<integer>6007</integer>
<key>itemId</key>
<integer>294770918</integer>
<key>itemName</key>
<string>EasyWriter</string>
<key>kind</key>
<string>software</string>
<key>playlistName</key>
<string>EasyWriter</string>
<key>popularity</key>
<string>0.13890815</string>
<key>price</key>
<integer>0</integer>
<key>priceDisplay</key>
<string>Free</string>
<key>releaseDate</key>
<string>2009-04-06T07:00:00Z</string>
<key>s</key>
<integer>143441</integer>
<key>softwareIcon57x57URL</key>
<string>

http://a1.phobos.apple.com/us/r30/Purple/40/ce/15/mzl.dtewfrse.png

</string>
<key>softwareIconNeedsShine</key>
<false/>
<key>softwareSupportedDeviceIds</key>
<array>
<integer>1</integer>
</array>
<key>softwareVersionBundleId</key>
<string>net.sax.easywriter</string>
<key>softwareVersionExternalIdentifier</key>
<integer>1589121</integer>
<key>softwareVersionExternalIdentifiers</key>
<array>
<integer>875361</integer>
<integer>1472572</integer>
<integer>1486886</integer>
<integer>1589121</integer>
</array>
<key>url</key>
<string>

http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=294770918&mt=8

</string>
</dict>

This is a set of summary details for one product, and there will be up to 3500 of them on the page that we just downloaded. While the information here is useful, we need to go to another page to get the complete set of information for the produce. We do this by extracting a URL using an xpath statement, which I will post next time.
[ad#co-1]

Scraping the iTunes AppStore part i

Friday, April 17th, 2009

There have been quite a few sites set up recently that provide lists of apps on the Apple iPhone AppStore – apptism.com is probably the best known. Obviously they are accessing the AppStore and pulling the data down by masquerading as iTunes. I decided to find out how they are doing it.
In order to monitor the network connection between iTunes and the AppStore I needed a network packet sniffer. I googled the best solution and found that OSX ships with tcpdump, which does a good job of tracking network traffic – the command I used was

sudo tcpdump -s 0 -A -i en0 port 80

which gave me enough details to see what was happening.
I then started up iTunes and went to the AppStore page, and I was able to see all the messages passing between the two – it looks like iTunes AppStore feature functions very similarly to a web browser, although it doesn’t use HTML – it is based on a proprietary,and quite complex, xml format. There are different ways of finding the apps in iTunes – by clicking a category, and then paging through, which is painfully slow although it shows details of each app, including the price, rating and a description. The alternative is to use the browse by category list, which just lists the app name, price and genre. Since I plan on pulling down all the app details, I thought I’d start with the browse list and then drill down through the categories to the individual apps.

The url iTunes uses for browsing is
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/36
This returns a list of categories and associated IDs in pairs – the meat of the xml response looks like this:
<key>infoURL</key><string>http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/viewGenre?id=</string>
<key>items</key>
<array>
<dict>
<key>itemName</key><string>Books</string>
<key>itemId</key><integer>6018</integer>
</dict>
<dict>
<key>itemName</key><string>Business</string>
<key>itemId</key><integer>6000</integer>
</dict>
<dict>
<key>itemName</key><string>Education</string>
<key>itemId</key><integer>6017</integer>
</dict>
...

The first item is a URL which can be used together with the category id to generate a list of apps in that category. The categories and their ids follow in <dict> pairs.

These URLs can be tested with curl or in a web browser. I’ll cover getting category entries and individual app details in a later posting.
[ad#co-1]