Finding new iPhone apps
Friday, July 31st, 2009To monitor new releases on the iPhone, use this URL:
To monitor new releases on the iPhone, use this URL:
Here’s a tip – if you’re as in to free iPhone apps as I am, use the following link to find apps that have recently been made free -
http://www.macworld.com/appguide/browse.html#prices=Free&sort=onSale&dir=desc
And here’s one to find recently-released free apps
http://www.macworld.com/appguide/browse.html#prices=Free&sort=new&dir=desc
The AppStore doesn’t make this kind of search very easy, so this is a good work-around.
[ad]
MacWorld have launched their appguide site for iPhone apps. I think this raises the bar on appstore scrapers – it displays a lot of information and makes it simple to find cool apps. And the editorial content adds a lot of value too – there are just too many apps out there to navigate without some kind of guidance. And they’re giving away iTunes gift cards for reviews – of course they can’t restrict it to purchasers only as Apple can so it’s open to abuse, but historically they’ve been pretty hard on spammers and abusers so we’ll have to give them the benefit of the doubt. Check it out.
[ad]
I got a chance to investigate the strange access problems I found last week trying to scrape the iPhone Appstore. It looks to me like something has definitely changed on the server, but it’s hard to see what. My original script used curl with its default user string, and that seemed to suffer timeout problems on every page. So I by changed the agent string to the firefox one, which seemed to result in an immediate improvement, but it too started to suffer slowdowns as it progressed through the store. Finally I changed all timeouts to 5 minutes and it looks like every call returned successfully. So all I can think of is that requests that aren’t from the iPhone or iTunes are being served, but they’re being sent to the back of the queue. I don’t see much logic in all this, but the Apple do move in mysterious ways sometimes, and it’s always possible it’s a quirk rather than a policy decision. But the good news is that access is still being permitted at some level and we can go on cutting and dicing appstore content into something useful.
Not quite sure what’s going on with the AppStore. I just resumed my experiments and it appears that a couple of things have changed. Firstly calls from curl seem to be blocked – although changing the user agent seems to get round that. Why they would impose such a trivially bypassed hurdle is a bit of a mystery – surely if there is a target of a block there are better ways to keep them out, like ip address blocking. It is interesting that they aren’t moving to impose a total block from non-iTunes clients though, clearly that is a tacit admission that they are allowing store scraping at some level. More seriously, some of the browse URLs I was using previously don’t appear to work any more. I’m sure I can figure out what’s going on but I’m going to need more time than I have now to investigate. I’ll post back as soon as I figure it out.
[ad#co-1]
Further to my recent posts covering scraping the itunes appstore – I have made some progress towards decoding the browse URL that returns the list of apps by category. There is a slight wrinkle with categories that have sub-categories (currently only games) and a potential work-around to the 3500-per-page limit.
The browse URL breaks down to this:
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/category/subcategory/page
The top level browse URL, ie
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse
on its own gives a list of top level categories and their associated ids- eg TV shows is 32, Music videos is 31, Music is 34 and AppStore is 36.
So to browse a category from the root, you append the URL with the query string path=/id. Ie the AppStore URL is
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/36
which returns a list of AppStore categories and their ids – Weather = 6001, Travel = 6003, Games = 6014, etc.
Then, to browse all weather apps the URL is
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/36/6001/1
where the final 1 seems to be a paging control – so where there are > 3500 apps you can increment the last number to retrieve the next set of app details.
Where there are subcategories, they can be accessed by replacing the top level id with that of the category – so to browse all games subcategories the URL is
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/6014
which returns the names and ids of the games subcategories (Action = 7001, Adventure = 7002, and so on). Then to browse the action games the URL becomes
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/6014/7001/1
It looks to me that currently if the tree is traversed from the root until the list of subcategories returns an empty list, and then the leaf node is used to retrieve the apps, there are no need for paging with a value of greater than 1. This is also the only method I can see for determining which subcategory an app is listed under – the apps themselves link to the category and a genre but not a subcategory. I also don’t know right now if this will produce multiple instances of the same app – ie if an app can appear under multiple subcategories.
[ad#co-1]
So we now have the page containing all (or the first 3500) applications for each category. To read details of individual apps I used the following XPath query -
/*[name()='Document']/*[name()='View']/*[name()='ScrollView']/*[name()='VBoxView']/*[name()='View']/*[name()='MatrixView']/*[name()='MatrixView']/*[name()='VBoxView']/*[name()='VBoxView']/*[name()='TextView']"
Each node of this contains a long list of name/value pairs as shown in my previous post. Some of the fields are:
[ad#co-1]
My last post on scraping the iTunes store showed how to read the categories page. We had a URL something like this:
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/36/6081/1/
(for books). Sending that in produces a large xml response – the interesting parts look like this:
<dict>
<key>artistId</key>
<integer>293260414</integer>
<key>artistName</key>
<string>Saxorama.net</string>
<key>buy-only</key>
<true/>
<key>buyParams</key>
<string>
productType=C&salableAdamId=294770918&pricingParameters=STDQ&price=0&ct-id=14
</string>
<key>genre</key>
<string>Productivity</string>
<key>genreId</key>
<integer>6007</integer>
<key>itemId</key>
<integer>294770918</integer>
<key>itemName</key>
<string>EasyWriter</string>
<key>kind</key>
<string>software</string>
<key>playlistName</key>
<string>EasyWriter</string>
<key>popularity</key>
<string>0.13890815</string>
<key>price</key>
<integer>0</integer>
<key>priceDisplay</key>
<string>Free</string>
<key>releaseDate</key>
<string>2009-04-06T07:00:00Z</string>
<key>s</key>
<integer>143441</integer>
<key>softwareIcon57x57URL</key>
<string>
http://a1.phobos.apple.com/us/r30/Purple/40/ce/15/mzl.dtewfrse.png
</string>
<key>softwareIconNeedsShine</key>
<false/>
<key>softwareSupportedDeviceIds</key>
<array>
<integer>1</integer>
</array>
<key>softwareVersionBundleId</key>
<string>net.sax.easywriter</string>
<key>softwareVersionExternalIdentifier</key>
<integer>1589121</integer>
<key>softwareVersionExternalIdentifiers</key>
<array>
<integer>875361</integer>
<integer>1472572</integer>
<integer>1486886</integer>
<integer>1589121</integer>
</array>
<key>url</key>
<string>
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=294770918&mt=8
</string>
</dict>
This is a set of summary details for one product, and there will be up to 3500 of them on the page that we just downloaded. While the information here is useful, we need to go to another page to get the complete set of information for the produce. We do this by extracting a URL using an xpath statement, which I will post next time.
[ad#co-1]
There have been quite a few sites set up recently that provide lists of apps on the Apple iPhone AppStore – apptism.com is probably the best known. Obviously they are accessing the AppStore and pulling the data down by masquerading as iTunes. I decided to find out how they are doing it.
In order to monitor the network connection between iTunes and the AppStore I needed a network packet sniffer. I googled the best solution and found that OSX ships with tcpdump, which does a good job of tracking network traffic – the command I used was
sudo tcpdump -s 0 -A -i en0 port 80
which gave me enough details to see what was happening.
I then started up iTunes and went to the AppStore page, and I was able to see all the messages passing between the two – it looks like iTunes AppStore feature functions very similarly to a web browser, although it doesn’t use HTML – it is based on a proprietary,and quite complex, xml format. There are different ways of finding the apps in iTunes – by clicking a category, and then paging through, which is painfully slow although it shows details of each app, including the price, rating and a description. The alternative is to use the browse by category list, which just lists the app name, price and genre. Since I plan on pulling down all the app details, I thought I’d start with the browse list and then drill down through the categories to the individual apps.
The url iTunes uses for browsing is
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/36
This returns a list of categories and associated IDs in pairs – the meat of the xml response looks like this:
<key>infoURL</key><string>http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/viewGenre?id=</string>
<key>items</key>
<array>
<dict>
<key>itemName</key><string>Books</string>
<key>itemId</key><integer>6018</integer>
</dict>
<dict>
<key>itemName</key><string>Business</string>
<key>itemId</key><integer>6000</integer>
</dict>
<dict>
<key>itemName</key><string>Education</string>
<key>itemId</key><integer>6017</integer>
</dict>
...
The first item is a URL which can be used together with the category id to generate a list of apps in that category. The categories and their ids follow in <dict> pairs.
These URLs can be tested with curl or in a web browser. I’ll cover getting category entries and individual app details in a later posting.
[ad#co-1]