Archive for April, 2009

Scraping iTunes App Store part iv – reading application details

Tuesday, April 28th, 2009

So we now have the page containing all (or the first 3500) applications for each category. To read details of individual apps I used the following XPath query -


/*[name()='Document']/*[name()='View']/*[name()='ScrollView']/*[name()='VBoxView']/*[name()='View']/*[name()='MatrixView']/*[name()='MatrixView']/*[name()='VBoxView']/*[name()='VBoxView']/*[name()='TextView']"

Each node of this contains a long list of name/value pairs as shown in my previous post. Some of the fields are:

  • artistId – The unique id of the app developer
  • artistName – A string containing the name of the developer.
  • genre – the name of the genre to which the app is assigned
  • genreId – the numeric id of the genre
  • itemId – the unique id used to identify the app throughout iTunes
  • itemName – the name of the app
  • kind – always “software”as far as I can see
  • popularity – a ranking indicator. Not quite sure how this is calculated right now
  • price – the price in tenths of a cent
  • priceDisplay – the price as a formatted string
  • releaseDate
  • softwareIcon57×57URL – the URL of the app’s icon
  • url – the URL to view the app in iTunes

[ad#co-1]

I'm a Mac

Monday, April 27th, 2009

Had a discussion today with one of those people who gets all defensive when you mention that you use a Mac. I haven’t had one of those conversations in a while – most people don’t care, or they already know that the Mac is a better choice. I’ve been using a Mac for the best part of a decade now (since soon after the release of OSX in 2001) and the arguments started out being frequent and kind of fun quite soon became less frequent but rather tedious. I’m not going to re-hash the arguments here – the perceived higher cost (or “Apple tax” – c.f. “Microsoft tax”) and software compatibility issues, and even the use of Parallels as a case *against* a Mac (“you’ll only use it to run Windows anyway”) are incredibly boring and totally beside the point. I was a Windows developer for years. I know all about Windows. I still have to work on Windows for clients once in a while – and no I don’t try to convert them, unless there’s a pretty good reason to. I use a Mac because I like it – it does everything I need (and really well to boot) and I like the way it works. What really frustrates me is the implication that the Mac is trendy and therefore only people who can’t think for themselves would possibly use one. This kind of argument really annoys me, not least because I have met it lots of times over the years for different reasons – for example when I first started working on computers PCs were regarded as a fad (yes I’m old) and I was regarded with suspicion for advocating them over the minis and mainframes that the companies I worked in were using. I was also a strong advocate of Windows when that was released. On thing I will say is that back when they were considered subversive PCs were adopted by a small subset of leading edge opinion formers (mainly the more tech-savvy user and developer) and they gradually spread as people cottoned on to their advantages. I see a similar movement now with regard to people adopting Macs – many software developers, especially those on the leading edge of web development. use Macs as standard. I thought the argument was pretty much over, to be honest. I think the PC user now is like the Vax user I upset years ago when I said his beloved machine would be dead in five years time. Not that the PC will necessarily be dead in 5 years, but it’s certainly no longer the future of the industry. But as I said all this is boring. The next time someone questions my choice of OS I shall merely smile politely and let it pass.

Scraping iTunes part iii – reading the categories list

Thursday, April 23rd, 2009

My last post on scraping the iTunes store showed how to read the categories page. We had a URL something like this:
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/36/6081/1/
(for books). Sending that in produces a large xml response – the interesting parts look like this:

<dict>
<key>artistId</key>
<integer>293260414</integer>
<key>artistName</key>
<string>Saxorama.net</string>
<key>buy-only</key>
<true/>
<key>buyParams</key>
<string>
productType=C&salableAdamId=294770918&pricingParameters=STDQ&price=0&ct-id=14
</string>
<key>genre</key>
<string>Productivity</string>
<key>genreId</key>
<integer>6007</integer>
<key>itemId</key>
<integer>294770918</integer>
<key>itemName</key>
<string>EasyWriter</string>
<key>kind</key>
<string>software</string>
<key>playlistName</key>
<string>EasyWriter</string>
<key>popularity</key>
<string>0.13890815</string>
<key>price</key>
<integer>0</integer>
<key>priceDisplay</key>
<string>Free</string>
<key>releaseDate</key>
<string>2009-04-06T07:00:00Z</string>
<key>s</key>
<integer>143441</integer>
<key>softwareIcon57x57URL</key>
<string>

http://a1.phobos.apple.com/us/r30/Purple/40/ce/15/mzl.dtewfrse.png

</string>
<key>softwareIconNeedsShine</key>
<false/>
<key>softwareSupportedDeviceIds</key>
<array>
<integer>1</integer>
</array>
<key>softwareVersionBundleId</key>
<string>net.sax.easywriter</string>
<key>softwareVersionExternalIdentifier</key>
<integer>1589121</integer>
<key>softwareVersionExternalIdentifiers</key>
<array>
<integer>875361</integer>
<integer>1472572</integer>
<integer>1486886</integer>
<integer>1589121</integer>
</array>
<key>url</key>
<string>

http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=294770918&mt=8

</string>
</dict>

This is a set of summary details for one product, and there will be up to 3500 of them on the page that we just downloaded. While the information here is useful, we need to go to another page to get the complete set of information for the produce. We do this by extracting a URL using an xpath statement, which I will post next time.
[ad#co-1]

Scraping the iTunes AppStore part ii – categories

Tuesday, April 21st, 2009

My previous post on scraping the iTunes Appstore showed how to retrieve the top level page for iPhone apps – basically a list of the available categories and their associated IDs. The next step is to get a list of the applications in each category.
The previous call gave us a url:
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/viewGenre?id=
and a list of name/value pairs like this:

<dict>
<key>itemName</key><string>Books</string>
<key>itemId</key><integer>6018</integer>
</dict>

It looks like we need to combine the URL we just received with the ids to get a page containing links to all the apps in that category, so for all books we would use-
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/viewGenre?id=6081
but unfortunately I couldn’t get this to work. I have no idea what that URL is doing there. Instead I saw that iTunes was using a variant of the original URL, and sending this:
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/36/6081/1/
This indeed returns a list of applications belonging to the requested category, although it seems to be limited to 3500 results per category That is enough for most categories right now. As the store gets bigger more categories will exceed this – I would then expect new categories to be added, or the limit to be raised. If you really need all the apps in the store, you’ll have to use one of teh other approaches I mentioned earlier.
Again, the response is an xml message. It’s quite a complicated one, but my next post on the subject will show you how to get details of individual applications.
[ad#co-1]

Java – printing the contents of a bean

Monday, April 20th, 2009

I just discovered this trick, and I’m recording it here so other people don’t have to search as much as I did to find it. To dump the contents of a bean b use this code:

org.apache.commons.beanutils.BeanUtils.describe(b).toString()

This uses reflection to dump the bean field names and values – very useful in debugging. describe returns a HashMap of fields and values – also useful if you want to check that specific fields have been set properly.

[ad#co-1]

Scraping the iTunes AppStore part i

Friday, April 17th, 2009

There have been quite a few sites set up recently that provide lists of apps on the Apple iPhone AppStore – apptism.com is probably the best known. Obviously they are accessing the AppStore and pulling the data down by masquerading as iTunes. I decided to find out how they are doing it.
In order to monitor the network connection between iTunes and the AppStore I needed a network packet sniffer. I googled the best solution and found that OSX ships with tcpdump, which does a good job of tracking network traffic – the command I used was

sudo tcpdump -s 0 -A -i en0 port 80

which gave me enough details to see what was happening.
I then started up iTunes and went to the AppStore page, and I was able to see all the messages passing between the two – it looks like iTunes AppStore feature functions very similarly to a web browser, although it doesn’t use HTML – it is based on a proprietary,and quite complex, xml format. There are different ways of finding the apps in iTunes – by clicking a category, and then paging through, which is painfully slow although it shows details of each app, including the price, rating and a description. The alternative is to use the browse by category list, which just lists the app name, price and genre. Since I plan on pulling down all the app details, I thought I’d start with the browse list and then drill down through the categories to the individual apps.

The url iTunes uses for browsing is
http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/browse?path=/36
This returns a list of categories and associated IDs in pairs – the meat of the xml response looks like this:
<key>infoURL</key><string>http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/viewGenre?id=</string>
<key>items</key>
<array>
<dict>
<key>itemName</key><string>Books</string>
<key>itemId</key><integer>6018</integer>
</dict>
<dict>
<key>itemName</key><string>Business</string>
<key>itemId</key><integer>6000</integer>
</dict>
<dict>
<key>itemName</key><string>Education</string>
<key>itemId</key><integer>6017</integer>
</dict>
...

The first item is a URL which can be used together with the category id to generate a list of apps in that category. The categories and their ids follow in <dict> pairs.

These URLs can be tested with curl or in a web browser. I’ll cover getting category entries and individual app details in a later posting.
[ad#co-1]

Scrum, or the Prayer Meeting

Wednesday, April 15th, 2009

I re-read my recent post on agile development and I think it sounded rather more negative that I intended. I don’t wish to give the impression that I am opposed to agile methodologies – quite the contrary. I don’t think I’ve ever worked in a development environment that I would not describe as agile – I think that kind of approach is the norm in Silicon Valley, where I have worked almost my entire career. I think the large corporate IT departments are the ones that embraced the waterfall, highly structured and inflexible approaches that agile methodologies (henceforward referred to as simply ‘agile’ or ‘am’) is advocated to replace, and I would never advocate that approach. Having said that, I think there are different ways to implement am and different reasons for doing so. I have seen them introduced in four companies, and only in one was it done for the right reason and in none was it a classic replacement of waterfall methods.
In the first case, the company was failing and in an attempt to find a scapegoat the development team was deemed to be “out of control” and so a project manager attempted to rein it in by imposing agile on them. This was a complete failure because the development team was in fact doing a great job, and resented being bullied by someone who turned out to be too controlling and ultimately incompetent. The company eventually failed, certainly hastened by the departure of most of the development team who refused to buckle under.
In another case a new manager was hired to run an existing team, and in order to make his mark he also tried to impose agile methodologies, again in a way that increased the reporting and meeting burden on the developers and caused resentment and dissatisfaction. That team also started to disintegrate but thankfully the new manager was replaced before any real damage was done.
In the one successful case, the development team did not really change its approach at all – they had been used to working on short-term projects, and were releasing daily updates to their product. The change really applied to the project management and executive teams who learned whole new ways to communicate. They used the story approach to better define their requirements, and were able to understand the scheduling impact of their projects and to gain insight into the development process by becoming more involved in the planning and estimating processes. And of course they got to see exactly how things were progressing by turning up at scrum meetings.
So I think the moral is that a successful agile process allows everybody space to do what they do best, without any power struggles or imposition of one group’s will onto another’s. And mostly, I think it often requires a change in mindset of the customers more than it does the developers.
Oh, and the prayer meeting thing? It’s probably just me, but when I see a bunch or people standing in a room round a table, often with heads bowed, it’s not a rugby match that comes to mind.
[ad#co-1]

Go-GoGrid

Wednesday, April 15th, 2009

Interesting news from GoGrid today – they are adding two new features which actually go some way towards addressing the cloud computing issues I have mentioned before. Firstly they are allowing user-defined images to be uploaded, from which multiple virtual servers with varying memory allotments; this means that I can clone my entire application stack and deploy multiple copies of it in minutes. They are also connecting their Colocation servers to their cloud servers, so it will be possible to use the standard hosting service and expand into the cloud as and when required, just as I said I wanted to here. This issue is also potentially addressed by Dynect’s solution, although I have yet to fully investigate that service. Anyhow, it’s good to see that somebody is thinking through the cloud offering and what it takes to make it a practical solution. As soon as I’ve fully investigated Dynect and kicked GoGrid’s tyres for deploying my new project I’ll post the details here for people to throw stones at.
[ad#co-1]

More cloud computing thoughts

Thursday, April 9th, 2009

Further to my recent comments about cloud computing – I am currently helping a new site set itself up – we want to use a traditional hosting service but with the option of dynamically expanding into the cloud if bandwidth or other capacity limits are approached. Somebody had a bright idea – set up a free starter account a goggrid, configure a loadbalancer as the public address of the site and point the loadbalancer back at our server. Then when we need to we can start up a gogrid server and have the loadbalancer distribute load between the two servers. Brilliant – just the kind of scenario I wrote about, and a logical use of cloud capacity. Unfortunately gogrid don’t let you configure a loadbalancer to point to anything but their own servers. Rats. Something is going to have to give here if this concept is ever going to take flight.

Firefox and Javascript

Thursday, April 9th, 2009

I have a window with a variable number of sets of radio buttons I need to get the values of, so I have some javascript that does a few iterations of ids and uses eval to read the values since i don’t know in advance what the radio buttons are called. Sometimes it works flawlessly. sometimes Firefox just disappears, and sometimes this happens:

Firefox telling me there's something wrong with my javascript.

Firefox telling me there's something wrong with my javascript.

Safari seems to work ok, and of course all the javascript debugging tools I have are part of Firefox. Oh well.