Bots, Crawling & Scraping

Crawling and Scraping App Store and/or Android Market

After a project about this subject, I want to write down what I learnt about these 2 platforms. If you need a solution for a similar project, contact me, I know all the problems we will face.

Let’s go.

Apple Store

EPF Importer solution provided by Apple seems the most easy solution to get platform’s data. The problem is that you must be registered and accepted as a partner. Depending on your project it’s no easy to fit Apple Store terms and conditions.

Other way, it’s crawling apple store public pages. You can crawl by category or by country, no problem. Categorys has same Id’s in every country/store. NOT ALL APPS are available to download from EVERY country/store.

The other tool You have at your disposal it’s the RSS generators from apple, where you can get fresh data.

Public information you could get: 

– Downloads NO
– User Rating Count YES
– App details {screenshots, app name, bundle Id, developer, …) YES
– Price YES
– AverageRating YES

Android Market

With google’s platform there is just one way. Crawling and Scraping android market public pages. There are a lot of implementations at github, but all does the same (crawl public pages). ALL APPS ARE AVAILABLE TO DOWNLOAD from every country/store. But you need a country IP to crawl and scrape a country/store. Google will show you just apps matching your country getting your IP.

Public information available:
– Downloads: YES (Range… Big range)
– User rating count: NO
– App details {screenshots, app name, bundle Id, developer, …) YES
– Price: YES
– AverageRating: YES

Both
– In both platforms, App Revenue data it’s only available from developers dashboard. So you need login/pass to scrape this information. That’s the way Appannie works. There is no other way.
– The problem in boths platforms, is that they promote so hard top apps, but it’s impossible to find all apps, because the impopular ones don’t even show up at public pages
– The number of apps in every category varies so much
– The problem of crawling bigs sites like those, is that you will need some proxys and some engineering to escape from their abuse controls. The solution is easy, don’t abuse their servers and all will be ok. Use some proxys randomizers, delays between calls and all will be ok.

I hope these 200 words will help clarify something, anyway you can contact me throught my email.

contacto@phpninja.info
 

 

Beto López
"Full stack" web developer focused in maintenance and bug fixing.Wordpress, Prestashop, HTML, CSS, Javascript, Php and Mysql. Also open source collaborator. Linkedin y Twitter.


Leave your email and we'll contact you