Scraping, scripting and hacking your way to API-less data

Mike Ellis from eduserv talking about getting data out of web pages.

Scraping – basically allows you to extract data from web pages – and then you can do stuff with it! Some helpful tools for scraping:

Yahoo!Pipes
Google Docs – use of the importHTML() function to bring in data, and then manipulate it
dapper.net (also mentioned by Brendan Dawes)
YQL
httrack – copy an entire website so you can do local processing
hacked search – use Yahoo! search to search within a domain – essentially allows you to crawl a single domain and then extract data via search

So, once you’ve scraped your data, you need some tools to ‘mung’ it (i.e. manipulate it)

regex – regular expressions are hugely powerful, although can be complex – see some examples at http://mashedlibrary.ning.com/forum/topics/extracting-isbns-from-rss
find/replace – can use any scripting language, but you can even use Word (I like to use Textpad)
mail merge (!) – if you have data in excel, or access, or csv etc. you can use mail merge to output with other information – e.g. html
html removal – various functions available
html tidy – http://tidy.sourceforge.net – can chuck in ‘dirty’ html – e.g cut and pasted from Word, and tidy it up

Processing data:

Open Calais – service from Reuters that analyses block of text for ‘meaning’ – e.g. if it recognises the name of a city it can give information about the city such as latitude/longitude etc.
Yahoo!Term Extraction – similar to Open Calais – submit text/data and get back various terms – also allows tuning so that you can get back more relevant results
Yahoo!geo – a set of Yahoo tools for processing geographic data – http://developer.yahoo.com/geo

The ugly sisters:

Last resorts:

Use Freedom of Information – for data you can’t get any other way, submit FoI requests via What do they know
OCR stuff (Mike has used http://www.softi.co.uk/freeocr.htm)
Re-key data – or use Mechanical Turk to get people to do it for you?

Overdue Ideas