Mike Ellis from eduserv talking about getting data out of web pages.
Scraping – basically allows you to extract data from web pages – and then you can do stuff with it! Some helpful tools for scraping:
- Yahoo!Pipes
- Google Docs – use of the importHTML() function to bring in data, and then manipulate it
- dapper.net (also mentioned by Brendan Dawes)
- YQL
- httrack – copy an entire website so you can do local processing
- hacked search – use Yahoo! search to search within a domain – essentially allows you to crawl a single domain and then extract data via search
So, once you’ve scraped your data, you need some tools to ‘mung’ it (i.e. manipulate it)
- regex – regular expressions are hugely powerful, although can be complex – see some examples at http://mashedlibrary.ning.com/forum/topics/extracting-isbns-from-rss
- find/replace – can use any scripting language, but you can even use Word (I like to use Textpad)
- mail merge (!) – if you have data in excel, or access, or csv etc. you can use mail merge to output with other information – e.g. html
- html removal – various functions available
- html tidy – http://tidy.sourceforge.net – can chuck in ‘dirty’ html – e.g cut and pasted from Word, and tidy it up
Processing data:
- Open Calais – service from Reuters that analyses block of text for ‘meaning’ – e.g. if it recognises the name of a city it can give information about the city such as latitude/longitude etc.
- Yahoo!Term Extraction – similar to Open Calais – submit text/data and get back various terms – also allows tuning so that you can get back more relevant results
- Yahoo!geo – a set of Yahoo tools for processing geographic data – http://developer.yahoo.com/geo
The ugly sisters:
- Access and Excel – don’t dismiss these! They are actually pretty powerful
Last resorts:
- Use Freedom of Information – for data you can’t get any other way, submit FoI requests via What do they know
- OCR stuff (Mike has used http://www.softi.co.uk/freeocr.htm)
- Re-key data – or use Mechanical Turk to get people to do it for you?