IWMW10: HTML5 and friends

The second day of IWMW10 kicks off with Patrick Lauke – he is currently ‘Web Evangelist’ at Opera, and was previously web manager at University of Salford. Slides at http://www.slideshare.net/redux/html5-and-friends-institutional-web-management-workshop-2010

HTML5 is a huge topic – Patrick wants to try to answer the question today ‘should I use HTML5 today?’

HTML5 is a ‘woolly’ term – people use it to encompass lots of technologies. However, Patrick is going to concentrate on the ‘core’ HTML5 – not talking about other technologies that often get lumped into the HTML5 bucket (example he gives is geo-location services – not part of HTML5, but often referred to when people talk about HTML5) – HTML5 without the hype.

Why are we back talking about HTML – weren’t we all going to be using XHTML? XHTML 1.0 came out in 2000 – idea was to move to xml base. Started to see development of XML based technologies – related to te web – e.g. XForms. At Opera they liked the functionality of XForms, but wanted to be able to introduce the same ideas to older sites, still using non-XML based sites – and came up with Web Forms 1.0

In 2004 W3C started to focus on XHTML 2.0 – but this was not backwards compatible – and browser companies not happy. So Firefox, Opera and Apple (Safari) worked together as the ‘Web Hypertext Application Technology Working Group’ (WHATWG). Eventually  W3C proposed bringing the work done by WHATWG back into W3C development – stepping back from XHTML. So in 2007 a W3C working group for HTML5 was setup (including browser vendors)

Quote from Ian Hickson (Google) who is Editor of HTML5 says HTML5 is about “extending the language to better support Web applications […] This puts HTML in direct competition with other technologies […], in particular Flash and Silverlight.”

Patrick says HTML5 does not replace HTML 4.01 (or XHTML 1.0) – just extends the languages. In general if you have a valid HTML 4.01 website, simply changing the doctype will result in a valid HTML5 webpage (with some minor tweaks maybe).

HTML5 specification is aimed at browser developers – so if you aren’t a browser developer you don’t want to look at it. For authors rather look at ‘HTML5 differences from HTML4‘. HTML5 standardises current browser and authoring behaviour – previously there has been a lot of inconsistency in how different browsers deal with different code.

HTML5 doctype just says it is html – doesn’t include a version etc. So just use <! DOCTYPE html> – in reality this is all the browsers use generally anyway.

HTML5 doesn’t care about some of the xml conventions – use of lowercase for tags, double quotes in attributes, closing empty tags. Patrick emphasises some of these things can still be good practice – but HTML5 doesn’t care about them.

HTML5 looked at the most commonly used names for <div> tags, and transformed them into elements – so now a <nav> tag can be used instead of <div class=”nav”> for navigation – also for <header> and <article>.

Lots of new types and attributes in forms for built-in validation – e.g. <input type=”date”> etc. Browser can both validate, and automatically offer a date picker rather than having to build this into javascript in the site. I guess this moves use much more towards browser as platform – more stuff ‘baked in’ to the browser. One nice example (I think) can specify a regular expression as a ‘pattern’ for a input – so can simply validate input (plus pre-defined types like email etc.)

HTML5 introduces <video> element for embedding video – allows specification of basic player information – like size, whether controls display etc.

Bringing Video as a native object is important – it ‘plays nice’ with the rest of the page, has keyboard accessibility built-in and an API for controls (javascript based APIs). Also means you can style video player using CSS etc.

However, still a big debate about video formats – H.264/MP4 – supported by Chrome, Safari and IE 9 – but there are patent issues – and worries that this could lead to royalty coming down the line. Firefox and Opera support Ogg Theora – no patent/licensing issues but not very many tools for Ogg Theora – very geeky still.

New video standard started by Google and released free without patent/licensing issues – WebM – supported by most major browsers, but IE needs codec installed, as would Safari (unspoken but implication Apple are the barrier to agreement on WebM adoption?). However you can specify a cascade of different video formats in the <video> element – so basically ‘use WebM if browser can, otherwise use Ogg, otherwise use H.264 etc.

Is HTML5 a ‘flash killer’? Patrick says too early to talk about HTML5 replacing Flash – but HTML5 introduces choice – look at the tools, and what you want to do.

Should you use HTML5 today? Patrick says, if you want to make use of the tools – yes, but otherwise you don’t need to rush. However, you could just try changing the doctype – you might already have a valid HTML5 site.

Q & A

Q: How does HTML5 work with/relate to RDFa?

A: Working group looking at this. Also Patrick mentioned something about ‘microdata’

Q: Why is there not a <content> tag

A: Good question – Patrick says some of the decisions on tags are slightly odd – e.g. <article> not going to be relevant in all cases

Q: How secure is HTML5 for copyright material?

A: It is an issue – for e.g. easier to grab video because it is just referenced in the HTML code. YouTube are experimenting with HTML5 but have said they won’t use it for some types of video – e.g. those with adverts in them, because people could easily write something to skip the ads and go straight to the content – so loss of control over this is an issue. But then, there is no good way of protecting the content generally

IWMW10: Are web managers still needed…

Second plenary today is from Susan Farrell – asking ‘are web managers still needed when everyone is a web “expert”?’. Slides at http://www.slideshare.net/iwmw/farrell

Susan asks – why web manager are not valued? Who are these ‘web experts’? Should we be looking at recognised qualifications? How do we show the value we add to the institution? She is thinking of the ‘softer’ skills sides – writing for the web, metadata, search, user interface design etc.

Susan asks what is the perception of ‘web people’ – techies in a cellar? Are staff and students aware of what you do?

Web professionals – similar to librarians? … but less respected Susan suggests:

  • Organising information
  • Classification
  • Cataloguing
  • Updating users on new resources

Some issues Susan outlining:

  • user research & usability testing – difficult to get funding for this
  • writing for the web – devolved responsibility can mean that web team can’t affect what is written
  • SEO – not appreciated, but lots spent on ‘marketing’ – but SEO is marketing, so why mismatch?
  • Metadata – no-one interested (welcome to the library world)
  • Information architecture – expectation that you can just change this to suit specific needs
  • Search – expectation of good search experience – but

Experience that ‘consultants know best’ – often listen to consultants above local expertise.

What is a ‘web professional’?

  • Need broad range of skills and experience
  • Softer skills less recognised
  • No set qualifications
  • Skills being absorbed by other roles

[hmm – I struggle with this in the way that I struggle with the professional status of librarians. For me the key thing about the professional status of librarians is not skills or qualifications – these have changed over time and continue to change. However, the professional ethics are (and should be) more persistent – ‘Concern for the public good’ and other principles listed at http://www.cilip.org.uk/get-involved/policy/ethics/pages/principles.aspx should be true no matter what the technology. So what do ‘web professionals’ stand for?]

Susan saying web professionals have to promote themselves to key audiences…

Susan asking do web professionals need a professional body?

So – are web managers still needed when everyone is a web ‘expert’? Yes – but we need to promote ourselve and be part of the solution in our Institutions, not part of the overhead…

Q & A

Q: Martin Moyne – is it that difficult to justify? Survey that says web site No 1 factor in overseas students making decisions

A: But perhaps senior management don’t know that?

Q: Careers advice – came across career advice on government site for web managers that said this was going to disappear as a career and suggested a move to finance!

A: Perhaps this is where a professional body is needed – make a representation to government etc.

Q: Brian Kelly – what do we need to do?

A:  Perhaps not a generic thing – tuned to institution?

Q: (from me) librarians have a set of professional ethics – do web managers ‘stand’ for anything?

A: perhaps to many routes to being a web manager to say there is a standard approach

Q: Jeremy Speller – status can’t be gained overnight – takes along time

A: True – but got to start somewhere

IWMW10: The Web in Turbulent Times keynote

Now Chris Sexton, director of IT at the University of Sheffield (where IWMW is being hosted). Chris blogs at http://cicsdir.blogspot.com/ and tweets as cloggingchris.

It is a certainty that ‘we’ (Universities I guess) are going to get less money – the question is how much less. Cuts are going to bite next year, and the year after – not this year.

Letter from David Willets and Vince Cable to University VCs included a line describing IT projects as “discretionary” – and suggestion they should be cut.

Why IT projects? Vince Cable as shadow chancellor identified many failing and very expensive public sector IT projects. Web sites also a target for cuts – example of Business Link website – cost an incredible £105million!

Government committed to getting government web back under control. Martha Lane Fox (Digital Champion) looking at how resources can be shared and use of open source software etc. can save money.

It is very easy to see IT as a cost. Chris gets frustrated by the view of ‘IT’ as something separate from the ‘business’ – she says, we shouldn’t have IT projects, we should only have business projects.

Shared services being pushed by government and HEFCE – there are examples of massive savings in parts of public sector especially the NHS. Part of the shared services agenda is around back office systems – e.g. Finance, HR, Payroll.

Chris suggests we already have very good examples of shared services – JANET, UCAS, HESA.

Chris highlighting example from Charity sector – ‘Just Giving’ website – all charities need to need people to give money – and Just Giving website gives a shared service they can all use to achieve that.

Chris describing how things have changed for IT departments – have to provide access to services on any device – IT don’t control the user platform anymore (if they ever did). Need to provide services to multiple devices/browsers/platforms etc.

User expectations are changing – increasing demand for services and increase in student expectations – especially if student fee cap is lifted. Students used to easy access to services (e.g. dropbox), via high quality interfaces (e.g. don’t need training to use the Tesco website?).

Generation of students who grew up with the internet. Not interested in ‘software’ but services. Also big contrast between attitude from students and often senior staff who get PAs to print off emails for them to read. Students often describe university systems/services as ‘clunky’.

Lots of overlapping services – everything does everything. Count how many services/software in your institution lets you store a document – Sheffield got into double figures for this looking at institutional services.

24/7 expectations – at Sheffield Information Commons operates 24/7 – and if printers go down on a Saturday afternoon users expect them to be fixed. How can you support services 24/7 when staff generally employed on 9-5 contracts. Resilience is key.

Survey by  University of Edinburgh – 100% of students had phones, and 50% had what they would have called a ‘smartphone’. Biggest challenge for mobile devices is diversity of devices and platforms – anecdote that to develop app to achieve 70% penetration of mobile market had to test on over 300 devices. Starting to see universities developing apps… University of Sheffield app – 2000 downloads in weeks after launch – two thirds to iPod touch…

Chris relating how they are having to increase their wireless provision to cope with the profusion of devices – many students now connecting with 2 devices – phone and laptop – when they use the network.

Chris now talking about data security – only a matter of time before a laptop with large amounts of student data or university finance data left on a train/cab/bus?

Moving on to legislation – Chris believes Digital Economy Act full of problems – and worries that unless OFCOM consultation clarifies some of this, there are going to be huge problems.

Green IT – IT accounts for 2% of global carbon emissions (possibly – Chris isn’t sure how accurate this is) – same as airline industry. Lots can and needs to be done in universities – in areas like:

  • Printing
  • Data centres
  • Video conferencing
  • Reduce Power
  • Virtualisation

Sheffield has dropped from 130 servers to 4 servers – with approximately 75% reduction in energy bill!

Need to be much more flexible and agile. Days of 2 year projects are gone – if we can’t deliver in 6 months, we shouldn’t be doing it. Chris quotes ‘Keep it Simple Stupid’. Noting ‘shared services’ – haven’t even got consistency across institutions nevermind between institutions. In some institutions there are distributed IT services – lots of servers in departments/labs/offices, people don’t get best value because don’t use central procurement, security issues on those servers that aren’t centrally managed, etc. etc.

Chris stresses – it’s about processes not technology – technology does not solve anything. If the process doesn’t work, no amount of IT will make it work. Responsibility has to be taken by individuals – people have to take responsibility for their own (efficient) use of IT.

Different delivery models:

  • Self service
  • Managed services
  • Outsourcing
  • Out hosting
  • Cloud

Have to focus IT department resources on key tasks in University – teaching and research. There are going to be hard choices outsources services may not be as good as in-house – but which is more important a good calendar or a good vle? If Google docs offer a collaborative environment, why should the University provide one? These are the hard decisions that will need to be made.

IT department will no longer be ‘gatekeepers’ – help people use systems – going to be facilitators and educators instead.

Chris does not believe we can afford to ‘just keep the lights on’ – we have to keep innovation – otherwise we will die as IT departments. Innovation carries a risk – but it is a risk that you need to take. Need to get balance right – need to get resourcing right. Chris very clear need to continue to invest in innovation.

Q & A

Q: Ben Coulthard – University of Leicester. Lots of changes at Leicester – but not sure feeling the benefit of that. This year no money for innovation – and no money to web team this year. What about at Sheffield?

A: May not have funded more – but have protected them. Money tends to go into projects as opposed to teams – so flexible. Some comments on split between marketing and IT – at Sheffield Marketing and IT work together on web team (2 from marketing, 2 from IT and 1 across the two)

Sorry – missed the other Q & A, but interesting stat from Chris – review at Sheffield suggests on 2% of IT budget goes specifically to support research – that needs to change in Chris’s view. Question about how much iPhone app cost to develop – Chris says she can’t give a proper figure as it was first time company they worked with had done that, but she’d estimate £10k

IWMW 2010: The Web in Turbulent Times

I’m at the ‘Institutional Web Managers Workshop’ in Sheffield for the next few days. I’m running a workshop later today, but I’m really grateful to Statistics into Decisions, who have provided sponsorship for a number of attendees, including me – without this I would have just been turning up for my workshop, then heading home – instead I can attend the whole event, and get a chance to hear the other speakers and talk to the other attendees.

The theme of the workshop this year (it’s the 14th IWMW event), is ‘the web in turbulent times’, and opening the conference the Marieke Guy and Brian Kelly (the Chair and Co-Chair of the event) are setting the scene.

Brian reflecting on the fact we’ve enjoyed growth and development in insitutional web investment and activity over the last 10 or so years – up until last year, when the future looked a lot more uncertain. Now we are starting to see cuts being introduced – and these are going to impact on education and the related web community.

During the workshop there is going to be use of various technologies to try things out and build a community. Examples:

  • Use of Twitter – can use the hashtag #iwmw10, as well as hashtags for sessions (this session is #p0), and the #eureka hashtag for moments of realisation
  • Use of QR codes to play a game
  • Use of various technologies including video streams, twitter streams etc. to ‘amplify’ the event and enable participation from those not able to attend in person

What is the purpose of IWMW? Brian and Marieke started to draw up a list – posted at http://iwmw.ukoln.ac.uk/blog/2010/2010/07/the-role-of-iwmw/ and looking for feedback.

SORT – Closing Keynote

From Bill Thompson, Partnership Manager at the BBC Archive

The BBC has an enormous archive – 1 million hours of tv/radio, photographs, over 7 miles of shelves of printed material, artifacts (tardises, daleks) – job of the archive to serve the needs of the BBC. But growing awareness there is public value in this material – this is area where Bill works in a small team.

First idea was to put digital material online. However, thinking about how to expose content and add value started to infect thinking across corporation. Started to think about online aspects of public space.

  • About the creation of material
  • Exploiting new technology capabilities (e.g. cameras have embedded GPS)
  • Use of content – not just about making programmes, and putting stuff on TV – lots of stuff that never gets broadcast
  • Preservation – complex – wide variety of physical formats

All these things need to happen – even just for internal purposes.

Want to make it as open as possible – but there are issues – may contain personal data, need to think about regulatory regime, commercial issues. These are constraints, although not unreasonable constraints.

Want to think beyond the BBC – think at webscale. Would be foolish to do some of these things just at the scale of the BBC – e.g.

  • location based stuff – need to make sure what the BBC does fits into wider frameworks.
  • Have naming conventions at the BBC, need to look at how this fits with other data outside the BBC.
  • Time – knowing when stuff was done – BBC have said they want to publish a time axis of all programmes ever broadcast – so need to decide how to do this – again not something just the BBC interested in.
  • Want to bring in users – e.g. university academics

BBC is very creative environment. BBC has huge engineering expertise. Engaging seriously with standards efforts etc. For Bill semantic web is a way of building sustainable approaches.

While we’ve been waiting for the semantic web for a long time, it does seem to be getting closer.

We have more processing power than we know what to do with. We have connectivity. We can store data.

What we don’t have the tools and intelligent agents that allow us to apply reasoning across data. We are talking about Artificial Intelligence (AI) – and these come with very big problems. We haven’t made much progress with AI.

—————-

OK – I admit, at this point I kind of lost track – Bill delved into some of the challenges for creating Artificial Intelligence, and while I felt I was getting the general points, much of the detail washed over me.

I think that one of the key aspects Bill was highlighting was that some AI researchers believe that you can’t have intelligence without a surrounding environment – and it is in the ability of an entity to interact with it’s environment – especially in the sense of taking in, and pushing out, information.

I think Bill’s argument was that when you simply develop software based AI, you don’t have these kinds of interactions, and so you aren’t going to get intelligence.  Bill quoted a book around this topic which I didn’t manage to get the details of, but it reminded me very much of the arguments put forward by Steven Grand in his book “Growing up with Lucy: How to Build an Android in Twenty Easy Steps“. You can read more about Steven Grand’s approach to AI at http://www.scienceagogo.com/news/cyber_life.shtml. I think a quote from this page that relates to what Bill was saying is:

“Central to the Lucy project is researching how an organism gains and uses knowledge, or more appropriately how the process of acquiring and using data are interconnected attributes.”

However, whereas ‘Lucy’ (the Steven Grand project) takes the approach of building a physical entity that can interact with the physical world, I got the impression that Bill was arguing that the semantic web could create an environment in which a more purely software based AI could interact.

Hope that makes sense. It was an interesting talk, and a mind stretching way of closing a very interesting and challenging conference.

SORT – Panel discussion

Q: What are the business models – how do we make this sustainable?

A: (Mike Ellis) Some of this activity can reduce costs – so not revenue stream, but cheaper to do stuff. Requires creative thinking – need to talk to marketeers and communications specialists. E.g. National Gallery – partnered with commercial company to produce iphone app – which was sold

A: (Dan Greenstein) We are going to have to take money away from existing activities – e.g. University of California is now boycotting Nature due to price increases. Need to make sure those things that don’t work go away.

A: (Jo Pugh) Some of this stuff just ‘has to be done’ – freeing our data might be like preservation – doesn’t make us money, but we do it.

A: (Andy Neale) Some of this can be done as part of ‘business as usual’ – tacks on to existing activity

A: (Mike Ellis) Income from protecting some of this stuff (e.g. picture libraries selling use of pictures) is not that great – and there are costs with things like chasing copyright etc.

A: (Jo Pugh) V&A changed rules over what they could do – became more permissive and revenue went up, and they were able to reduce staff

A: (Dan Greenstein) Some publishers protecting backlists in anticipation of a revenue stream that isn’t available, and yet they could realise revenue streams

A: (Stuart Dempster) Look at Ithaca model case studies – real world financials about how much it costs to operate types of digital services – and will be updated this year so will be able to see impact of downturn. Also recommend looking at government technology policy – will be source for innovative practice. Now seeing funders requiring exit strategies for project from day 1.

Q: (Sally Rumsden) Interested in metadata. What is ‘good enough’ metadata. What should we be doing to make sure people can find stuff reliably

A: (Andy Neale) DigitalNZ took any metadata – some items only have title, and even some of those title are ‘title unknown’ – but even this is a hook. When you start pushing this into resource discovery systems with faceting etc. start to expose the quality of the metadata – can highlight to contributors problems they didn’t appreciate, and result in improvements over time. Even if you only have a title – this can still be useful…

A: (Tom Heath) Good enough is in the eye of the beholder. We can’t anticipate. However, you could flag stuff you aren’t happy with so it is clear to users

A: (Dan Greenstein) Metadata enhancement adds to cost. And specialist materials have a particularly high cost, and delivers value to a small number of people. We aren’t good at saying ‘no’ to stuff. We have to be clear what we can afford – have to model costs of project more effectively.

A: (Liz Lyon) Trove project (in Australia) using crowdsourcing to improve metadata

A: (Balviar Notay) Some projects already looking at how text mining tools can enhance metadata for digital repositories – although perhaps unlikely to solve all problems so likely to see mixed manual and automatic approaches

A: (David Kay) If you put stuff out, evidence from Internet Archive and others, authors become motivated to improve and add more. But have to get it out there first

A: (Mike Ellis) Look at Powerhouse Museum collection – using OpenCalais to generate tags. Picasa starting to add object recognition – some automated tools improving, but still some issues. Also look at Google tagging game, V&A also doing user engagement to generate content from humans

A: (Peter Burnhill) Think about what metadata already existed and reuse or leverage that data

SORT – Getting your attention

David Kay going to talk about ‘attention data’ – what users are looking at or showing interest in – and also how it relates to user generated content as starting to believe that attention data is key to getting user engagement.

The TILE project  – looked at library attention data – could this inform recommendations for students. David mentioning well known ‘recommendation’ services – e.g. Amazon, also in the physical world – Clubcard, Nectar card informs marketing etc.

David Pattern at University of Huddersfield – “Libraries could gain valuable insights into user behaviour by data mining borrowing data and uncovering usage trends.”

Types of attention data:

  • Attention – behaviour indicating interest/connections – such as queries, navigation, details display, save for later
  • Activity – formal transactions such as requesting, borrowing, downloading
  • Appeal – formal and informal lists – types of recommendations – such as reading lists – can be a proxy for activity
  • And …

We could concentrate and contextualise the intelligence (patterns of user activity) existing in HE systems at institutional level whilst protecting anonymity – we know which institution a user is in, what course they are on, what modules they are doing. This contextual data is a mix of  HE ‘controlled’ (e.g. studies, book borrowing),  user controlled (e.g. social networks) and some automatically generated data.

The possibility of critical mass of activity data from ‘day 1’ brings to life te opportunity and motivation to embrace and curate user contribution – such as ratings, reviews, bookmarks, lists. To achieve this need to make barriers to contribution as low as possible.

What types of questions might be asked – did anyone highly rate this textbook, what did last year’s students download most.

What level is this type of information useful – institutional, consortial, national, international?

MOSAIC ran a developer competition based on usage data from the University of Huddersfield (see Read to Learn). 6 entries fell into three areas:

  • Improving Resource Discovery
  • Supporting learning choices
  • Supporting decision making (in terms of collection management and development)

However – some dangers – does web scale aggregation of data to provide personalised service threaten privacy of individual? David says we believe that as long as good practice is followed. We need to be careful but not scared off. Already examples:

  • California State University show how you can responsibly use recommendation data
  • MESUR project – contains 1 billion usage events (2002-2007) to drive recommendations

In MOSAIC project, CERLIM did some research with MMU (Manchester Metropolitan University) students – 90% of students said they would like to be able to find out what other people are using:

  • To provide a bigger picture of what is available
  • To aid retrieval of relevant resources
  • To help with course work

CERLIM found students were very familiar and happy with the idea of recommendations – from Amazon, Ebay etc.

University of Huddersfield have done this:

  • suggestions based on circ data – people who borrowed this also borrowed…
  • suggestions for what to borrow next – most people who read book x, go on to read book y next

Impact of borrowing – when recommendations introduced into the catalogue there is an increase in the range of books borrowed by students and the average number of books borrowed went up – really striking correlations here.

Also done analysis of usage data by faculty – so can see which faculties have well used collections. Also identify low usage material.

Not only done this for themselves – released data publicly.

Conclusion/thoughts from a recent presentation by Dave Pattern:

  • serendipity is helping change borrowing habits
  • analysis of usage data allows greater insights in how our services are used (or not)
  • would national aggregation of usage data be even more powerful?

Now David (Kay) moving onto some thoughts from Paul Walk – should we be looking at aggregating usage data, or engaging with people more directly? Paul asks the question “will people demand more control over their own attention data?”

Paul suggests that automated recommendation systems might work for undergraduate level, but in academia need to look beyond automatic recommendations – because it is ‘long tail all the way’. Recommendations from peers/colleagues going to work much much better.

David relating how user recommendations appear on bittorrent sites and drive decisions about which torrents to download. Often very small numbers – but sometimes one recommendation (from the right person) can be enough. Don’t need to necessarily worry about huge numbers – quality is powerful.

Q & A

Comment: (Dan Greenstein) At Los Alamos use usage data for researchers moving between disciplines (interdisciplinary studies) – fast way of getting up to speed.

Comment: (Liz Lyon) Flips peer-review on its head – post review recommendation – and if you know who is making that recommendation allows you to make better judgements about the review…

Comment: (Chris Lintott) Not all ‘long tail’ – if you aggregate globally – there are more astronomer academics that there are undergraduates at the university of Huddersfield.

Comment: (Andy Ramsden) Motivation of undergraduate changes over time – assessment changes across years of study – and groups of common study become smaller. Need to consider this when thinking about how we make recommendations

Q: (Peter Burnhill) Attention data about ‘the now’ – what about historical perspective on this.

A: Examples on bittorrent sites of preserving older reviews – in some cases material posted 30 years old – so yes, important.

SORT – Working the crowd: Galaxy Zoo & the rise of the citizen scientist

I’ve been looking forward to this session by Chris Lintott on Galaxy Zoo

As our ability to get information about the universe has increased we are challenged to deal with larger and larger amounts of data. In astronomy driven by availability of hi-resolution digital imaging etc – whereas 20-30 years ago you could get collections of hundreds of galaxies – now can get collections of millions.

Analysis of galaxy images is about looking at the shape of galaxy. While machine approaches have been developed – they typically have only an 80% accuracy. However humans are very good at this type of task. This used to be a task students would do – but the amount of data far outstripped ability of students to keep up.

In astronomy there is a long tradition of ‘amateurs’ taking part and spotting things that may not be spotted by professionals. However contibutions have generally been around data collection – and then passed to experts for analysis. Galaxy Zoo reverses this – data collection been done and asking public to analyse data.

GalaxyZoo was meant to be a side project – but picked up by media – specifically BBC News website – and sudden burst of publicity got huge boost. However, first thing that happened was server went down – 30,000 emails telling them that the server had gone down. Luckily able to get that back up and running quickly.

After 48 hours were classifying as many galaxies in 1 hour as a student previously doing in a month.

Found that getting many people to do the classification improves accuracy – over professional astronomers. Took away all barriers to participating to get as many people involved as possible. Originally had a ‘test’ for users – but took this away.

The huge side effect is that humans can spot unexpected stuff without being told – much better than machines.

Also built community around people participating – this community now starting to solve problems – e.g. discovery of small green galaxies – started to analyse, recruited programmer to interrogate data and this has eventually resulted in published paper – these objects have been known since 1960s but never analysed. None of the people in the group were scientists.

When they’ve talked to users of the site the overwhelming reason for taking part is that they want to do something useful – want to contribute.

We have responsibility not to waste peoples time – collective manpower on GalaxyZoo 2 was equivalent to employing a single person for 200 years – we cannot take this likely.

Don’t make promises you can’t keep – e.g. don’t offer ‘free response’ that you then can’t actually read – Galaxy Zoo handles this via the online community forums.

Chris describes three strands of engagement with users

  • Known knowns
  • Unknown unknowns
  • Known unknowns

Now JISC funded project to convert information from old ship logs – because has climate data.

Show pages of ships logs –

  • key data you should extract (known knowns – that stuff the researchers know they want from the logs like weather reports)
  • unexpected things you might spot (unknown uknowns – stuff you might spot in the logs – pictures, unexpected information)
  • expected things, but not known how much (known unknowns – events you know will be in there but not how often e.g. encounters with other ships)

These strands are generalisable to many projects

Zooniverse – takes the generalisable stuff from the researchers and provides it – platform for citizen science.

Can no longer rely on media to get message out and drive engagement – “it’s on the internet isn’t it amazing” no longer a story – need to work out how we get the next 300,000 people involved [my first thought – Games – look at Farmville…]

SORT – Open Science at Genome Scale

Second session of the second day – Liz Lyon from UKOLN

Open Science at Web-scale report – a consultative document, Liz says now available on writetoreply (but I can’t find it) (thanks to Kevin Ashley, now got a link to this http://writetoreply.org/openscience/)

OK – Liz talking about the amount of data being generated by Genome sequencing machines – now into second generation of Genome sequencing, and the next generation is being worked on which will work at orders of magnitude larger volumes of data.

This type of huge data production brings challenges. Need large-scale data storage that is:

  • Cost effective
  • Secure
  • Robust and resilient
  • Low entry barrier
  • Has data-handling/transfer/analysis capability

Looking at ‘cloud services’ that could offer this – e.g. Nature Biotechnology 10.1038/nbt0110-13 details use of cloud services in biotechnology.

Starting to see data sets as new instruments for science.

Cost of genome sequencing dropping, while number of sequenced genomes rises.

Leroy Hood says “medicine is going to become an information science”. P4 medicine:

  • Predictive
  • Personalised
  • Preventive
  • Participatory

Stephen Friend – chief exec of Sage Bionetworks – wants to develop open data repository (Sage Commons) to start to develop redictive models of disease – liver/breast/colon cancer, diabetes, obesity.

Paraphrasing a quote Liz read out: To Cultural forces encourage sharing – the way people handle personal data will impact on how researchers deal with data and mean they have not choice to share.

Need to think about ways to incentivise researchers to share data – through mechanisms that allow credit and attribution which will then mean researchers benefit from sharing data.

Need to thing about:

  • Scaleable data infrastructure
  • Personal genomic – share your data?
  • Transform 21st Century medicine/bioscience
  • Credit and attribution for data and models

SORT – Digital New Zealand: Implementing a national vision

The second day of ‘Survive or Thrive’ starts with Andy Neale, Programme Manager for DigitalNZ

Going to cover:

  1. Getting started and getting stuff done
  2. Issues and opportunities
  3. Ongoing development and iteration
  4. Strategic drivers and reflection
  5. Things that worked

Andy stresses New Zealand is a different environment to the UK. Thinks small size may be an advantage despite smaller budgets (wonder if there is a lesson here – perhaps trying to do things in the UK at a ‘New Zealand’ scale?)

First pitch to collaborators – didn’t push a ‘national vision’ very hard. Was an invitation to contribute to a ‘Coming Home’ programme which was focussed on content relevant to Armistice Day (World War I). Collaborators asked to signup to a series of 4 projects:

  • Search widget
  • Mashup
  • Remix experience
  • Tagging demonstration

Search widget – pull together simple search across relevant material in New Zealand archives etc. and make it possible to embed into any web page. As well as the search widget, built a fuller ‘search’ experience (sounds like based on Solr) – simply a different presentation layer on same service/content as the widget.

As demonstrator of what opening up data enabled they added an API to the material, and used to build a timeline display using timeline tool from MIT (Simile).

Memory Maker – making it possible for users to remix content as a video and submit back to the site – although people were excited by technology, it was making content available that made it possible.

Finally joined Flickr Commons to try out user tagging of content.

These four small projects were laying foundations for something bigger. When building search widget, although focussed on World War I material, asked ‘would you mind us taking everything?’ – and in most cases there was no objection, and there was no extra cost.

Infrastructure built to be scalable and extensible. So move from ‘Coming Home’ to DigitalNZ was a small step – really mainly about presentation because already got content (metadata).

Made a form for ‘collection creation’ – allowed building of a search of a subset of whole collection – based on various criteria – e.g. keyword search. The application then setup a widget that you could paste into your web page and also a pointer to the fuller ‘search page’ for the subcollection. This was used to create the Coming Home subcollection – but also the tool was opened up to anyone – so any member of the public could build their own subcollection, complete with search widget and web page (think this is genius!)

The API that was used to do the timeline mashup described above was documented and opened up to anyone – and people started to build stuff, although Andy feels it is early days for this, and more could be done (and hopes it will be)

Next stage was to start talking about this as a National initiative – but all the pieces were in place.

Timeline for project was:

  • May 08 – governance approval for concepts
  • July 08 – Began s/w development activity
  • Nov 11th 08 – Launch of Coming Home projects (Armistice Day provided and absolute deadline for project!)
  • Dec 1st 08 – Launch of full aggregation, custom search and API service

Took a long time to negotiated and discuss before May 2008 – but the Armistice Day deadline focussed minds – had to get agreement and move forward and do something.

There is the ‘vision’ – but Andy says even as programme manager the vision feels like something to secure funding – so the vision used to inform team mission:

Helping people find, share and use New Zealand digital content

This mission was how the team described it – so they had ownership over the concept.

DigitalNZ wasn’t the first initiative to do something similar in New Zealand – so need to say how you are different to these – there is always a lot of History. Andy comparing a previous aggregation in NZ to Europeana – with agreed standards etc. that contributors need to sign up to. DigitalNZ decided not to have standards – they would take what they could get! Important to bring other initiatives and those involved along with you and get their support as far as possible.

DigitalNZ had limited development capacity – so teamed up with vendors – but teamed up with 3 vendors which had expertise overlap to cover:

  • User experience design
  • Front-end development
  • Search infrastructure

Because of overlap between vendors, each could lead in an area, but could backfill in other areas where necessary.

DigitalNZ completely depended on agile development methodology – specifically Scrum – this approach means you deliver real working software every two weeks – which makes it clear to collaborators that you are getting stuff done – they can see real progress.

Knew from the beginning that branding would be an issue – but perhaps underestimated how much of an issue. Andy says all organisations have egos – and also others see them in specific ways. So although the initiative was led by National Library of New Zealand – this is not the up front branding. This means it is more seen as a true collaboration of equals.

Had to make low barriers to entry, as no money for collaborators to take part. One of the things they did is to accept metadata in any format and quality. So that could mean scraping websites etc. Then deal with the issues arising from this later – not make it the collaborators problem – otherwise many simply wouldn’t be able to participate.

In some cases got enthusiastic initial response from partners, but then could get bogged down in local internal discussions. Again the initial Armistice Day deadline meant decisions were made more quickly.

After launch of services, have kept in project mode – Andy says the phrase “business as usual” is unacceptable! So this means still doing 2 week development cycles using same Scrum methodology etc. Need to thing about this as you plan projects on a national scale.

The next step from getting the metadata and search was to look at how to create digital content – digitise content (unavailability of content in digital format is usually the biggest barrier to getting access). Setup ‘Make it Digital’ toolkit – advice for those wanting to digitise, but also includes voting tool for public to suggest material for digitisation.

Took the search widget/search page creation tool and starting to apply to richer content – can launch a new collection instance in 2 hours (although not including graphic design) – wow!

Now also running a Fedora instance to allow hosting of content that hasn’t got anywhere else to live – e.g. for organisations who can’t run their own repository.

Now grappling with:

  • The focus of content to be included in DigitalNZ Search – should it start to pull in relevant content from bodies outside NZ?
  • The balance of effort between central and distributed tools – focus is more on distributed approach – then local organisations can do marketing etc.
  • The balance of effort split between maintenance of existing solutions and development of new solutions – challenged to grow services without any more money
  • The availability of resource to fund digitisation consultancy, workshops and events – often what is needed is money but this is not available

What has worked for DigitalNZ?

  • Have the team articulate vision
  • Start with small exciting projects
  • Be clear about your points of difference (to other projects in same space)
  • Lower barriers to participation
  • Use branding and design to inspire committment – a lot of effort goes into make what they do look good
  • Invest time building strong relationships with collaborators
  • Have deadlines that can’t be moved
  • Once you are in the door you can up-sell the intiative
  • Build for reuse and extensibility
  • Iterate in small fast cycles – Andy says he can’t recommend this enough – better to do 2 days of requirements analysis and then deliver something, then iterate again
  • Have lightweight governance and a team of experts
  • Get on with it and refactor in response to change

Q & A

Q: (Jill Griffiths, CERLIM) How many people have responded and engaged with ability to suggest content for digitisation

A: About 100 items have been nominated – and the most popular item has about 600 votes. Also have a weighted scorecard for organisations to help them decide on priorities – one aspect on the scorecard is user demand, which is where this tool is used to inform. Also doing work on microfunding digitisation – e.g. $10k (NZ $)

Q: (?) How did you cope with jealously of big collections (e.g. Turnbull)

A: Not a problem. Building on previous initiatives so many of the concepts not new and already had been agreed – e.g. exposing metadata through other platforms. Education is part of the process.