INF11 – Activity Data incubation workshop 2

This blog post is written on behalf of JISC.

David Kay from Sero Consulting kicking off this second part of this workshop, reporting on the JISC Activity Data Workshop which was held on 14th July 2010 (http://www.jisc.ac.uk/events/2010/07/businessintelligence.aspx). David says notes from the day are available at http://ie-repository.jisc.ac.uk/486/

David skimming over what was presented on the day including:

  • the differences between HE and the commercial sector use of Activity Data – and what HE can learn from the commercial sector
  • that sometimes things seem ‘too difficult’ – but look at what you can achieve
  • that even small amounts of activity data can show interesting results
  • that where we have concerns about ethical issues around activity data – we don’t have to do things we feel are unethical
  • cultural fear is endemic so we need to demystify the subject matter – there is a lack of case law in this area, and online activity data falls under ‘personal data’ in terms of the data protection act (1998) – however, use of personal data can be defended in some uses for corporate benefit, and anonymised data may not count

The event had a series of 5 debates:

  • Why should WE do it?
  • Love Data, Hate Silos
  • Local, National or Global
  • Appropriate & Inappropriate Use
  • Attention, Activity, Rating, Review – Where to stop?

Full notes from these debates are available in the event write-up.

David notes that while bids under the Activity Data strand may focus on specific types of activity data, they may need to draw on data from several systems – for example, not all the data (such as course of study a student is on) will be in the library system, so need to think about all the systems that will contribute to activity data picture.

Finally in this session, a panel session made up of David Pattern (University of Huddersfield), Graham Stone (University of Huddersfield), Joy Palmer (MIMAS), Ross MacIntyre (MIMAS), Mark Stubbs (MMU). I’ve recorded (hopefully) the spirit of the questions and answers – not verbatim, and answers may have come from more than one person on the panel:

Q: Interested in idea of involving a PhD student in the project (at MMU) – how much data do you need to do this kind of stats analysis?

A: PhD student able to do analysis – looking at showing that VLE/MLE was actually worth using – to convince lecturers etc. Notes that going to start tagging things with module code – reading lists, exam papers etc.

Q: What are the issues around exposing some of this data to potential embarrassment of some people – e.g. Huddersfield showing that for some courses the students don’t use library

A: Some data removed because might result in identification of individuals – e.g. for very small courses don’t want to publish information about attainment and borrowing. But for other cases courses have welcomed data – way of getting students to engage with library

Q: How much data is needed to do good recommendations

A: David Pattern relating that using an OpenURL resolver that even a couple of months worth of data was enough to start doing reasonable recommendations – so not necessary to have data collected over a period of time

Q: Is relating the type of data we can capture (e.g. loans, e-resource download) to actual usage a problem? Especially for e-journals?

A: All of this data is ‘access data’ rather than ‘usage data’. But still useful. Big issue interpreting a ‘zero access’ statistic – all zeroes merit closer inspection! E.g. lack of use of library could mean larger use of bookshop

A: Need to look for explanations for unusual activity data  – use the activity data to find anomalies – and then investigate, run focus groups, etc. etc.

Q: Can we go beyond collecting data from a single system – and starting ‘scrobbling’ in the way last.fm does? Is information from a single system interesting in it’s own right?

A: Definitely lots of interesting data outside central systems – so have to be clear about this. However can still be interesting, but keep in mind it is only a partial picture. Can see for example bringing together information from primary sources such as might be in a national aggregation of library circulation data, with secondary sources in journals from Mendeley.

Q: Quote from Ken Chad ‘Search is dead … welcome to the age of recommendation’ – true?

A: Recommendation is one way in – but just one way of discovering things – brings serendipity into systems. False dichotomy between ‘search’ and ‘recommendation’ – both of these activities can be either passive or active – so just different pathways to finding content.

Q: What will this add to user experience

A: Better degree possibly!

A: Business case includes better collection management; fostering academic excellence; helping people find stuff; no deadends; better use of existing (free or already paid for) resources.

Q: Who needs to have buy-in to this for the sector?

A: HESA stats would really aid benchmarking – identifying similar institutions for example

Q: What is that nature of the investment that is needed? Is it money? Or just expertise/leadership/risk?

A: Not necessarily expensive to do a small scale, but possibly at larger scale. Again need buy in (prioritisation) in the institution to do anything at all though – this perhaps one of the reasons not seen lots of institutions following example of Huddersfield. But there is real tangible payback. Pressures on space are key in many institutions – if you can show impact on space saving – e.g. by enabling disposal of library stock effectively.

Q: Are there serious legal risks?

A: This question posed to the room at large – generally the feeling was that there weren’t serious risks related to use with an institution. Noted that the use to which data is put is important part of legal picture – if used for corporate purpose.

A: Noted that University of Minnesota did work on use of ‘affinity strings’ to avoid identification of users (there is high sensitivity around the possibility of being asked to give up user data under the patriot act so pressure to anonymise data, and not keep data unnecessarily) – this was written up in the code4lib journal at http://journal.code4lib.org/articles/501

David Kay recommends a starting point from a technical perspective is ‘Scaling and productising MOSAIC search and recommendation services’ at http://hedtek.com/?p=371

David emphasises that this is not just about library data, that local institutional value needs to be shown in the bid – even if you are bidding as a consortium or partnership.

INF11 – Activity Data incubation workshop 1

This blog post is written on behalf of JISC.

There are a number of afternoon session, and I’m attending perhaps the most structured which is around the Activity Data strand. This is kicking off with a number of short talks.

Mendeley

This afternoon session is kicking off with Ian Mulvany from Mendeley talking about the work they have done. Mendeley comes out of the tradition of ‘reference management’ in some ways, but adds ‘activity data’ to it. Users of Mendeley can download a desktop client and also host an online library – either of citations, or citations with papers attached.

Mendeley uses the ‘Last.fm‘ model of activity data ‘scrobbling‘ information about usage from the desktop client – so as users read, use, annotate papers, this information is captured.

This ‘activity data’ can then be used to build recommendations – so based on papers you have used, or have in your library, Mendeley can start to make recommendations on other papers that may be of interest.

Mendeley has an API which developers are encouraged to use to build applications on.

Now Mark Stubbs from MMU talking about lessons from Activity Data Analysis:

Top tips

  • Common ids essential
  • Snapshotting helpful – things change over time (e.g. data may be removed)
  • Visualisations can be very helpful to understand the data
  • Look at both Quantitative and Qualitative aspects of the data – discuss processes underlying patterns

Experience from two projects:

PhD on evaluation MLEs

Combining information from Student Records systen (Agresso/Unit4) and Blackboard WebCT Vista – but had to work around the fact they didn’t use common identifier

Random Forest Algorthim used to do analysis.

Found interesting results – e.g.

  • Correlation between late night usage of VLE and failure to progress
  • Stopping VLE use early might be a sign of dropping out
  • Diminishing reutrns for staf input to VLE – stop fiddling
  • Found that ‘document download’ was more important to progression that participation in chat/discussion

MMU undergoing an Institutional Transformation – re-writing undergraduate curriculum – and using information from activity data to help.

David Kay notes that if you are looking at the Activity Data strand of the call, you aren’t limited to library data – VLE/MLE and other activity data is in scope.

MIMAS and Activity Data

Joy Palmer from MIMAS talking about how MIMAS would like to see activity data used in relation to bibliographic data and services. Joy asking why we are still talking about the potential of activity data in libraries – especially after the TILE and MOSAIC projects and further work at the University of Hudderfield demonstrated the value activity data could add. Identifying some barriers:

  • Technical barriers
  • Getting ‘buy in’ across the library & institution

Some questions:

  • Where are the ‘quick wins’?
  • What don’t we know about exploiting Activity Data?

Need to articulate:

  • user demand
  • benefits
  • value
  • sustainability (Joy notes how often sustainability was mentioned this morning during the briefings)

Making a Business Case (Joy says) is key.

In arts and humanities book usage still very high – and will continue. Even Google accept there will be books that will not be digitised in the forseeable future.

MIMAS conducted some market research and found:

  • Centrifugal searchers [think this meant working out from a central place]
  • Berry-picking from various trails
  • possible Information Literacy issues resulting in dead-ends

Researchers are suspiciuous about User Generated Content – especially ratings and reviews – but could see immediately benefits of ‘tacit’ recommendation systems – and are very used to this type of recommendations from Amazon. Joy uses example of how BookGalaxy – the winning entry for the MOSAIC competition – can surface relationships that wouldn’t come from simple searching.

Joy asks ‘What if’?

  • These patterns represented a national aggregation of activity data
  • Users could search the long talk of data

“In humanities research it’s the long tail all the way” – Joy attributes this quote to Paul Walk.

What can this mean?

  • Surfacing and increasing usage of hidden collections (and demonstrating value)
  • Providing new routes to discovery based on user and disciplinary context (not traditional classfication)
  • Powering ‘centrifugal searcing’ and discovery through serendipity
  • Enabling new, original research – academic excellence

We can make data work harder to solve other problems – e.g. what you can let go from your collections (collections management)

Could this be a ‘virtuous circle’? Can ‘activity data’ be ‘open’ in the same way we might aspire to for bibliographic data?

Publisher and Institutional Repository Usage stats

Paul Needham from Cranfield University presenting on this – PIRUS2 which was a continuation of the original PIRUS project (originally led by COUNTER).

Paul notes the rise in interest in article-level usage – more journal articles hosted by institutional and other respositories, and online usage becoming an accepted meauser of article and journal value – and technical and standards development (e.g. COUNTER) make it possible to track usage at article level.

However COUNTER had focussed on usage stats at the journal title level – so PIRUS2 aimes to develop COUNTER compliant usage data and stats at the individual article level. Also to create guidelines to enabling the sharing or production of standardised usage data and reports.

PIRUS2 developing a model for a real-worlod article-elvel publisher/repository usage statistics service, and to develop a suite of free, open access programmes to support the generation and sharing of COUNTER-compliant usage data and stats.

PIRUS2 has three scenarios for gathering data:

  • ‘tracker’ code – a server-side ‘Google Analytics’ for full-text article downloads
  • OAI-PMH harvesting – as this is supported as standard by major repository software
  • SUSHI – Standardized Usage Statistics Harvesting Initiative Protocol – which publishers already use – however, doesn’t currently support article level stats, although it will do in the future

PIRUS2 has already developed plugins/extensions for DSpace, EPrints and Fedora. Currently gathering data via tracker from 6 repositories.

Paul notes that the techical side is relatively easy, but the ‘political’ side more challenging. This involves getting involvement and agreement from publishers, instituitons, other stakeholders. Clearly there are sensitivities around for example the promotion of instituitonal repositories when working with some publishers.

More information on PIRUS2 at http://www.cranfieldlibrary.cranfield.ac.uk/pirus2/tiki-index.php

Activity data at University of Huddersfield Library

David Pattern and Graham Stone talking about the work carried out at University of Huddersfield using library system activity data to drive a number of services:

  • Recommendations (e.g. people who borrowed this also borrowed)
  • Personalised Recommendations (e.g. what to borrow next based on your loan history)
  • Keyword search cloud – based on what people were searching for – found originally approximately a quarter of searches found zero results – so implemented spellchecker!
  • Guided keyword searches – if someone searches ‘law’ get thousands of results – so highlight the words often combined with law in searches
  • Click stream data – currently collecting this, although not sure the best way to make use of this

University of Huddersfield released the circulation and recommendation data under an open license – Dave says this makes him feel good and he recommends it!

Looking at the impact, found that once borrowing suggestions were added to the catalogue (in 2005), there was a change in the borrowing habits of students – increase in range of stock circulating – quite a marked correlation.

Another correlation is an increase in the average books borrowed per year – increased after the implementation of borrowing suggestions.

Graham Stone now talking about activity data for the School of Human and Health Sciences – found that a reasonably large proportion of students were not using the library – either borrowing books or logging in to online resources. So started to look at final outcomes for students in terms of attainment, and found that the more books a student borrowed, the higher classification of degree they tended to get – a strong correlation.

Graham noting that they haven’t looked at whether certain affects are statistically significant, but using data to help inform thinking. Finding some differences in terms of book vs electronic resource usage across different courses of study.

Now starting to look at the reasons for unexpected non/low use. Looking at

  • Course profiling
  • Targeted promotion
  • Raise tutor awareness

Need to benchmark findings with potential partner, as well as test for statistical significance, and would like to develop a toolkit.

INF11 – Sustaining ‘At Risk’ Online Resources

This blog post is written on behalf of JISC.

Neil Grindley presenting this on behalf of Amber Thomas. The aim of this strand is to provide small amounts of transitional funding to place valuable but ‘at risk’ online resources into new stewardship environments where they will be safe, accessible and sustainable.

A detailed description of this strand is at http://infrastructurecalloct2010.jiscpress.org/appendix-j-sustaining-%E2%80%98at-risk%E2%80%99-online-resources/ and a briefing paper at http://inf11briefingoct2010.jiscpress.org/sustaining-%E2%80%9Cat-risk%E2%80%9D-online-resources/

This strand could cover resources such as websites, resource banks, respositories and other collections of online materials. However, Open source software is not the target of this funding. ‘At risk’ means that the owning organisiation will no longer exist, tha the server infrastructure will be switched off, and/or that there is no technical maintenance available.

This is NOT to support organisations shifting resources around internally. Bidders will need to provide evidence that without action the resource will be abandoned. There should be clear evidence that the material have value.

  • Transfer of ownership must be demonstrably feasible and involve named staff
  • It should be apparent that transfer of ownership is strategically and sustainably sensible
  • Funding cannot be used to pay license fees to current owner

The project must produce:

  • The online resource/collection itself in its new home(s)
  • Full ‘handover’ documentation
  • A case study of the process

Q: Does the lead need to be an HEI

A: The lead partner must be one that is eligible to bid as outlined at http://infrastructurecalloct2010.jiscpress.org/eligibility/

INF11 – Preservation of complex visual digital materials and environments

This blog post is written on behalf of JISC.

Neil Grindley covering this strand, which is described in detail at http://infrastructurecalloct2010.jiscpress.org/appendix-i-preservation-of-complex-visual-digital-materials-and-environments/. Also see briefing paper http://inf11briefingoct2010.jiscpress.org/preservation-of-complex-visual-digital-materials-and-environments/

There areas of focus for this strand:

Note that there is a focus on visual materials, and this strand aimed at bigger projects with longer timescales (looking to fund one project for up to £130,000 to be completed by end by 31st March 2012)

The scope of this work is not desinged to try and tackle the entire spectrum of digital material that are often referred to as ‘complext digital objects’, for example the follow are NOT in scope:

  • Digital literary editions
  • General websites contain embedded multimedia compontent
  • Database driven business and factual information systems

The objectives that must be addressed by the project are listed at http://infrastructurecalloct2010.jiscpress.org/appendix-i-preservation-of-complex-visual-digital-materials-and-environments/?paragraph=12#12

Neil outlines a ‘plausible model’ for desinging the collaborative sturcture of this project (although this is just one possible model):

Have a lead partner (one eligible to bid of course), working with 1 or more ‘domain’ partners who might run symposiums on specific areas – e.g. one for each of the three areas noted above, these domain partners feedback to the lead partner who synthesises and reports.

INF11 – Preservation Tools

This blog post is written on behalf of JISC.

Neil Grindley covering the Preservation Tools strand, a detailed description is available at http://infrastructurecalloct2010.jiscpress.org/appendix-h-preservation-tools/, and a briefing paper is available at http://inf11briefingoct2010.jiscpress.org/repositories-take-up-and-embedding/

This strand is not about supporting technical tools development – but about using current, existing tools, and embedding the use of these tools into normal working practices. However Neil notes that projects may identify that more development is needed, or that existing tools have limitations etc. but that this needs to come out of using preservation tools in live environments.

Emphasis on solving real and pressing problems.

There are a huge range of tools –  too many for me to capture here – but note that this is not limited to tools developed specifically for ‘preservation’ – but anything that includes aspects of preservation or preservation workflow could fall into this – an example mentioned is the image editing package GIMP. However, also obviously tools developed with preservation specifically in mind are also in scope.

Very important to consider how the work the projects propose to carry out will impact on the broader community – this needs to be addressed by proposals.

Markers of the proposals will also be looking for (amongst other things):

  • Proposal involving new users of tools developed elsewhere
  • Described procedures and processes that represent new practice for the bidder
  • Rigorous and plausible descriptions of proposed engagement with toolsets
  • Intelligent and complementary partnerships between tools providers and tool users

INF11 – Access and Identity management

This blog post is written on behalf of JISC.

This session being covered by Neil in the absence of Chris Brown who will be managing the strand. Detailed description of the strand is available at http://infrastructurecalloct2010.jiscpress.org/appendix-a-identity-management-pilots/, and briefing paper at http://inf11briefingoct2010.jiscpress.org/identity-management-pilots/

This strand looking at using the existing Identity Management Toolkit – Neil mentions specifically:

  • Case studies implementing the toolkit
  • Show its effectiveness at improving institutional procedures/policies
  • Show cost savings through implementation of the toolkit

Neil notes that projects may want to focus on specific areas of the toolkit – not necessarily the whole thing which is very wide ranging. Match bids to resources and time available.

Further information via:

INF11 – Research Information Management

This blog post has been written on behalf of JISC.

Detailed description of this strand is at http://infrastructurecalloct2010.jiscpress.org/appendix-b-research-information-management/, and briefing paper at http://inf11briefingoct2010.jiscpress.org/research-information-management/

The aim of this strand is “to expand the community of higher education instutitons and organisations who are using CERIF” – the proposal must show how this will be achieved.

Highly recommended that you read the briefing papers and references therein

Looking for projects using CERIF to support the interoperability of systems and processes within one of more institutions – eg. developing a CERIF ‘wrapper’, data warehousing for importing data from external systems

Project must develop outputs that are demonstrably applicable in other instutution and have been piloted in at least two such instutitons

Reports on issues of data qualitiy, costs/benefitis, CERIF gaps

Links to (inter)national projects

Contacts are Neil Jacobs (n.jacobs@jisc.ac.uk) and Josh Brown (j.brown@jisc.ac.uk)

Q: Can you say more about piloting with other institutions

A: Look at examples from Scotland where institutions have shared research information using CERIF – that is in a systematic way.

INF11 – Repositories take up and embedding strand

The blog post is written on behalf of JISC.

The programme manager for this strand is Balviar Notay and detailed description at http://infrastructurecalloct2010.jiscpress.org/appendix-g-repositories-take-up-and-embedding/. A briefing paper is available at http://inf11briefingoct2010.jiscpress.org/repositories-take-up-and-embedding/

This strand is about embedding proven good practice to develop and existing repository – not about new development. But not just about ‘JISC good practice’ – good practice from other places/communities are welcome – but need to be clear about what good practice you are going to implement – and perhaps ideally be in contact with those who originated the relevant good practice.

Need to see service improvement – not just about ‘better technology’. Also need to show how a wide range of instituitons will benefit from this work – and need to see this – including working with the RSP. This also has to be sustainable – about sustainable innovation – needs to go beyond the lifetime of the funding available from JISC.

Q: Would there be an expectation that the originators of good practice you are going to use would become a partner in the bid?

A: No – especially given the size of the bid – so a consultancy role may be appropriate

Q: Is partnering/good practice limited to the UK?

A: No – good practice from around the world whereever appropriate – but have to ensure you can manage this within the funding and time available (e.g. think about travel etc.)

Q: Are there hashtags this

A: No hashtag mentioned for this particular strand (I’ll clarify this if I can)

INF11 – Activity Data

This blog post is written on behalf of JISC

This strand looking for projects that explore user activity data to improve services to institutional staff and students – also there will be a single ‘synthesis’ project in this strand. The detailed description of this strand are at http://infrastructurecalloct2010.jiscpress.org/appendix-f-activity-data/ and a briefing paper is available at http://inf11briefingoct2010.jiscpress.org/infrastructure-for-resource-discovery/

All about identifying tools and techniques that can work for the sector. Looking for very practical projects – lookin at how services wil be improve, who will it affect, and how they will be affected. Each project should start with a hypothesis (see http://infrastructurecalloct2010.jiscpress.org/appendix-f-activity-data/?paragraph=27#27 and http://infrastructurecalloct2010.jiscpress.org/appendix-f-activity-data/?paragraph=33#33) and expect projects to look at proving/exploring the hypothesis.

Expect project to release datasets using an open licence wherever possible – but to be clear about any legal or moral problems with this within the bid.

Activity data related to all instituitonal systems is in scope.

There is a related call out at the moment – 12/10 the JISC Business Intelligence Programme – Andy highly recommend that anyone thinking of bidding under the #inf11 Activity data strand should also read this call. Note JISC does not want duplicate bids to both of these calls.

INF11 – Infrastructure for Resource Discovery

This blog post is written on behalf of JISC.

Projects to release open metadata about the ollections and resources of HE libraries, museums and archives – details in Appendix E of the call at http://infrastructurecalloct2010.jiscpress.org/appendix-e-infrastructure-for-resource-discovery/ – Andy McGregor (giving this briefing) suggests this is a good place to ask questions via the commenting system, and also may be a way of finding possible partners for bids through the comments. Also see the briefing paper at http://inf11briefingoct2010.jiscpress.org/infrastructure-for-resource-discovery/

Projects in this strand should take into consideration the fact that they are part of a wider vision and should take this into account, and consider how they contribute to this (and that they have the time/resource to do it).

Look very carefully at the strict methdology in place – if bids don’t adhere to this won’t get funded. ‘Linked data’ is encouraged but not compulsory – see http://infrastructurecalloct2010.jiscpress.org/appendix-e-infrastructure-for-resource-discovery/?paragraph=15#15 and http://infrastructurecalloct2010.jiscpress.org/appendix-e-infrastructure-for-resource-discovery/?paragraph=18#18

Funding is focussed on HE institutions – but partnerships with institutions outside HE is welcome.

Project are about establishing practices that can be adopted by other institutions to spread the benefits around the sector – looking for projects that have ways of doing this embedded into them – not just lip-service to concept.

Data and process must be sustainable – looking for more than just a simple declaration in the bid here but clear ideas of how projects will tackle this.