Monday 20 May 2013

Text and Data Mining: International Perspectives, Licensing and Legal Aspects

Graham Taylor welcomes delegates
Licensing for text and data mining is a minefield for publishers. How do you use the technology? What are the implications of policy development in Europe and internationally? How do you ensure that your licenses are fair and practical?

Text and Data Mining: international perspectives, licensing and legal aspects, a Journals Publishers Forum seminar organised in conjunction with ALPSP, the Publishers Association and STM, gathered together a range of speakers to try and answer these questions. This is the first in a series of posts from the afternoon that provides a summary of the discussion.

Graham Taylor, founder of The Long Game consultancy and director of both the CLA and PLS, kicked off proceedings with a summary of text and data mining. It has been something of a political hot potato, but has recently settled down. He introduced Jonathan Clark from Jonathan Clark & Partners B.V., author of Text Mining and Scholarly Publishing report from PRC, who provided an overview of the 'What, Why and How?' of text and data mining.

Text mining is about mining or extracting the meaning from one or many articles. Sounds a lot like reading, right? Yes, but it's about a machine doing it. If you imagine that you were teaching a machine to read, how would you do it? You can provide formalised rules of grammar and language and teach the machine to read. The other way is a statistical, rule based approach. This is where you take as much text as is possible and tell it to read. Amazingly, machines do make up the rules and start to make sense of it. The best example is Google Translate. It achieves this by sitting on a vast amount of translated content that it searches to match particular phrases. Why is this important? Think of the way that scholarly communication is done and how it is structured. It is essentially to share facts and to shape opinions from them. Data mining is pretty much the domain of machines where you look for patterns and trends.

Why do text mining?

  1. Getting the facts out of the article, making them sensible and enhancing the text.
  2. Systematic literature review: machine reading faster and more of it than humans could ever do, and probably more accurately as well.
  3. Discovery: he referenced brainmap.org - completed manually and has become very important resource for researchers on brain scanning since)
  4. Computational linguistics research: the new rules about making research available
Eefke Smit is Director of Standards and Technology for STM and co-authored a study on Journal Article Mining on behalf of the Publishing Research Consortium. Historically there has been a mix of optimists and pessimists in text and data mining (TDM). 

The sceptics claim:
  • TDM has always over-promised
  • It is only in specialised fields
  • The tools are still complicated
  • It needs manual curation
  • There are high investments
  • It is domain dependent
  • There is no common dictionary
  • Subject to over ambition in the promise of knowledge discovery.
However, the optimists counter that:
  • There is a vast digital corpus available and growing
  • It has more and more application areas (business, legal, social, etc)
  • The tools are improving fast
  • Manual work is reduced
  • It can be public domain or domain precision
  • Processing power is less of a problem, analytical tools are better, visualisation adds to analysis.
There are some interesting insights into exactly how publishers approach text and data mining in the report as well as insight into what drives the requests. The third part of the report focused on cross-sector solutions to facilitate content mining better. Suggestions made by experts during the interviews included:
  1. standardization of content formats
  2. one content mining platform
  3. commonly agreed access terms
  4. one window for mining permissions
  5. collaboration with national libraries
It was interesting to note that most of the interviewed experts did not see open access as a related issue; access issues relate to datafile delivery or mining on the platform itself.

Richard Mollet on the latest policy
Richard Mollet, Chief Executive of the Publishers Association provided what he described as an 'aide memoir' of how the policy is tracking in the UK and the EU. Since Hargreaves report in 2011, and the UK Government's subsequent acceptance of all the recommendations, the Intellectual Property Office has been tasked with taking this forward. There is a proposed 'three step test' for text and data mining which will allow copying for purpose of analytic techniques. The caveats are: 
  1. the person already has a right to access under an existing agreement, NOT the ability to access;
  2. for sole purpose of non-commercial research; 
  3. the license may impose conditions of access to licenses or third party system (this allows the publisher to impose some restrictions to avoid degrading the whole system, for maintenance of some form of control).
There is a tension between the challenge of being able to do what is legal under copyright law, but when that is then prevented by a contract. This has made the translation from policy document to parliamentary language even more difficult and has hence been delayed. Due to this, but it's likely that this won't be UK legislation until October 2014.

In parallel, the European Commission has come to that view itself. There are stakeholder dialogue working groups that are trying to identify short term wins. One of these is on data and text mining: trying to ascertain does anyone want to do it and if so how do they want to do it. However, there are real tensions within the Commission with different positions between rights-holder communities who feel they can fix this, with significant work already underway, and the research community that believe the system is broken and needs a complete overhaul. The risk here is that it will move from dialogue to monologue, as researchers have indicated that any licensing solution - as opposed to the reopening of the Copyright Directive - will be insufficient for their purposes. 

1 comment: