Wednesday 22 May 2013

Text and Data Mining: practical aspects of licensing and legal aspects

Alistair Tebbit
Text and Data Mining: international perspectives, licensing and legal aspects was a Journals Publisher Forum seminar organised in conjunction with ALPSP, the Publishers Association and STM held earlier this week in London. This is the second in a series of posts summarising discussions.

Alistair Tebbit, Government Affairs Manager at Reed Elsevier, outlined his company's view on evolving publishers’ solutions in the STM sector. Elsevier have text mining licenses with higher education and research institutions, corporates and app developers. They give access to text miners through transfer of content to user’s system enabled by one of two delivery mechanisms under a separate licence: via API or using ConSyn. The delivery mechanisms have been set up and there are running costs. Their policy to date has been to charge the value add to the services to users for commercial organisations, but not academic.

Why is content delivery managed this way? Platform stability is a critical reason. Data miners want content at scale – they generally don’t do TDM on a couple of articles - but delivering scale via their main platform is sub-optimal. APIs or ConSyn are the solution as they leave ScienceDirect untouched. Effectively they are separating machine-to-machine traffic from traffic created by real users going to Content security is another key issue. Free-for-all access to miners on ScienceDirect would not allow bona fide users to be checked. XML versions are less susceptible to piracy than PDFs. Why is content delivery managed this way? It’s more efficient for genuine text miners. Most miners prefer to work off XML, not from article versions on ScienceDirect. Their delivery mechanisms put the content into data miners' hands fast.

With text and data mining outputs, they use a CC BY-NC licence when redistributing results of text mining in a research support or other non-commercial tool. They require that the DOI link back to the mined article whenever feasible when displaying extracted content. They grant permission to use snippets around extracted entities to allow context when presenting results, up to a maximum of 200 characters or one complete sentence.

Licensing is working well at Elsevier and will improve further. The demand to mine is being met and there are no extra charges in the vast majority of cases. Additional services to support mining will likely be offered as they improve. However, it’s early days. Mining demand is embryonic with low numbers at the moment. Copyright exceptions are a big cause for concern and there is a major risk of spike in unauthorized redistribution. Platform stability may be threatened and there is a risk of a chilling effect on future service innovation.

Duncan Campbell
Duncan Campbell, Associate Director for Journal Digital Licensing at Wiley-Blackwell, provided an overview of emerging solutions in text and data mining with a publisher perspective on intermediary solutions. Text and data mining is important to publishers as it enriches published content, adds value for customers and aids development of new products. For researchers it helps identify new hypotheses, discover new patterns, facts and knowledge. For corporate research and development, the benefits are as above and in addition it accelerates drug discovery and development and maximises the value of information spend.

There are a number of barriers to text and data mining:

  • Access: how can users get hold of content for text mining
  • Content formats: there is no standard cross-publisher format
  • Evaluation: understanding user needs and use cases
  • Uncertainty: what is allowed by law, what is the use of text and data mining output
  • Business models: lack of business pricing models e.g. access to unsubscribed content
  • Scale: define and manage demand, bilaterial licensing unlikely to be scalable.
There is a potential role for intermediary to help with publisher/end user relationship. This could include as a single point of access and delivery; by providing standard licensing terms as well as speed and ease of access. The intermediary may make mining extensible and scalable and they can cover the long tail of publishers and end-users. It also enables confidential access, especially in pharma.

Andrew Hughes
Andrew Hughes, Commercial Director at the Newspaper Licensing Agency (NLA), provided a different perspective on text and data mining. Text mining requires copying of all data to establish data patterns and connections and computers need to index data. Every word on every page has to be copied. Once the copy exists, it needs to be managed. Copying requires access to data so that indexing can only happen on either the publisher database, but there is a risk of damage and disruption unless managed, and expense; or copy provided to text minders’ database where there are costs and control risks for publishers. He believes that you also need to bear in mind that third party licence partners aren’t always as careful with your data as you are.

In the newspaper sector, press packs are produced by text mining. The NLA eClips is a service where the proprietary way of mining content is withheld and a PDF is supplied of the relevant articles. There are substantial risks for publishers in text mining including the potential for technical errors by miners, challenges around data integrity and commercial malpractice. There are also cost implications including the technical loads on systems, management of copies and uses and opportunity costs.

Hughes cited the Meltwater case where the industry had to tackle the unauthorised use of text and data mining for commercial use. It took a lot of time and litigation, but they are now thriving within the NLA rules. They are licensed by the NLA and their users are licensed. It means they are operating on fair and equal terms with competitors and is an example of how licenses can work to the benefit of all parties.

No comments:

Post a Comment