Alistair Tebbit, Government Affairs Manager at Reed Elsevier, outlined his company's view on evolving publishers’ solutions in the STM sector. Elsevier have text mining licenses with higher education and research institutions, corporates and app developers. They give access to text miners through transfer of content to user’s system enabled by one of two delivery mechanisms under a separate licence: via API or using ConSyn. The delivery mechanisms have been set up and there are running costs. Their policy to date has been to charge the value add to the services to users for commercial organisations, but not academic.
Why is content delivery managed this way? Platform stability is a critical reason. Data miners want content at scale – they generally don’t do TDM on a couple of articles - but delivering scale via their main platform ScienceDirect.com is sub-optimal. APIs or ConSyn are the solution as they leave ScienceDirect untouched. Effectively they are separating machine-to-machine traffic from traffic created by real users going to ScienceDirect.com. Content security is another key issue. Free-for-all access to miners on ScienceDirect would not allow bona fide users to be checked. XML versions are less susceptible to piracy than PDFs. Why is content delivery managed this way? It’s more efficient for genuine text miners. Most miners prefer to work off XML, not from article versions on ScienceDirect. Their delivery mechanisms put the content into data miners' hands fast.
With text and data mining outputs, they use a CC BY-NC licence when redistributing results of text mining in a research support or other non-commercial tool. They require that the DOI link back to the mined article whenever feasible when displaying extracted content. They grant permission to use snippets around extracted entities to allow context when presenting results, up to a maximum of 200 characters or one complete sentence.
Licensing is working well at Elsevier and will improve further. The demand to mine is being met and there are no extra charges in the vast majority of cases. Additional services to support mining will likely be offered as they improve. However, it’s early days. Mining demand is embryonic with low numbers at the moment. Copyright exceptions are a big cause for concern and there is a major risk of spike in unauthorized redistribution. Platform stability may be threatened and there is a risk of a chilling effect on future service innovation.
There are a number of barriers to text and data mining:
- Access: how can users get hold of content for text mining
- Content formats: there is no standard cross-publisher format
- Evaluation: understanding user needs and use cases
- Uncertainty: what is allowed by law, what is the use of text and data mining output
- Business models: lack of business pricing models e.g. access to unsubscribed content
- Scale: define and manage demand, bilaterial licensing unlikely to be scalable.
In the newspaper sector, press packs are produced by text mining. The NLA eClips is a service where the proprietary way of mining content is withheld and a PDF is supplied of the relevant articles. There are substantial risks for publishers in text mining including the potential for technical errors by miners, challenges around data integrity and commercial malpractice. There are also cost implications including the technical loads on systems, management of copies and uses and opportunity costs.
Hughes cited the Meltwater case where the industry had to tackle the unauthorised use of text and data mining for commercial use. It took a lot of time and litigation, but they are now thriving within the NLA rules. They are licensed by the NLA and their users are licensed. It means they are operating on fair and equal terms with competitors and is an example of how licenses can work to the benefit of all parties.