Sunday 26 May 2013

Text and Data Mining: rights holder licensing tools

Text and Data Mining: international perspectives, licensing and legal aspects was a Journals Publisher Forum seminar organised in conjunction with ALPSP, the Publishers Association and STM held last week in London. This is the last in a series of posts summarising discussions.

Sarah Faulder, Chief Executive of the Publishers Licensing Society announced they are developing PLS Clear – the PLS clearing house – a central window to handle license requests that will be a rights holder search and discovery service.

Text and data mining involves access to, and usage of, articles in bulk. Researchers need to track and contact potentially hundreds of publishers for permission to mine their text. The PLS service will connect researchers to rights owners for search and discovery.

Publishers already entrust licensing their secondary rights to PLS on a non-exclusive basis. As a result PLS has built arguably the most comprehensive database in the UK of publishers and their content (by ISBN/ISSN and, in due course, by DOI). This is a natural role for PLS and the network of Reproduction Rights Organizations all over the world.

They are testing a single discovery portal through which researchers can both find the appropriate publisher(s) and route their permissions requests to the relevant person in the publishing house. The plans are for a generic clearing house. The first application is text and data mining, but it will have wider usage over time.

Text and data mining presents a technical infrastructure problem first and foremost. Licensing is a necessary means of managing access to content where the scale of access increases risk of leakage and therefore piracy, and puts an unacceptable strain on publisher platforms not designed for systematic crawling and scraping.

Carlo Scollo Lavizzari is legal advisor to STM on copyright law, policy and legal affairs. Lavizzari outlined how structuring a license is easy. Leave rhetoric aside and look to business opportunities, it is about defining the terms, what are the sources, input of content, what does the user do with that content? Where is it stored, what is done with it, can it compete or not, etc. Consider the mechanical clause on delivery mechanisms. Should also deal with the end of project – always have an exit strategy! That is the legal skeleton of a legal license.

There are calls for cooperation between those who hold content in public domain, those who hold open access content, those who hold content that is subscribed to or purchased; those who already hold a lot of purchased contents; and researchers who might want to access/mine. The question they haven’t managed to get through with any community is how to combine open access environment and copyright protected license. It is an area where he believes that licensing can provide a solution, but still trying to tackle.

John Billington works in corporate products and services at Copyright Clearance Center, who’ve been working on their Text and Data Mining Pilot Service. They have developed a pilot service that provides licensed, compliant access and retrieval of full text XML and metadata from multiple scientific publishers for the purposes of text mining.

CCC’s role is to provide an authorized means to access and retrieve published content in a standard format. The initial pilot is focused on corporate users with an access, retrieval and licensing layer. Future markets may include corporate marketing users, or academic uses.

He reflected on how it has been challenging to extract full text from different publishers and convert to a normalized format that is usable in text mining technologies. There is a lack of federation and existing tools are still difficult. They are trying to provide a one-stop shop for users and publishers that incorporate standardization, license and business model and access method that works for both sides.

He noted that a researcher wouldn’t want to be limited to what the library is subscribed to. So the tool will show them the metadata for what they aren’t subscribed to. It will filter to help them understand what they have subscribed to or not. They intend to include a purchase mechanism for full text unsubscribed articles. You will be able to download results in normalized XML format. It currently has a web interface, but they are working on an API so they systematize it.

Ed Pentz from CrossRef closed the day by outlining their latest beta application Prospect. They work on the assumption that researchers aren’t doing search or discovery – researchers will know or will have used another tool. Their service relies on DOI Content Negotiation. They are now collecting ORCID IDs for researchers. In text and data mining it is important to have a unique ID for a researcher so you can see who is doing it. They are also including funding information.

DOI content negotiation can serve as a cross-publisher API for accessing full text for TDM purposes. To make use of hits, publishers merely need to register URIs to full-text. NISO is working on a fuller specification on some metadata. They are focusing on an interim solution to at least record URIs to well known licenses. They think it will also be possible to extend to handle embargoes.

He observed that there’s potential to coordinate across initiatives, but only once each organization has individually figured out during their own trial periods. CrossRef are testing the system over the summer and will then assess if it is workable as a production system.


  1. Outsourcing proves to be one of the cost effective process which is one
    of the main reason of people following more to outsourcing.

  2. Wonderful facts. Thank you pertaining to providing you such a beneficial facts. Keep up the good function in addition to continue providing you much more high quality facts every now and then.Data entry