Sunday 26 May 2013

Text and Data Mining: rights holder licensing tools

Text and Data Mining: international perspectives, licensing and legal aspects was a Journals Publisher Forum seminar organised in conjunction with ALPSP, the Publishers Association and STM held last week in London. This is the last in a series of posts summarising discussions.

Sarah Faulder, Chief Executive of the Publishers Licensing Society announced they are developing PLS Clear – the PLS clearing house – a central window to handle license requests that will be a rights holder search and discovery service.

Text and data mining involves access to, and usage of, articles in bulk. Researchers need to track and contact potentially hundreds of publishers for permission to mine their text. The PLS service will connect researchers to rights owners for search and discovery.

Publishers already entrust licensing their secondary rights to PLS on a non-exclusive basis. As a result PLS has built arguably the most comprehensive database in the UK of publishers and their content (by ISBN/ISSN and, in due course, by DOI). This is a natural role for PLS and the network of Reproduction Rights Organizations all over the world.

They are testing a single discovery portal through which researchers can both find the appropriate publisher(s) and route their permissions requests to the relevant person in the publishing house. The plans are for a generic clearing house. The first application is text and data mining, but it will have wider usage over time.

Text and data mining presents a technical infrastructure problem first and foremost. Licensing is a necessary means of managing access to content where the scale of access increases risk of leakage and therefore piracy, and puts an unacceptable strain on publisher platforms not designed for systematic crawling and scraping.

Carlo Scollo Lavizzari is legal advisor to STM on copyright law, policy and legal affairs. Lavizzari outlined how structuring a license is easy. Leave rhetoric aside and look to business opportunities, it is about defining the terms, what are the sources, input of content, what does the user do with that content? Where is it stored, what is done with it, can it compete or not, etc. Consider the mechanical clause on delivery mechanisms. Should also deal with the end of project – always have an exit strategy! That is the legal skeleton of a legal license.

There are calls for cooperation between those who hold content in public domain, those who hold open access content, those who hold content that is subscribed to or purchased; those who already hold a lot of purchased contents; and researchers who might want to access/mine. The question they haven’t managed to get through with any community is how to combine open access environment and copyright protected license. It is an area where he believes that licensing can provide a solution, but still trying to tackle.

John Billington works in corporate products and services at Copyright Clearance Center, who’ve been working on their Text and Data Mining Pilot Service. They have developed a pilot service that provides licensed, compliant access and retrieval of full text XML and metadata from multiple scientific publishers for the purposes of text mining.

CCC’s role is to provide an authorized means to access and retrieve published content in a standard format. The initial pilot is focused on corporate users with an access, retrieval and licensing layer. Future markets may include corporate marketing users, or academic uses.

He reflected on how it has been challenging to extract full text from different publishers and convert to a normalized format that is usable in text mining technologies. There is a lack of federation and existing tools are still difficult. They are trying to provide a one-stop shop for users and publishers that incorporate standardization, license and business model and access method that works for both sides.

He noted that a researcher wouldn’t want to be limited to what the library is subscribed to. So the tool will show them the metadata for what they aren’t subscribed to. It will filter to help them understand what they have subscribed to or not. They intend to include a purchase mechanism for full text unsubscribed articles. You will be able to download results in normalized XML format. It currently has a web interface, but they are working on an API so they systematize it.

Ed Pentz from CrossRef closed the day by outlining their latest beta application Prospect. They work on the assumption that researchers aren’t doing search or discovery – researchers will know or will have used another tool. Their service relies on DOI Content Negotiation. They are now collecting ORCID IDs for researchers. In text and data mining it is important to have a unique ID for a researcher so you can see who is doing it. They are also including funding information.

DOI content negotiation can serve as a cross-publisher API for accessing full text for TDM purposes. To make use of hits, publishers merely need to register URIs to full-text. NISO is working on a fuller specification on some metadata. They are focusing on an interim solution to at least record URIs to well known licenses. They think it will also be possible to extend to handle embargoes.

He observed that there’s potential to coordinate across initiatives, but only once each organization has individually figured out during their own trial periods. CrossRef are testing the system over the summer and will then assess if it is workable as a production system.

Friday 24 May 2013

The ALPSP International Conference debate ‘Who is the Publisher?’

Jane Tappini from Publishing Technology will be chairing a panel session later this year at academic and scholarly publishing event the ALPSP International Conference.  

Titled “Who is the Publisher?, the session, which takes place from 14.00 – 15.30 on Thursday 12 September will focus on the changing role of publishers in a dynamic content world and discuss the challenges facing them in their battle to stay relevant.

The panel will take a look at how skillsets are set to continue to evolve and how publishers can deal with flexibility in content and pricing mechanisms, the implications they have on rights and royalties and the impact on licensing, author relations and reader expectations. The expert panel will include some leading lights in academic and research publishing.

Ziyad Marar
Deputy Managing Director and Executive Vice President of Global Publishing at SAGE Publications, Ziyad Marar has a BSc in Psychology and a Ma in the philosophy and psychology of language. Marar is also a successful author. Having had three books published over the last decade, he has become an authority on how philosophy and psychology can create a better understanding of a modern identity as well as having extensive knowledge of SAGE’s reference works, humanities, social sciences, technology and medicine publishing programs.

Dr Timo Hannay
Managing director of Digital Science, a division of Macmillan and a spin off from the Nature Publishing Group, Dr Timo Hannay specialises in providing software and data solutions for scientists and others involved in the research process. With a BSc in Biochemistry from Imperial College and a DPhil in Neurophysiology from Oxford University, Hannay has worked on sophisticated data management platforms like SureChem, BioData and Symplectic. Hannay is also involved with the Science Foo Camp, co-organising the collaborative event between Nature, O’Reilly Media and Google. Bringing together between 200 and 300 leading scientists, technologists, writers and other thought-leaders, the Science Foo Camp is a recurring weekend of unbridled discussion, demonstration and debate.

Dr Victor Henning
Co-Founder and CEO of Mendeley, as well as Vice President for Strategy at Elsevier, Dr Victor Henning understands the role of emotion in consumer decision making, intertemporal choice and the theory of reasoned action. Holding a PhD from the Bauhaus University of Weimar, Henning has had research published in Psychology & Marketing and Culture & Society. He won the Overall Best Conference Paper Award at the AMA’s in 2005 and was granted a doctoral dissertation scholarship in 2006 by the Foundation of the German Economy. In 2011 he was elected a fellow of the Royal Society of Arts and has since gone on to join academic social network tool Mendeley with academic publisher Elsevier in the heavily discussed merger from earlier in the year.

Louise Russell
Former COO at Publishing Technology, Louise Russell held the Senior Vice President position at the company’s Scholarly Division and was Head of Client Management at Ingenta, before leaving the company earlier this year to become a Consultant. Having graduated from the University of Liverpool, Russell specialises in publishing, strategic planning and product management.

With such a high calibre of speaker on board and a fascinating topic to explore, the panel is set to be one of the ALPSP Conference’s signature events. To book your place at the conference register here and we look forward to seeing you there.  

Wednesday 22 May 2013

Text and Data Mining: practical aspects of licensing and legal aspects

Alistair Tebbit
Text and Data Mining: international perspectives, licensing and legal aspects was a Journals Publisher Forum seminar organised in conjunction with ALPSP, the Publishers Association and STM held earlier this week in London. This is the second in a series of posts summarising discussions.

Alistair Tebbit, Government Affairs Manager at Reed Elsevier, outlined his company's view on evolving publishers’ solutions in the STM sector. Elsevier have text mining licenses with higher education and research institutions, corporates and app developers. They give access to text miners through transfer of content to user’s system enabled by one of two delivery mechanisms under a separate licence: via API or using ConSyn. The delivery mechanisms have been set up and there are running costs. Their policy to date has been to charge the value add to the services to users for commercial organisations, but not academic.

Why is content delivery managed this way? Platform stability is a critical reason. Data miners want content at scale – they generally don’t do TDM on a couple of articles - but delivering scale via their main platform is sub-optimal. APIs or ConSyn are the solution as they leave ScienceDirect untouched. Effectively they are separating machine-to-machine traffic from traffic created by real users going to Content security is another key issue. Free-for-all access to miners on ScienceDirect would not allow bona fide users to be checked. XML versions are less susceptible to piracy than PDFs. Why is content delivery managed this way? It’s more efficient for genuine text miners. Most miners prefer to work off XML, not from article versions on ScienceDirect. Their delivery mechanisms put the content into data miners' hands fast.

With text and data mining outputs, they use a CC BY-NC licence when redistributing results of text mining in a research support or other non-commercial tool. They require that the DOI link back to the mined article whenever feasible when displaying extracted content. They grant permission to use snippets around extracted entities to allow context when presenting results, up to a maximum of 200 characters or one complete sentence.

Licensing is working well at Elsevier and will improve further. The demand to mine is being met and there are no extra charges in the vast majority of cases. Additional services to support mining will likely be offered as they improve. However, it’s early days. Mining demand is embryonic with low numbers at the moment. Copyright exceptions are a big cause for concern and there is a major risk of spike in unauthorized redistribution. Platform stability may be threatened and there is a risk of a chilling effect on future service innovation.

Duncan Campbell
Duncan Campbell, Associate Director for Journal Digital Licensing at Wiley-Blackwell, provided an overview of emerging solutions in text and data mining with a publisher perspective on intermediary solutions. Text and data mining is important to publishers as it enriches published content, adds value for customers and aids development of new products. For researchers it helps identify new hypotheses, discover new patterns, facts and knowledge. For corporate research and development, the benefits are as above and in addition it accelerates drug discovery and development and maximises the value of information spend.

There are a number of barriers to text and data mining:

  • Access: how can users get hold of content for text mining
  • Content formats: there is no standard cross-publisher format
  • Evaluation: understanding user needs and use cases
  • Uncertainty: what is allowed by law, what is the use of text and data mining output
  • Business models: lack of business pricing models e.g. access to unsubscribed content
  • Scale: define and manage demand, bilaterial licensing unlikely to be scalable.
There is a potential role for intermediary to help with publisher/end user relationship. This could include as a single point of access and delivery; by providing standard licensing terms as well as speed and ease of access. The intermediary may make mining extensible and scalable and they can cover the long tail of publishers and end-users. It also enables confidential access, especially in pharma.

Andrew Hughes
Andrew Hughes, Commercial Director at the Newspaper Licensing Agency (NLA), provided a different perspective on text and data mining. Text mining requires copying of all data to establish data patterns and connections and computers need to index data. Every word on every page has to be copied. Once the copy exists, it needs to be managed. Copying requires access to data so that indexing can only happen on either the publisher database, but there is a risk of damage and disruption unless managed, and expense; or copy provided to text minders’ database where there are costs and control risks for publishers. He believes that you also need to bear in mind that third party licence partners aren’t always as careful with your data as you are.

In the newspaper sector, press packs are produced by text mining. The NLA eClips is a service where the proprietary way of mining content is withheld and a PDF is supplied of the relevant articles. There are substantial risks for publishers in text mining including the potential for technical errors by miners, challenges around data integrity and commercial malpractice. There are also cost implications including the technical loads on systems, management of copies and uses and opportunity costs.

Hughes cited the Meltwater case where the industry had to tackle the unauthorised use of text and data mining for commercial use. It took a lot of time and litigation, but they are now thriving within the NLA rules. They are licensed by the NLA and their users are licensed. It means they are operating on fair and equal terms with competitors and is an example of how licenses can work to the benefit of all parties.

Monday 20 May 2013

Text and Data Mining: International Perspectives, Licensing and Legal Aspects

Graham Taylor welcomes delegates
Licensing for text and data mining is a minefield for publishers. How do you use the technology? What are the implications of policy development in Europe and internationally? How do you ensure that your licenses are fair and practical?

Text and Data Mining: international perspectives, licensing and legal aspects, a Journals Publishers Forum seminar organised in conjunction with ALPSP, the Publishers Association and STM, gathered together a range of speakers to try and answer these questions. This is the first in a series of posts from the afternoon that provides a summary of the discussion.

Graham Taylor, founder of The Long Game consultancy and director of both the CLA and PLS, kicked off proceedings with a summary of text and data mining. It has been something of a political hot potato, but has recently settled down. He introduced Jonathan Clark from Jonathan Clark & Partners B.V., author of Text Mining and Scholarly Publishing report from PRC, who provided an overview of the 'What, Why and How?' of text and data mining.

Text mining is about mining or extracting the meaning from one or many articles. Sounds a lot like reading, right? Yes, but it's about a machine doing it. If you imagine that you were teaching a machine to read, how would you do it? You can provide formalised rules of grammar and language and teach the machine to read. The other way is a statistical, rule based approach. This is where you take as much text as is possible and tell it to read. Amazingly, machines do make up the rules and start to make sense of it. The best example is Google Translate. It achieves this by sitting on a vast amount of translated content that it searches to match particular phrases. Why is this important? Think of the way that scholarly communication is done and how it is structured. It is essentially to share facts and to shape opinions from them. Data mining is pretty much the domain of machines where you look for patterns and trends.

Why do text mining?

  1. Getting the facts out of the article, making them sensible and enhancing the text.
  2. Systematic literature review: machine reading faster and more of it than humans could ever do, and probably more accurately as well.
  3. Discovery: he referenced - completed manually and has become very important resource for researchers on brain scanning since)
  4. Computational linguistics research: the new rules about making research available
Eefke Smit is Director of Standards and Technology for STM and co-authored a study on Journal Article Mining on behalf of the Publishing Research Consortium. Historically there has been a mix of optimists and pessimists in text and data mining (TDM). 

The sceptics claim:
  • TDM has always over-promised
  • It is only in specialised fields
  • The tools are still complicated
  • It needs manual curation
  • There are high investments
  • It is domain dependent
  • There is no common dictionary
  • Subject to over ambition in the promise of knowledge discovery.
However, the optimists counter that:
  • There is a vast digital corpus available and growing
  • It has more and more application areas (business, legal, social, etc)
  • The tools are improving fast
  • Manual work is reduced
  • It can be public domain or domain precision
  • Processing power is less of a problem, analytical tools are better, visualisation adds to analysis.
There are some interesting insights into exactly how publishers approach text and data mining in the report as well as insight into what drives the requests. The third part of the report focused on cross-sector solutions to facilitate content mining better. Suggestions made by experts during the interviews included:
  1. standardization of content formats
  2. one content mining platform
  3. commonly agreed access terms
  4. one window for mining permissions
  5. collaboration with national libraries
It was interesting to note that most of the interviewed experts did not see open access as a related issue; access issues relate to datafile delivery or mining on the platform itself.

Richard Mollet on the latest policy
Richard Mollet, Chief Executive of the Publishers Association provided what he described as an 'aide memoir' of how the policy is tracking in the UK and the EU. Since Hargreaves report in 2011, and the UK Government's subsequent acceptance of all the recommendations, the Intellectual Property Office has been tasked with taking this forward. There is a proposed 'three step test' for text and data mining which will allow copying for purpose of analytic techniques. The caveats are: 
  1. the person already has a right to access under an existing agreement, NOT the ability to access;
  2. for sole purpose of non-commercial research; 
  3. the license may impose conditions of access to licenses or third party system (this allows the publisher to impose some restrictions to avoid degrading the whole system, for maintenance of some form of control).
There is a tension between the challenge of being able to do what is legal under copyright law, but when that is then prevented by a contract. This has made the translation from policy document to parliamentary language even more difficult and has hence been delayed. Due to this, but it's likely that this won't be UK legislation until October 2014.

In parallel, the European Commission has come to that view itself. There are stakeholder dialogue working groups that are trying to identify short term wins. One of these is on data and text mining: trying to ascertain does anyone want to do it and if so how do they want to do it. However, there are real tensions within the Commission with different positions between rights-holder communities who feel they can fix this, with significant work already underway, and the research community that believe the system is broken and needs a complete overhaul. The risk here is that it will move from dialogue to monologue, as researchers have indicated that any licensing solution - as opposed to the reopening of the Copyright Directive - will be insufficient for their purposes. 

Thursday 16 May 2013

Kathy Law on Outsourcing: The Good, The Bad and The Ugly

Kathy Law is a publishing professional with over 30 years' experience in both sales and distribution roles, most recently in business development and publication management at MPS and HighWire Press. She is a member of ALPSP's Professional Development Committee and is a co-opted member of the main Council.

Here, she reflects on the challenges publishers face when outsourcing all or part of their activity.

"For many organisations, outsourcing is not a daily event, but is a major shift for the organisation that can be fraught with potholes for the unwary or unprepared. There is the sense of losing control, where work is not done the way you normally do it. Sometimes, it can seem difficult to get your message or instructions across. And what do you do when you aren't getting the right results?

Poorly defined work specifications and unrealistic expectations about who does what and how much often lead to vendors not delivering what you expected. Confusion over how the outsourced activity will be managed can contribute to an unsatisfactory, and potentially costly outcome.

Sometimes it pays to take a step back and look at the challenges, good practices and pitfalls around outsourcing. There are many functions that can be outsourced. Ask yourself the following:

  • Are you going to just outsource the ubiquitous typesetting scenarios? This is a hugely important area and probably the first thing to get outsourced by a publisher. 
  • Are you interested in outsourcing sales and marketing functions? 
  • What about hosting, content enhancement and conversion? 
  • And let's not forget copy editing, proofreading and other editorial functions. 

There is much to learn from talking through your outsourcing with a range of potential vendors and other publishers. The more insight you gain into what can go right or wrong - the good, the bad, and the ugly - the more likely you'll be able to be make sound decisions when selecting and working with your supplier.

Don't forget that it's not just a straight transactional relationship, there are also valuable insights to be had on handling cultural differences. Crucially, think about how outsourcing can be turned into a benefit for your publishing activity by allowing you to re-focus affected staff into positive channels of other activity.

In my experience, the more thought that goes into these areas, the more positive and successful the outsourcing relationship will be."

Kathy will be sharing her experience at the ALPSP seminar Outsourcing: the good, the bad and the ugly on 12 June in London. Book your ticket now.

Thursday 2 May 2013

SAGE sponsors the ALPSP International Conference Travel Grant

This month SAGE has announced they will again sponsor a librarian place at this year's Association of Learned and Professional Society Publishers (ALPSP) International conference. The annual event takes place this year from Wednesday 11 to Friday 13 September 2013 in Birmingham, United Kingdom.

SAGE has supported the ALPSP conference as a sponsor since its launch in 2008. This is the third year SAGE has supported a librarian travel grant at the event, which provides a free place at the conference for a librarian or information professional, including entry to the ALPSP awards dinner, travel within the UK, and accommodation for one person.

The ALPSP Conference has become a pivotal conference for understanding changes faced by both librarians and publishers and how we should work together, providing an engaging environment for open dialogue. The winner of the 2012 travel grant, Stephen Buck, E-Resources and Periodicals Librarian at Dublin City University Library said, "I was a delighted to be given the ALPSP award last year. It was a great opportunity to broaden my exposure to, and awareness of, the issues affecting publishers and their relationships with libraries and to provide an enhanced perspective on relevant themes that have helped facilitate the generation of ideas and building of expertise moving forwards.” His full post on his experience of the conference can be found here.

SAGE is running a competition to win the sponsored place, with entries submitted either via email or via Twitter to @SAGELibraryNews. To enter, librarians must answer the following question in 140 characters or less:

“What would be your top tip to give students about conducting research?”

In an environment where both the education and research landscapes are rapidly evolving, it is increasingly important that both librarians and publishers work together to support the dissemination of knowledge.

A selection of the responses received will be posted on SAGE Connection later in the year.

Completed answers should be sent to @SAGElibraryNews tagged with #ALPSP or by email to The closing date is Friday 24th May.

See here for further details. For more information about the conference please visit