Thursday 26 November 2015

Standards: chaos minimization, credibility and the human factor

Standard, standards, standards. One is born, one conforms to standards, one dies. Or so Edmund Blackadder might have said.

And yet, as David Sommer and his panel of experts demonstrated earlier this month, standards underpin our scholarly publishing infrastructure. Without them, we could not appoint editorial teams, enable the review process, tag or typeset articles, publish in print or online, catalogue, discover, or even assess the quality of what we published – assuming, that is, we had been allowed through the office door by our standards-compliant HR departments. We couldn’t determine the citation rates of our publications, sell said publications to libraries (all of them naturally sceptical of our unstandardized claims for high usage) or even contact our high-profile UCL author (is this the UCL in London, Belgium, Ecuador, Denmark, Brazil or the USA?). Resolution, disambiguation, standardization is the order of the day.

‘We are’, as Tim Devenport of EDItEUR said, ‘in the chaos minimization business’.

Speakers at the seminar offered overviews of the roles played by CrossRef, Ringgold, ORCID, COUNTER, Thomson Reuters, EDItEUR, libraries (in the guise of the University of Birmingham) and JISC, considering content, institutional and individual identifiers, plus usage, citation, metadata and library standards.

Audio of all talks is available via the ALPSP site, but here are some broader conclusions from issues discussed on the day.

Humans make standards

But we’re remarkably good at breaking them too. The most foolproof systems are those that don’t allow much human intervention at all (ever tried to accurately type a sixteen-digit alphanumerical code on less than eight cups of coffee?). Vendors should build systems that not only pre-populate identifier fields, but actively discourage users from guessing, ignoring or simply making up numbers.

Be the difference

Publishers, funders and institutions need to actively assert their need for standards at every stage of their workflows. Break one part of the article supply chain and something, somewhere, is bound to be lost. (And the worse part? We don’t know where.) That means that the entire supply chain must inform and develop standards, not just 'free ride' on existing ones.

Standards help authors find their voice

If an article can be found by DOI, funding source, award number or ORCID iD – in other words, if one or more of the key standards is applied to a particular publication – then research gets heard above the online ‘noise’. Authors can help themselves by claiming their own iDs, but it’s up to publishers and institutions to show them why it matters.

Identifiers enforce uniqueness

They not only help with functionality (disambiguating data and eradicating duplication), but they ensure correct access rights, help understand a customer base and build stronger client relationships. All of this adds immense value to your data.

Standards build credibility everywhere

We tend to think of publishing standards as being the building blocks of the standard workflows – and they are. But the latest development from ORCID encourages involvement in peer review, with journals and funders now collecting reviewers’ iDs to track review activities. That’s a startling contribution to tenure decisions and research assessments. And what about the prospect of using iDs in job applications to verify your publications?

The Impact Factor is a number, not a standard

OK, so we knew that. And we probably had an opinion on it. But coming on a day when Thomson Reuters announced they were ‘exploring strategic options’ for the Intellectual Property & Science businesses, it was good to hear from the horse’s mouth.

Even the ‘standard’ standards need, well, standardizing

Given the significance of COUNTER usage statistics for library negotiations, the possibility for inaccuracy seems startlingly high. Over 90% of users still require some form of manual intervention, and that means greater likelihood of error. There is a role for standardizing and checking IP information to improve the accuracy of COUNTER data - but for now, no one seems to be claiming that ground.

Slow is good

If a publisher/funder/institution is a late standards adopter, that’s OK. Better to start slow and get it right than to implement poorly and leave a (data) trail of tears. But start. Organizations such as ORCID make available plenty of information about integrating identifiers into publisher and repository workflows.

Standards are not anti-innovation

On the contrary, they facilitate innovation. And they provide the information architecture for innovation to flourish in more than one place.

Share it

Since we can't predict when/where (meta)data will be used, let’s make sure everyone knows as much as possible. Make it open source, or at the very least, make it trustworthy.

And finally…

The mobile charging area at the British Dental Association front desk is a perfect example of the need for rational standards. How many wires?

Martyn Lawrence (@martynlawrence) is Publisher at Emerald Group Publishing and attended the recent ALPSP Setting the Standard seminar in London. He can be contacted at

Monday 9 November 2015

Why Publishers Need to Know the Difference between Search and Text Mining

picture of Haralambos “Babis” MarmanisHaralambos “Babis” Marmanis CTO and VP, Engineering & Product Development at the Copyright Clearance Center looks at the concepts behind search and text mining and highlights why publishers need to understand the differences in order to make the best use of each.

As the author of works on search and the lead architect of a product which enables text mining of scientific journal articles, I am often asked about the difference between Search and Text Mining, and have observed that the two are sometimes conflated. Unless you work with technology every day, this confusion is certainly understandable. Knowing the differences, however, can open new business opportunities for publishers. Both functions deal with the application of algorithms to natural language text, and both need to cope with the fact that, as compared with “pure data,” text is messy. Text is unstructured, amorphous, and difficult to deal with algorithmically.

While the challenges associated with text are common to both search and text mining, the details with respect to inputs, analytical techniques, outputs, and use cases differ greatly. For years, publishers have been engaged in search engine optimization, designed to make their works more discoverable to users. As publishers are increasingly asked to enable text mining of their content, they enter into new territory – a territory that is different than that of public search engines. Thus, it is more important than ever to understand the difference between these two distinct mechanisms of processing content, so that optimal business and licensing strategies are chosen for each.

To begin with, let me describe the key concepts for each area. "Search" means the retrieval of documents based on certain search terms. Think, for example, of your usual web search on well-known search engines such as Google, Yahoo or Bing. In search, the typical actions performed by a software system are index-based and designed for the retrieval of documents. The indexing process therefore aims to build a look-up table that organizes the documents based on the words they contain. The output is typically a hyper-link to text/information residing elsewhere, along with a small amount of text which describes what is to be found at the other end of the link. In these systems, no “net new” information is derived from the documents through the processes that are employed to create the search index. The purpose is to find the existing work so that its content can be used.

On the other hand, "text mining" is a less widely understood but well-developed field that deals with analyzing (not finding) text. That is, while text mining can sometimes look at meta-textual issues – for example, tracking the history of science by counting the instances of a specific phrase (e.g., “avian flu”) in articles – more often the goal is to extract expressed information that is useful for particular purposes, not just to find, link to, and retrieve documents that contain specific facts.

Text mining tools accomplish this by allowing computers to rapidly process thousands of articles and integrate a wealth of information. Some tools rely on parsing the text contained in the documents and apply simple algorithms that effectively count the words of interest. Other tools dig deeper and extract basic language structure and meaning (such as identifying noun phrases or genes) or even analyze the complete grammatical structure of millions of sentences in order to gain insights from the textual expression of the authors. By extracting facts along with authors’ interpretations and opinions over a broad corpus of text, this more sophisticated approach can deliver precise and comprehensive information, and in the commercial setting, provides more value than simple word counts.

Unlike with search, the output of text mining will vary depending on the use to which the researcher wishes to apply the results. In some contexts, the output is digital and designed for machines to process. In other examples, such as using text mining to drive marketing of products and services, the ultimate output will be human-readable text. In other words, even when text mining is performed, sometimes the user needs and receives the full article.

Although both search and text mining involve the parsing and lexical analysis of documents, there are important differences that should drive a publisher’s decisions about investments in text mining and search.

  1. In text mining, the processing and analysis is often done on a project by project basis. Unlike the search functionality provided by search engines, the “how, why, and what” are infinitely variable, and it is difficult to accurately anticipate the inputs, processes, and outputs required. For example, depending on a text miner’s use case, the output may be facts, data, links, or full expression, as opposed to the simple links that are the output of search.
  2. Search is about finding a set of relevant documents, each of which is considered independently by the algorithm; if applied to a single document the process will yield the same result for that document. On the other hand, text mining is mostly about discovering and using information that lives in the fabric of a corpus of documents. Change one document and the fabric of the corpus changes. Mining is usually (but not always) consumptive of the content. So, the “search” process is document-by-document specific, while the “mining” process involves sets of documents and how these documents relate to each other.
  3. Lastly, the mining process aims at extracting “higher-order” information that involves first-, second-, and higher-order correlations that may occur among any combination of the terms, data, or expressions appearing in the corpus of documents that is processed.

In summary, search and text mining should be considered as two quite distinct processing mechanisms, with often different inputs and outputs. While publishers need to engage with both, by conflating them, one loses the unique opportunities and strengths that each provides. With search, it’s all about helping users find the specific content that they are looking for. Text mining goes well beyond search, to find multiple meanings in a publisher’s content in order to derive new value therefrom. Hence, one would expect that, just as the processes themselves differ, publishers’ licenses for the search and text mining processes will differ too.