Monday, 9 November 2015

Why Publishers Need to Know the Difference between Search and Text Mining

picture of Haralambos “Babis” MarmanisHaralambos “Babis” Marmanis CTO and VP, Engineering & Product Development at the Copyright Clearance Center looks at the concepts behind search and text mining and highlights why publishers need to understand the differences in order to make the best use of each.

As the author of works on search and the lead architect of a product which enables text mining of scientific journal articles, I am often asked about the difference between Search and Text Mining, and have observed that the two are sometimes conflated. Unless you work with technology every day, this confusion is certainly understandable. Knowing the differences, however, can open new business opportunities for publishers. Both functions deal with the application of algorithms to natural language text, and both need to cope with the fact that, as compared with “pure data,” text is messy. Text is unstructured, amorphous, and difficult to deal with algorithmically.

While the challenges associated with text are common to both search and text mining, the details with respect to inputs, analytical techniques, outputs, and use cases differ greatly. For years, publishers have been engaged in search engine optimization, designed to make their works more discoverable to users. As publishers are increasingly asked to enable text mining of their content, they enter into new territory – a territory that is different than that of public search engines. Thus, it is more important than ever to understand the difference between these two distinct mechanisms of processing content, so that optimal business and licensing strategies are chosen for each.

To begin with, let me describe the key concepts for each area. "Search" means the retrieval of documents based on certain search terms. Think, for example, of your usual web search on well-known search engines such as Google, Yahoo or Bing. In search, the typical actions performed by a software system are index-based and designed for the retrieval of documents. The indexing process therefore aims to build a look-up table that organizes the documents based on the words they contain. The output is typically a hyper-link to text/information residing elsewhere, along with a small amount of text which describes what is to be found at the other end of the link. In these systems, no “net new” information is derived from the documents through the processes that are employed to create the search index. The purpose is to find the existing work so that its content can be used.

On the other hand, "text mining" is a less widely understood but well-developed field that deals with analyzing (not finding) text. That is, while text mining can sometimes look at meta-textual issues – for example, tracking the history of science by counting the instances of a specific phrase (e.g., “avian flu”) in articles – more often the goal is to extract expressed information that is useful for particular purposes, not just to find, link to, and retrieve documents that contain specific facts.

Text mining tools accomplish this by allowing computers to rapidly process thousands of articles and integrate a wealth of information. Some tools rely on parsing the text contained in the documents and apply simple algorithms that effectively count the words of interest. Other tools dig deeper and extract basic language structure and meaning (such as identifying noun phrases or genes) or even analyze the complete grammatical structure of millions of sentences in order to gain insights from the textual expression of the authors. By extracting facts along with authors’ interpretations and opinions over a broad corpus of text, this more sophisticated approach can deliver precise and comprehensive information, and in the commercial setting, provides more value than simple word counts.

Unlike with search, the output of text mining will vary depending on the use to which the researcher wishes to apply the results. In some contexts, the output is digital and designed for machines to process. In other examples, such as using text mining to drive marketing of products and services, the ultimate output will be human-readable text. In other words, even when text mining is performed, sometimes the user needs and receives the full article.

Although both search and text mining involve the parsing and lexical analysis of documents, there are important differences that should drive a publisher’s decisions about investments in text mining and search.

  1. In text mining, the processing and analysis is often done on a project by project basis. Unlike the search functionality provided by search engines, the “how, why, and what” are infinitely variable, and it is difficult to accurately anticipate the inputs, processes, and outputs required. For example, depending on a text miner’s use case, the output may be facts, data, links, or full expression, as opposed to the simple links that are the output of search.
  2. Search is about finding a set of relevant documents, each of which is considered independently by the algorithm; if applied to a single document the process will yield the same result for that document. On the other hand, text mining is mostly about discovering and using information that lives in the fabric of a corpus of documents. Change one document and the fabric of the corpus changes. Mining is usually (but not always) consumptive of the content. So, the “search” process is document-by-document specific, while the “mining” process involves sets of documents and how these documents relate to each other.
  3. Lastly, the mining process aims at extracting “higher-order” information that involves first-, second-, and higher-order correlations that may occur among any combination of the terms, data, or expressions appearing in the corpus of documents that is processed.

In summary, search and text mining should be considered as two quite distinct processing mechanisms, with often different inputs and outputs. While publishers need to engage with both, by conflating them, one loses the unique opportunities and strengths that each provides. With search, it’s all about helping users find the specific content that they are looking for. Text mining goes well beyond search, to find multiple meanings in a publisher’s content in order to derive new value therefrom. Hence, one would expect that, just as the processes themselves differ, publishers’ licenses for the search and text mining processes will differ too.

Tuesday, 13 October 2015

Standard Identifiers, Metrics and Processes in Journal Publishing: Mark Hester asks 'Aren't they a bit...dull?'

Why should we use standards? Identifiers, transaction processes, schemas, metrics and many other things in scholarly publishing have standards, or are developing them. Isn’t this a rather arduous and bureaucratic way of handling things? Are these things really there to make life easier or just another way of overcomplicating an already complex market, taking time away from the efforts of actually producing high quality content?

Here Mark Hester of Aries Systems delves into why we should care.

Aren’t standards a bit….dull?'

Standards? Just a bunch of numbers, right? With tedious documentation on how and where to use them? Why would I bother with those?

It’s not hard to see why you might think that, but also easy to see how this is misguided. Jumping straight into a document to read about standards is a little bit like reading the telephone directory when you have no intention of calling someone, or leafing through a Haynes manual when you’re not repairing a car.

An example of a standard from outside publishing might help – EAN-13. What is EAN-13 you might ask? You see examples of it daily – it is the standard for the barcodes we see on everything we buy in the supermarket. Retail staff don’t need to know how EAN-13 works, it is unlikely that they’ve read documentation on it, but they are all grateful that it does work when checking stocks, pricing items and working on the till and, in turn, so are their customers.

So I ignore standards: what’s the worst that can happen?

When I was a student in the early nineties, the departmental librarian had been using his own classification system for many years. Back then, it didn’t matter much – students got used to its quirks, visitors from other departments were rare, from other universities much rarer still. The people using the service understood it, and that was enough.

Imagine taking this approach in the online world - it would mean that your content would be less discoverable and also less usable. Online library catalogues wouldn’t work if everyone took the librarian from my alma mater’s approach! Not using DOIs means frustration for researchers who can’t click on the references and go straight to the articles, and a simple change to a URL means a broken link. If your content isn’t seen it affects your reputation, and in the case of a commercial publisher, your profits.

The benefit of standards will only increase as the ‘digital natives’ used to touch screen technology enter academia and the workplace – having to click more than once or search for more than a minute will lead them to go elsewhere.

How can standards enhance my working life and be good for my organization?

Rapid changes in scholarly publishing means that new applications are found for standards once they are in place. Adopting standards can ‘future proof’ your content and processes against changes that occur in the future.

A great example of this is the relentless adoption of gold open access. The publishing standards which enable Copyright Clearance Center’s RightsLink for OA to display different article processing charge policies to different users on the fly developed separately from one another – Ringgold for institutions, ORCID for identifying authors, and FundRef for funder identification. Brought together, however, their machine readability allows flexible APC pricing models and automated billing and payment processing, making life easier and saving time and money for both publishers and institutions.

The advantages can be psychological as well as practical – if authors, researchers and librarians see the ORCID or CrossRef logos displayed on your website, they will know that your organization is a serious player, one which will help them, one they can trust.

So what's next?

By now, I hope I’ve convinced you of the importance of standards. But if the prospect of researching the topic still fills you with a sense of dread, there's an upcoming seminar from ALPSP I'm helping to coordinate called Setting the Standard. It's being held in London on Wednesday 11 November and includes speakers from CrossRef, Ringgold, ORCID, COUNTER, Thomson Reuters, EDItEUR, Jisc and an institution. Everything you ever wanted to know about standards, but were too scared to ask.

I hope to see you there.

Tuesday, 22 September 2015

Reflections on #alpsp15 - Digital Science's Phill Jones explores the key issues

Phill Jones, Head of Publisher Outreach at Digital Science, reflects on the duelling keynote talks from Anurag Acharya, co-founder of Google Scholar and Kuansan Wang, Director, Internet Service Research Center at Microsoft Research in this blog post reflecting on the 2015 ALPSP Conference. He noted their very different views on academic discovery on the open web.

"Citing the difference between "general" search for say a local business, and the geographically global nature of "academic" search, Acharya suggested that personalizing Google scholar wouldn’t yield much additional value. Conversely, Wang described a very different philosophy of highly monitored, highly personalized search through Bing and Cortana that would adapt to individual users needs."

He reflects on the shift in customer base from library to researcher and the resulting revelations as publishers try to better understand their needs:

"Google is so firmly embedded in young researcher’s routines that they don’t even think about the fact that they use it. You wouldn’t expect somebody to tell you that they opened an internet browser, would you?" 

The panel on peer review provoked the following thought:

"One reoccurring theme that emerged from the discussions: the fact too much is currently being asked of the peer-review process. With the mantra of "publish or perish" being truer now than it’s ever been, it can be argued that publishers find themselves unwittingly in the position of administering the process that decides whose career advances and whose doesn’t."

With that position comes great responsibility, something that will no doubt be considered in more detail during Peer Review Week to be held 28 September to 2 October 2015, a collaboration between ORCID, OpenScience and Wiley announced by Alice Meadows during the conference.

Read Phill's full post here on the Digital Science blog.

Monday, 14 September 2015

The Academic Book of the Future?

Much discussion of scholarly communication is dominated by scientific and (especially) serials concerns. This session aimed to redress the balance. Richard Fisher chaired a distinguished panel of academics to discuss the recent trends and data on monographs and the current AHRC project on the Academic Book of the Future. These are natural starting points for an extended discussion of what still remains the major currency of both communication and esteem in many academic subjects in the humanities and social sciences.
Simon Tanner from King’s College London and the AHRC Academic Book of the Future project provided an overview of the work completed to date and some highlights from the research data. The first stage of the research project has focused on finding out what the roles and purposes of academic books to serve scholarship and wider learning for all groups involved in this area and then to sense check that back to those groups.

The REF2014 submissions provided a rich data set as a means of learning more about the academic books created and deemed worthy of submission in the last REF cycle (2009-2014). They focused on the Main Panel D for Arts and Humanities. Within this Panel the data can be investigated by Unit of Assessment Subject Area and by Research Output Type. They hope to look at various areas including author gender, book format or length, books per submitting institutions and open access books. Tanner shared some initial analysis that threw up surprising findings including how chapters still feature in REF submissions and how few publishers submit more than 10 titles.

Michael Jubb has been working with the Academic Book of the Future and initial findings from the research suggest books remain a critical part of the scholarly infrastructure in analogue form, but we haven't yet articulated how to present the broad range of scholarly resources in the humanities in an effective and user-friendly way. More will be discussed during Academic Book Week in the UK 9-16 November 2015.

There seem to be powerful incentives to write and to publish books, even as volumes of sales of individual titles fall. Are we publishing too many books?
Professor Peter Mandler from the University of Cambridge and President of the Royal Historical Society observed that technology is making more of an impact on publishing and it's right to reflect on effects on monograph publishing for good or evil. The high cost is a barrier, but it's not practical to totally remove price as a good deal of work goes into it - including remunerated peer review. However, he believes that the sooner we can reduce cost through use of ebooks, the better. New initiatives including 30-60,000 length monographs are to be welcomed.

It's interesting to note the funders don't discuss that the average score for a monograph was much higher than for chapter or article. There are implications for the future of the academic book driven by changes to productivity in output, measurement and metrics around research, and generational changes (younger generation often prefer chapters or articles than long form research).

Professor John Holmwood, University of Nottingham and Past President of the British Sociological Association noted there are some social sciences that hardly submit any books. He reflected on a decline in cultural scholarlship in some social science disciplines. He believes the move across to a linear, cumulative form of journal output perhaps lacks the reflection and transformative impact. The commercialisation of higher education and the development of publishing business models suggests a link in government actions. On the one hand there is a radical ambition to create a democractic open online library, but how does that fit with the commercialisation (and underlying privatisation) of universities?
Holmwood observed there is disruption of the curriculum and the book as a result. Publishers are disrupting both as well with innovation with education delivery and digital provision of learning materials. He feels monographs and journals are moving at different speeds and this is now becoming a problem. Article based disciplines have citation patterns that show a short life of an article compared to disciplines that tend towards long form research which can be cited for decades.

There is no doubt the debate will continue. Follow #AcBookWeek and @AcBookFuture for more details.

Friday, 11 September 2015

Digital developments and new revenue streams

Timo Hannay introduced the final panel at the 2015 ALPSP Conference that focused on digital developments and new revenue streams.

Mary Ging has worked as a consultant and as MD for international at Infotrieve. In traditional access models, the annual license model still dominates. Publishers have been experimenting with value-add options. In the medium to long term the consensus is that the model will change. Given millenial's short attention spans and time crunch, is the traditional 12-15 page article the best way to disseminate research information? Is there a better alternative?

Pay Per View document delivery is bigger than you think. Publishers including CUP, Wiley, Nature and others offer rental on their websites as well as third parties such as DeepDyve and ReadCube. With article enhancement, sharing and collaboration tools there are also publisher or third party options as well.

There has been rapid growth in OA publications. In September 2015, DOAG listed 10,555 - a 30% increase in three years. Hybrid journals are increasing and there's a significant improvement in quality. Most articles are covered by a Creative Commons license, increasingly CC-BY. Text and data mining is a new area where there's a lot of interest, but not a lot of revenue. Another issue is the lack of expertise. The biggest challenge is collecting the corpus is a consistent way. There is an opportunity to provide a corpus creation solution for those who wish to do text and data mining. Is there a market for an ebay for datasets? That could work as an incentive to work with it.

Other opportunities include the importance of the patent literature and helping academics stay in touch with what's happening in this area. There could be tools to meet with regulatory requirements (e.g. Quosa for pharmacovigliance). What about the cloud? Are their opportunities to use for one solution for publisher to buy into with standard data structures? The best opportunities will be found by publishers who define their remit more broadly than just the paper.

Mat Pfleger, Managing Director at the Copyright Licensing Agency shared the challenges that CLA are considering as they develop their services. These relate to policy changes to education and HE as well as cuts in funding. Automatic renewals and inflationary pricing are symptoms of complacency. The challenge is to think beyond the short term deal. The current focus on cost masked the broader challenges we face today. We need to focus on that.

Another challenge is a range of new, disruptive services that deliver content as part of a service, each providing data that can be used in many different ways. Each creates value across multiple touch points across an institution. Some examples include The Mobile Learner's Library from Pearson
Kortext - beyond content, it's a collaboration and analytics tool
Article Galaxy Widget

How do we as a community engage with multiple open sources. Some interesting examples are Open Stack Space. Funded by a number of foundations. It provides student access to peer reviewed text books. This year alone they serviced 200,000 students and claim $25million savings. Lumen's mission is to provide open education resources to eliminate textbook costs. 4.9 million resources are downloaded each week from Tes. They recently hired a former ebay contact and have created a marketplace for teachers. It is a significant platform for open educational resources. When you combine with the challenge to the budget, this is a significant game-changer.

What does the collective licensing and streaming of content mean for collective licensing organizations. Netflix, Spotify, EPIC! are all subscription services that have potential to disrupt. They all have a growing catalogue of content which is presented at a granular level. Royalties are linked at this level with micro payment system. Every content industry engaging with these services has had to have a serious rethink about their business model.

Chris Graf, Business Development Director for the Society Services Team at Wiley pondered on what societies really want from publishers. Primarily it is financial, and particularly around new revenue. This new revenue can come from new markets, new adjacent markets and new products. Surprisingly, the biggest growth in content is in Latin America. So they are an area they focus on. With adjacent markets, transactional income such as rental and advertising can be considered. You can think about the user as potential adjacent revenue, but a user pays model can be risky. With new breakout products you need technical insight, drive down costs and usability. They consider this when looking at developing author services.

Graf closed by reflecting on revenue steams. What we have right now is a complex eco-system that publishers and societies benefit from, but it has taken hundreds of years to develop. It's worth bearing that in mind.

Tanya Field, Director of Mobile Value Partners and self-proclaimed outside had a simple message. All the other industries are having to learn that working as individuals will enable them to overcome the hurdles they face to remain profitable. Whatever you deliver to your consumers, the actual delivery needs to be simple and incredibly easy to use. That means presentation levels for every single format. That's a technical challenge as there are so many formats. You really need very clear signposting and intuitive flows for the users to get to the content. It's not just about delivering flat information. Younger users want to engage with content.

Your distribution strategy needs to be at the top of the access point. Last, but not least, the most important thing you need to consider is that the whole world is driven by data. Context and relevance are key to success. Know who your customer is, what they like, when they like it and deliver it to them. Your customer data strategy is key. If data isn't at the heart of your strategy it will be a problem in the future.

Peer review: evolution, experiment and debate

John Sack, Founding Director of HighWire Press introduced the morning panel on peer review at the 2015 ALPSP Conference.

Dr Aileen Fyfe, PI of the ‘Publishing the Philosophical Transactions’ project at the University of St Andrews reflected on how Henry Oldenburg used an editorial driven model for Philosophical Transactions. The Royal Society approved all issues for publication. But it wasn't at the article level, it was to confirm there were no threats to the country: it was ratification of a sort.

In the 1760s considered papers by looking at the abstracts and would take a vote. This was to protect the reputation of the Society and did not check facts or reasoning. Meanwhile in France, the Académie royale des sciences asked academicians might be asked to report jointly on submitted papers on the truth claims being made in the papers. However, they did not judge each other, only outsiders. This high level checking scrutiny was abandoned in the 1780s as it was deemed too difficult. In the Philosophical Transactions in the 1860s referees who were a member of the society would make recommendations for publications. They provided literary comments about the article. The Fellows were writing about each other's work as well as outsiders. This now included a judgment on originality and significance.

The Philosophical Magazine in the 1920s:
Nature in the 1950s-70s would publish papers if they weren't actually wrong with erratic refereeing. They relied on papers that came from good institutions and/or known labs. In summary, the history of peer review is not as simple as you might imagine. Much better to understand this before we move forward to revise and update peer review going forward.

Dr John R Inglis, Executive Director and Publisher at Cold Spring Harbor Laboratory Press wryly noted there are many critics of modern peer review, but the prevailing view tends to be that it may not be perfect, but it's better than nothing. What do scientists think about peer review? Most are satisfied with it, think it helps scientific communication and think it has improved their papers. However, many think it can be improved, think it holds back science and now believe it is now unsustainable with increasing journals and science.

Most scientists think peer review should improve the quality of a paper, determine its originality and the importance of its findings. It helps to ensure previous work is acknowledged, help select the best papers for the journal and detect falsehoods. There are many ways that peer review is changing including double blinding, transparency, publishing reviews alongside papers, checking figures for manipulation, use specialised data, validating authors and reviewers and forbidding author-offered reviewers.

There are also changes to where peer review is done. often it is outsourced to peer review platforms like Rubriq, Peerage of Science, Editage and PubLons. There are also changes to when peer review is done. After publication there are a range of options to comment on papers: journal specific commenting functions, PubMeds Commons, PubPeer, ResearchGate and, ScienceOpen and

Cold Spring Harbor Press launched their pre-print server for biology bioRxiv. It's a not-for-profit free service that distributes draft papers for open comment. Posting is quick. Papers get a date stamp and a DOI. There is a commenting function and they link to the history of the evolution of the paper. It results in rapid transmission of results for community consideration. They have more than 2000 manuscripts posted from over 40 countries and more than 800 institutions. There are rising rates of submission and usage. Every subject category is respected and most manuscripts eventually appear in journals.

They know that 30% of papers have been revised and 33% of all papers have been published in more than 190 journals. They have extensive feedback via social media including 25,000 tweets. There are plans to make submission easier for authors. The use of pre-prints is changing. The behaviour of biologists is changing and journals policies are changing. Inglis closed by quoting the warnings contained in the Research Information Network report on Peer Review.

Dr Simon Kerridge is Director of Research Services, University of Kent and Chair of the Board of Directors at the Association of Research Managers and Administrators. Peer review is generally for journal articles, monographs and other long form research outputs, research data and other forms of scholarly output. From his point of view, it also includes research project proposals, research environment and strategy, and research impact. There are many purely academic reasons for doing peer review, but recognition is a part of it too. Promotion, esteem, time and money are all factors. Many universities have 'citizenship' as criteria for promotion where peer review is a factor. There may be internal or external mentoring or structured support. Becoming a journal editor or reviewer looks good on a CV.

By raising your profile you gain more recognition, but how is this recorded/advertised? Very few journals list reviewers. Some funders list 'peer review college' and most conferences list reviewers. Most academics list their own reviewing and universities try to keep full lists. There are internal work allocation models that provide recognition for peer review. Some journals reward reviewers with reduction in charges, 'peer review miles' to offset future fees and other waivers. Some conferences have reductions and some funders do pay reviewers or their institution. There are some universities that pay bonuses for peer review (e.g. REF peer review panel). With internal peer review it is unlikely you will get paid.
Dr Kirsty Edgar, Leverhulme Early Career Fellow at the University of Bristol provided an early career researcher (ECR) point of view on peer review. She reflected that it always seems to be the third review that's bad! ECRs want to improve research, get academic seal of approval, improve the dissemination of the research. But most importantly, they want to get through the process, publish in the highest impact journal possible, improve their CV and get a job.

There are several issues. There is little in the way of support or training, although this is improving. Are you getting a fair deal as a reviewee? Will you get promotion or fellowship? Will people read your work and will you be able to afford to publish my work?

There are some solutions: improve training or change the system in a small way such as peer choice, cascading reviews and open peer reviews. You can also fundamentally change the system by getting rid of journals, or on a slightly less radical agenda, introduce pre-publication, data and post-publication review. Edgar cited the eLife model of the support they provide to early career researchers. Edgar closed with some recommended reading: Sense About Science, the Voice of Young Science blog by James Steele and the BioMed Central Blog by Sarah Hayes.

Thursday, 10 September 2015

What does content and behavioural data mean for publishing? Microsoft's Kuansan Wang considers.

The availability of large amounts of content and behavioural data has also instigated new interdisciplinary research activities in the areas of information retrieval, natural language processing, machine learning, behavioural studies, social computing and data mining.

Kuansan Wang, Director of the Internet Service Research Centre at Microsoft Research considered the impact for the publishing and consumption of content, drawing on observations derived from a web scale data set, newly released to the public.

If you think about the web as a gigantic library of the future, then you should think about the semantic web as the librarian. It involves trust, proof, logic, ontology vocabulary, rdf schema, xml schema, Unicode and URI.

A central theme for the semantic web is trying to help a machine read and makes sense: human readable versus machine readable contents. The semantic web requires humans to define a standard for data formats and models. It has an explicit and precise specification of knowledge representation that everyone has to agree upon.

The knowledge web is where a machine reads human readable contents. With the knowledge web, the machine learns to conflate different formats of the same thing. It involves latent and fuzzy representation of knowledge learned by mining big data.

There has been a paradigm shift in discovery. Traditional web search involves index keywords in documents, matches keywords in queries and has the relevance of "10 blue links". With knowledge web search it digests the world's knowledge, matches user intent and has a dialogue experience.

The dialogue acts in Bing and Cortana are:
  1. answer 
  2. confirmation 
  3. disambiguation 
  4. suggestion 
  5. progress: refinement.

In Bing, you get answers, there is an element of confirmation/correction, refinement dialogue and digressive suggestion. The interface is designed for naturally spoken language with context, confirmation and answer. You don't have to go to the search page, the disambiguation starts as you type. They train the system to try to summarise what it has to learn.

Some of the issues that bug the academic community are:
  • How to recommend completions for seldom observed or never foreseen queries?
  • How to rank these suggestions?
  • How to avoid making suggestions leading to no or bad results?
For finding researchers and potential collaborators they train a machine to go through and aggregate all the information.
Cortana provides proactive suggestions on Windows Android IOS. Concept is based on the successful personal assistants to the stars who write down the interests and activities of the people they serve to gain better insight. They have built in a lot of switches you can turn on/off for personalisation and if you have privacy concerns and now trained Cortana to do this for academics. One of the pain points you hit as a researcher is that you hit a paywall. Cortana tries to help by showing not only the academic article, but also related news stories.

The latest Microsoft vision is about empowering every person and every business to achieve more. They intend to do this through re-imaged productivity, more personal computing and most intelligent cloud. This translates to academic search, Cortana Academic and Project Oxford.