ALPSP blog: at the heart of scholarly publishing: Why Publishers Need to Know the Difference between Search and Text Mining

Haralambos “Babis” Marmanis CTO and VP, Engineering & Product Development at the Copyright Clearance Center looks at the concepts behind search and text mining and highlights why publishers need to understand the differences in order to make the best use of each.

As the author of works on search and the lead architect of a product which enables text mining of scientific journal articles, I am often asked about the difference between Search and Text Mining, and have observed that the two are sometimes conflated. Unless you work with technology every day, this confusion is certainly understandable. Knowing the differences, however, can open new business opportunities for publishers. Both functions deal with the application of algorithms to natural language text, and both need to cope with the fact that, as compared with “pure data,” text is messy. Text is unstructured, amorphous, and difficult to deal with algorithmically.

While the challenges associated with text are common to both search and text mining, the details with respect to inputs, analytical techniques, outputs, and use cases differ greatly. For years, publishers have been engaged in search engine optimization, designed to make their works more discoverable to users. As publishers are increasingly asked to enable text mining of their content, they enter into new territory – a territory that is different than that of public search engines. Thus, it is more important than ever to understand the difference between these two distinct mechanisms of processing content, so that optimal business and licensing strategies are chosen for each.

To begin with, let me describe the key concepts for each area. "Search" means the retrieval of documents based on certain search terms. Think, for example, of your usual web search on well-known search engines such as Google, Yahoo or Bing. In search, the typical actions performed by a software system are index-based and designed for the retrieval of documents. The indexing process therefore aims to build a look-up table that organizes the documents based on the words they contain. The output is typically a hyper-link to text/information residing elsewhere, along with a small amount of text which describes what is to be found at the other end of the link. In these systems, no “net new” information is derived from the documents through the processes that are employed to create the search index. The purpose is to find the existing work so that its content can be used.

On the other hand, "text mining" is a less widely understood but well-developed field that deals with analyzing (not finding) text. That is, while text mining can sometimes look at meta-textual issues – for example, tracking the history of science by counting the instances of a specific phrase (e.g., “avian flu”) in articles – more often the goal is to extract expressed information that is useful for particular purposes, not just to find, link to, and retrieve documents that contain specific facts.

Text mining tools accomplish this by allowing computers to rapidly process thousands of articles and integrate a wealth of information. Some tools rely on parsing the text contained in the documents and apply simple algorithms that effectively count the words of interest. Other tools dig deeper and extract basic language structure and meaning (such as identifying noun phrases or genes) or even analyze the complete grammatical structure of millions of sentences in order to gain insights from the textual expression of the authors. By extracting facts along with authors’ interpretations and opinions over a broad corpus of text, this more sophisticated approach can deliver precise and comprehensive information, and in the commercial setting, provides more value than simple word counts.

Unlike with search, the output of text mining will vary depending on the use to which the researcher wishes to apply the results. In some contexts, the output is digital and designed for machines to process. In other examples, such as using text mining to drive marketing of products and services, the ultimate output will be human-readable text. In other words, even when text mining is performed, sometimes the user needs and receives the full article.

Although both search and text mining involve the parsing and lexical analysis of documents, there are important differences that should drive a publisher’s decisions about investments in text mining and search.

In text mining, the processing and analysis is often done on a project by project basis. Unlike the search functionality provided by search engines, the “how, why, and what” are infinitely variable, and it is difficult to accurately anticipate the inputs, processes, and outputs required. For example, depending on a text miner’s use case, the output may be facts, data, links, or full expression, as opposed to the simple links that are the output of search.
Search is about finding a set of relevant documents, each of which is considered independently by the algorithm; if applied to a single document the process will yield the same result for that document. On the other hand, text mining is mostly about discovering and using information that lives in the fabric of a corpus of documents. Change one document and the fabric of the corpus changes. Mining is usually (but not always) consumptive of the content. So, the “search” process is document-by-document specific, while the “mining” process involves sets of documents and how these documents relate to each other.
Lastly, the mining process aims at extracting “higher-order” information that involves first-, second-, and higher-order correlations that may occur among any combination of the terms, data, or expressions appearing in the corpus of documents that is processed.

In summary, search and text mining should be considered as two quite distinct processing mechanisms, with often different inputs and outputs. While publishers need to engage with both, by conflating them, one loses the unique opportunities and strengths that each provides. With search, it’s all about helping users find the specific content that they are looking for. Text mining goes well beyond search, to find multiple meanings in a publisher’s content in order to derive new value therefrom. Hence, one would expect that, just as the processes themselves differ, publishers’ licenses for the search and text mining processes will differ too.

Monday, 9 November 2015

Why Publishers Need to Know the Difference between Search and Text Mining

No comments:

Post a Comment