Text and Data Mining Member Briefing (member login required). As part of the update, Roy Kaufman, Managing Director of New Ventures at Copyright Clearance Center, provided an overview of the potential of TDM, outlined below.
"Big data may be making headlines, but numbers don’t always tell the whole story. Experts estimate that at least 80 percent of all data in any organization—not to mention in the World Wide Web at large— is what’s known as unstructured data. Examples include email, blogs, journals, Power Point presentations, and social media, all of which are primarily made up of text. It’s no surprise, then, that data mining, the computerized process of identifying relationships in huge sets of numbers to uncover new information, is rapidly morphing into text and data mining (TDM), which is creating novel uses for old- fashioned content and bringing new value to it. Why? Text-based resources like news feeds or scientific journals provide crucial information that can guide predictions about whether the stock market will rise or fall, can gauge consumers’ feelings about a particular product or company, or can uncover connections between various protein interactions that lead to the development of a new drug.
For example, a 2010 study at Indiana University in Bloomington found a correlation between the overall mood of the 500 million tweets released on a given day and the trending of the Dow Jones Industrial Average. Specifically, measurements of the collective public mood derived from millions of tweets predicted the rise and fall of the Dow Jones Industrial Average up to a week in advance with an accuracy approaching 90 percent, according to study author Johan Bollen, Ph.D., an associate professor in the School of Informatics and Computing. At the time, Dr. Bollen predicted, with uncanny accuracy, where he felt TDM was going, from the imprecise, quirky world of Facebook and Twitter to high-value content. He said, "We are hopeful to find equal or better improvements for more sophisticated market models that may in fact include other
information derived from news sources and a variety of relevant economic indicators."
In other words, structured data alone is not enough, nor is text mined from the wilds of social media. Wall Street and marketers, eager to predict the right moment to hit buy or sell or to launch an ad campaign, have already moved from mining Facebook and Twitter to licensing high-value content, such as raw newsfeeds from Thomson Reuters and the Associated Press, as well as scientific journal articles reformatted in machine- readable XML. In fact, a 2014 study by Seth Grimes of Alta Plana concludes that the text mining market already exceeds 2 billion dollars per year, with a CAGR of at least 25%.
Far from being irrelevant in our digital age, high-value content is about to have its moment, and not just to improve the odds in the financial world or help marketers sell soap. It represents a new revenue stream for publishers and their thousands of scientific journals as well. For example, in 2003, immunologist Marc Weeber and his associates used text mining tools to search for scientific papers on thalidomide and then targeted those papers that contained concepts related to immunology. They ultimately discovered three possible new uses for the banned drug. “Type in thalidomide and you get between 2,000 and 3,000 hits. Type in disease and you get 40,000 hits,” writes Weeber in his report in the Journal of the American Medical Informatics Association. “With automated text mining tools, we only had to read 100-200 abstracts and 20 or 30 full papers to create viable hypotheses that others could follow up on, saving countless steps and years of research.”
The potential of computer-generated, text-driven insight is only increasing. In his 2014 TedX Talk, Charles Stryker, CEO of the Venture Development Center, points out that the average oncologist, after scouring journals the usual way, reading them one by one, might be able to keep track of six or eight similar cancer cases at a time, recalling details that might help him or her go back, re-read one of two of those articles, and determine the best course of care for a patient with an intractable cancer. The data banks of the two major cancer institutes, on the other hand, hold searchable records of cancer cases that can be reviewed in conjunction with 3 billion DNA base pairs and 20,000 genes contained within each. So using that data would mean a vast improvement in the odds of finding clues to help treat a tricky case or target the best clinical trial for someone with a rare disease. This information might otherwise have been difficult, if not impossible, for even the most plugged-in oncologist to find, let alone read, see patterns, or retain the information for a period of time.
Think, then, of the possibilities of improving healthcare outcomes if the best biomedical research were aggregated in just a few, easily accessible repositories. That’s about to happen. My employer, Copyright Clearance Center (CCC), is coming to market with a new service designed to make it easier to mine high-value journal content. Scientific, technical and medical publishers are opting into the program, and CCC will aggregate and license content to users in XML for text mining. Although the service has not yet fully launched, CCC already has publishers representing thousands of journals and millions of articles participating.
Consider the difficulties of researchers, doctors, or pharmaceutical companies wishing to use text mining to see if cancer patients on a certain diabetes drug might have a better outcome than patients not on the drug. They must go to each publisher, negotiate a price for the rights, get a feed of the journals, and convert that feed into a single useable format. If the top 20 companies did this with the top 20 publishers, it would take 400 agreements, 400 feeds, and 400 XML conversions. The effort would be overwhelming.
Instead, envision a world where users can avail themselves of an aggregate of all relevant journals in their field of interest. Instead of 400 agreements and feeds to navigate and instead of 400 documents to convert to XML, there would be maybe 40 agreements: 20 between the publishers and CCC and 20 with users. There would be no need for customers to convert the text. In other words, researchers could get their hands on the high-value information they need to move research and healthcare forward, in less time, with less effort. And that’s only the beginning. As Stryker said about the promise of TDM, “We are in the first inning of a nine-inning game. It’s all coming together at this moment in time.”
ALPSP Members can login to the website to view the Briefing here.
Roy Kaufman is Managing Director of New Ventures
at the Copyright Clearance Center. He is responsible for expanding service capabilities as CCC moves into new markets and services. Prior to CCC, Kaufman served as Legal Director, Wiley-Blackwell, John Wiley and Sons, Inc. He is a member of the Bar of the State of New York and a member of, among other things, the Copyright Committee of the International Association of Scientific Technical and Medical Publishers and the UK's Gold Open Access Infrastructure Program. He formerly chaired the legal working group of CrossRef, which he helped to form,
and also worked on the launch of ORCID. He has lectured extensively on the subjects of copyright, licensing, new media, artists' rights, and art law. Roy is Editor-in-Chief of ‘Art Law Handbook: From Antiquities to the Internet’ and author of two books on publishing contract law. He is a graduate of Brandeis University and Columbia Law School.