ALPSP blog: at the heart of scholarly publishing: Richard Kidd

Wednesday, 10 July 2013

The Royal Society of Chemistry's Richard Kidd with some home truths on semantic enrichment

Richard Kidd on semantics

Richard Kidd, Business Development Manager in the Strategic Innovation Group at the Royal Society of Chemistry, outlined semantics for discovery in STM.

He reflected on the challenge of how you have to learn many different ways to research, as highlighted by Russell Burke from Royal Holloway earlier in the day at 'It's all about discoverability, stupid!' seminar.

How can semantic enrichment help support user search behaviour? And how can it help to show under-researched content? You can use classification, index terms, and identify keywords, then build your own classification.

The value of topic modelling
When RSC Advances launched, it was a new journal covering all of chemistry (and published 1800 articles in 2012). They needed to develop a sensible way of navigating all this content so they used topic modelling. Their publishing expertise gave them 12 broad subjects that would be intuitive to users. They worked with Wordle and then sense checked removing seven topics that were nonsense, one topic that was too general, which left over 120 topics that were classified. The remaining topics can be used for gap analysis, finding hot topics and assessing competitor weaknesses.

Steer clear of ontologies (unless you really want to do them)
A heavier level of classification is ontologies which are used in text mining. An ontology is a machine readable account of what there is in a given domain and how the things there relate to other things. Ontologies tend to be very domain specific and therefore context specific. But in the real world, they don't fit easily into hierarchies which can lead to large gaps in coverage and requires considerable efforts to build and maintain. Kidd's advice? Steer clear of them unless you really want to do them.

Build links on meaning
The semantic web build links on meaning, and each concept should have URLs or URIs. You have to use standard approaches so resources can connect. Why take a semantic approach? There are only a few successful projects out there. Why is that? There are so many different formats, structures, vocabulary, concepts and meanings. The structure should be data-oriented (not HTML). The meaning of data should be clear (not XML). Reusable mappings between data are needed (not XSLT). You should avoid extensive schema rewriting (not data warehouses). And data should have standard APIs (not Flickr). And to quote or paraphrase Lee Harland form Open Phacts, you need to 'synergise with many public efforts.'

Where people have been doing a lot of semantic work, it has tended to be to link their own content together (for example, GeoFacets from Elsevier or Nature.com linked data projects).

Another consideration is text mining. This builds on entity identification (from dictionaries, ontologies, taxonomies) to extract recipes and additional data. You insert the semantic mark-up and conduct sentiment analysis and business intelligence. Examples include NaCTeM tools applied to PMC Europe and SciBite.com.

Does a publisher text mining help text miners? In Kidd's opinion, mostly not.

Semantics won't help with content discovery right now
What should you do? For an immediate return there isn't much evidence that semantics will help your content discovery right now. Customers aren't paying for enriched data. However, making discovery metadata widely available as linked data seems sensible. Semantic works tends to be development rather than discovery. Currently, it is used internally for product development and pump priming. A lot of data is being pushed externally via API for linking, while everyone waits for clever ideas to come along.

And in the future? Big issues are likely to focus on authoring tools and data quality.

Friday, 26 April 2013

What next for data analysis? Notes from the London Book Fair 2013

The panel line up for questions

What next for data analysis? A scholarly publisher's guide was a seminar organised by ALPSP at this year's London Book Fair. The panel discussed the importance of researchers sharing data, how it benefits the public as well as advancing disciplines, and how a reward system is needed around publishing sharing data. Encouragingly, it's clear that publishers have an important role to play.

The problem with not sharing

Lee-Ann Coleman, Head of Scientific, Technical and Medical Information at the British Library, chaired the session. She has particular insight into the use of data by researchers having worked on both the DRYAD project and currently DataCite. There are a number of challenges sharing data amongst researchers. Coleman acknowledged that publishers have been helpful by requiring this, but this is not standard practice. The lack of sharing can be a real problem, particularly in public health or multidisciplinary areas. A maximum return on sharing data is not realised by the current system despite a focus on open data from policy makers and organisations such as the Royal Society.

Lee-Ann Coleman kicks off the session

The lack of a system to store, cite or link research data is the reason why the DataCite project was established in 2009. DataCite comprises full and associate members organisations, enabling them to assign Digital Object Identifiers (DOIs) to submitted data sets to support finding, accessing and reusing the data.

Read more about DataCite here.

What practical challenges do publishers face in making data open?

Phil Hurst is Publisher at The Royal Society who published a research report Science as an open enterprise in 2012. It highlighted the need to deal with the deluge of data, to exploit it for the benefit of the development of science, and the need to preserve the principle of openness. Hurst asserted that before you can analyse data, you need to open it up. Why bother? A recent outbreak of E. coli was a classic case study of how open, shared data helped to quickly control an outbreak of a deadly virus.

The report highlights the power of opening up data for science and provides a vision of all scientific literature online. The Royal Society makes sharing data a condition of publication. The data should go into a repository where it can be linked to it. Being practical, it is still early days for this. Hurst observed that you need to identify suitable repositories, establish appropriate criteria and share a list to guide authors. One repository they are working with is DRYAD.

Phil Hurst and a nasty strain of E. coli

The Society has amended licences to allow text and data mining and work with partners to facilitate. Challenges to take into account include how to manage access control for text and data mining purposes There are differences between subjects and varying degrees of willingness to share across the spectrum of science. Sharing data allows analysts to conduct meta analyses, modelling and data and text mining; and ultimately, enables scientists get new scientific value from content.

Developing taxonomies to track and map data

Richard Kidd, Business Development Manager for the Strategic Innovation Group at the Royal Society of Chemistry, outlined how they had approached data analysis at the RSC by using topic modelling to determine a set of true topics. They identified/invented 12 broad subjects which then generated 100+ categories. These were narrowed down and then mapped to existing categories.

Richard Kidd from the RSC in action

The 12 general categories and 120 or so sub-categories enable them to map new content. As a result, as their publishing output shifts, they can continue to track and map its evolution. This taxonomy provides a navigation aid for journals. It also works across other books, magazines and educational content. This provides sales opportunities for subject-specific focused customers.

They are now looking at data in their publications and patterns in data for sub-domains and hope that this approach will allow them to look at their back list and bring back the original data points.

Chemists don't have a community norm about sharing with a laboratory group culture. There is a lack of available standards and issues about releasing data when patents could be developed. This leads to a more protective culture in relation to research data that can be at odds with open data principles. However, the RSC will be operating the EPSRC National Chemical Database, a domain repository for chemical sciences. Use and reuse is a priority with data availability feeds especially.

The rise of the 'meta journal'

Brian Hole of open access publisher Ubiquity Press outlined how researchers’ needs drive their publishing efforts. The model they use encourages researchers to share data. Hole is a strong proponent of what he calls the social contract of science and considers not only publication of research but also research data to be an essential part of it. As a result an author’s conclusions can be validated and their work more efficiently built upon by the research community. On the other hand it is effectively scientific malpractice to withhold data from the community. He argues that this principle applies to publishers, librarians and repositories as well as researchers.

Brian Hole from Ubiquity Press

Benefits of sharing data cut across different interest groups. Researchers want recognition in the form of citations, and those who share data tend to receive more citations, and potential for career advancement. This in turn makes data easier to find and use in future studies which is more data efficient. Shared data can be used in teaching to improve the learning experience. For the public, if it is easier to find data, it can help build public trust in science. There are also potential economic benefits for the private sector to drive innovation and product development He believes that there are many disciplines that are yet to benefit, especially in the humanities.

Ubiquity Press are developing 'metajournals' to aid in discovery of research outputs scattered throughout the world in different repository silos, and also to provide incentives for researchers to openly share their data according to best practices. The metajournals provide researchers with citable publications for their data or software, which are then referenced by other researchers in articles and books. The citations are the tracked along with the public impact of papers (using altmetrics). The platform so far includes metajournals in public health, psychology, archaeology and research software, with more to come including economics and history. Read more about Ubiquity Press' meta journals here.

If you are interested in data, join us at the ALPSP Conference this September to hear Fiona Murphy from Wiley and a panel of industry specialists discuss Data: Not the why, but the how (and then what?). Book online by 14 June to secure the early bird rate.