Wednesday 10 July 2013

The Royal Society of Chemistry's Richard Kidd with some home truths on semantic enrichment

Richard Kidd on semantics
Richard Kidd, Business Development Manager in the Strategic Innovation Group at the Royal Society of Chemistry, outlined semantics for discovery in STM.

He reflected on the challenge of how you have to learn many different ways to research, as highlighted by Russell Burke from Royal Holloway earlier in the day at 'It's all about discoverability, stupid!' seminar.

How can semantic enrichment help support user search behaviour? And how can it help to show under-researched content? You can use classification, index terms, and identify keywords, then build your own classification.

The value of topic modelling
When RSC Advances launched, it was a new journal covering all of chemistry (and published 1800 articles in 2012). They needed to develop a sensible way of navigating all this content so they used topic modelling. Their publishing expertise gave them 12 broad subjects that would be intuitive to users. They worked with Wordle and then sense checked removing seven topics that were nonsense, one topic that was too general, which left over 120 topics that were classified. The remaining topics can be used for gap analysis, finding hot topics and assessing competitor weaknesses.

Steer clear of ontologies (unless you really want to do them)
A heavier level of classification is ontologies which are used in text mining. An ontology is a machine readable account of what there is in a given domain and how the things there relate to other things. Ontologies tend to be very domain specific and therefore context specific. But in the real world, they don't fit easily into hierarchies which can lead to large gaps in coverage and requires considerable efforts to build and maintain. Kidd's advice? Steer clear of them unless you really want to do them.

Build links on meaning
The semantic web build links on meaning, and each concept should have URLs or URIs. You have to use standard approaches so resources can connect. Why take a semantic approach? There are only a few successful projects out there. Why is that? There are so many different formats, structures, vocabulary, concepts and meanings. The structure should be data-oriented (not HTML). The meaning of data should be clear (not XML). Reusable mappings between data are needed (not XSLT). You should avoid extensive schema rewriting (not data warehouses). And data should have standard APIs (not Flickr). And to quote or paraphrase Lee Harland form Open Phacts, you need to 'synergise with many public efforts.'

Where people have been doing a lot of semantic work, it has tended to be to link their own content together (for example, GeoFacets from Elsevier or Nature.com linked data projects).

Another consideration is text mining. This builds on entity identification (from dictionaries, ontologies, taxonomies) to extract recipes and additional data. You insert the semantic mark-up and conduct sentiment analysis and business intelligence. Examples include NaCTeM tools applied to PMC Europe and SciBite.com.

Does a publisher text mining help text miners? In Kidd's opinion, mostly not.

Semantics won't help with content discovery right now
What should you do? For an immediate return there isn't much evidence that semantics will help your content discovery right now. Customers aren't paying for enriched data. However, making discovery metadata widely available as linked data seems sensible. Semantic works tends to be development rather than discovery. Currently, it is used internally for product development and pump priming. A lot of data is being pushed externally via API for linking, while everyone waits for clever ideas to come along.

And in the future? Big issues are likely to focus on authoring tools and data quality.

No comments:

Post a Comment