ALPSP blog: at the heart of scholarly publishing: metadata

Showing posts with label metadata. Show all posts

Wednesday, 10 August 2016

Spotlight on the Crossref Metadata API - shortlisted for the 2016 ALPSP Awards for Innovation in Publishing

This is the third in a series of interviews with the 2016 ALPSP Awards for Innovation in Publishing shortlist. Ginny Hendricks from Crossref tells us more about their Metadata API.

Tell us a bit about your company.

Crossref is a not-for-profit membership organization for scholarly publishing working to make content easy to find, cite, link, assess, and re-use. We do it in five ways: rallying the community; tagging metadata; running a shared infrastructure; playing with new technology; and making tools and services to improve research communications.

We’ve been around for sixteen years primarily for storing and registering identifiers that enable persistent linking between research articles. We’ve since grown to almost 6000 publisher members. This makes us not so much a Start-up as a ‘Scale-up’. We are seeing over 150 new publishers joining every month, international in scope and location, and many of these are library publishers, scholar publishers, and organizations exploring new publishing models.

What is the project you submitted for the Awards?

It’s the Crossref Metadata API, which is becoming a significant focus for us. We always describe APIs as machine-to-machine interfaces, but as more of us, including researchers, grow our developer mind-set, more of the services that we and others build need a dynamic way to integrate and use the cross-publisher metadata registered by Crossref. Put simply, the API lets anyone search, filter, facet and sample Crossref metadata related to over 80 million content items with unique Digital Object Identifiers (DOIs).

Tell us more about how it works and the team behind it.

We’re a small team of fewer than thirty people total, about a third in Oxford, UK and two thirds in Boston, USA. This, like most Crossref initiatives, started with the R&D team led by Geoffrey Bilder and has been extensively developed by Karl Ward.

The API was initially conceived to support funders who wanted to be able to find and report on the outputs of the research they funded. This was information publishers had started to provide Crossref, but to make best use of it, funders need to be able to access the most up-to-date information from publishers to be able to filter and facet their searches to look for specific subsets of information to report on the KPIs they were interested in.

Then it grew - with the introduction of funding data we started to see the API being used extensively. Coupled with that, the increased breadth of the metadata that publishers can provide Crossref with has also been growing - letting it be interrogated and used in lots of interesting ways. As such, the API has been developed to support the different information that users might ask of the metadata: asking for things like licence information, ORCID iDs, full-text links, clinical trial numbers and being able to filter on and combine these to get the specific sub-set of data they’re looking for.

Why do you think it demonstrates publishing innovation?

For its openness, its wide applicability, and it’s growing user base. Also because it’s used solely by developers who are looking to innovate themselves. As a communications person it’s been really interesting to see how the developer community has engaged with the API. The kind of use cases we’re seeing include text-mining, simple reporting and tracking, notification services, search interfaces (including our own) and integration in online editing and blogging tools.

What are your plans for the future?

Robustness! We have plans to scale up the technology to handle the growing usage the API is experiencing and make sure we can wholeheartedly support and grow the community that is using it. Of course, the API is only as valuable as the information that publishers provide Crossref with, and we’ll also be encouraging publishers to deposit the best, most complete metadata they can to improve the discoverability and usability of the research they publish.

Ginny Hendricks is Director of Member & Community Outreach at Crossref.

You can watch Crossref present during the ALPSP Awards for Innovation in Publishing lightning sessions at the Conference in September, where the winners will be announced. Further information and booking available online.

The ALPSP Awards for Innovation in Publishing 2016 are sponsored by MPS Ltd.

Tuesday, 26 February 2013

ASA 2013: Laura Cox on Pulling Together - Information Flow Throughout the Scholarly Supply Chain

Laura Cox with a messy and complex supply chain

Laura Cox, Chief Marketing Officer at Ringgold Inc, talked through the problems of information flow throughout the scholarly supply chain. If only publishers would use the right identifiers with their content, then there is a huge opportunity to improve information, insight and cost efficiencies.

What are the things that go wrong? Records are unconnected through the supply chain. Links fail between entities, between internal systems, and between external systems. Renewals are mishandled. Journal transfers, access and authentication is mishandled. Authors and individuals are not linked to their institution. Open access fees have to be checked manually. Authors are not linked to their research and funders are not linked to the research they fund.

We need to find a path to using standardized data. Identifiers can help. They can provide a proper understanding of customers, whether author, reader or institution. They also provide a simple basis for wider data governance (that is data governance defined as processes, policies, standards, organization, technology required to organize, maintain and access/interpret data) through:

ongoing data maintenance
identifiers enforce uniqueness
enable ongoing data governance
ensure systems work
help with cleansing data for future use.

Cox cited research from Mark Ware and Michael Mabe (The STM Report, 2012) for the wider context:

Journals are increasing by 3.5% per annum
There is an increase in the number of articles by 3% per annum
The number of researchers is increasing by 3% per annum
Growth in China is in double digits
There is increasing demand for any time, any where access
Library budgets are frozen.

There are a number of identifiers available. For people, there is the International Standard Name Identifier (ISNI) which can apply to authors, playwrights, artists - any creator - which is a bridge identifier. The Open Research and Contributor ID (ORCID) links research to their authors. It disambiguates names looking at the different manner in which they can be recorded and can help remove problems with name changes. They can embed IDs into research workflows and the supply chain, can link to altmetrics and to integrate systems.

Institutional identifiers include Ringgold and ISNI, which map institutions and link data together. This ensures you can identify institutional customers so you can give correct content, and it disambiguates institutional naming conventions.

If you put institutional and author IDs together you gain genuine market intelligence:

who's working with whom and where
impact of research output on a particular institution - the contribution of their faculty
subscription sales or lack thereof
where reseach funding is concentrated
ability to track open access charges (APCs) to fee structure.

Use internal linking in your systems, you can use identifiers to connect:

customer master file
financial system
CRM/sales database
authentication system
fulfilment
usage statistics
submissions system
author information.

This enables you to access information from multiple systems in one place, reducing time and cost in locating information, and enabling you to use information to make decisions and inform strategy.

A nice and tidy supply chain

External linking using identifiers will enable you to:

ensure accuracy of information
speed up data transactions
reduce queries
reduce costs
open data up to new uses
provide seamless supply chain where data flows from one org to next
ensures that authors receive credit for the work they produce
provide a good service to the community.

We need a forum to discuss and pull together: to engage with the problems in data transfer, generate an industry wide policy on using identifiers, break down the data silo mentality, and use universal identifiers to enable our systems to communicate with each other accurately on an ongoing basis. This will help serve the author and reader more effectively and strengthen the links in the supply chain.

ASA 2013: Ed Pentz on CrossMark - A New Era for Metadata

Horse burger, anyone?

Ed Pentz, Executive Director at CrossRef, provided an overview of how CrossMark provides information on enhancements and changes to an article - even if it is downloaded as a PDF to your computer.

With a slide showing a horse-shaped burger, Pentz observed no one knew what was happening in the supply chain and ingredients were mis-labelled. As a consumer it's hard to know what's verified. Third party certification such as Fairtrade or the Soil Association mark have arisen to help consumers. This is an important lesson for the scholarly publishing community.

Pentz is not talking about bibliographic metadata. This is about some of the things that are changing in broader descriptive metadata - what are users starting to ask? They are interested in the status of the content. What's been done to this content? And what can I do with this content?

Good quality metadata drives discovery, however, there are problems with metadata and identification. This is a challenge for primary and secondary publishers as the existing bibliographic supply chain hasn't been sorted, new things being added in, and this could potentially lead to big problems.

NISO announced two weeks ago standards for open access metadata and indicators. The detail is still to follow which will include things like: licensing; has an APC been paid?; if so, how much and who pays it? These factors will be particularly important to help identify open access articles in hybrid journals.

There are a number of new measures that have to be captured via the workflow. These include:

The FundRef Workflow

CrossRef has launched the FundRef pilot to provide a standard way of reporting funding sources.
Altmetrics allow you to look at what happens after publication, looks at aspects of usage, post-publication peer review, capturing social buzz and getting beyond impact factors.
PLOS has article level metrics - available via APIs.

What about content changes? Historically, the final version of the record has been viewed as something set in stone. We need to get away from this idea because it doesn't recognise the ongoing stewardship publishers have for the content.

Many things happen to the status of content - post-publication - including:

errata
updates
corrigenda
enhancements
retractions
protocol updates.

As we have heard throughout the conference, the number of retractions are on the rise. Pentz referred back to an article in Nature 478 (2011) on the trouble with science publishing retractions. The case is clear: when content changes, readers need to know, but there is no real system to do this.

In a digital world, notification of changes can be done more effectively, and that's what CrossRef is all about. Another challenge is the use of PDF: there is no way of knowing whether the status has changed. When online, the correction is often listed below the fold, even on a Google search. The whole issue of institutional repositories is also a factor.

What is CrossMark? It is a logo that identifies a publisher maintained copy of a piece of content. When you click on the logo it tells you whether there have been updates, is the copy being maintained by the publishers, where is it publisher maintained, what version is it and other important publication record information.

Taking the example of the PDF sitting on a researcher's hard drive, the document has the CrossMark logo. Click on it for an update on whether the PDF version is current. You can then link through to the clarification if it is there. It includes a status tab and publication record tab. The record tab is a flexible area where publishers can add lots of non-bibliographic information that is useful to reader, for example, peer review, copyright and licensing, FundRef data, location of repository, open access standards, etc.

Lots of things can be enabled by this such as Mendeley. Pentz showed a demo of how a plugin for Google might be written that flags CrossMark when you search. It was launched in April 2012 and has been developing slowly with 50,000 CrossMark pilot deposits since launch, with 400+ updates. They are working with 20+ publishers on CrossMark implementation.

Tuesday, 9 October 2012

Tools of Change Frankfurt: Metadata Futures

Karina Luke from BIC introduced a panel on metadata for the future. Graham Bell from EDIteur began with an observation of how uncomfortable book publishers are with the concept of metadata. He provided an explanation of the fundamentals of metadata for the industry and how it can enable you to begin to discover new metadata within the data.

He went on to describe in more detail what meta and linked data are. Linked data expresses metadata as a collection of triples. It uses URIs to represent relations and things and prefers persistent HTTP URIs so they can be 'looked up' to get further details. This lets the data be 'self-describing'. He warned about Linked Open Data: this has an additional view added in which requires the data to be free and accessible, and counselled to bear this in mind as it may - or may not - be what you want.

Linked data is just another way of expressing the same data. Some practitioners have a loose view of semantics, that it's not best suited to the supply chain. You need to be selective about data sources, as the system is based around trust and expectations of persistence. There is a need for common entities, shared vocabulary and a standard approach.

George Lossius' presentation was called 'Navigating the Semantic Web'. He covered definitions of linked data - the semantic web - and why we need it, who is using it now, and the business benefits for the trade. Working in the semantic web isn't a scary thing: it brings you closer to the original, scientific view point, and it's fun.

The semantic web takes the web solution further by providing:

web of linked data vs web of documents
framework of emerging standards (W3C)
structured content - standard way of describing things
ontology
inference / relationship
interoperable
combination of data from diverse sources

'The semantic web is a little bit about us: it uses deductive reasoning and inference to do things you ask it to do.'

An example of a semantic website is Breathing Space, a pilot project that aims to explore the value to researchers of compiling and mining a critical mass of data within a discipline. Another example is GSE Research, which aims to provide a bridge between scholarly research and practice in the fields of governance, environment and sustainability. It was interesting to hear him note that the BBC website for Olympic athletes was populated by a semantic search.

Is it relevant to the publishing industry or to trade books? Yes. Your consumers are becoming more demanding, time poor and intolerant of waiting. So the job in the publishing supply chain is to make it easy and interesting so you don't lose your readers. What the semantic web gives you is the opportunity to create compelling, relevant and interesting material to create value for them and your business.

'The semantic web is about fulfilment: the fulfilment of books and the fulfilment of the right content to consumers at the right time.'

Beat Barblan from Bowker provided an illustration of how identification can be difficult online and how the ISNI helps. The ISNI is an ISO standard which uniquely and authoritatively identifies Public Identities across multiple fields of creative activity. For a full definition of ISNI read the website.

It will help with discovery, search ranking, identifying rights holders and distribution. It is the tool that can link the unique content to the creator. It is a bridge identifier that will link while showing enough to disambiguate. The rich content will be found elsewhere. There are just under 1.5 million ISNIs assigned and around 15.5 million provisional records.

Valla Vakili from Small Demons focused on the great chain of narrative in his talk. He focused on V for Vendetta as an extreme example of a great way for a narrative to break out into the world. It referenced so many aspects of history and life including Guy Fawkes. Data collected included:

book
character in book
chararcter's role
character's clothing
character's clothing was inspired by historical figure of Guy Fawkes
where to get the mask (which is also the highest selling mask on Amazon)

Howard Willows at Nielsen BookData closed with an overview of moving toward a single subject classification scheme for the global market. Drawing a comparison with the Tower of Babel, there is still a range of systems designed for local languages and confusion reigns (e.g. BIC, BISAC, SAB, RVM, YSO, etc). This system undermines their overarching goal and introduces inefficiency into the supply chain.

There is a gap in the metadata for trading partners across national borders even between divisions of mulitnational companies. The traditional fix is mapping and while this works, it only works up to a point. The problem with mapping is:

it's not a complete solution
there are often competing versions of varying quality with different outcomes
they tend to be either simple and inaccurate or complex and accurate.

Overall there is a degradation of quality and loss of discoverability which results in poor experience and degradation of sales. Mapping has been pushed to breaking point by the growth of digital publishing and online trading and has outrun interim solutions.

'A global market needs globally understood metadata.'

The best and only viable long term solution is a single universal subject classification scheme. Who will benefit? Publishers through greater control over product data; aggregators through less data manipulation; as well as retailers and consumers through a clearer and much simpler supply chain.

As a result of this need, a new organisational structure has been put together, independent of BIC and any other existing company. THEMA has been born to ensure global subject class scheme stays free to use and truly international.

Monday, 1 October 2012

Sarah Price: Library Technology and Metadata - Measuring Impact

The afternoon session at To Measure or Not To Measure: Driving Usage seminar included a session from Sarah Price who is E-Resources and Serials Coordinator at the University of Birmingham and Co-Chair of KBART.

One of key things librarians are interested in is ensuring that the content they buy is easy to use, is discoverable and accessible for their students. She provided a candid and compelling story of how the University had got to grips with critical feedback from students on the eLibrary provision, and how they instigated a major review and development programme to address the issue.

Traditionally, there were two access points to content: traditional library catalogue (mainly for print collections) and the elibrary service. Both were accessed via the home page, but didn't take into account special collections and other services they had. The user interface was very text heavy, old fashioned and not very user friendly and you had to search separately for ejournals and ebooks, making the experience confusing, unattractive and a source of dissatisfaction.

As a results the University has invest in a Resource Discovery Service which provides:

single search interface and search box (with a Google-like interface)
harvesting of collections across institutions
much faster search and results retrieval
discovery at article and chapter level
post search filtering and refinement.

The service is publicly available - with no (upfront) authentication - as a taster for potential students and academics. However, if you want to access in-depth content you have to sign-in with your university account. It is designed to have no dead ends and is integrated with other web services such as the University portal. They worked with Ex Libris to develop the product and included embedded searching as a function.

They added the Primo Central Index to this product which is a very important part of the discovery service delivering article level searching. A user can also narrow research from 'everything' to specific collections or using advanced search. You can log in with your own personal account which then provides access to the full set of content and lifts restrictions. When using a search term, the results will indicate what type of resource it is (e.g. articles, books, etc.) Where it is a book, it will show stock and location of copies on a site-specific basis, even including a map of the location in library. Print and electronic resources are listed alongside in a discovery tool. You can see where terms are where you searched to check relevance and you can also facet or post-filter (e.g. by article, book, library site, date range, author, language, electronic database, etc.), and it will attempt to group similar records.

Another interesting feature for scholarly publishers is the link to the in-house reading list management system on each textbook. This is flagged at the foot of the entry and you can click through to see full reading list and then continue through to other titles and services. Crucially, this will be helpful in checking against your records whether an academic has added a title to a reading list or not after receiving an inspection copy.

The resource is embedded on the university portal my.bham within a MyLibrary tab. This is a primary source of driving usage to the site. It's early days for analytics, but at the start of term they have the same amount of traffic from my.bham university portal as from Google Scholar. In addition, index based searching is generating a lot of traffic from their users.

During the implementation they decided to:

still provide database level link to native interface function
provide library catalogue only search but within FindIt@BHam
'everything' set as search default but enable a limit of scope
linking SFX component of Metalib library catalogue to reading list management system and the University of Birmingham Research Archive (UBIRA).

They dispensed with the A-Z list and pre-search limiters and now rely on post filtering facets. They also dispensed with ebook MARC records as metadata input and now directly harvest from SFX. It was a bold decision, but they have found that it works for them. There has also been integration of the single search in the portal and library services homepage.

Price flagged the importance of metadata for discovery. It supports linking to the appropriate copy; allows an appropriate set of links to be presented in a single place; allows the library to accurately and comprehensively display an entire portfolio; accurately depicts the entitled coverage for that user; and allows users to find keywords in full text - not just abstracts.

As 'Resource Discovery Service' isn't the most exciting or engaging title, they ran a competition amongst staff for the new brand name. There were 80 suggestions, but the winner - FindIt@Bham - was felt to tie in with the overall university brand well. They thought long and hard about integrating the Birmingham brand and used pictures of the distinctive campus to customise the out of the box product. They have integrated with the University portal VLE and embedded in the library Facebook page. Other marketing and promotion included:

social media
lots of work with the Student Guild
postcards/bookmarks
university staff and student newsletters
focus groups, training and briefing sessions
integration and prominent website advertising
university-wide plasma screens.

It's early days in terms of measuring impact, but they are assessing reviews of user feedback post-launch and have a continuous improvement strategy and post-launch authority group in place. They will analyse future quality measures, service and resource usage and benefits realisation. They are expecting to see a big hike in full text usage, are anticipating a massive impact to their ratings and anticipate seeing value added throughout the supply chain.

It has been interesting to compare to a Google Scholar set of results for certain specific searches. These only give generic results, not library entitlement. It has been interesting to note that the top result on a title is a pdf from JSTOR of a similar book to the one searched - their system is much more precise.

When addressing concerns about wider access to content, a demonstration showed that while Google will present the results, it won't present the full text unless they are free for access or the viewer can log in with an entitlement through the library system. The system doesn't embed authentication without library intervention - the link resolver.

Already, in comparison with Google Scholar searches, the Library discovery is context sensitive to the definition and results are more focused. Library discovery allows added value with resources grouped by subject and scholarly recommender services.

Her advice to publishers on how to integrate titles into the system includes:

send your title level metdata to link resolvers (KBART)
keep it up-to-date (cessations, title changes, etc)
provide your deep linking algorithm
allow discovery platforms to harvest your metadata
don't be exclusive, be promiscuous!
assess usage patterns following integration.

She concluded by saying that integration with library discovery tools is essential to drive usage. This needs to be based on industry good practice and there is a growing body of evidence supporting usage increase (and decrease) dependent on RDS integration.