ALPSP blog: at the heart of scholarly publishing: January 2014

Wednesday, 29 January 2014

Data linking systems: publishers’ experiences

Three publishers - Royal Society of Chemistry, Taylor & Francis, and the British Editorial Society of Bone and Joint Surgery - shared their experiences of data linking systems at last week's Data, the Universe and Everything seminar.

Sarah Day, Royal Society of Chemistry

Sarah Day, Senior Marketing Manager, CRM and Customer Systems at the Royal Society of Chemistry outlined how they integrated Ringgold into Salesforce, their cloud based CRM system.

The RSC data model includes: activity, contact, account, opportunity, campaign, and campaign member. Salesforce is customisable so they integrated Ringgold into their Account function. They use Ringgold for initial aggregation and import of data into Salesforce, for improving data quality, for external links (to related systems) and for bringing SCV data back into Salesforce.

Before they implemented Salesforce they had to do manual and fuzzy checks on a range of spreadsheets used for sales leads. One challenge was that they hadn’t fully integrated Ringgold so they had to copy and paste to get hierarchy of institutions. Sales team now have to apply the Ringgold ID otherwise they can’t close or apply revenue to an account. This has proved to be an effective way to drive compliance.

Ringgold is the central identifier source for Salesforce, MasterVision (SCV), THINK (Subscription Management), authentication engine. There are, however, some challenges. The data entry team have to understand the data (e.g. understand phonetic spelling for Japanese or Chinese English pronunciation, etc). You may also have good reasons for inconsistent identifiers in your systems (Salesforce rolls up to a parent organisation, access/permissions may be different, etc).

Sarah Wright, Taylor & Francis

Sarah Wright, Customer Services Director for Taylor & Francis outlined the benefits of linking systems from a customer service point of view.

Customers expect answers and service instantly. Automated systems help. With traditional systems, a customer order is taken, the payment is processed and you send the issue of the journal. But how can you use the same system to deliver access to online content?

Print copies are straight forward: you post one to a particular address (e.g. Christ Church College). But if an institutional subscription has been purchased for online access this needs to go to Christ Church College, Queens College, the Department of Economics, in fact the whole university.

Once you factor in duplications in the system due to different parts of the supply chain having different forms of data for the same place, it's a complex picture.

As a result, they chose Ringgold ID. They have two levels so they can see they both link to the university one. Although institutional identifiers are great, they don’t always reflect how they sell to their customers (e.g. global corporate companies, consortia, etc). They can tell the system that online access should go at parent level. This has been a big success with benefits including: increased usage, reduction in complaints, improved service, visibility and reporting as well as project transfer. It is now ingrained in daily processes and reviewed all the time – so not a one-off project. Keeping data clean allows them to get the right content to the right customer at the right time.

Peter Richardson, Managing Director at the British Editorial Society of Bone and Joint Surgery, outlined the problem they faced: legacy systems - such as old and inflexible subscription systems and the author database - that do not communicate.

They have data ‘black holes’ such as new leads stored in their email marketing client Adestra that aren't updated if/when the lead was converted. There might also be poor data management by individuals and an over-reliance on external subscription data which may sometimes be of poor quality.

Peter Richardson: plugging data black holes

There were also external factors such as opportunities to link data together using identifiers such as Ringgold or ORCID. Users expectations drive the need for better data linking systems as does the drive for better customer service and more efficiencies.

They have tackled these challenges with a brand new subscription fulfilment system - Myriad - which went live in August 2013. Data about customers is held separately from individual subscription records, with an improved data intake.

DataSalon will pull everything together, including data previously poured into ‘black holes’. DataSalon will also pull together customer, subscription information with the Ringgold identifier, leads, authors, OVID customers etc. They have a greatly improved ‘dashboard’ and enhanced marketing opportunities.

They hope to achieve better capture of marketing data (demographics, campaign codes, etc) in Myriad, have accurate addresses and data (end user info, mailing address) and better targeted campaigns as a result. Myriad is up and running and the project with DataSalon is just getting under way. They will know more in a couple of months, but believe they are on the right track.

Tuesday, 28 January 2014

Kirsty Meddings on New Metadata, New Identifiers: CrossMark and FundRef

Kirsty Meddings from CrossRef

Kirsty Meddings, Product Manager at CrossRef, updated the delegates at ALPSP's recent data seminar on 'New Metadata, New Identifiers: CrossMark and FundRef.'

CrossRef is a non-profit membership association with over 4,000 publishers and organisations who are members. Traditional metadata includes title, volume, contributors, issue, publication date, ISSN, ISBN, URL, DOI and first page information. New metadata elements include: funder name, correction, award number, license reference, ORCID, publication history and retraction.

Funder information is increasingly important. Often it is a free text definition and there are issues of consistency even when marked up with tagging and names (e.g. NIH, N.I.H., etc). Why does this matter? Funding bodies can’t easily track published output of funded work. It isn’t easy to report which articles result from research supported by specific funders, making it really difficult to report and analyse. This is why FundRef was launched. They are exploring tying together funder IDs with Ringgold and an overlaying ISNI ID.

They are looking at the with Licence_Ref element for updates and changes, erratum corrigendum, updates, enhancements, withdrawals, retractions, new editions and policy updates. As Meddings noted, we’ve come a long way from the days of product recall, but there is still some way to go before we get it right.

Thursday, 23 January 2014

Laurel L Haak on ORCID Author Identifiers

Laurel L. Haak is Executive Director of ORCID. She provided an outline of what they do and new developments for 2014 at the Data, the universe and everything seminar.

ORCID is an independent non-profit organisation supported by member fees. They run an open registry of unique identifiers for researchers and APIs for the community to embed identifiers in research systems and workflows. Data marked public by researchers is published annually by ORCID under a CC0 waiver. ORCID code is available on their GitHub open source repository and they support community efforts to develop tools and services.

ORCID is for anyone who contributes to scholarly communication – not just academics or researchers. They capture and make more public what these contributions are, particularly for peer review.

There are multiple contribution types including:

Funding
Publications
Service Activities
Affiliations
People
Datasets
Impacts.

ORCID is a unique identifier that will go with you through your career. It is integrated in standard workflows and embedded in works metadata, independent of platform. You can link in with websites and other identifiers (eg ResearcherID, Scopus, Author ID, ISNI). It tackles the issue of different spellings of names and is great for thinking beyond the paper.

ORCID has broad international usage with 34 countries with over 10,000 unique visitors and 82 countries with over 1,000 unique visitors. The registry supports multiple character sets. They have content in Spanish, French, English and Chinese (adding Portuguese, Korean, Japanese, and Russian in 2014).

They have issued over 500,000 identifiers since the launch in October 2012 with registrations growing steadily. The majority (about two thirds) come through trusted parties such as publishers. They are also beginning to see universities creating ORCID identifiers for research staff

ORCID works collaboratively with the research community to ensure use and adoption of research information exchange standards (e.g. ISNI, ODIN, CASRAI, Ringgold, CERiF-XML, CrossRef etc). They link to and include identifiers from other systems including DOIs, ISBNs, ISNIs, etc.

New features in 2014 include:

Funding
New languages
Account delegation
Third party assertions
New search and link wizards
Continuing to harmonize metadata

Who is integrating ORCID and how? They are working with research funders, professional associations, research institutions and metrics sites to incorporate ORCID IDs.

Wednesday, 22 January 2014

Laura Cox on institutional identifiers internally and throughout the supply chain

Laura Cox: do a data health check

Laura Cox, Chief Financial and Operating Officer at Ringgold kicked off her talk with a focus on what constitutes healthy data. Good quality, reliable and consistent data helps make good decisions. You gain insight into customers and business relationships as well as support strategic planning, decision making and ongoing business operations.

Poor data has real consequences. It is hard to get a true picture of relationships with institutions, can lead to a lack of quality author (and affiliation) data and an inability to see overlap between authors, members and customers. It can drive inaccurate holdings and revenue reports leading to protracted time and effort which can cost your business time and money. Healthy records are complete, accurate, free of duplications, current, consistent and conform with standard identifiers.

What are unique identifiers and how can they help?

They are numeric or alpha-numeric designations which are associated with a single entity. Entities can be institutions, persons or pieces of content. They enable the disambiguation of each entity and provide a proper understanding of customer, author, reader or institution as well as a proper identification of content object, article, product or package. They can also be used internally or in conjunction with external partners.

Why should we worry about data now?

Cox cited the 2012 STM Report (Ware, M and Mabe, M. The STM Report, 2012) which stated that the number of researchers and the number of article are both increasing by 3% per annum. The number of journals is increasing by 3.5% per annum and growth in China has been in double digits for over 15 years. At the same time there is increased demand for anytime/anywhere access while library budgets are frozen or being cut, less money for more content.

Institutional Identifiers can be used for disambiguation (e.g. which UCL?), consolidating different versions (many ways of describing the University of Oxford and its institutions). They provide a hierarchy view (institute within an institution) and reinforce uniqueness. This means you can use them for a gap analysis.

The Kafka-esque 'Identifiers identified' slide

The main challenge is around multiple data sources. There are system data silos, multiple locations - geographic data silos, data entered by different people for different purposes, data from 3^rd parties in the supply chain and data from bought in sources. These things aren’t integrated. Typical publisher systems include financial, CRM or sales databases, authentication system, fulfilment, usage statistics, submissions systems and so on.

Cox advised that the first thing to do is think about your data and implement a data governance plan. What data is held, where and how is it accessed? How can it be used to benefit business and work across silos? But always bear in mind where are you now and where do you want to go?

Another recommendation was to improve data capture. If you can, use web forms as they minimise variance in data input. Implement required fields, use date validation and at a minimum use naming conventions. There are a number of tools such as address validation, postcode look-up, institution validation/lookup. Avoid free-text fields and make institutional identifiers a requirement.

You can use an institutional identifier as a lynch pin to link internal systems for better data integration. It can prevent duplicate account creations, help keep data up-to-date and systems synchronized. It also enables staff to use data more effectively, break down silos, simplify data transmission and provide more insight and power to analyse and understand the business.

So what can you do now?

Engage with the problems
Think about resources. Time? Money? Systems?
How do you want it to work – look at priorities
Have a data governance policy
Appoint a data champion and document everything
Create some basic rules for data entry
Use universal identifiers to clean and link your data
Work with suppliers and customers to use institutional identifiers ot strengthen the supply chain.

Data, the universe and everything. How data can drive your business

Melinda Kenneway: loves data

Melinda Kenneway, Chair of Data, the universe and everything: How data can drive your business seminar, opened the day by emphasizing how data underpins modern business. She summed up the premise for the day by quoting Jeff Weiner from LinkedIn:

‘Data really powers everything that we do.’

If we’re not selling content, what are we selling? Data is also becoming a product or service itself now. An era of big data can help us understand and deliver a much better experience for customers In 2012, 90% of all the data that existed in our entire history has been created in the previous 2 years. We need to get the basics of data right, but also have a strategy.

It’s knowing what you need to know rather than trying to find out about everything. Have clear objectives and stick to them. Don’t forget to consider privacy and legal issues about data. Deliver useful services that build on what the market thinks is acceptable. And if you don’t like data now? You should get out of marketing.

Colin Meddings from DataSalon focused on why you should care about cleaning up your data. To start with it can embarrass or annoy your customers when it is wrong and it’s open to error when users don’t put in info correctly.

Colin Meddings: not vague about data

Continuing with quotes from the great and the good, he cited an observation from 2013’s UKSG conference by Liam Earney from JISC:

‘Publishers can’t even tell a library what they subscribe to.’

We’re too vague about absolute fundamental data. The basic process of publishing involves: submit, review, copy edit, publish, purchase, read, cite, licence. All of these aspects generate data. The volume of data is huge and all these areas are where errors or mistakes can happen and where data can conflict.

Reflect on your efficiency with dealing in data. Do you struggle with data wrangling where you have to continually review and refresh datasets? See data as an asset in your business. It is worth investing in. You will get a return. Don’t forget that there’s a legal requirement around data to ensure it is accurate and fit for purpose. He recommended the DQM Data Governance Maturity Model. Be aware. Become reactive. Turn this into proactive. Then make it a managed process. And finally, it becomes optimal.

Meddings outlined the seven deadly sins of data quality:

Missing data
‘Siloed’ data
Invalid data
Out of date info (people move around)
Inconsistencies
Right information, wrong field
Duplicate and conflicting data.

Put some effort and resource into sorting data. Task to someone specifically, create champions, set targets. It’s great if you can employ a team, but if you can’t, get senior management buy-in and work with those who have aptitude/passion for data. Start with one problem, don’t try to fix all at once.

Do a data audit. It can be manual. People who work with data everyday will know where the problems are. Sometimes an automated audit is better – good at finding information about your data for identifying empty fields, etc.