ALPSP blog: at the heart of scholarly publishing: #alpspdata

Showing posts with label #alpspdata. Show all posts

Thursday, 4 February 2016

All Change in Scholarly Communications: How are the Players – Veterans and Newbies – Adapting?

Fiona Murphy reports from #APE2016

Last month, in characteristically bracing January Berlin weather, around 250 intrepid speakers and delegates attended the 11th Academic Publishing in Europe (APE – pronounced “Ahhhpay”) meeting. Keep an eye on Twitter #ape2016 as all of the presentations were recorded and so should become available in the near future.

A number of familiar characters – large publishers, established platform providers, and so forth – whose language seems to have evolved over the past few years – spoke about ‘openness’ and ‘sharing’ rather than preserving business models. Todd Toler of Wiley, for instance, expressed the “publisher’s value proposition” as having shifted from content provision – basically “moving stuff about” to “strengthening knowledge connections”. This feels like a real turning of tides; such players are now actively aiding and abetting our efforts to garner significant knowledge from our scholarly ecosystem.

In point of fact, there was a general theme around intelligence rather than simply the power of data. Barend Mons bemoaned the existence of “a Christmas tree of hyperlinks and the malpractice of supplementary material’”, instead calling for the training of experts to really understand how machine learning and human interrogation of data can be meshed together to form a powerful whole – “Open Science as a Social Machine” (keep an eye on the IDCC programme in Amsterdam later this month, as he’ll be expanding on the topic there). Meanwhile, Emma Green, of Zapnito – a start-up that aids knowledge-based companies to maximise the impact of their associated experts spoke of growing the ‘knowledge economy’ by reducing the noise and chatter, thereby freeing up the collective intelligence.

John Sack of Highwire’s approach was to examine frictions in the workflow. If workflow is ‘a way of getting things done,’ then instances of friction – with the possible exception of a review stage – largely involve the loss of efficiency. Currently most journal workflows are still based on the original print journal format, but with the version of record shifting online, the resulting misalignments between what is desired and what is produced are causing delays, and infringements of established rules (such as copyright). Friction-reducing tools that can support and simplify the generation, finding, and attribution of scholarly outputs are needed. This can be enabled by standards such as e.g. ORCID or ResearcherID for people, and by initiatives such as openRIF/VIVO for connecting people and their roles to their works and activities. This connectivity will surely boost quality, productivity, and the need for improved garnering of knowledge from our research landscape that generally arose as a theme across APE in general. This connectedness, according to Sack, is about a supported conversation amongst collaborators who are enabled by tools that sift, pre-curate and – potentially – publish their scholarly outputs.

Opportunities for new business models are appearing in a number of points in the workflow – Publons acknowledges and badges peer review activities, Overleaf provides templated support to write journal articles, and Elsevier is leveraging the new Mendeley Data service to enable authors to publish their data and link it immediately with journal articles.

At the same time, policy (=funding) is also moving in the same direction. Stephan Kuster, Head of Policy Affairs for Science Europe explained its function and mission. Science Europe is a think tank set up to support and advise EU National Research Funding Councils around on EU R&D policy issues. Open Access is one of nine key priorities, including enabling authors to hold copyright, supporting sustainable archiving, and publication and dissemination are integral part of research process and should be funded as such.

There was a thoughtful debate about Scholarly Communications Networks and whether they add value, which would not have been possible even a few years ago. Fred Dylla, Emeritus Executive Director of the American Institute of Physics, made the salient point that reputation of the journal still needs to be fundamentally challenged for the landscape to be really disrupted. Currently, the people and institutions making the key decisions about funding, tenure and promotion, are still fixated on journal reputations and impact factors. So, despite feeling as though there has been a lot of progress in the last few years, it also seems there’s still a lot to do.

Luckily there are several opportunities coming up to extend and develop our understanding of and strategies for adapting to this changing landscape. As well as the aforementioned IDCC later this month. And look out for the ALPSP Seminar on research data, digital preservation and innovation in March. Standing on the Digits of Giants is co-organised with the Digital Preservation Coalition and is designed to orientate and empower publishers, research managers and researchers to navigate and flourish in the new landscape.

Another key space to continue these discussions is in the context of the Force11 community, which aims to bring together many of the stakeholders needed at the table to effect change: policy makers, funders, researchers, technologists, publishers, informaticists, lawyers, etc. Force16 promises to be an exciting venue where we’ll be pushing scholarly communications into uncharted territory. Hope to see you there too.

Fiona Murphy, February 2016

Now associated with the Maverick Publishing Specialists, Fiona Murphy has held a range of production and editorial roles at Wiley, Oxford University Press, Random House and Bloomsbury Academic. She specializes in emerging scholarly communications (including Open Science and Open Data) and works to raise expertise and activity levels across the wider research and publications communities. Fiona has written and presented extensively on the research landscape, data and publishing. She is Co-Chair of the World Data System—Research Data Alliance Publishing Data Workflows Working Group, an Editorial Board Member of the Data Science Journal and enjoys organizing meetings. orcid.org/0000-0003-1693-1240

This post was written by Fiona Murphy with the support of Melissa Haendel.

Monday, 3 February 2014

Managing the open access data deluge without going grey

Cameron Neylon: the OA data deluge

The final two sessions at ALPSP's Data, the universe and everything seminar reflected on the changing nature of data within an open access context and what needs to be taken into account when trying to cope with data.

Cameron Neylon, Advocacy Director at PLOS, counselled delegates on 'Managing a (Different) Data Deluge'. Publishing is now a different business. Customer may look the same, but they act different and you have to think differently. Data is core to the value you give.

There's no sign of the growth trajectory of open access publishers slowing down. PLOS One on its own is 11% of the funded research papers output from the Wellcome Trust. PLOS One is 5% of the biomedical literature. PLOS One Publishes on average 100 papers per day. All the metadata they have comes from the authors and they don’t necessarily have accurate data on who they are or where they are based, so it gets complicated. This is happening on a large scale across scholarly communication services.

Neylon believes that the business of open access publishing is fundamentally different to subscription publishing. With a traditional subscription business you have a pool of researchers and institutions. Advertising and reprints come from third parties. This is a distribution model and not so much about where the research has come from.

With an APC-funded open access business it is a service or push model. The customer is the author at some level. Increasingly (in UK for example) this is coming through the funder. This means that suddenly all these players have an interest) which they didn’t have before). A third model is the funders directly funding infrastructure (e.g. eLife, PDB, Genbank etc).

The customer = institution, the author, the funder. They have questions about how much? How many articles have you published? What's the quality of service? Are there compliance guarantees (this is relatively simple in the UK, but tricky in North America or the EU). They want repository deposit. And all this has to happen at scale. You need to track who funded the research. This means that the market is being commoditized. It also means that the market is smaller, with space to make profit smaller.

Neylon feels that if we do not do this collectively, the whole system will collapse and we’ll be left with one or two big players. Using identifiers, capturing data up front and making it easy for the author to include the correct data up front are key to tackling the issue of the data deluge we face. If we don’t will have lost the opportunities. It’s about shared data identifiers and making them at the core of your systems.

He reflected on the particular challenge that smaller publishers face if they are to survive. They need to share infrastructure across multiple organisations. ALPSP is well placed to support and advise suppliers that smaller publishers need ORCID and FundRef etc up front.

Ann Lawson, Senior Director of Publisher Relations and EBSCOAdvantage Europe, focused on the various challenges for managing open access data without going grey. EBSCO see the impact of data from their own perspective (with 27 million articles in the EBSCO database products) and also from the perspectives of their client publishers and institutions. They have their own ID systems, but also input any partner or publisher IDs which results in 485 data elements per subscription record.

Ann Lawson: trying not to go grey

In a recent research report drawn from their own data, they've noted that large publishers are getting larger: in 1994 the top 10 publishers were responsible for 19% by value. In 2009, the top 10 publishers represented 50% by value. And in 2013, the top 10 publishers accounted for 68% by value.

In the immediate future, EBSCO see a mixed market of Gold, Green and Subscriptions within scholarly communications. However, there will be an impact on transactions from individual journals, to big deals, to small gold open access APCs. The impact on subscription agents is challenging as they have to keep on doing what they do, plus play in the open access area. There is a challenge of scale and transparency for everyone.

What will these market trends mean for data? There is a new cycle for open access which impacts on the need for data. This includes measures of value for money, speed to publication, reach and impact, reporting, funding sources, and the approval process.

There are data issues for the institution: who are active authors? What funding sources are available? Which funders demand what compliance? Which journals are compliant? What happens at school/research group? How much does the APC cost? Who paid what, with what effect? What reporting is needed for whom? Compliance – and deposit in repositories.

The institution workflow is at the heart of the data flow:

Policies
Advocacy
OA request form
Acceptance email
Funding pot
Copy of invoice
Article and DOI
CC licence
VAT
Approvals
Records
Reporting and analysis.

The reality is that many publisher systems do not have the ability to adapt their systems. Current points of tension include: money management, complex workflows, and author involvement. Discovery is key, but can be tricky with hybrid journals so discovery at article level is essential. NISO is helping, but there is more work to be done in this and many other areas of data.

Wednesday, 29 January 2014

Data linking systems: publishers’ experiences

Three publishers - Royal Society of Chemistry, Taylor & Francis, and the British Editorial Society of Bone and Joint Surgery - shared their experiences of data linking systems at last week's Data, the Universe and Everything seminar.

Sarah Day, Royal Society of Chemistry

Sarah Day, Senior Marketing Manager, CRM and Customer Systems at the Royal Society of Chemistry outlined how they integrated Ringgold into Salesforce, their cloud based CRM system.

The RSC data model includes: activity, contact, account, opportunity, campaign, and campaign member. Salesforce is customisable so they integrated Ringgold into their Account function. They use Ringgold for initial aggregation and import of data into Salesforce, for improving data quality, for external links (to related systems) and for bringing SCV data back into Salesforce.

Before they implemented Salesforce they had to do manual and fuzzy checks on a range of spreadsheets used for sales leads. One challenge was that they hadn’t fully integrated Ringgold so they had to copy and paste to get hierarchy of institutions. Sales team now have to apply the Ringgold ID otherwise they can’t close or apply revenue to an account. This has proved to be an effective way to drive compliance.

Ringgold is the central identifier source for Salesforce, MasterVision (SCV), THINK (Subscription Management), authentication engine. There are, however, some challenges. The data entry team have to understand the data (e.g. understand phonetic spelling for Japanese or Chinese English pronunciation, etc). You may also have good reasons for inconsistent identifiers in your systems (Salesforce rolls up to a parent organisation, access/permissions may be different, etc).

Sarah Wright, Taylor & Francis

Sarah Wright, Customer Services Director for Taylor & Francis outlined the benefits of linking systems from a customer service point of view.

Customers expect answers and service instantly. Automated systems help. With traditional systems, a customer order is taken, the payment is processed and you send the issue of the journal. But how can you use the same system to deliver access to online content?

Print copies are straight forward: you post one to a particular address (e.g. Christ Church College). But if an institutional subscription has been purchased for online access this needs to go to Christ Church College, Queens College, the Department of Economics, in fact the whole university.

Once you factor in duplications in the system due to different parts of the supply chain having different forms of data for the same place, it's a complex picture.

As a result, they chose Ringgold ID. They have two levels so they can see they both link to the university one. Although institutional identifiers are great, they don’t always reflect how they sell to their customers (e.g. global corporate companies, consortia, etc). They can tell the system that online access should go at parent level. This has been a big success with benefits including: increased usage, reduction in complaints, improved service, visibility and reporting as well as project transfer. It is now ingrained in daily processes and reviewed all the time – so not a one-off project. Keeping data clean allows them to get the right content to the right customer at the right time.

Peter Richardson, Managing Director at the British Editorial Society of Bone and Joint Surgery, outlined the problem they faced: legacy systems - such as old and inflexible subscription systems and the author database - that do not communicate.

They have data ‘black holes’ such as new leads stored in their email marketing client Adestra that aren't updated if/when the lead was converted. There might also be poor data management by individuals and an over-reliance on external subscription data which may sometimes be of poor quality.

Peter Richardson: plugging data black holes

There were also external factors such as opportunities to link data together using identifiers such as Ringgold or ORCID. Users expectations drive the need for better data linking systems as does the drive for better customer service and more efficiencies.

They have tackled these challenges with a brand new subscription fulfilment system - Myriad - which went live in August 2013. Data about customers is held separately from individual subscription records, with an improved data intake.

DataSalon will pull everything together, including data previously poured into ‘black holes’. DataSalon will also pull together customer, subscription information with the Ringgold identifier, leads, authors, OVID customers etc. They have a greatly improved ‘dashboard’ and enhanced marketing opportunities.

They hope to achieve better capture of marketing data (demographics, campaign codes, etc) in Myriad, have accurate addresses and data (end user info, mailing address) and better targeted campaigns as a result. Myriad is up and running and the project with DataSalon is just getting under way. They will know more in a couple of months, but believe they are on the right track.

Tuesday, 28 January 2014

Kirsty Meddings on New Metadata, New Identifiers: CrossMark and FundRef

Kirsty Meddings from CrossRef

Kirsty Meddings, Product Manager at CrossRef, updated the delegates at ALPSP's recent data seminar on 'New Metadata, New Identifiers: CrossMark and FundRef.'

CrossRef is a non-profit membership association with over 4,000 publishers and organisations who are members. Traditional metadata includes title, volume, contributors, issue, publication date, ISSN, ISBN, URL, DOI and first page information. New metadata elements include: funder name, correction, award number, license reference, ORCID, publication history and retraction.

Funder information is increasingly important. Often it is a free text definition and there are issues of consistency even when marked up with tagging and names (e.g. NIH, N.I.H., etc). Why does this matter? Funding bodies can’t easily track published output of funded work. It isn’t easy to report which articles result from research supported by specific funders, making it really difficult to report and analyse. This is why FundRef was launched. They are exploring tying together funder IDs with Ringgold and an overlaying ISNI ID.

They are looking at the with Licence_Ref element for updates and changes, erratum corrigendum, updates, enhancements, withdrawals, retractions, new editions and policy updates. As Meddings noted, we’ve come a long way from the days of product recall, but there is still some way to go before we get it right.

Wednesday, 22 January 2014

Laura Cox on institutional identifiers internally and throughout the supply chain

Laura Cox: do a data health check

Laura Cox, Chief Financial and Operating Officer at Ringgold kicked off her talk with a focus on what constitutes healthy data. Good quality, reliable and consistent data helps make good decisions. You gain insight into customers and business relationships as well as support strategic planning, decision making and ongoing business operations.

Poor data has real consequences. It is hard to get a true picture of relationships with institutions, can lead to a lack of quality author (and affiliation) data and an inability to see overlap between authors, members and customers. It can drive inaccurate holdings and revenue reports leading to protracted time and effort which can cost your business time and money. Healthy records are complete, accurate, free of duplications, current, consistent and conform with standard identifiers.

What are unique identifiers and how can they help?

They are numeric or alpha-numeric designations which are associated with a single entity. Entities can be institutions, persons or pieces of content. They enable the disambiguation of each entity and provide a proper understanding of customer, author, reader or institution as well as a proper identification of content object, article, product or package. They can also be used internally or in conjunction with external partners.

Why should we worry about data now?

Cox cited the 2012 STM Report (Ware, M and Mabe, M. The STM Report, 2012) which stated that the number of researchers and the number of article are both increasing by 3% per annum. The number of journals is increasing by 3.5% per annum and growth in China has been in double digits for over 15 years. At the same time there is increased demand for anytime/anywhere access while library budgets are frozen or being cut, less money for more content.

Institutional Identifiers can be used for disambiguation (e.g. which UCL?), consolidating different versions (many ways of describing the University of Oxford and its institutions). They provide a hierarchy view (institute within an institution) and reinforce uniqueness. This means you can use them for a gap analysis.

The Kafka-esque 'Identifiers identified' slide

The main challenge is around multiple data sources. There are system data silos, multiple locations - geographic data silos, data entered by different people for different purposes, data from 3^rd parties in the supply chain and data from bought in sources. These things aren’t integrated. Typical publisher systems include financial, CRM or sales databases, authentication system, fulfilment, usage statistics, submissions systems and so on.

Cox advised that the first thing to do is think about your data and implement a data governance plan. What data is held, where and how is it accessed? How can it be used to benefit business and work across silos? But always bear in mind where are you now and where do you want to go?

Another recommendation was to improve data capture. If you can, use web forms as they minimise variance in data input. Implement required fields, use date validation and at a minimum use naming conventions. There are a number of tools such as address validation, postcode look-up, institution validation/lookup. Avoid free-text fields and make institutional identifiers a requirement.

You can use an institutional identifier as a lynch pin to link internal systems for better data integration. It can prevent duplicate account creations, help keep data up-to-date and systems synchronized. It also enables staff to use data more effectively, break down silos, simplify data transmission and provide more insight and power to analyse and understand the business.

So what can you do now?

Engage with the problems
Think about resources. Time? Money? Systems?
How do you want it to work – look at priorities
Have a data governance policy
Appoint a data champion and document everything
Create some basic rules for data entry
Use universal identifiers to clean and link your data
Work with suppliers and customers to use institutional identifiers ot strengthen the supply chain.

Data, the universe and everything. How data can drive your business

Melinda Kenneway: loves data

Melinda Kenneway, Chair of Data, the universe and everything: How data can drive your business seminar, opened the day by emphasizing how data underpins modern business. She summed up the premise for the day by quoting Jeff Weiner from LinkedIn:

‘Data really powers everything that we do.’

If we’re not selling content, what are we selling? Data is also becoming a product or service itself now. An era of big data can help us understand and deliver a much better experience for customers In 2012, 90% of all the data that existed in our entire history has been created in the previous 2 years. We need to get the basics of data right, but also have a strategy.

It’s knowing what you need to know rather than trying to find out about everything. Have clear objectives and stick to them. Don’t forget to consider privacy and legal issues about data. Deliver useful services that build on what the market thinks is acceptable. And if you don’t like data now? You should get out of marketing.

Colin Meddings from DataSalon focused on why you should care about cleaning up your data. To start with it can embarrass or annoy your customers when it is wrong and it’s open to error when users don’t put in info correctly.

Colin Meddings: not vague about data

Continuing with quotes from the great and the good, he cited an observation from 2013’s UKSG conference by Liam Earney from JISC:

‘Publishers can’t even tell a library what they subscribe to.’

We’re too vague about absolute fundamental data. The basic process of publishing involves: submit, review, copy edit, publish, purchase, read, cite, licence. All of these aspects generate data. The volume of data is huge and all these areas are where errors or mistakes can happen and where data can conflict.

Reflect on your efficiency with dealing in data. Do you struggle with data wrangling where you have to continually review and refresh datasets? See data as an asset in your business. It is worth investing in. You will get a return. Don’t forget that there’s a legal requirement around data to ensure it is accurate and fit for purpose. He recommended the DQM Data Governance Maturity Model. Be aware. Become reactive. Turn this into proactive. Then make it a managed process. And finally, it becomes optimal.

Meddings outlined the seven deadly sins of data quality:

Missing data
‘Siloed’ data
Invalid data
Out of date info (people move around)
Inconsistencies
Right information, wrong field
Duplicate and conflicting data.

Put some effort and resource into sorting data. Task to someone specifically, create champions, set targets. It’s great if you can employ a team, but if you can’t, get senior management buy-in and work with those who have aptitude/passion for data. Start with one problem, don’t try to fix all at once.

Do a data audit. It can be manual. People who work with data everyday will know where the problems are. Sometimes an automated audit is better – good at finding information about your data for identifying empty fields, etc.

Monday, 16 December 2013

Colin Meddings: Why data quality matters.

Colin Meddings is the Client Director at DataSalon. Colin will be one of the speakers at the forthcoming ALPSP seminar Data, the universe and everything taking place in January.

Here, in a guest post, he reflects on why good quality customer and internal data is important for scholarly publishers.

'Only four types of organisations need to worry about data quality: Those that care about their customers; Those that care about profit and loss; Those that care about their employees; and Those that care about their futures.' – Thomas C. Redman (2006)

Over recent years publishers have had to overcome many hurdles in the digital world, such as making content available online, managing complex consortia deals, creating new packages of content and tracking usage statistics. The result of all this digital activity is vast amounts of data. However, the pace of change can often distract from the careful governance of this data, leading to gaps, inconsistencies and inaccuracies.

But why does the quality of all this data matter so much? Good data is your most valuable asset, and bad data can seriously harm your business and credibility…

What have you missed?
At a management level, poor data quality equates directly to poor visibility of key trends in the growth or decline of certain products or markets. At the contact level, you may miss out on valuable sales opportunities if email address fields aren’t filled out correctly or customer names are wrong. Having good data will help deliver better customer service and enhance your reputation, and it means you can make better selections for targeted prospecting, cross-selling and up-selling.

When things go wrong.
Bad data can lead to ‘accidents’ and wrong decisions or actions which can affect customer confidence. You’ve spent time building up a valuable customer list – so it’s important not to waste this by sending campaigns to the wrong people, or with messages which don’t match their interests, or to out-of-date or deceased contacts. Data quality issues can also cost you money directly – for example if invoices or renewal notices are sent to the wrong recipient, or at the wrong time.

Making confident decisions.
Data quality matters most of all because it enables your staff and management team to really trust the accuracy of the reports and analysis they’re given. Without that confidence, apparent trends or new opportunities will always leave you wondering whether they really present a true picture. But with a complete and accurate view of your customers and prospects, comes the confidence to make well informed business decisions and commit fully to your strategic planning.

So, data quality is a very important foundation for a publisher’s entire business planning process and customer contact strategy. Good data quality will allow your business and its reputation to grow and flourish.

Data quality is just one of the topics in the forthcoming ALPSP seminar Data, the universe and everything. Other areas covered will include the use of institutional and personal identifiers in the scholarly publishing supply chain, publisher metadata, data relating to open access publishing and some case studies from publishers who have tackled data issues.

This post originally appeared on DataSalon’s own blog From the Armchair.