Sunday, 15 September 2013

Data: not the why, but the how (and then, what?)

Simon Hodson introduces CODATA
It is now old news that data - its production, management and re-use potential - is of growing significance to all the key stakeholders within the scholarly communication ecosystem. Publishers need to navigate the emerging landscape of technology, researcher and industry needs, and funders’ and policy-makers’ priorities in order to continue supporting the growth and discoverability of knowledge.

Wiley's Fiona Murphy chaired a panel discussion that provided publishers with insights on how – and by whom – the roadmap is being written, as well as its challenges and potential opportunities from researchers, funders and industry perspectives.

Simon Hodson, Executive Director at CODATA, provided an overview of their work. Their focus is on strengthening international science for the benefits of society by promoting improved scientific and technical data management and use. It is an international community and network of expertise on data issues. Their key areas of activity are policy frameworks for data, frontiers in data science and technology and data strategies. They develop data citation, standards and practices. September sees the release of their major report Out of Cite, Out of Mind.

Hodson provided an overview of relevant data policy including the Royal Society Science as an Open Enterprise Report from 2012. Examples of projects relating to data policy include the Dryad Joint Data Archiving Policy and Dryad Sustainability.

Kerstin Lehnert, Director of the Integrated Earth Data Applications Research Group (IEDA), provided a view of data from the researcher's perspective. Why open access to data? There are two main reasons: to allow verification of research results and to make data accessible for re-use.

Kerstin Leynert on useful data
Data must be fit for re-use: it must be discoverable, be openly accessible, be safe and it must be useful. One example of useful data is the Sloan Digital Sky Survey. It contains 2000 articles with over 70,000 citations, and a lot of useful science has come out of it. Another example is Earthchem Synthesis. Within 2 minutes you can explore the whole literature and create a map with different composition of different areas. It has seamless integration within the discipline.

There are a number of guiding principles. Quality of data makes it useful to include complete documentation of provenance. Domain-specific data stewardship provides development, maintenance and promotion of domain-specific, community based standards for data and metadata. Domain-specific repositories are best positioned to ensure 'fitness for re-use'. However, they must ensure professional data curation services and integrate with the scholarly communication ecosystem.

Lehnert outlined the IEDA development of standards. They had requirements for the reporting of geochemical data. Steps were taken to move from a suite of databases to become a repository to ensure better sustainability. They improved policies and procedures using DOI, IGSN and long term archiving agreements with NGDC and Columbia University libraries. They also sought accreditation with membership in the world data system and as a publication agent of DataCite.

Lehnert closes with a number of questions. Many data types have no home or standards. Where should these go? How can we help other domains to establish repositories and best practices? Who decides which repository data should be submitted to? Are there recommendations from societies? The sustainability of repositories is still not solved, so what are the business models to ensure longevity? How can we streamline the link between repositories and journals? Do we need a centralised solution?

The final speaker was Tony Brookes from the Department of Genetics at the University of Leicester. He urged data sharing as it is important, but acknowledged it is problematic. Prioritise IDs and risk categorisation, improve data discovery and consider setting up database journals. In essence, don't just  tweak the current model. The elephant in the room is the real reason data sharing isn't happening as quickly as we might like. No one wants to, whether they are researchers, institutions or companies.

No comments:

Post a comment