In this post, we hear from Tim Vines about DataSeer.
The idea for DataSeer comes from my time as a Managing Editor at the journal Molecular Ecology. In 2010 we adopted the Joint Data Archiving Policy – which mandates data sharing as a condition of publication – and were experimenting with how best to enforce it. We found that the only consistent approach was to check for compliance ourselves by reading every accepted article and listing the datasets the authors needed to share. After about 500 articles it occurred to me that we could get a machine to do the same job, with the advantage that the machine would be quicker, cheaper, and much more scalable.
Fast forward to 2018, when we received a Sloan Foundation grant to develop DataSeer as part of the Open Source software developer Coko (aka the Collaborative Knowledge Foundation). We’ve recently released DataSeer as a Beta and we’re working with numerous potential users to see how best to fit DataSeer into their workflows.
What is the project/product that you submitted for the Awards?
Our organisation and our product are pretty much the same thing! Our goal with DataSeer is to address one of the biggest obstacles to Open Research Data: there’s no easy way to get from the generally worded data sharing policies to the actions the authors need to take for their particular manuscript.
This issue is pernicious because it both increases the time and effort authors need to devote to data sharing (often to the point they give up altogether), and prevents the stakeholders (journal and funders) from knowing what should have been done. The stakeholders then struggle to enforce their data sharing policies, such that authors have no consequences for non-compliance.
DataSeer uses Natural Language Processing to scan research texts for sentences that describe data collection, infers the type of data being collected, and provides best practice advice on how and where that dataset should be shared. Once the author has shared all of the required datasets (or given a reason why they can’t be shared), DataSeer passes a report back to the journal or funder. This approach saves time and worry for the authors and empowers stakeholders to promote open research data.
Tell us a little about how it works and the team behind it
DataSeer has three main parts - the algorithm, the user interface, and our ‘Research Data Wiki’. Our code is open source (here and here). The algorithm has been trained on about 3000 open access articles from a wide range of subject areas. Moreover, researchers tend to describe data collection with similar language regardless of their field, so training an NLP algorithm to spot data sentences is a fairly manageable problem. As with any AI application, it’s making a fair number of mistakes at the moment, but that should change as we process more and more articles.
The wiki hosts our ‘best practice’ advice for sharing many different types of data, and we encourage users to edit our advice if they feel it can be improved. Our vision is that widespread use of DataSeer will eventually lead to a global resource on best practice for data sharing across all areas of research.
As mentioned above, the idea for DataSeer stems from my JDAP enforcement efforts at Molecular Ecology. I started out as an researcher in evolutionary biology before moving into journal management in 2008. In 2014 I founded Axios Review, an independent peer review service that acted as a broker between authors and journals. I've since become a Managing Editor again (this time for the Journal of Sexual Medicine), and rejoined the Scholarly Kitchen blog. I am based in Vancouver, Canada.
Our business lead is Kristen Ratan – Kristen has been involved with developing technology solutions for the academic publishing industry for over 20 years, and has heaps of experience in bringing open science products from idea to marketplace. She has worked at HighWire Press, Atypon, PLOS, and Coko, and now runs her own consultancy on open source solutions for promoting open research. Kristen is based in Santa Cruz, California.
Our lead developer is Patrice Lopez, who has spent the last ten years developing open source NLP tools for research articles. His pdf parser, Grobid, has been applied to over 1.6 million articles and is incorporated into workflows at many large academic publishing organizations. Patrice is based in a small village in France.
In what ways do you think it demonstrates innovation?
DataSeer’s innovation is to use the efficiency of machine learning and Natural Language Processing to automate a really difficult step in enforcing data sharing policies: working out what the authors of a particular article need to do, and helping them do it. At some journals this step is performed by PhD level data curation experts, but as each article can take them between 30 minutes and an hour to process, this approach is only practical for accepted manuscripts at well-resourced publishers. By making this process much cheaper and quicker, DataSeer will enable many more journals to adopt data sharing policies.
Moreover, because DataSeer is cheap and highly scalable, it enables journals to require that all submitted articles share their data, so that the datasets can be scrutinised during peer review. This in turn will prompt researchers to be more rigorous with their data management throughout the research cycle, which should ultimately improve the overall reliability of published work.
DataSeer will also ensure that a much higher proportion of articles share their data, and also do a better job of sharing all of their datasets. Articles will become more reproducible, and many more datasets will be available for testing new hypotheses, conducting powerful meta-analysis, or just verifying the authors’ results. This is the crux of DataSeer’s innovation: by fixing an apparently minor stumbling block in the peer review process, we can usher in a revolution in open science.
What are your plans for the future?
In the immediate future, we’re focused on working with our current partners to ensure that DataSeer is doing everything that they need it to do. Longer term, we will 1) allow authors to deposit their data in the most suitable repository directly from our User Interface; 2) promote reproducibility by detecting mentions of code and data then helping authors share both correctly; and 3) expand DataSeer to numerous other use cases and workflows, to ensure that we’re helping as many groups and stakeholders as possible.
Tim Vines is a researcher, journal manager, and entrepreneur. His research has motivated and informed many aspects of the open data movement.
You can hear from all of the Finalists at the ALPSP Awards Lightning Session on Tuesday 8 September. Visit the ALPSP website to register and for full details of the ALPSP Virtual Conference and Awards 2020.
The 2020 ALPSP Awards for Innovation in Publishing are sponsored by PLS.