Tuesday, 23 September 2014

Big data: mining or minefield? Kurt Paulus reflects...

Who's Afraid of Big Data? Not this panel...
"Data are the stuff of research: collection, manipulation, analysis, more data… They are also the stuff of contemporary life: surveillance data, customer data, medical data… Some are defined and manageable: a researcher’s base collection from which the paper draws conclusions. Some are big: police data, NHS data, GCHQ data, accumulated published data in particular fields. Two Plenaries and several other papers at ALPSP 2014 addressed the issues, options, opportunities and some threats around them.

There have long been calls for authors’ underlying research data to be made accessible, so as to substantiate research conclusions, suggest further work and so on. The main Plenaries concerned themselves with Big Data, usually unstructured sets of elements of unprecedented scale and scope, such as the whole of Wikipedia, accumulated Google searches, the biomedical literature, the visible galaxies in the universe. The challenge of ‘mining’ these datasets is to bring structure to them so that new insights emerge beyond those arising from limited or sampled data. This requires automation, big computing resources and appropriate but speeded-up human intervention and sometimes crowd sourcing.

Gemma Hersh from Elsevier on TDM
Text and data mining has some kinship with information discovery where usually structured datasets are queried, but goes well beyond it by seeking to add structure to very large sets of data where there is no or little structure, so that information can be clustered, trends identified and concepts linked together to lead to new hypotheses. The intelligence services provide a prime, albeit hidden example. So does the functional characterization of proteins, the mining of the UK Biobank for trends and new insights or the crowd-sourced classification of galaxy types to test cosmological theories.

Inevitably there are barriers and issues. The data themselves are often inadequate; for example not all drug trials are published and negative or non-results are frequently excluded from papers. Research data are not always structured and standardized and authors are often untutored in databases and ontologies. The default policy, it was recommended, should be openness in the availability of authors’ published and underlying data, standardized with full metadata and unique identifiers, to make data usable and mitigate the need for sophisticated mining.

CrossRef's Rachael Lammey
Because of copyright and licensing, not all data are easily accessed for retrieval and mining. Increasingly licensing for ‘non-commercial purposes’ is permitted but exactly what is non-commercial is ill-defined, particularly in pharmaceuticals. Organizations like CrossRef, CCC, PLS and others are beginning to offer services that support the textual, licensing and royalty gathering processes for both research and commercial data mining.

Rejecting the name tag Cassandra, Paul Uhlir of the National Academies urged a note of caution. Big Data is changing the public and academic landscape, harbouring threats of disintermediation, complexity, luddism and inequality and exposing weaknesses in reproducibility, scientific method, data policy, metrics and human resources, amongst others.

Paul F. Uhlir urges caution

Judging by the remainder of these sessions and the audience reaction, excitement was more noticeable than apprehension.

ALPSP of course is on the ball and has just issued a Member Briefing on Text and Data Mining (member login required) and will publish a special online-only issue of Learned Publishing before the end of this month."

Kurt Paulus

No comments:

Post a Comment