ALPSP blog: at the heart of scholarly publishing: Who's afraid of big data?

Who's afraid of big data? panel

Fiona Murphy from Wiley chaired the final panel on day two of the 2014 ALPSP International Conference. She posed the question: how do we skill up on data, take advantage of opportunities and avoid the pitfalls?

Eric T. Meyer, Senior Research Fellow and Associate Professor at the University of Oxford was first up trying to answer. He observed how a few years ago you would struggle to gain an audience for a big data seminar. Today, it's usually standing room only.

Big data has been around for years. People were quite surprised when Edward Snowden leaked the NSA documents via Wikileaks, but it had been going on for a long time. Big data in scholarly research has also been around a long time in certain disciplines such as physics or astronomy. There was always money to be made in big data, but there's even more now, and everyone is starting to realise it. So much so, you need a big data strategy.

Meyer defines big data as data unprecedented in scale and scope in relation to a given phenomenon. It is about looking at the whole datastore rather than one dataset. Big data for understanding society is often transactional. We're talking really big. If you can use it on your laptop, it won't be big data.

Meyer drew on some entertaining examples of how big data can be used. If you key in the same sentence in different country versions of Google you'll see the variety of responses change. There are limits to big data approaches, they can come up with misleading results. What happens when bots are involved? Does it skew the results? The challenge will be how you can make it meaningful and more useful.

David Kavanagh from Scrazzl reflected on how the challenge researchers face when making decisions about how to structure and plan your experiments. If you want to leverage collective scientific knowledge and identify which products you want to use for your work, there wasn't a structured way of searching of doing this. Kavanagh urged publishers to throw computational power at data and content as a way to solve problems, improve how you work and help make sense of unstructured content.

That's what they have tackled with Scrazzl which is a practical application of structured or unstructured data that Eric Meyer mentioned. You need to have a product database. Then you have to cut out as much human intervention out as you can. Automation is key. Where they couldn't find a unique identifier or a catalogue match, they had to make it as fast as possible for a human to make an association. Speed is key.

Finally, they built a centralised cloud system that vendors could update their own records. It's a crowd sourced system for those who have a vested interest in keeping it up-to-date. The opportunity for them going forward will be through releasing this structured information through unstructured APIs to drive new insights. It also allows semantic enablement of content and offers the opportunity to think about categorisation in new ways.

For publishers running an ad supported model, they can get use the collection of products from the content search and then identify which advert is the most suited for you.

Paul F. Uhlir

Paul F. Uhlir from the Board on Research & Information at The National Academies observed that even after 20 years, we have yet to deconstruct the print paradigm and reconstruct it on the Net very well. In the 1980s a gigabyte was considered a lot of data. in the 1990s. a terabyte was a lot of data. In this decade, the exabyte era is not far ahead of us and a whole lot of others ahead of it.

Huge databases in business, mining marketing information and other data. The Internet of Things and semantic web. Everything now can be captured, represented and manipulated in a database. It's an issue of quantity. But there is also an issue of quality. There needs to be a social response. There are a series of threats around big data.

Disintermediation

The rise of big data promises a lot more disruption. Think about 3D printing. The consequence could be millions of product designers specifying items. Manufacturing will be affected. Jobs will be lost. What will happen to the workers in a repair and body shop when cars are driverless? What will happen to the insurance industries. Workers will be disintermediated. What is certain is that there will be massive labour shifts and disruptions.

Playing God

Custom organs for body parts. The ability of insert genes into another organism. All these applications are data intensive and will become even more so. They have profound social and ethical issues and have potential to do great harm.

End of Privacy

Meyer touched on the NSA files. What about spying satellites? The ubiquity of CC TV? These images are kept in huge databases for future use. Product information is held and used to identify preferences by private companies. There is no such thing as privacy any more.

Inequality

Big data are increasingly powerful means to increase hold on global power.

Complexity

The more we learn, the less we know. Any scientist will tell you that greater understanding leads to more questions than answers.

Luddite reactions

The reaction of people to the encroachment of strange and frightening techniques of the technology age where through passive resistance they try to lead a simple life.

There are also a number of weaknesses that centre around:

Improving the policies of the research community
New or better incentive mechanisms versus mandates
Explicit links of big data to innovation and tech transfer
Changing legal landscape-lag in law/bad law/IP law
Data for policy-communicating with decision makers.

Thursday, 11 September 2014

Who's afraid of big data?