Wednesday, 14 September 2016

Plenary 1: The Conversation: Research and Scholarly Publishing in the Age of Big Data

Ziyad Marar is Global Publishing Director at SAGE Publishing. Chairing the first plenary session of the ALPSP conference, he engaged his colleague Ian Mulvany, Head of Product Innovation, and Fran Bennett, CEO and co-founder of a big data company Mastodon C in a conversation about publishing in the age of big data.

Is big data hype and nonsense - just an exciting term that let's an agency sell their services? Fran Bennett believes there are some fundamental things that have changed that mean it is so much more than that. It can help companies open up new insights, generate additional income and lower barriers to technology entry. As the technology gets better it can do different applications. There is more data and cheaper processing.

Mastodon C are working with the UK Government department responsible for animals and farming. They are collecting all the data of dead livestock. They don't have enough staff so sometimes patterns get missed. They use computers to identify any of these threads to analyse post mortem. They can take messy structural data and sorts it out so expert humans can use their time more effectively and in a targeted way.

Ian Mulvany thinks high quality content is what we do as an industry, but it's all digitally mediated content. All publishing organizations need to be technologically competent. We're in a mixed world of software solutions that are beginning to be commodified. But the variety of the services around them are living in a handwritten world: a dilemma he is endlessly fascinated by.

Corporate applications of big data can transfer to publishing in market projections, customer retention, internal SWOT analysis and with hiring. Mulvany asks how many publishers have tried to re-analyse their entire corpus using big data techniques? Not many hands went up... there are lots of opportunities here. Bennett observed that a good data scientist is a statistician who can code and understand the context of their data and warned against tracking things purely because you can: the risk is you create 'data exhaust' that you can't do anything with.

Mulvany noted that some fields have long worked with big data and have good standards and procedures to deal with it. He is particularly interested in working with researchers that have realised they have a whole load of data and don't know what to do with it. There is a 'data under the desk' problem. Data is collected sporadically, is not necessarily kept well, and isn't large scale.

Caution was called for by delegates in the audience and on Twitter when using algorithms for peer review: it can and will be exploited by researchers. The panellists all agreed that machines can do the dirty work for us, but not all the work.

Marar outlined the work of the Berkeley sociologist, Nick Adams, who is using crowdsourcing and algorithms to look at reports on the Occupy movements in nine cities. Analysis that would normally have taken 15 years has actually taken one year, and is finding interesting patterns. He also cited the work of Gary King, a Harvard social scientist who is developing and applying empirical methods in many areas of social science research, focusing on innovations that span statistical theory to practical application.

Social researchers are coming more slowly to big data analysis, but are doing some unusual work with it. SAGE Publishing has conducted a massive survey into the area of data and social science with over 13,000 responses. It's something they are focusing on as a priority.

An interesting side issues when looking at social data is sometimes, when you look at the data, you find that the quality of it is not what it might be, with potential to lead to data protection breaches on a grand scale. There are differences between ethical and legal behaviour concerning datasets. it may be cheap to capture and hold data, but expensive to extract, clean and deliver it.

Mulvany closed with the observation that there are researcher needs, potential development tools, but why should the industry care about these things? Because at our heart we are about democratising knowledge and finding the right solutions and people around that knowledge. If we look purely at their purpose it will give us the realisation on how we make it happen. Those tools are becoming cheaper to experiment and innovate with. So we should do so.

