Tuesday 8 October 2013

Big Data / Little Data: The practical capture, analysis and integration of data for publishers

Laura Dawson, from Bowker, leans in.
Laura Dawson from Bowker provided the ultimate 'Data 101' for publishers at the Big Data/Little Data session at Frankfurt Book Fair's Contec conference.

She cautioned that data doesn't stop with getting something on Amazon. They have tracked the explosion in the amount of books. In the United States there were 900,000 books in print in 1999. This grew to 28 million in 2013. Information is on a massive scale. We are swimming in it.

There is a problem and opportunity in this abundance. The problem is with fluidity - all this information is out of the container. Abundance, persistance and fluidity lead to issues with discovery.

There are four different types of metadata:

  1. Bibliographic: basic book information, the classic understanding of metadata.
  2. Commercial: tax codes, proprietary fields.
  3. Transactional: inventory, locations, order and billings, royalties, etc.
  4. Merchandising: descriptive content, marketing copy, consumer oriented content.

Part of the challenge of managing metadata are the many different sources. There are publisher prepared files, publisher requests (typically email), data aggregators (e.g. Bowker), social reading sites, online and offline retailers and libraries (remember them?).

Other complicating factors for digital metadata include differential timing (physical books require 6 months prior, digital upon publication). There are different attributes and more frequent price changes. Conversions are often outsourced and, in relative terms, this is a whole new process.

Current metadata practices tend to include creation in 4 primary departments (editorial/managing editorial, marketing, production and creative services). Management responsibility varies by sender. Most publishers treat publication as end date for updates (although this is changing). Complete does not mean accurate, inspection is limited. And prepping metadata is somewhat ad hoc. But it's not all bad news. Many publishing houses are now looking at metadata as a functional map. They are examining the process and putting all data into a metadata repository.

Best practice in organising metadata is emerging. You need a hub - a single source of truth for your data able to deal with multiple contributors and multiple recipients. Design defined roles and provide a single source. Identifiers are much more efficient to search engines than thesauri. Text matching doesn't work across character sets or even languages that use the same characters.

There are a number of codified representations of a concept that should be used as they are helpful to search engines as they are short cuts:


Machine language is key. Codes are easier to process than text, faster and less complex. Codes are unambiguous. Natural language evolves and is more unstable. You can use linking data sets using ISNI. Content's new vocabulary is based upon:

  • structured content
  • linked data/linked open data
  • the semantic web
  • ontology
  • Good Relations - an ontology devised specifically for describing products for sale
  • RDF - Resource Description Framework
  • and data visualisation.

No comments:

Post a Comment