Thursday, 18 October 2018

Getting From Word to JATS XML

In this blog Bill Kasdorf, Principal, Kasdorf & Associates, LLC talks us through a perennial problem and the different approaches to addressing this:

It is a truth universally acknowledged that journal articles need to be in JATS XML but they’re almost always authored in Microsoft Word.

This is not news to anybody reading this. This has been an issue since before JATS existed. Good workflows want XML. So for decades (yes, plural) publishers have been trying to get well structured XML from authors’ manuscripts without having to strip them down to plain text and tag them by hand. (This still happens. I’m not going to include that in my list of strategies because nobody thinks that’s a good idea anymore.)

There are four basic strategies for accomplishing this:
• Dedicated, validating XML editors.
• Editors that emulate or alter MS Word.
• Use Word as-is, converting styles to XML.
• Editors that use Word as-is, with plug-ins.
Here are the pros and cons of these four approaches.

Dedicated, Validating XML Editors

This is the “make the authors do it your way” method. The authors are authoring XML from the get-go. And not just any XML. Not even just any JATS (or whatever XML model). Exactly the specification of JATS that the publisher needs, conforming in every way to the publisher’s or journal’s style guide and technical requirements. This strategy works in controlled authoring situations like the people developing technical documentation. (They’re probably authoring DITA, not JATS.) They’re typically employees of the publisher, and the document structures are exactly the same every day those employees show up to work.

I have never seen this strategy successfully employed in a traditional publishing context, although I have seen it attempted many times. (If anybody knows of a journal publisher doing this successfully, please comment. I’d like to know about it.) This doesn’t work for journals for two main reasons:
1. Authors hate it. They want Word.
2. They have already written the paper before submitting it to the journal. The horse is out of the barn!

Editors that Emulate or Alter MS Word

This always seems like a promising strategy, and it can work when it’s executed well in the right context. The idea is to either let authors use Word, but make it impossible for them to do things you don’t want them to do (like making a line of body text bold when it should be styled as a heading), either by disabling features in Word like local formatting or by creating a separate application that looks and acts a lot like Word.

I have seen this work in some contexts, but for authoring, I’ve seen it fail more often. The reason is No. 1 above. Despite being a lot like Word, it’s not Word, and authors balk at that. These are often Web-based programs, and authors want to write on a plane or the subway. And there’s always No. 2: most journal articles are written before it’s known which journal is going to publish it.

This strategy can work well, though, after authoring. Copyeditors and production staff can use a structured tool like this more successfully than authors can. We’re seeing these kinds of things proliferate in integrated editorial and production systems like Editoria, developed by the Coko Foundation for the University of California Press, and XEditPro, developed by a vendor, diacriTech.

Use Word As-Is, Converting Styles to XML

This is by far the most common way that Word manuscripts get turned into XML today. A well designed set of paragraph and character styles can be created to express virtually all of the structural components that need to be marked up in JATS for a journal article. This is done with a template, a .dotx file in Word, which, when opened, creates a .docx document with all of the required styles built in. And since modern Word files are XML under the hood, you can work with those files to get the JATS XML you need.

The question is who does the styling, and how well it gets done.

Publishers are sometimes eager to give these templates to their authors so they can either write or, post-authoring, style their manuscripts according to the publisher’s requirements. Good luck with that. The problem is that it’s too easy to do it wrong. Use the wrong style. Use local formatting (see above). Put in other things that need to be cleaned up, like extra spaces and carriage returns. Somebody downstream has to fix these things.

Those people downstream tend to be trained professionals, and it’s usually best just to let them do the styling in the first place. This is how most JATS XML starts out these days: as professionally styled Word files. Many prepress vendors have trained staff take raw Word manuscripts and style them, often augmented by programmatic processing to reduce the manual work. These systems, which the vendors have usually developed in-house, also typically do a “pre-edit,” cleaning up the manuscript of many of those nasty inconsistencies programmatically to save the copyeditor work.

This is also at the heart of what I would consider the best in class of such programs, Inera’s eXtyles. Typically, a person or people on the publisher’s staff are trained to properly style accepted manuscripts; eXtyles provides features that makes this easier to do than just using Word’s Styles menu. Then it goes to town, doing lots of processing of the resulting file based on under-the-hood XML. It’s primarily an editorial tool, not just a convert-to-XML tool.

Use Word As-Is, With Plug-Ins

This is not necessarily the same as the previous category, but there’s an overlap: eXtyles is a plug-in for Word, and the resulting styled Word files can just be opened up in Word without the plug-in by a copyeditor or author. But that approach still depends on somebody having styled the manuscript, and subsequent folks not having messed up the styling. It also presents the copyeditor (and then usually the author, who reviews the copyedits) with a manuscript that doesn’t look like the one the author submitted in the first place.

This tends to make authors suspicious—what else might have been changed?—and suspicious authors are more likely to futz. That’s why in those workflows it’s important to use Tracked Changes, though some authors realize that that can be turned on and off by the copyeditor so as not to track every little punctuation correction that’s non-negotiable anyway.

An approach that I have just recently come to appreciate is what Ictect uses. This approach is not dependent on styles. As much as I’ve been an advocate of styles for years, this is actually a good thing. Styles are the result of human judgment and attention. When done by trained professionals, that’s pretty much okay. But on raw author manuscripts—not.

Ictect uses Artificial Intelligence to derive the XML not from the appearance of the article, which is unreliable, but on the content. Stop and think about that a minute. Whereas authors are sloppy or incompetent in getting the formatting right, they are pretty darn obsessive about getting the content right. That’s their paper.

Speaking of which, in addition to not changing the formatting the author submitted, Ictect doesn’t change the content either. The JATS XML is now embedded in that Word file, but you only see that if you’re using the Ictect software. After processing by Ictect, the document is always a Word document and it is always a JATS document. To an author or a copyeditor it just looks like the original Word file. This inspires trust.

I was initially skeptical about this. But it actually works. Given a publisher’s style requirements and a sufficiently representative set of raw author manuscripts, Ictect can be set up to do a shockingly accurate job of generating JATS from raw author manuscripts. In seconds. Nobody plowing through the manuscripts to style them.

There have been tests done by large STM publishers that have demonstrated that Ictect typically produces fully correct, richly tagged JATS for over half of the raw Word manuscript files submitted by authors, and over 90% of manuscripts can be perfected in less than ten minutes by non-technical staff like production editors. The Ictect software highlights the issues and makes it easy for publishing staff to see what the problem is in the Word file and fix it. That’s because the errors aren’t styling errors, they’re content errors. They have to be fixed no matter what.

In case you think this is simplistic or dumbed-down JATS XML, nope. I’m talking about fully expressed, granular JATS, with its metadata header and all the body markup and even granularly tagged references that enable Crossref and PubMed processing. Not just good-enough JATS. Microsoft Office 365 is not exactly a new kid on the block now, but journal publishers have not made much use of it. As things evolve naturally, more and more authors are going to use Office 365 for peer review, quick editing, corrections and even for full article writing. Since Ictect software creates a richly tagged Word document that can be edited using Office 365, it opens up some interesting workflow automation and collaboration possibilities, especially for large scale publishing.

And if you need consistently styled Word files, no problem. Because you’ve got that rich JATS markup, a styled file can be generated automatically in seconds. For example, in a consistent format for copyediting (I would strongly recommend that), or a format that’s modeled after the final published article format. Authors also really like to see that at an early stage. It’s an unavoidable psychological truism that when an author sees an article in published form she notices things she hadn’t noticed in her manuscript. So you can do both: return the manuscript in its original form, and provide a PDF from the styled Word file to emulate the final layout.

All of the methods I’ve discussed in this blog have a place in the ecosystem, in the right context. I haven’t mentioned a product that I wouldn’t recommend in the right situation. For example, you might initially view Ictect as a competitor of eXtyles and those home-grown programs the prepress vendors use. It’s not. It belongs upstream of them. It’s a way to get really well tagged JATS from raw author manuscripts to facilitate the use of editorial tools, without requiring manual styling. It’s the beginning of an Intelligent Content Workflow. It’s a very interesting development.

Bill Kasdorf is Principal of Kasdorf & Associates, LLC, a consultancy specializing in accessibility, XML/HTML/EPUB modeling, editorial and production workflows, and standards alignment. He is a founding partner of Publishing Technology Partners 


Twitter: @BillKasdorf

To find out further information on Ictect visit: 

or register for one of their free monthly webinars at:

Monday, 8 October 2018

2018 ALPSP Conference Report - From Adventures in Publishing to #MeToo

In this blog, Alastair Horne, Press Futurist and social media correspondent at this year's ALPSP Conference reports on a packed few days in Windsor hearing from the scholarly publishing community.

This year’s conference once again offered a range of perspectives from across the scholarly publishing ecosystem on the key issues that affect us.

photo Chris Jackson
Keynote - Professor Chris Jackson
Thursday’s opening keynote was given by Professor Chris Jackson, who shared his own experiences as a researcher who has engaged deeply with the industry, publishing more than 150 articles, acting as editor for three journals, and co-founding the EarthArXiv preprint server. In a wide-ranging talk, Jackson offered some advice for publishers drawn from his experience: to be transparent about APC pricing; to offer strongly reduced APCs to early career researchers in order to build an affinity with new authors; and to be clear about their views on metrics. On open access, though generally enthusiastic, he suggested that Plan S had caused concerns among academics and might create challenges for societies who relied on income from subscription or hybrid journals to fund their other activities.

Open access was, inevitably, a theme that persisted throughout the conference. The panel that followed Jackson’s talk asked how societies and publishers should ‘accelerate the transition’. Kamram Naim shared details of the ‘subscribe to open’ model used by non-profit publisher Annual Reviews, which addressed the twin problems of library policies on ‘donations’ often preventing the support of open access initiatives, and the fact that APCs don’t work for journals that publish invited contributions from scholars, rather than receiving submissions. Their ‘Subscribe to Open’ model, which bears some similarities to Knowledge Unlatched’s, sees libraries receive a discount on their journal subscriptions if they choose to participate in unlocking initiatives: if enough do so, then that volume’s issues of the journal become available through open access; if not, then only subscribing institutions have access. Naim’s fellow panellist Steven Hill, Director of Research at Research England, and architect of the new REF, insisted that the new requirement for open access monographs would not mandate any particular model. His position was strongly challenged, though, by the panel’s third speaker, Goldsmiths Press’ Sarah Kember, who asked why the transition to open access for monographs was happening at all, and called for a deceleration to allow time for more consideration of differences across the sector. Plan S, she suggested, totally disregarded the humanities and monographs, and posed a considerable threat to academic freedom by restricting where researchers could publish.

photo Conference Panel session
Panel debate on Open Access
The following day, a further session considered the impact of open access on library sales, strategies, and solutions, as library directors from Europe and the US shared some insights into their institutions’ recent cancellations of big deals. Wilhelm Widmark, Library Director of Stockholm University, suggested that the Swedish universities’ decision to reject what he described as a ‘good’ proposed deal with Elsevier was because it didn’t offer a sustainable route to full open access; the money saved is being redirected towards fully open access journals. Jean Fran├žois Lutz, Head of the Digital Library at the University of Lorraine, and Adrian Alexander, Dean of the Library at the University of Tulsa, added that their own institutions’ decision to cancel some of their big deal contracts were prompted by budget constraints and unsustainable pricing increases.

Friday’s opening session considered another increasingly hot topic: customer data. Chris Leonard from Emerald shared insights from their work in mapping user journeys in accessing their content, and one key finding – that though a high proportion of people who visit their site discover it through Google, the majority of those people don’t have institutional access and so leave; people who come to the site via library discovery services are far more likely to continue their journey further. Lettie Conrad of Maverick Consulting spoke of the wealth of data available to publishers, both internal – customer service records, sales reports, customer data, market research findings, product testing and user studies – and external – competitor analysis, discovery journeys, and usage analytics. Transforming such data into usable information required strategic thinking and some investment, she suggested, but it wasn’t rocket science. The third panel member, David Hutcheson, told how BMJ had developed a strategy for using data to inform their decisions, drive user engagement and deepen user understanding. Working with consultants and stakeholders to create an overall plan, they started by deepening their understanding of their existing technology and resources and testing them to see what worked. Integrating their different platforms to connect their data, and developing partnerships with suppliers, the BMJ set up a small six-person data team to serve as a specialist centre of excellence, supporting the rest of the business, automating processes and delivering self-service reporting to enable and empower colleagues to make use of the data produced.

The parallel sessions offered the usual dilemma of which to attend, and though there’s too little space to describe them all here, a personal highlight was a fascinating panel on the digital humanities. Peter Berkery of the Association of University Presses, Paul Spence of King’s College London, and Etienne Posthumus of Brill all discussed recent experiments in finding modes of publishing that would support the complex needs of this growing sector. Spence spoke of the need to fix a common terminology for the different types of publications produced, while Berkery talked through four marquee digital projects by university presses: Rotunda at Virginia, Manifold at Minnesota, Fulcrum at Michigan, and .supDigital at Stanford; Posthumus spoke on Brill’s own initiatives in labs and data.
Revenues from rights formed the focus of the day’s final session, sponsored by Publishers’ Licensing Services: Rebecca Cook of Wiley emphasised the need for thorough documentation governing what can be done with content, while Clare Hodder urged publishers to invest in metadata.

Photo Awards Presentation
Code Ocean wins the ALPSP Awards for Innovation 2018
Then, at the evening’s gala dinner, the winners of two prestigious ALPSP Awards were announced: Richard Fisher was honoured for his Contribution to Scholarly Publishing over a long career, both at Cambridge University Press and in his retirement, busier than many people’s main careers; then the cloud-based computational reproducibility platform Code Ocean was named the winner of the ALPSP Award for Innovation in Publishing.

The final day of the conference was dominated by ethical questions. Professor Graham Crow of the University of Edinburgh explored issues in research and publishing ethics, before the closing panel session addressed ‘The #MeToo Era in Academic Publishing: Tackling harassment and the roots of gender bias’. Femi Otitoju of the Challenge Consultancy shared some lessons drawn from thirty years of experience working in this area, emphasising the need to create the right working culture by focusing on positive outcomes rather than problems – having a ‘dignity at work’ policy rather than one on harassment, for instance – and prominently highlighting such policies through posters rather than pages buried on the company intranet. Karen Phillips of SAGE spoke of the need for publishers to learn from each other, while Eric Merkel-Sobotta of De Gruyter emphasised the importance of economic arguments in convincing management of the need to address such problems. Dr Afroditi Pina shared the results of her research into sexual harassment and successful strategies for addressing it: the need to agree appropriate sanctions for unacceptable behaviour, the role that public apologies can play in such sanctions, and the importance of listening un-defensively to those reporting harassment.

photo Beaumont Estate
The Beaumont Estate

If you would like to hear more about this year's ALPSP Conference, you can find video footage, audio and speaker presentations at:

The ALPSP Conference and Awards 2019 will be held at Beaumont Estate, Old Windsor, UK on 11-13 September. Please save the date!