Thursday 18 October 2018

Getting From Word to JATS XML

In this blog Bill Kasdorf, Principal, Kasdorf & Associates, LLC talks us through a perennial problem and the different approaches to addressing this:

It is a truth universally acknowledged that journal articles need to be in JATS XML but they’re almost always authored in Microsoft Word.

This is not news to anybody reading this. This has been an issue since before JATS existed. Good workflows want XML. So for decades (yes, plural) publishers have been trying to get well structured XML from authors’ manuscripts without having to strip them down to plain text and tag them by hand. (This still happens. I’m not going to include that in my list of strategies because nobody thinks that’s a good idea anymore.)

There are four basic strategies for accomplishing this:
• Dedicated, validating XML editors.
• Editors that emulate or alter MS Word.
• Use Word as-is, converting styles to XML.
• Editors that use Word as-is, with plug-ins.
Here are the pros and cons of these four approaches.

Dedicated, Validating XML Editors

This is the “make the authors do it your way” method. The authors are authoring XML from the get-go. And not just any XML. Not even just any JATS (or whatever XML model). Exactly the specification of JATS that the publisher needs, conforming in every way to the publisher’s or journal’s style guide and technical requirements. This strategy works in controlled authoring situations like the people developing technical documentation. (They’re probably authoring DITA, not JATS.) They’re typically employees of the publisher, and the document structures are exactly the same every day those employees show up to work.

I have never seen this strategy successfully employed in a traditional publishing context, although I have seen it attempted many times. (If anybody knows of a journal publisher doing this successfully, please comment. I’d like to know about it.) This doesn’t work for journals for two main reasons:
1. Authors hate it. They want Word.
2. They have already written the paper before submitting it to the journal. The horse is out of the barn!

Editors that Emulate or Alter MS Word

This always seems like a promising strategy, and it can work when it’s executed well in the right context. The idea is to either let authors use Word, but make it impossible for them to do things you don’t want them to do (like making a line of body text bold when it should be styled as a heading), either by disabling features in Word like local formatting or by creating a separate application that looks and acts a lot like Word.

I have seen this work in some contexts, but for authoring, I’ve seen it fail more often. The reason is No. 1 above. Despite being a lot like Word, it’s not Word, and authors balk at that. These are often Web-based programs, and authors want to write on a plane or the subway. And there’s always No. 2: most journal articles are written before it’s known which journal is going to publish it.

This strategy can work well, though, after authoring. Copyeditors and production staff can use a structured tool like this more successfully than authors can. We’re seeing these kinds of things proliferate in integrated editorial and production systems like Editoria, developed by the Coko Foundation for the University of California Press, and XEditPro, developed by a vendor, diacriTech.

Use Word As-Is, Converting Styles to XML

This is by far the most common way that Word manuscripts get turned into XML today. A well designed set of paragraph and character styles can be created to express virtually all of the structural components that need to be marked up in JATS for a journal article. This is done with a template, a .dotx file in Word, which, when opened, creates a .docx document with all of the required styles built in. And since modern Word files are XML under the hood, you can work with those files to get the JATS XML you need.

The question is who does the styling, and how well it gets done.

Publishers are sometimes eager to give these templates to their authors so they can either write or, post-authoring, style their manuscripts according to the publisher’s requirements. Good luck with that. The problem is that it’s too easy to do it wrong. Use the wrong style. Use local formatting (see above). Put in other things that need to be cleaned up, like extra spaces and carriage returns. Somebody downstream has to fix these things.

Those people downstream tend to be trained professionals, and it’s usually best just to let them do the styling in the first place. This is how most JATS XML starts out these days: as professionally styled Word files. Many prepress vendors have trained staff take raw Word manuscripts and style them, often augmented by programmatic processing to reduce the manual work. These systems, which the vendors have usually developed in-house, also typically do a “pre-edit,” cleaning up the manuscript of many of those nasty inconsistencies programmatically to save the copyeditor work.

This is also at the heart of what I would consider the best in class of such programs, Inera’s eXtyles. Typically, a person or people on the publisher’s staff are trained to properly style accepted manuscripts; eXtyles provides features that makes this easier to do than just using Word’s Styles menu. Then it goes to town, doing lots of processing of the resulting file based on under-the-hood XML. It’s primarily an editorial tool, not just a convert-to-XML tool.

Use Word As-Is, With Plug-Ins

This is not necessarily the same as the previous category, but there’s an overlap: eXtyles is a plug-in for Word, and the resulting styled Word files can just be opened up in Word without the plug-in by a copyeditor or author. But that approach still depends on somebody having styled the manuscript, and subsequent folks not having messed up the styling. It also presents the copyeditor (and then usually the author, who reviews the copyedits) with a manuscript that doesn’t look like the one the author submitted in the first place.

This tends to make authors suspicious—what else might have been changed?—and suspicious authors are more likely to futz. That’s why in those workflows it’s important to use Tracked Changes, though some authors realize that that can be turned on and off by the copyeditor so as not to track every little punctuation correction that’s non-negotiable anyway.

An approach that I have just recently come to appreciate is what Ictect uses. This approach is not dependent on styles. As much as I’ve been an advocate of styles for years, this is actually a good thing. Styles are the result of human judgment and attention. When done by trained professionals, that’s pretty much okay. But on raw author manuscripts—not.

Ictect uses Artificial Intelligence to derive the XML not from the appearance of the article, which is unreliable, but on the content. Stop and think about that a minute. Whereas authors are sloppy or incompetent in getting the formatting right, they are pretty darn obsessive about getting the content right. That’s their paper.

Speaking of which, in addition to not changing the formatting the author submitted, Ictect doesn’t change the content either. The JATS XML is now embedded in that Word file, but you only see that if you’re using the Ictect software. After processing by Ictect, the document is always a Word document and it is always a JATS document. To an author or a copyeditor it just looks like the original Word file. This inspires trust.

I was initially skeptical about this. But it actually works. Given a publisher’s style requirements and a sufficiently representative set of raw author manuscripts, Ictect can be set up to do a shockingly accurate job of generating JATS from raw author manuscripts. In seconds. Nobody plowing through the manuscripts to style them.

There have been tests done by large STM publishers that have demonstrated that Ictect typically produces fully correct, richly tagged JATS for over half of the raw Word manuscript files submitted by authors, and over 90% of manuscripts can be perfected in less than ten minutes by non-technical staff like production editors. The Ictect software highlights the issues and makes it easy for publishing staff to see what the problem is in the Word file and fix it. That’s because the errors aren’t styling errors, they’re content errors. They have to be fixed no matter what.

In case you think this is simplistic or dumbed-down JATS XML, nope. I’m talking about fully expressed, granular JATS, with its metadata header and all the body markup and even granularly tagged references that enable Crossref and PubMed processing. Not just good-enough JATS. Microsoft Office 365 is not exactly a new kid on the block now, but journal publishers have not made much use of it. As things evolve naturally, more and more authors are going to use Office 365 for peer review, quick editing, corrections and even for full article writing. Since Ictect software creates a richly tagged Word document that can be edited using Office 365, it opens up some interesting workflow automation and collaboration possibilities, especially for large scale publishing.

And if you need consistently styled Word files, no problem. Because you’ve got that rich JATS markup, a styled file can be generated automatically in seconds. For example, in a consistent format for copyediting (I would strongly recommend that), or a format that’s modeled after the final published article format. Authors also really like to see that at an early stage. It’s an unavoidable psychological truism that when an author sees an article in published form she notices things she hadn’t noticed in her manuscript. So you can do both: return the manuscript in its original form, and provide a PDF from the styled Word file to emulate the final layout.

All of the methods I’ve discussed in this blog have a place in the ecosystem, in the right context. I haven’t mentioned a product that I wouldn’t recommend in the right situation. For example, you might initially view Ictect as a competitor of eXtyles and those home-grown programs the prepress vendors use. It’s not. It belongs upstream of them. It’s a way to get really well tagged JATS from raw author manuscripts to facilitate the use of editorial tools, without requiring manual styling. It’s the beginning of an Intelligent Content Workflow. It’s a very interesting development.

Bill Kasdorf is Principal of Kasdorf & Associates, LLC, a consultancy specializing in accessibility, XML/HTML/EPUB modeling, editorial and production workflows, and standards alignment. He is a founding partner of Publishing Technology Partners 


Twitter: @BillKasdorf

To find out further information on Ictect visit: 

or register for one of their free monthly webinars at:

No comments:

Post a Comment