ALPSP blog: at the heart of scholarly publishing: journal publishing

Showing posts with label journal publishing. Show all posts

Thursday, 18 October 2018

Getting From Word to JATS XML

In this blog Bill Kasdorf, Principal, Kasdorf & Associates, LLC talks us through a perennial problem and the different approaches to addressing this:

It is a truth universally acknowledged that journal articles need to be in JATS XML but they’re almost always authored in Microsoft Word.

This is not news to anybody reading this. This has been an issue since before JATS existed. Good workflows want XML. So for decades (yes, plural) publishers have been trying to get well structured XML from authors’ manuscripts without having to strip them down to plain text and tag them by hand. (This still happens. I’m not going to include that in my list of strategies because nobody thinks that’s a good idea anymore.)

There are four basic strategies for accomplishing this:
• Dedicated, validating XML editors.
• Editors that emulate or alter MS Word.
• Use Word as-is, converting styles to XML.
• Editors that use Word as-is, with plug-ins.
Here are the pros and cons of these four approaches.

Dedicated, Validating XML Editors

This is the “make the authors do it your way” method. The authors are authoring XML from the get-go. And not just any XML. Not even just any JATS (or whatever XML model). Exactly the specification of JATS that the publisher needs, conforming in every way to the publisher’s or journal’s style guide and technical requirements. This strategy works in controlled authoring situations like the people developing technical documentation. (They’re probably authoring DITA, not JATS.) They’re typically employees of the publisher, and the document structures are exactly the same every day those employees show up to work.

I have never seen this strategy successfully employed in a traditional publishing context, although I have seen it attempted many times. (If anybody knows of a journal publisher doing this successfully, please comment. I’d like to know about it.) This doesn’t work for journals for two main reasons:
1. Authors hate it. They want Word.
2. They have already written the paper before submitting it to the journal. The horse is out of the barn!

Editors that Emulate or Alter MS Word

This always seems like a promising strategy, and it can work when it’s executed well in the right context. The idea is to either let authors use Word, but make it impossible for them to do things you don’t want them to do (like making a line of body text bold when it should be styled as a heading), either by disabling features in Word like local formatting or by creating a separate application that looks and acts a lot like Word.

I have seen this work in some contexts, but for authoring, I’ve seen it fail more often. The reason is No. 1 above. Despite being a lot like Word, it’s not Word, and authors balk at that. These are often Web-based programs, and authors want to write on a plane or the subway. And there’s always No. 2: most journal articles are written before it’s known which journal is going to publish it.

This strategy can work well, though, after authoring. Copyeditors and production staff can use a structured tool like this more successfully than authors can. We’re seeing these kinds of things proliferate in integrated editorial and production systems like Editoria, developed by the Coko Foundation for the University of California Press, and XEditPro, developed by a vendor, diacriTech.

Use Word As-Is, Converting Styles to XML

This is by far the most common way that Word manuscripts get turned into XML today. A well designed set of paragraph and character styles can be created to express virtually all of the structural components that need to be marked up in JATS for a journal article. This is done with a template, a .dotx file in Word, which, when opened, creates a .docx document with all of the required styles built in. And since modern Word files are XML under the hood, you can work with those files to get the JATS XML you need.

The question is who does the styling, and how well it gets done.

Publishers are sometimes eager to give these templates to their authors so they can either write or, post-authoring, style their manuscripts according to the publisher’s requirements. Good luck with that. The problem is that it’s too easy to do it wrong. Use the wrong style. Use local formatting (see above). Put in other things that need to be cleaned up, like extra spaces and carriage returns. Somebody downstream has to fix these things.

Those people downstream tend to be trained professionals, and it’s usually best just to let them do the styling in the first place. This is how most JATS XML starts out these days: as professionally styled Word files. Many prepress vendors have trained staff take raw Word manuscripts and style them, often augmented by programmatic processing to reduce the manual work. These systems, which the vendors have usually developed in-house, also typically do a “pre-edit,” cleaning up the manuscript of many of those nasty inconsistencies programmatically to save the copyeditor work.

This is also at the heart of what I would consider the best in class of such programs, Inera’s eXtyles. Typically, a person or people on the publisher’s staff are trained to properly style accepted manuscripts; eXtyles provides features that makes this easier to do than just using Word’s Styles menu. Then it goes to town, doing lots of processing of the resulting file based on under-the-hood XML. It’s primarily an editorial tool, not just a convert-to-XML tool.

Use Word As-Is, With Plug-Ins

This is not necessarily the same as the previous category, but there’s an overlap: eXtyles is a plug-in for Word, and the resulting styled Word files can just be opened up in Word without the plug-in by a copyeditor or author. But that approach still depends on somebody having styled the manuscript, and subsequent folks not having messed up the styling. It also presents the copyeditor (and then usually the author, who reviews the copyedits) with a manuscript that doesn’t look like the one the author submitted in the first place.

This tends to make authors suspicious—what else might have been changed?—and suspicious authors are more likely to futz. That’s why in those workflows it’s important to use Tracked Changes, though some authors realize that that can be turned on and off by the copyeditor so as not to track every little punctuation correction that’s non-negotiable anyway.

An approach that I have just recently come to appreciate is what Ictect uses. This approach is not dependent on styles. As much as I’ve been an advocate of styles for years, this is actually a good thing. Styles are the result of human judgment and attention. When done by trained professionals, that’s pretty much okay. But on raw author manuscripts—not.

Ictect uses Artificial Intelligence to derive the XML not from the appearance of the article, which is unreliable, but on the content. Stop and think about that a minute. Whereas authors are sloppy or incompetent in getting the formatting right, they are pretty darn obsessive about getting the content right. That’s their paper.

Speaking of which, in addition to not changing the formatting the author submitted, Ictect doesn’t change the content either. The JATS XML is now embedded in that Word file, but you only see that if you’re using the Ictect software. After processing by Ictect, the document is always a Word document and it is always a JATS document. To an author or a copyeditor it just looks like the original Word file. This inspires trust.

I was initially skeptical about this. But it actually works. Given a publisher’s style requirements and a sufficiently representative set of raw author manuscripts, Ictect can be set up to do a shockingly accurate job of generating JATS from raw author manuscripts. In seconds. Nobody plowing through the manuscripts to style them.

There have been tests done by large STM publishers that have demonstrated that Ictect typically produces fully correct, richly tagged JATS for over half of the raw Word manuscript files submitted by authors, and over 90% of manuscripts can be perfected in less than ten minutes by non-technical staff like production editors. The Ictect software highlights the issues and makes it easy for publishing staff to see what the problem is in the Word file and fix it. That’s because the errors aren’t styling errors, they’re content errors. They have to be fixed no matter what.

In case you think this is simplistic or dumbed-down JATS XML, nope. I’m talking about fully expressed, granular JATS, with its metadata header and all the body markup and even granularly tagged references that enable Crossref and PubMed processing. Not just good-enough JATS. Microsoft Office 365 is not exactly a new kid on the block now, but journal publishers have not made much use of it. As things evolve naturally, more and more authors are going to use Office 365 for peer review, quick editing, corrections and even for full article writing. Since Ictect software creates a richly tagged Word document that can be edited using Office 365, it opens up some interesting workflow automation and collaboration possibilities, especially for large scale publishing.

And if you need consistently styled Word files, no problem. Because you’ve got that rich JATS markup, a styled file can be generated automatically in seconds. For example, in a consistent format for copyediting (I would strongly recommend that), or a format that’s modeled after the final published article format. Authors also really like to see that at an early stage. It’s an unavoidable psychological truism that when an author sees an article in published form she notices things she hadn’t noticed in her manuscript. So you can do both: return the manuscript in its original form, and provide a PDF from the styled Word file to emulate the final layout.

All of the methods I’ve discussed in this blog have a place in the ecosystem, in the right context. I haven’t mentioned a product that I wouldn’t recommend in the right situation. For example, you might initially view Ictect as a competitor of eXtyles and those home-grown programs the prepress vendors use. It’s not. It belongs upstream of them. It’s a way to get really well tagged JATS from raw author manuscripts to facilitate the use of editorial tools, without requiring manual styling. It’s the beginning of an Intelligent Content Workflow. It’s a very interesting development.

Bill Kasdorf is Principal of Kasdorf & Associates, LLC, a consultancy specializing in accessibility, XML/HTML/EPUB modeling, editorial and production workflows, and standards alignment. He is a founding partner of Publishing Technology Partners

Website: https://pubtechpartners.com/

Twitter: @BillKasdorf

To find out further information on Ictect visit: http://www.ictect.com/

or register for one of their free monthly webinars at: http://www.ictect.com/journal-webinars

Tuesday, 13 March 2018

Understanding how to get your journal article in front of the reader

Understanding how to get your journal article in front of the reader and working out how to navigate the multitude of discovery resources and authentication barriers is essential to the success of a publishing organisation. It is also the topic of one of our most popular training courses. Here, we spoke to Online Journal Discovery and Delivery: Working with Libraries and industry intermediaries to maximise readership co-tutor, Tracy Gardner, about the challenges of keeping up-to-date in this area.

"One of the biggest challenges publishers face is making sure their content can be easily found in the various discovery resources readers use to find journal articles, and then to ensure the steps between the reader finding the content and reading it are seamless and without barrier. There are so many potential pitfalls along the way, and this issue therefore concerns people working in production, IT, editorial, sales, marketing and customer service.

The pace of change is fast, technology is evolving all of the time and the driver for much of it has come from the libraries. Libraries are keen to ensure their patrons find and access content they have selected and purchased and by keeping them in a library intermediated environment they feel they can improve their research experience overall. Ultimately the library would like the user to start at the library website, find content they can read and not be challenged along the way.

Simon Inger and I have been running the Online Journal Discovery and Delivery course two or three times a year for twelve years now and we have never run the same course twice - it constantly needs to be updated.

Those working in customer facing roles such as sales, marketing and customer service may not fully appreciate how much library technology impacts on the way researchers find and access their content. Many people are surprised to learn that poor usage within an institution is often because something has gone wrong with the way the content is indexed within the library discovery layer, how it is set up in the library link resolver, or issues with authentication.

For those in operational or technology roles, the business technology side of journals can seem unnecessarily complex and, especially for those new to the industry, the way the information community works can seem counter to the way many other business sectors operate. What makes sense in classic B2B or B2C environments will not make sense within the academic research community.

By helping people who work in publishing houses understand how the technology supporting journal delivery works, and how they can most effectively work with libraries to maximise discovery and use of their content. Many people who have attended our course have not been aware of the impact some of their decisions have had and our course has helped them understand why they need to work in certain ways."

Tracy Gardner will tutor on the upcoming Online Journal Discovery and Delivery course on 20 March. Full course details can be found here

Tuesday, 13 October 2015

Standard Identifiers, Metrics and Processes in Journal Publishing: Mark Hester asks 'Aren't they a bit...dull?'

Why should we use standards? Identifiers, transaction processes, schemas, metrics and many other things in scholarly publishing have standards, or are developing them. Isn’t this a rather arduous and bureaucratic way of handling things? Are these things really there to make life easier or just another way of overcomplicating an already complex market, taking time away from the efforts of actually producing high quality content?

Here Mark Hester of Aries Systems delves into why we should care.

Aren’t standards a bit….dull?'

Standards? Just a bunch of numbers, right? With tedious documentation on how and where to use them? Why would I bother with those?

It’s not hard to see why you might think that, but also easy to see how this is misguided. Jumping straight into a document to read about standards is a little bit like reading the telephone directory when you have no intention of calling someone, or leafing through a Haynes manual when you’re not repairing a car.

An example of a standard from outside publishing might help – EAN-13. What is EAN-13 you might ask? You see examples of it daily – it is the standard for the barcodes we see on everything we buy in the supermarket. Retail staff don’t need to know how EAN-13 works, it is unlikely that they’ve read documentation on it, but they are all grateful that it does work when checking stocks, pricing items and working on the till and, in turn, so are their customers.

So I ignore standards: what’s the worst that can happen?

When I was a student in the early nineties, the departmental librarian had been using his own classification system for many years. Back then, it didn’t matter much – students got used to its quirks, visitors from other departments were rare, from other universities much rarer still. The people using the service understood it, and that was enough.

Imagine taking this approach in the online world - it would mean that your content would be less discoverable and also less usable. Online library catalogues wouldn’t work if everyone took the librarian from my alma mater’s approach! Not using DOIs means frustration for researchers who can’t click on the references and go straight to the articles, and a simple change to a URL means a broken link. If your content isn’t seen it affects your reputation, and in the case of a commercial publisher, your profits.

The benefit of standards will only increase as the ‘digital natives’ used to touch screen technology enter academia and the workplace – having to click more than once or search for more than a minute will lead them to go elsewhere.

How can standards enhance my working life and be good for my organization?

Rapid changes in scholarly publishing means that new applications are found for standards once they are in place. Adopting standards can ‘future proof’ your content and processes against changes that occur in the future.

A great example of this is the relentless adoption of gold open access. The publishing standards which enable Copyright Clearance Center’s RightsLink for OA to display different article processing charge policies to different users on the fly developed separately from one another – Ringgold for institutions, ORCID for identifying authors, and FundRef for funder identification. Brought together, however, their machine readability allows flexible APC pricing models and automated billing and payment processing, making life easier and saving time and money for both publishers and institutions.

The advantages can be psychological as well as practical – if authors, researchers and librarians see the ORCID or CrossRef logos displayed on your website, they will know that your organization is a serious player, one which will help them, one they can trust.

So what's next?

By now, I hope I’ve convinced you of the importance of standards. But if the prospect of researching the topic still fills you with a sense of dread, there's an upcoming seminar from ALPSP I'm helping to coordinate called Setting the Standard. It's being held in London on Wednesday 11 November and includes speakers from CrossRef, Ringgold, ORCID, COUNTER, Thomson Reuters, EDItEUR, Jisc and an institution. Everything you ever wanted to know about standards, but were too scared to ask.

I hope to see you there.