Jeromy Anglim's Blog: Psychology and Statistics


Monday, March 15, 2010

Converting a Microsoft Word Document into a LaTeX Document

This post discusses my experience converting a large MS Word document into a LaTeX document using Word-to-LaTeX. Along the way I encountered several challenges. I thought I'd document them in case it may be of interest to others.

Overview of Options
Having a good conversion method is important when transitioning existing Word documents to LaTeX and when colloborating with others who are not familiar with LaTeX. Wilfred Hennings provides a great page that summarises options for converting documents from PC Wordprocessors to LaTeX. The page list options and provides recommendations. Wilfred's Quick Comparison List is particularly useful.

My Experience with Word-To-LaTeX:
I started my conversion journey with Michal Kerbt's Word-to-LaTeX (Word-to-XML) Convertor. It's free software but Michal accepts donations. I used it in it's stand alone form. It provides many options for converting documents. While it generally worked well, the following discusses various challenges in the conversion process with a description of what I did in response.

Convert *.docx to *.doc format:
Problem: The convertor did not appear to work with Word 2007 files (*.docx).
Solution: Save As Word 97 (*.doc) format.

Set up a PostScript printer:
Problem: In order to convert graphics files a postscript printer needs to be setup.
Solution: (Warning: I've had some problems with the EPS files generated; I'm not sure if its related to this printer setup)
  1. Install a postcript printer: Adobe sets out one way to set up a virtual printer. I followed these instructions. Here is a direct link to the downloads. In short it involves installing a PostScript printer driver with a PPD specification. 
  2. Specify in Word-to-LaTeX: This printer is then specified in the Word-to-LaTeX configuration: Figures/Eq/Documents - Figures - PostScript printer.
Big complex documents:
Problem: Long and complex documents can take a while to run (e.g., 15 minutes for a 60,000 word document with many styles and tables on a 2007 laptop).
Solution: Hey. Who cares! Just let it run. It's quicker than trying to do it manually.

Security Alert Over Macros:
Problem: The program installs a macro in the Word Startup folder. My version of Word (Word 2007) disabled this by default.
Solution: It is possible to enable all macros. However, this is not particularly safe. I decided to delete the file from: "C:\Program Files\Microsoft Office\Office12\STARTUP" and just run the program through its stand-alone desktop interface.

Tidying Up
Problem: The *.tex document was not exactly what I wanted.
Solution: Several options presented themselves.
  1. Change input: I could alter the Word document. I could remove styles, remove unwanted fonts, and so on.
  2. Change process: Word-to-LaTeX presents many configuration options which I could play with.
  3. Post-process: I could apply various replacement operations on the *.tex created by Word-to-LaTeX.
The solution I adopted combined all three approaches. For example, I converted hidden text in the Word document to a particular style. This meant that it was enclosed in a command in *.tex that was easy to find and replace in post-processing.

Start Up Problem
Problem: When I ran Word-to-LaTeX, I obtained the following error:
Conversion started.
Fatal error: Call was rejected by callee.
   at Word.DocumentClass.Activate()
   at WordToLatex.WLConvertor.Convert()
   at WordToLatex.Bin.WLApplication.Main(String[] args)

Solution I closed Word-to-LaTeX. I closed Word. I then pressed control+alt+delete and ended any WINWORD processes that were running. I then restarted Word-to-LaTeX. As an additional point it was sometimes necessary to close Word-to-LaTeX

Conversion Problem
Problem: I obtained the following error.
Converting document fields.
Unknown error: Object reference not set to an instance of an object.
   at WordToLatex.WLProcessFields.FieldHyperlink(Field field)
   at WordToLatex.WLProcessFields.ProcessField(Field field)
   at WordToLatex.WLProcessFields.ProcessAllFields()
   at WordToLatex.WLConvertor.ConvertInner()
   at WordToLatex.WLConvertor.Convert()

Solution:
  • Divide and conquer: Dividing a long document into smaller parts to identify which parts could be processed was one strategy. If you do this, it may be good to put the files in separate folders, otherwise image files from one subdocument may be overridden by a latter subdocument. 
  • Paste into Fresh Document: Another trick that worked for me was to copy and paste the contents of the document into a new document. I'm not sure why this worked. Perhaps it worked because it removed a number of custom styles I had.
Problems importing EPS files
Problem: I let Word-to-LaTeX convert the images to EPS. I could view these images in an editor and they had indeed been converted. However, when I added them in LaTeX, only white space was shown.
Solution: For pictures derived from R I just created them again, this time using the postscript driver. 
Opening the image in Adobe Acrobat Professional and saving as EPS was one option for the other pictures.