WordPress to LaTeX with Pandoc and J: LaTeX Directories (Part 2)

WordPress to LaTeX

WordPress to LaTeX

In this post I will describe the LaTeX directory structure the J script TeXfrWpxml.ijs is expecting. To convert WordPress export XML to LaTeX with this script you will have to set up similar directories.

LaTeX documents are built from *.tex[1] files. This makes LaTeX more like a compiled programming language than a word processing program. There are advantages and disadvantages to the LaTeX way. In LaTeX’s favor, the system is enormously adaptable, versatile and powerful. There is very little that LaTeX/TeX and associates cannot do.  Unfortunately, “with great power comes great responsibility.” LaTeX is demanding! You have to study LaTeX like any other programming language. It’s not for everyone but for experienced users it’s the best way to produce documents with the highest typographic standards.

LaTeX directory structure

To use LaTeX efficiently it’s wise to pick a document directory structure and stick with it. I use a simple directory layout. Each document has a root directory. The root directory used by TeXfrWpxml.ijs is:

Windows c:/pd/blog/wp2latex
Linux /home/john/pd/blog/wp2latex

I put my document specific *.tex, *.bib, *.sty and other LaTeX/TeX files in the root. To handle graphics I create an immediate subdirectory called inclusions.

c:/pd/blog/wp2latex/inclusions

The inclusions directory holds the document’s *.png, *.jpg, *.pdf, *.eps and other graphics files.  To reference files in the inclusions directory with the standard LaTeX graphicx package insert

\usepackage{color,graphicx,subfigure,sidecap}
\graphicspath{{./inclusions/}}

in your preamble. Finally, to track document changes I create a GIT repository in the root directory.

c:/pd/blog/wp2latex/.git

Self contained directories

I take care to keep my document directories self-contained. Zipping up the root and inclusions directory collects all the document’s files. This means that I sometimes have to copy files that are used in more than one document. Many LaTeX users maintain a common directory for such files but I’ve found that common directories complicate moving documents around. You’re always forgetting something in the damn common directory or you are copying a buttload of mostly irrelevant files from one big confusing common directory to another.

TeXfrWpxml.ijs files

The TeXfrWpxml.ijs script searches for these files in the root directory.

bm.tex Main LaTeX root file
bmamble.tex LaTeX preamble

bm.tex references bmtitlepage.tex.  I prefer a separate title page file; simply comment out this file if you create titles in other ways. The zip file wp2latex.zip contains a test directory in the format expected by TeXfrWpxml.ijs.  It also has a subset of my blog posts already converted to LaTeX. To get ready for WordPress to LaTeX with Pandoc and J: Using TeXfrWpxml.ijs (Part 3) download wp2latex.zip and attempt to compile bm.tex.  You might have to download a number of LaTeX packages.  Once you have successfully compiled bm.tex you are ready for the next step.


[1] LaTeX uses many other file types but key files are usually *.tex files.

WordPress to LaTeX with Pandoc and J: Prerequisites (Part 1)

There are no quick WordPress to LaTeX fixes

WordPress to LaTeX

WordPress to LaTeX

Over the next three posts I will describe how to convert WordPress’s export XML to LaTeX source code.  I know that many of you are looking for a quick WordPress to LaTeX fix; unfortunately there are no quick fixes. The two formats come from different worlds and are used in different ways.  Producing useful LaTeX source from WordPress export XML will require manual edits.  My goal here is to minimize manual edits, produce high quality LaTeX source and to outline what you will have to contend with. To get an idea of what you can expect download the LaTeX compiled version of this post.

Visual and Logical composition

WordPress and LaTeX are examples of the two basic approaches, visual and logical, taken by writing software.  Visual systems value appearance. It matters what things look like and no effort is spared to get the right look. Logical systems value content. What’s said is far more important than what it looks like. Logical systems impose order and structure and typically defer visual elements.  As you might expect there is no such thing as a pure visual or logical writing system. Successful systems use both approaches to a greater or lesser degree. Composing WordPress blog posts is roughly 35% visual and 65% logical.[1]  LaTeX composition is about 10% visual and 90% logical. The numbers do not line up; there is a basic mismatch here.

Many format X to LaTeX converters tackle this mismatch by attempting to maintain visual fidelity. This is a catastrophic error that renders the entire conversion useless.  Here’s a hint. If you’re using a predominantly logical system like LaTeX you don’t give a rodent’s posterior about visual fidelity. This method dispenses with all but the most basic of visual elements. No attempt is made to preserve fonts, type sizes, image scale, justification, hyphenation, text color and so forth.  The goal is to produce working LaTeX source that can be transformed to whatever final layout the author desires.

Prerequisite Software

I use two programs to transform WordPress export XML to LaTeX:  the J programming language and John MacFarlane’s Pandoc.  Pandoc is an excellent text mark-up to mark-up converter.  It wisely avoids attempting to convert entire complex documents and focuses on getting parts of documents right.  It does a particularly good job of converting HTML to LaTeX which is a crucial part of this process.  I use Pandoc to transform the HTML embedded in WordPress export XML CDATA elements to *.tex files and I use J to preprocess and post process Pandoc inputs and outputs and to stitch everything together into a set of LaTeX ready files.

Download Pandoc from here. I use the Windows command line version. There are Linux and Mac versions as well. Download J from here.  The easiest J install is the 32 bit Windows J 6.02 version. Other versions require additional steps to configure and deploy. If you are already a J user there is no need to install a particular system but you will need:

  1. The task library require 'task'
  2. The utility program wget.exe

Both of these components are typically part of the J distribution.

Install and check prerequisites

To continue download and install Pandoc and J and run the following tests; if you succeed you’re system is ready for WordPress to LaTeX with Pandoc and J: LaTeX Directories (Part 2).

Pandoc Test:

Download the test file: cdata.html and run Pandoc from the command line:

pandoc –o cdata.tex cdata.html

cdata.html is an example of the HTML code you find in WordPress export XML CDATA elements.  Note: required files are also available in the files sidebar in the WordPress to LaTeX directory.

J Test:

Start a J session and enter the following commands:

require 'task'

shell 'wget –help'

shell 'wget http://conceptcontrol.smugmug.com/photos/i-mNK4RHL/0/L/i-mNK4RHL-L.png'

If the shell command is properly loaded and wget.exe is found you will see help text. The second shell command downloads an image file.  Downloading post images is part of the overall conversion process.


[1] Actually this is not bad. Page layout systems are far worse. A typical layout system might be 90% visual and 10% logical making layout systems polar opposites of LaTeX.

Typesetting UTF8 APL code with the LaTeX lstlisting package

UTF8 APL characters within a LaTeX lstlisting environment. Click for *.tex source code

Typesetting APL source code has always been a pain in the ass! In the dark ages, (the 1970’s), you had to fiddle with APL type-balls and live without luxuries like lower case letters. With the advent of general outline fonts it became technically possible to render APL glyphs on standard display devices provided you:

  1. Designed your own APL font.
  2. Mapped the atomic vector of your APL to whatever encoding your font demanded.
  3. Wrote WSFULL‘s of junk transliteration functions to dump your APL objects as font encoded text.

It’s a testament to either the talent, or pig headedness of APL programmers, that many actually did this. We all hated it! We still hate it! But, like an abused spouse, we kept going back for more.  It’s our fault; if we loved APL more it would stop hitting us!

When Unicode appeared APL’ers cheered — our long ASCII nightmare was ending. The more politically astute worked to include the APL characters in the Unicode standard. Hey if Klingon is there why not APL? Everyone thought it was just a matter of time until APL vendors abandoned their nonstandard atomic vectors and fully embraced Unicode. With a few notable exceptions we are still waiting. While we wait the problem of typesetting APL source code festers.

My preferred source code listing tool is the \LaTeX lstlisting package. lstlisting works well for standard ANSI source code.  I use it for J, C#, SQL, C, XML, Ocaml, Mathematica, F#, shell scripts and \LaTeX source code, i.e. everything except APL! lstlisting is an eight bit package; it will not handle arbitrary Unicode out of the box.  I didn’t know how to get around this so I handled APL by enclosing UTF8 APL text in plain \begin{verbatim} … \end{verbatim} environments. This works for XeLaTeX and LuaLaTeX but you lose all the lstlisting goodies. Then I saw an interesting tex.stackexchange.com posting about The ‘listings’ package and UTF-8. One solution to the post’s “French ligature problem” showed how to force Unicode down lstlisting‘s throat. I wondered if the same method would work for APL. It turns out that it does!

If you insert the following snippet of TeX code in your document preamble LuaLaTeX and XeLaTeX will properly process UTF8 APL text in lstlisting environments. You will need to download and install the APL385 Unicode font if it’s not on your system.  A test \LaTeX document illustrating this hack is available here. The compiled PDF is available here. As always these files can be accessed in the files sidebar.

% set lstlisting to accept UTF8 APL text
\makeatletter
\lst@InputCatcodes
\def\lst@DefEC{%
 \lst@CCECUse \lst@ProcessLetter
  ^^80^^81^^82^^83^^84^^85^^86^^87^^88^^89^^8a^^8b^^8c^^8d^^8e^^8f%
  ^^90^^91^^92^^93^^94^^95^^96^^97^^98^^99^^9a^^9b^^9c^^9d^^9e^^9f%
  ^^a0^^a1^^a2^^a3^^a4^^a5^^a6^^a7^^a8^^a9^^aa^^ab^^ac^^ad^^ae^^af%
  ^^b0^^b1^^b2^^b3^^b4^^b5^^b6^^b7^^b8^^b9^^ba^^bb^^bc^^bd^^be^^bf%
  ^^c0^^c1^^c2^^c3^^c4^^c5^^c6^^c7^^c8^^c9^^ca^^cb^^cc^^cd^^ce^^cf%
  ^^d0^^d1^^d2^^d3^^d4^^d5^^d6^^d7^^d8^^d9^^da^^db^^dc^^dd^^de^^df%
  ^^e0^^e1^^e2^^e3^^e4^^e5^^e6^^e7^^e8^^e9^^ea^^eb^^ec^^ed^^ee^^ef%
  ^^f0^^f1^^f2^^f3^^f4^^f5^^f6^^f7^^f8^^f9^^fa^^fb^^fc^^fd^^fe^^ff%
  ^^^^20ac^^^^0153^^^^0152%
  ^^^^20a7^^^^2190^^^^2191^^^^2192^^^^2193^^^^2206^^^^2207^^^^220a%
  ^^^^2218^^^^2228^^^^2229^^^^222a^^^^2235^^^^223c^^^^2260^^^^2261%
  ^^^^2262^^^^2264^^^^2265^^^^2282^^^^2283^^^^2296^^^^22a2^^^^22a3%
  ^^^^22a4^^^^22a5^^^^22c4^^^^2308^^^^230a^^^^2336^^^^2337^^^^2339%
  ^^^^233b^^^^233d^^^^233f^^^^2340^^^^2342^^^^2347^^^^2348^^^^2349%
  ^^^^234b^^^^234e^^^^2350^^^^2352^^^^2355^^^^2357^^^^2359^^^^235d%
  ^^^^235e^^^^235f^^^^2361^^^^2362^^^^2363^^^^2364^^^^2365^^^^2368%
  ^^^^236a^^^^236b^^^^236c^^^^2371^^^^2372^^^^2373^^^^2374^^^^2375%
  ^^^^2377^^^^2378^^^^237a^^^^2395^^^^25af^^^^25ca^^^^25cb%
  ^^00}
\lst@RestoreCatcodes
\makeatother

More on Kindle Oriented LaTeX

I’ve been compiling \LaTeX PDFs for the Kindle. If you like \LaTeX typefaces, especially mathematical fonts, you’ll love how they render on the Kindle. It’s a good thing because you won’t like the Kindle’s cramped page dimensions. For simple flow-able text this isn’t a big deal but for complex \LaTeX documents it is!

There are two basic \LaTeX \Longrightarrow Kindle  workflows.

  1. Convert your \LaTeX to HTML and then convert the HTML to mobi.
  2. Compile your \LaTeX for Kindle page dimensions.

For simple math and figure free documents mobi is the best choice because it’s a native Kindle format. You will be able to re-flow text and change font sizes on the fly. There are many \LaTeX to HTML converters. This is a good summary of your options. You can also find a variety of HTML to mobi converters. I’ve used Auto Kindle; it’s slow but produces decent results.

Compiling \LaTeX for Kindle page dimensions is more work. First decide what works best for your document: landscape or portrait. Portrait is the Kindle default but I’ve found that landscape is better for math and figure rich documents. You can flip back and forth between landscape and portrait on the Kindle but it will not re-paginate PDFs. Of course with mobi this is no problemo!

After choosing a basic layout expunge all hard-coded lengths from your source *.tex files. Replace all fixed lengths with relative page lengths. For example, 4in might become 0.75\textwidth. If you have hundreds of figures and images to adjust write a little program to replace fixed lengths. I did this while preparing a Kindle version of Hilbert’s Foundations of Geometry.

The next hurdle to overcome is the Kindle’s blase attitude about length units. \LaTeX is extremely precise: an inch is an inch to six decimals. This is not the case on the Kindle! You will have to load your PDFs on the Kindle and inspect margins for text overflows. Be prepared for a few rounds of page dimension tweaking! For more details about preparing \LaTeX source check out LaTeX Options for Kindle.

Finally, after you have compiled your PDF and loaded it on your Kindle, there are some Kindle options you should set to optimize your PDF reading experience. My next post will walk you through setting these options.

The following *.tex file loads packages that are useful for Kindle sizing. It also shows how to print out \LaTeX dimensions with the printlen package.

% A simple test document that displays some packages and settings
% that are useful when compiling LaTeXe documents for the Kindle.
% Compile with pdflatex or xelatex.
%
% Tested on MikTeX 2.9
% July 22, 2011

\documentclass[12pt]{article}

% included graphics in immediate subdirectory
\usepackage{graphicx}
\graphicspath{{./image/}}

% extended coloring
\usepackage[usenames,dvipsnames]{color}

% hyperref link colors are chosen to display
% well on Kindle monochrome devices
\usepackage[colorlinks, linkcolor=OliveGreen, urlcolor=blue,
            pdfauthor={your name}, pdftitle={your title},
            pdfsubject={your subject},
            pdfcreator={MikTeX+LaTeXe with hyperref package},
            pdfkeywords={your,key,words},
            ]{hyperref}

\usepackage{breqn}         % automatic equation breaking
\usepackage{microtype}     % microtypography, reduces hyphenation

% kindle page geometry (no page numbers)
%\usepackage[papersize={3.6in,4.8in},hmargin=0.1in,vmargin={0.1in,0.1in}]{geometry}

% portrait kindle page geometry space reserved for page numbers
\usepackage[papersize={3.6in,4.8in},hmargin=0.1in,vmargin={0.1in,0.255in}]{geometry}

% landscape geometry
%\usepackage[papersize={4.8in,3.6in},hmargin={0.1in,0.18},vmargin={0.1in,0.255in}]{geometry}

% headers and footers
\usepackage{fancyhdr}
\pagestyle{fancy}
\fancyhead{}            % clear page header
\fancyfoot{}            % clear page footer

\setlength{\abovecaptionskip}{2pt} % space above captions
\setlength{\belowcaptionskip}{0pt} % space below captions
\setlength{\textfloatsep}{2pt}     % space between last top float or first bottom float and the text
\setlength{\floatsep}{2pt}         % space left between floats
\setlength{\intextsep}{2pt}        % space left on top and bottom of an in-text float

% print LaTeX dimensions
\usepackage{printlen}

% reduces footer text separation adjusted for page numbers
\setlength{\footskip}{14pt}

% scales down page number font size if document is at 12pt -> page numbers 10 pt
\renewcommand*{\thepage}{\footnotesize\arabic{page}}

\begin{document}

The \verb|\textwidth| is \printlength{\textwidth} which is also
\uselengthunit{in}\printlength{\textwidth} and
\uselengthunit{mm}\printlength{\textwidth}.

\uselengthunit{pt}
The \verb|\textheight| is \printlength{\textheight} which is also
\uselengthunit{in}\printlength{\textheight} and
\uselengthunit{mm}\printlength{\textheight}.

\end{document}

Open Source Hilbert for the Kindle

David Hilbert

David Hilbert

While searching for free Kindle books I found Project Gutenberg. Project Gutenberg offers free Kindle books but they also have something better! Would you believe \LaTeX source code for some mathematical classics.

The best book I’ve found so far is an English translation of David Hilbert’s Foundations of Geometry. Hilbert’s Foundations exposed some flaws in the ancient treatment of Euclidean geometry and recast the subject with modern axioms. Because it is relatively easy to follow, compared to Hilbert’s more recondite publications, this little book exercised disproportionate influence on 20th century mathematics. We still see its style aped, but rarely matched, in mathematics texts today.

I couldn’t resist the temptation of compiling a mathematical classic so I eagerly downloaded the source and ran it through \LaTeX.  Foundations compiled without problems and generated a nice letter-sized PDF. Letter-size is fine but I was looking for free Kindle books! I decided to invest a little energy modifying the source to produce a Kindle version. Project Gutenberg makes it clear that we are free to modify the source. Isn’t open source wonderful!

Converting Foundations was simple. The main \LaTeX file included 52 *.png illustrations with hard-coded widths in \includegraphics commands. I wrote a J script that converted all these fixed widths to relative \textwidth‘s. This lets \LaTeX automatically resize images for arbitrary page geometries. When compiled with Kindle page dimensions this fixed most of the illustrations. I had to tweak a few wragfig‘s to better typeset images surrounded by text. The result is a very readable Kindle oriented PDF version of Hilbert’s book. There are still a few problems. The Table of Contents is a plain tabular that does not wrap well and one table rolls off the right Kindle margin. Neither of these deficiencies seriously impair the readability of the text.  If these defects annoy you download the Project Gutenberg source with my modifications and build your own version.

This little experiment convinced me that providing free classic books, in source code form, is a service to mankind.  Not only does it allow you to “publish” classics on new media it also fundamentally changes your attitude toward books. Hilbert was one of the great mathematical geniuses of the 19th and 20th century. It’s hard to suppress we are not worthy moments and maintain a sharp critical eye when reading his “printed” works.  You don’t get the same vibe when reading raw \LaTeX.  Source code puts you in a, it’s just another bug infested program, frame of mind. You expect errors in code and you typically find them. This is exactly the hard-nosed attitude you need when reading mathematics.

Soon we will all be Software Archeologists

One of my pet peeves is the ridiculously short lifetimes of digital media.  I remember 9 track mainframe tapes and 5.5 inch floppies: technologies that thrived in an ancient bygone epoch known as the Eighties. Good luck trying to read 9 track tapes or 5.5 inch floppies today! You will have better luck with older paper punch cards. Punch card readers are hard to find these days but you can see the damn card holes with your own eyes! In fact you don’t even need eyes to read punch cards. I once knew a blind mainframe programmer that banged out massive FORTRAN programs by feeling the holes on punch cards. Try that with a USB flash drive.

Of course I appreciate that you can stuff the data from an entire filing cabinet of 5.5 inch floppies onto one modern USB flash drive but I am disturbed by the fact that all those gigabytes will soon be more unreadable than cuneiform. I am not the first to worry about our distressed digital data. Kevin Kelly considers the word “storage” a dangerous misnomer and advocates the use of “movage” instead. You had better move your data from old to new formats or you will lose it!

Rosetta Ball

Rosetta Ball

Movage is one of the reasons I have not jumped on the eReader bandwagon. Replacing myriagrams of books with one lightweight tablet is appealing but iPads and Kindles are not stable! High quality books have shelf lives measured in centuries.  With digital media you’re lucky to get through a decade.  It’s a good bet you won’t be able to read what’s on your eReaders in ten short years!  You poor dumb suckers will have to repurchase your library just like you repurchased your record and movie collections. It’s not in Amazon’s or Apple’s interest to worry too much about media durability. Fortunately some people do worry about media stability.  Check out The Long Now’s Rosetta project for what I consider a stable medium.

To belabor this point, while I was unpacking boxes of old-fashioned books, (we recently moved again),  I came across a notebook I put together for a poster I presented at the 1994 APL conference in Antwerp. My notebook contained a paper version, still eminently readable, and four 3.5 inch disks.  My oldest computer has a vestigial 3.5 inch disk drive so I tried copying these sixteen year old disks. Some of the disks were unreadable, (surprise surprise), but I was able to recover a directory containing my poster’s source. Some of these files were old Microsoft Word documents. Word 2007 could not read them! Even when bits survive changes in software can render them useless. Fortunately I loathed Word in 1994, a sentiment I still maintain, and wrote my poster in \LaTeX.

\LaTeX source is dull ASCII text. Civilization will collapse before we lose the ability to read it! Of course \LaTeX, like Word, has changed since 1994 so, just for the hell of it, I decided to compile this old document with MikTeK 2.9.  It didn’t compile;  I was missing some old graphics macros and a key style file. It didn’t take me long to fix these problems. I replaced the graphics macros with standard \includegraphics{} commands and converted all the Windows *.bmp files to *.png files. Google even found the long-lost missing style file qqaaelba.sty in arxmliv. After making these trivial changes pdflatex.exe gobbled my poster source and moved Using FoxPro and DDE to Store J Words into the 21st century.

Resume blues partly alleviated by LaTeX

Once again your fearless correspondent is seeking new consulting opportunities.  One of the major drawbacks of consulting is the constant need to keep marketing yourself!  When it comes to self promotion that old standby, the resume,  is still one of your most effective tools.  When communicating with potential clients their first question is; “Can you send me a resume?”

Resumes are a black art.  There are many,  mostly bogus,  theories about what constitutes a good resume and an entire cottage industry has sprung up to support resume creation.  I am sure you have walked down the power resume aisle in your local big-box bookstore marveling at how people can write entire books on composing three page resumes.  Maybe you have suffered through a corporate out-placing where well dressed human-resource types will earnestly criticize your use of bullets and personal pronouns.  Whenever people go on about resumes I always think of Monty Python’s theory of Brontosauruses.

Here’s the nasty truth: a resume is an advertisement!  Do you honestly think anyone would dare to propose a theory of advertisements? A good ad gets noticed and helps sell the product.  The same holds for resumes.

I have a simple resume style that has worked well.  The only complaints I get relate to file types.  Some clients want plain text, some want Word documents, others want PDFs and most don’t care!

Lately I revised the LaTeX version of my resume.  LaTeX is my preferred document format.  LaTeX source documents are simple text files. You can manipulate them with any text editor on any computer system.  Hence LaTeX documents cannot be held hostage by software vendors that encode your words in version specific binary formats. If you have ever converted a Word document to an old or new format you will know of what I speak.  Because LaTeX files are simple text it’s easy to share LaTeX on the web.  My current resume borrowed from a number of authors.  When I borrow I try to give back.  The following links point to the LaTeX source of my resume and the final PDF output.  Help yourself but be courteous and maintain the creative common license block in the LaTeX code.