Cross References in Org Mode


Referring to things in Org

Introduction

This is a companion to my previous post on Org Mode & \(\LaTeX\) that I'm writing after getting serious about managing cross-references in one of my Org Mode documents. I'll start with some background on the problem; if you're already familiar with cross references in the preparation of mathematical or scientific documents, feel free to skip ahead to the section on what Org Mode does for you right out of the box. After that, I'll discuss John Kitchin's org-ref package, followed by a small extension to it that I wrote.

I should note that throughout, I'm going to concern myself with Org export to Latex & HTML, the two formats I use almost exclusively.

Background

Perhaps this is obvious to readers, but it actually took me a while understand the problem that things like Org Mode names or org-ref were even solving. On the off chance that I'm not alone, here's a brief summary of the problem.

When preparing mathematical, scientific or engineering documents, whatever the format, one typically collects figures, tables and so forth. Conventionally, these are numbered, and one will often refer to them in the text (e.g. "as shown in Fig. 1…"). As soon as the document becomes non-trivial in size, maintaining this becomes real work. Imagine drafting a document with a few dozen figures, then deciding to insert a new figure between figures one & two: you now have to go back & re-number not only all the later figures, but all references thereto.

Authors also find themselves referring to entities in the document not just by number; they may want to say something like "as shown in the table on page…", referring to the page on which a given table appears. They may want to refer to multiple such objects (e.g. "as can be seen in figures 1 through 3"). They may want the entity caption to appear in the reference, rather than its number.

Document preparation systems have developed features for managing such references, generally referred to as cross references. The scheme is typically to have the author assign a mnemonic "tag" to each entity, and then refer to it using an assortment of commands that take as an argument that tag. Then, at publication time, the system figures out numbering, pagination & so forth and replaces the reference commands with the appropriate text (and links, depending on the publication target format).

As far as I know, the undisputed king of these systems is Latex; it offers multiple packages for managing and customizing cross references of various sorts. That said, Org Mode does provide some support for this by default.

What Org Mode Does Out of the Box

Org Mode provides a lot in terms of managing cross references without any extensions, but you'd never know it from the manual. There exists the following three sentences in the section on Internal Links: "During export, internal links are used to mark objects and assign them a number. Marked objects are then referenced by links pointing to them. In particular, links without a description appear as the number assigned to the marked object [emphasis added]."

I almost glossed over it the first time I read it, but this actually describes a workable cross reference management system. For instance, if we want to have a figure participate in this system, we need to:

  1. insert it into the document in the form of an image file with a "bare" link; i.e. a link of the form:

    [[file:path-to-file]]
    
  2. name it and caption it; again, this is barely mentioned in the manual– the sentence "When targeting a ‘NAME’ keyword, the ‘CAPTION’ keyword is mandatory in order to get proper numbering" literally appears as a footnote in the manual 1 I've only learned empirically that you have to be sure to put the NAME & CAPTION attributes directly above the link; I've found that putting other attributes in between can break this.

For example:

#+ATTR_LATEX: :placement [H]
#+NAME: fig-1
#+CAPTION: This is a test
file:../img/lambda-calculus-rewriting-eg-1.png

will render (in HTML) as:

lambda-calculus-rewriting-eg-1.png

Figure 1: This is a test

Now, we can link to it as per usual as [[fig-1][description]], but a "bare" link, a link such as [[fig-1]] will render as just the number "1"; as in: Figure 1.

Similarly for tables & source listings. This

#+NAME: tab-1
#+CAPTION: Test table
|     | A | B | C | D |
|-----+---+---+---+---|
| no. | 1 | 2 | 3 | 4 |

will render as:

Table 1: Test table
  A B C D
no. 1 2 3 4

and again we can have Org export just the number using a bare link: Table 1.

Embedded Latex equations, while also participating, work a bit differently. Consider this fragment:

#+NAME: eq-1
\begin{equation}
  1+1=2
\end{equation}

It yields:

\begin{equation} \label{orgf8424a7} 1+1=2 \end{equation}

Note that, in the absence of a \nonumber command, Latex/MathJax will assign a number. We can also link to it as per usual (i.e. [[eq-1][description]]), but a bare link will yield the equation number in parentheses: \eqref{orgf8424a7}.

Impressively, if we use the Latex \tag command to assign a different number, or even a textual tag, Org mode respects that:

#+NAME: eq-2
\begin{equation}
1+1=2 \tag{M}
\end{equation}

yields:

\begin{equation} \label{orgcc3295f} 2+2=4 \tag{M} \end{equation}

and a bare link gives: \eqref{orgcc3295f}.

Finally, special blocks also participate. This is useful because this is generally how we typeset theorems, lemmas & so forth. Let's step back: in order to indicate, say, lemmas, we can do the following:

#+LATEX_HEADER: \usepackage{amsthm}
#+LATEX_HEADER: \newtheorem{lemma}{Lemma}
#+HTML_HEAD: <style type="text/css">.lemma-label { font-weight: bold; } .lemma { font-style: italic; }</style>
# ...
#+BEGIN_EXPORT html
<span class="lemma-label">Lemma 1</span>
#+END_EXPORT
#+ATTR_LATEX: :environment corrolary
#+NAME: lem-1
#+CAPTION: required, but not shown
#+BEGIN_lemma
Named lemma here.
#+END_lemma

will produce:

Lemma 1

Named lemma here.

and similarly with theorems, corollaries, definitions & so on.

Regrettably, the custom HTML needs to be numbered manually, but beside that naming it (and, again, taking care to place the NAME attribute directly above the special block) lets this also play in Org mode numbering: Lemma 1.

Document Structure

Section headers do not seem to participate in this facility, which is a small pity. One can label headings at levels one through three using the CUSTOM_ID property; they will be exported as h2 through h4 for HTML, and as sections, subsections & subsubsections in Latex. Beyond that, lower-level sections will be rendered as lists in both formats. Now, we can refer to them by their custom identifiers by prefixing the link with a '#', but they don't get numbers assigned to them.

Where are we?

Tables, figures, listings, equations, embedded latex & even special blocks do get auto-numbering in Org mode. That said, the system still not as powerful as that of Latex; you just get the raw numbers– no pagerefs or namerefs, no cleverefs or anything like that. Section headings don't play at all. So, in sum, the cross referencing situation in Org mode is not at all bad, but still. The experienced Latex author can be forgiven for seeking the comforts of home here in Org mode.

Enter Org Ref

John Kitchen's org-ref package aims to provide those comforts. In fact, it provides a great deal more than just cross reference management, but I'm going to stay focused on just that.

org-ref attempts to "lift" many of the cross reference features familiar to Latex authors to Org Mode. Out of the box, it introduces several new link types, with which we can refer to entities by

item number, as per usual
ref:fig-1 yields: 1
page number
Figure 1 is found on page 1 (the second link was produced by an Org link of the form pageref:fig-1)
equation
cf. equation (\eqref{orgf8424a7}) (produced by eqref:eq-1)
cleveref
table 1 vs. figure 1 (produced by cref:tab-1 & cref:fig-1)

among others. Each of these corresponds to a particular Latex reference command, which will be produced on export… to Latex. If you want to export to, say, HTML, you're out of luck. Per Kitchen "org-ref was originally designed for writing scientific papers that would be converted to a PDF via LaTeX, where LaTeX would do all the cross-reference processing." 2

Disappointing to those of us who export to other formats, but then I discovered the optional org-ref-refproc package, which extends this to other export formats, including HTML (and, indeed, was used to produce the references in the list above).

To enable HTML support, you need to add org-ref-refproc to the org-export-before-parsing-functions variable for the HTML export operation only. If you add it globally, it will stomp all over export to the Latex backend. I worked around this by saying:

(defun sp1ff/org//if-latex (backend)
  (eq backend 'latex))

(advice-add 'org-ref-refproc :before-until #'sp1ff/org//abort-if-latex)

(add-to-list 'org-export-before-parsing-functions #'org-ref-refproc)

Alright, at this point, by adding org-ref to Org mode, we have a cross-reference management system on par with Latex. Latex being Latex, I'm sure there's some feature we're missing, but things are looking good.

Indicies

org-ref not only manages cross references, but also covers the Latex indexing feature. With org-ref loaded, you can create an index entry anywhere in your document by saying index:tag:

Here is the definition of flapdoodle index:flapdoodle. Here is more prose...

Then, at the end of your document, just say [[printindex:]] to actually generate the index. This, too, was originally intended to be used only with export to Latex. Like org-ref-refproc, an optional export filter org-ref-idxproc is offered, but will short-circult Latex export if it's installed globally. I played the same game here that I did for org-ref-refproc:

(advice-add 'org-ref-idxproc :before-until #'sp1ff/org//abort-if-latex)

(add-to-list 'org-export-before-parsing-functions #'org-ref-idxproc)

Awesome– now we can get an index in both Latex & HTML export. And yet… I've always been a fan of Texinfo's support for multiple indicies (concepts, variables, functions, even user-defined indicies). Wouldn't it be nice, I thought, if I could create multiple indicies in my Org documents, and have them show-up in both Latex and HTML documents?

Adding Multiple Indicies

This led to my shooting a perfectly nice summer afternoon hacking in Emacs Lisp. The first step was identifying the Latex package to use for generating multiple indicies. I chose the imakeidx package, largely on the basis of discussion here & here (especially the compatibility with AMS document classes).

My solution is highly derivative of org-ref. I began by introducing a new link type: "mindex". The problem is that we now need to encode two pieces of information in the link: the tag itself, as well as which index should contain this particular link. Let's put them both in the description & separate them with a pipe character. Give ourselves a utility function org-ref-multi-index--split that will split that description into a cons cell and say:

(org-link-set-parameters
 "mindex"
 :follow
 (lambda (path)
   (occur (cdr (org-ref-multi-index--split path))))
 :export
 (lambda (path _desc format)
   (cond
    ((eq format 'latex)
     (let* ((split (org-ref-multi-index--split path))
            (index (car split))
            (tag (cdr split)))
       (if (eq 'default index)
           (format "\\index{%s}" tag)
         (format "\\index[%s]{%s}" index tag)))))))

Alright: at this point, assuming the document author includes:

#+LATEX_HEADER: \usepackage{imakeidx}
#+LATEX_HEADER: \makeindex[title=Concept Index] % This is the default index
#+LATEX_HEADER: \makeindex[name=fn,title=Function Index]

in their Org document, they can write mindex:foo to make an entry in the default index, or mindex:fn|f to create one in the function index, and on export to Latex the appropriate \index commands will be produced.

Now for the actual production of the indicies: org-ref handles it as follows:

(org-link-set-parameters "printindex"
                         :follow #'org-ref-index
                         :export (lambda (_path _desc format)
                                   (cond
                                    ((eq format 'latex)
                                     (format "\\printindex")))))

We see that org-ref provides a very nice feature: beyond of merely producing the proper Latex command on export, it provides for producing a "live" index on demand in Emacs when one follows the link.

I replicated org-ref-index, mutatis mutundis, and said:

(org-link-set-parameters
 "printmindex"
 :follow #'org-ref-multi-index-show-index
 :export
 (lambda (path _desc format)
   (cond
    ((eq format 'latex)
     (if (eq 0 (length path))
         (format "\\printindex")
       (format "\\printindex[%s]" path))))))

At this point, we've essentially copied org-ref's index support, with a few changes extending it to multiple indicies. I shot the best part of the afternoon replicating the functionality of org-ref-refproc in org-ref-multi-index-proc. This is the function run just before the Org export backend gets its hands on our document, adding support for export formats other than Latex.

It walks through the document, replacing each "mindex" link with a radio target, and building-up each index in-memory. When it encounters a "printmindex" link, it will upate the export buffer (which is still in Org Mode), replacing that link with the text of the index itself (the entries link back to the abovementioned radio targets).

My implementation short-circuits on export to Latex, so we need only do:

(add-to-list 'org-export-before-parsing-functions #'org-ref-multi-index-proc)

I've wrapped this up in a nascent package called org-ref-multi-index. It's first code, and certainly not on MELPA, so you'll have to install it manually if you want to use it.

One side-note: while writing this code, I found that org-ref uses thing-at-point in the formation of its index entries. Specifically, it invokes (thing-at-point 'sentence). If, like me, you follow your sentences with a single space when writing prose, this will return the entire paragraph, unless you've said (setq sentence-end-double-space nil).

Conclusions

We've seen that while poorly documented, Org Mode provides quite a bit of functionality for managing cross-references "out of the box". Extend it with a few packages, and we're at or near Latex levels of functionality.

For reference, here's my Emacs configuration for all of this:

(use-package org-ref
  :ensure t
  :config
  (add-to-list
   'org-ref-refproc-clever-prefixes
   '(lemma :full "definition" :abbrv "def." :org-element special-block))
  (add-to-list
   'org-ref-refproc-clever-prefixes
   '(align :full "align" :abbrv "eq." :org-element latex-environment))
  (add-to-list 'org-export-before-parsing-functions #'org-ref-refproc)
  (add-to-list 'org-export-before-parsing-functions #'org-ref-idxproc))

(defun sp1ff/org//abort-if-latex (backend)
    (eq backend 'latex))

(advice-add 'org-ref-refproc :before-until #'sp1ff/org//abort-if-latex)
(advice-add 'org-ref-idxproc :before-until #'sp1ff/org//abort-if-latex)

(setq
     org-latex-pdf-process
     '("%latex -shell-escape -interaction nonstopmode -output-directory %o %f"
       "makeindex %b"
       "%latex -shell-escape -interaction nonstopmode -output-directory %o %f"
       "%latex -shell-escape -interaction nonstopmode -output-directory %o %f"))

(require 'org-ref-multi-index)
(add-to-list 'org-export-before-parsing-functions #'org-ref-multi-index-proc)

06/12/24 15:35