Docutils: Architecture, Extending, and Embedding

David Goodger
&
Lea Wiemann


We will describe the architecture of Docutils, how to add functionality to Docutils, and how to use Docutils in your own applications. Not necessarily in that order.

What is Docutils?

What is reStructuredText?

Status

Docutils 0.4 released January 9.

Existing Uses

What’s Missing?

Major features:

Please come to the Docutils Sprint and help out!

Component Architecture

components.png

In the component diagram, thick solid lines denote the transfer of standard document tree data. The double line between Reader and Transformer denotes a possibly non-standard document tree.

Data Flow (1)

components-small.png

Docutils components are selected at run time by the client application or front end.

  1. The Publisher calls the Reader.

    The Reader understands the context of the input. For example, the PEP Reader knows that PEPs begin with an RFC-822-style header, that a table of contents should be added after the header, and that all hyperlinks should be collected near the end of the document.

    Typical text files use the Standalone Reader. To extract docstrings & comments from Python source code, you’d use the Python Source Reader (under active development). To reprocess an existing document tree, use the doctree Reader.

  2. The Reader calls an Input object to gather text data.

    The Input classes provide a uniform interface for reading from arbitrary low-level input sources, such as files, strings, and even pre-parsed document trees. Input objects handle the decoding of input text to Unicode. Unicode is exclusively used internally.

  3. The Reader calls the Parser, passing the input text.

    There are currently two parsers installed in Docutils: the reStructuredText Parser, and the "Null" parser (used for reprocessing existing document trees, in conjunction with the doctree Reader and Input class). The parser generates a document tree, a tree of element and Text nodes, and returns it to the Reader.

  4. The Reader returns the doctree(s) to the Publisher.

Data Flow (2)

components-small.png
  1. The Publisher runs the Transformer.

    The Transformer applies various Transforms to the document tree, in a pre-determined order. Transforms modify the document tree in-place: resolving references, numbering sections, creating tables of contents, and performing other functions on the entire document or parts of the document.

  2. The Transformer returns the doctree to the Publisher.

    At this point, the doctree is standard, no matter what Parser was used or Reader context was in place.

  3. The Publisher calls the Writer.

    The Writer translates the document tree to a format like HTML or LaTeX.

  4. The Writer sends the result to an Output object.

    As with Input, the Output object provides a uniform interface for writing to arbitrary low-level destinations, such as files and strings. Output objects also handle text encoding.

The Publisher directly calls only the Reader, the Transformer, and the Writer. However it manages all objects (Input, Output, Reader, Parser, Transformer, Transform, and Writer instances) and passes them where they are needed. For example, the Input and Parser objects are passed to the Reader.

All of this complexity is encapsulated in the Publisher convenience functions; more on these later.

Document Tree

Sample input text:

"""
I like the Python_ language.

.. _Python: http://www.python.org/
"""

Resulting doctree:

<document source="doctree-demo.txt">
    <paragraph>
        I like the
        <reference
         refuri="http://www.python.org/">
            Python
         language.

The document tree data structure is similar to a DOM tree, but with specific node names (classes) instead of DOM’s generic nodes. The schema is documented in an XML Document Type Definition (DTD), which comes in two parts:

The DTD defines a rich set of elements, suitable for many input and output formats. The DTD retains all information necessary to reconstruct the original input text, or a reasonable facsimile thereof.

The document tree holds the components of Docutils together. The document tree is the unifying intermediate data structure used internally throughout Docutils, first created by the Parser and translated by the Writer. The``docutils.nodes`` module is a class library implementing the nodes of the document tree.

Docutils as a Library (1)

How to use Docutils from your own application.

Convenience functions, from docutils.core:

Docutils as a Library (2)

Docutils as a Library (3)

Nabu uses the publish_doctree and publish_from_doctree functions.

Extending Docutils

Docutils is completely modular. New components of all types can be added:

Test-First Development

The Test Suite

  • based on unittest.py

    but with

  • significant additions

  • data-driven

  • we have Test modules & test packages

    • test_*.py

    • test_*/

      (requires an __init__.py module; a real package!)

  • 1000 tests!

(DG) I first learned unit testing when I began Docutils. There is absolutely no way I could have developed Docutils without unit testing.

Extending reST

reStructuredText has three extension mechanisms:

Language Example

German input text(“bild” is German for “image”):

"""
.. bild:: test.png
"""

Process with this command line:

rst2html.py --language de test.txt test.html

Write a Transform

Sprint!

Join the Docutils sprint!

We will both be here for all 4 sprint days.

And that’s just the beginning!


docutils-users@lists.sourceforge.net

docutils-develop@lists.sourceforge.net

Did we mention the sprint?

Thanks for listening!

Questions? We’ve got answers!