Syntax Highlight

Author: Günter Milde
Contact: milde@users.berlios.de
Date: 2009-05-06
Copyright: © 2007, 2009 G. Milde, Released without warranties or conditions of any kind under the terms of the Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0

Abstract

Proposal to add syntax highlight of code blocks to the capabilities of Docutils.

Contents

Syntax highlighting significantly enhances the readability of code. However, in the current version, docutils does not highlight literal blocks.

This sandbox project aims to add syntax highlight of code blocks to the capabilities of docutils. To find its way into the docutils core, it should meet the requirements laid out in a mail on Questions about writing programming manuals and scientific documents, by docutils main developer David Goodger:

I'd be happy to include Python source colouring support, and other languages would be welcome too. A multi-language solution would be useful, of course. My issue is providing support for all output formats -- HTML and LaTeX and XML and anything in the future -- simultaneously. Just HTML isn't good enough. Until there is a generic-output solution, this will be something users will have to put together themselves.

1   State of the art

There are already docutils extensions providing syntax colouring, e.g:

SilverCity,
a C++ library and Python extension that can provide lexical analysis for over 20 different programming languages. A recipe for a "code-block" directive provides syntax highlight by SilverCity.
listings,

a LaTeX package providing highly customisable and advanced syntax highlight, though only for LaTeX (and LaTeX derived PS|PDF).

Since Docutils 0.5, the "latex2e" writer supports syntax highlight of literal blocks by listings with the --literal-block-env=lstlistings option. You need to provide a custom style sheet. The stylesheets repository provides two LaTeX style sheets for highlighting literal-blocks with "listings".

Trac
has reStructuredText support and offers syntax highlighting with a "code-block" directive using GNU Enscript, SilverCity, or Pygments.
rest2web,
the "site builder" provides the colorize macro (using the Moin-Moin Python colorizer)
Sphinx

features automatic highlighting using the Pygments highlighter. It introduces the custom directives

code-block:similar to the proposal below,
sourcecode:an alias to "code-block", and
highlight:configre highlight of "literal blocks".

(see http://sphinx.pocoo.org/markup/code.html).

Pygments

is a generic syntax highlighter written completely in Python.

  • Usable as a command-line tool and as a Python package.
  • A wide range of common languages and markup formats is supported.
  • Additionally, OpenOffice's *.odt is supported by the odtwriter.
  • The layout is configurable by style sheets.
  • Several built-in styles and an option for line-numbering.
  • Built-in output formats include HTML, LaTeX, rtf
  • Support for new languages, formats, and styles is added easily (modular structure, Python code, existing documentation).
  • Well documented and actively maintained.
  • The web site provides a recipe for using Pygments in ReST documents. It is used in the Pygments enhanced docutils front-ends below.
Odtwriter, experimental writer for Docutils OpenOffice export supports syntax
colours using Pygments. (See the (outdated) section Odtwriter syntax.)

1.1   Summary

On 2009-02-20, David Goodger wrote in docutils-devel

I'd like to see the extensions implemented in Bruce and Sphinx etc. folded back into core Docutils eventually. Otherwise we'll end up with incompatible systems.

Pygments seems to be the most promising Docutils highlighter.

For printed output and PDFs via LaTeX, the listings package is a viable alternative.

2   Pygments enhanced docutils front-ends

Syntax highlight can be achieved by front-end scripts combining docutils and pygments.

"something users [will have to] put together themselves"
Advantages:
  • Easy implementation with no changes to the stock docutils.
  • Separation of code blocks and ordinary literal blocks.
Disadvantages:
  1. "code-block" content is formatted by pygments and inserted in the document tree as a "raw" node making the approach writer-dependant.
  2. documents are incompatible with the standard docutils because of the locally defined directive.
  3. more "invasive" markup distracting from content (no "minimal" code block marker -- three additional lines per code block)

Point 1 and 2 lead to the code-block directive proposal.

Point 3 becomes an issue in literate programming where a code block is the most used block markup. It is addressed in the proposal for a configurable literal block directive).

3   code-block directive proposal

3.1   Syntax

Note

This is the first draft for a reStructuredText definition, analogue to other directives in directives.txt.

Directive Type:"code-block"
Doctree Element:literal_block
Directive Arguments:One (language) or more (class names), optional.
Directive Options:None.
Directive Content:Becomes the body of the literal block.

The "code-block" directive constructs a literal block where the content is parsed as source code and syntax highlight rules for language are applied. If syntax rules for language are not known to Docutils, it is rendered like an ordinary literal block.

Example:

A bit of Python code

.. code-block:: python

 def my_function():
     "just a test"
     print 8/2

The directive content will be parsed and marked up as Python source code. The actual rendering depends on the style-sheet.

Remarks:

  • If the language argument is missing, a (configurable) default language should be used.

  • Additional arguments might be defined and passed to the pygments parser or the output document (as class arguments), e.g.

    number-lines:let pygments include line-numbers

3.2   Include directive option

The include directive should get a matching new option:

code: language
The entire included text is inserted into the document as if it were the content of a code-block directive (useful for program listings).

3.3   Implementation

3.3.1   Reading

Felix Wiemann provided a proof of concept script that utilizes the pygments parser to parse a source code string and store the result in the document tree.

This concept is used in a pygments_code_block_directive (Source: pygments_code_block_directive.py), to define and register a "code-block" directive.

  • The DocutilsInterface class uses pygments to parse the content of the directive and classify the tokens using short CSS class names identical to pygments HTML output. If pygments is not available, the unparsed code is returned.
  • The code_block_directive function inserts the tokens in a "rich" <literal_block> element with "classified" <inline> nodes.

3.3.2   Writing

The writers can use the class information in the <inline> elements to render the tokens. They should ignore the class information if they are unable to use it or to pass it on.

Running the test script ../tools/test_pygments_code_block_directive.py produces example output for a set of writers.

HTML

The "html" writer works out of the box.

  • The rst2html-highlight front end registers the "code-block" directive and converts an input file to html.
  • Styling is done with the adapted CSS style sheet pygments-default.css based on docutils' default stylesheet and the output of pygmentize -S default -f html.

The conversion of myfunction.py.txt looks like myfunction.py.htm.

The "s5" and "pep" writers are not tested yet.

XML

"xml" and "pseudoxml" work out of the box.

The conversion of myfunction.py.txt looks like myfunction.py.xml respective myfunction.py.pseudoxml

LaTeX

"latex2e" (SVN version) works out of the box.

  • A style file, e.g. pygments-docutilsroles.sty, is required to actually highlight the code in the output. (As with HTML, the pygments-produced style file will not work with docutils' output.)

  • Alternatively, the latex writer could reconstruct the original content and pass it to a lstlistings environment.

    TODO: This should be the default behaviour with --literal-block-env=lstlistings.

The LaTeX output of myfunction.py.txt looks like myfunction.py.tex and corresponding PDF like myfunction.py.pdf.

OpenOffice

The sandbox project odtwriter provided syntax highlight with pygments but used a different syntax and implementation.

(What is the status of the odtwriter now included in the standard distribution?)

3.4   TODO

  1. Minimal implementation:
  2. Write functional test case and sample.
  3. Think about an interface for pygments' options (like "encoding" or "linenumbers").

4   Configurable literal block directive

4.1   Goal

A clean and simple syntax for highlighted code blocks -- preserving the space saving feature of the "minimised" literal block marker (:: at the end of a text paragraph). This is especially desirable in documents with many code blocks like tutorials or literate programs.

4.2   Inline analogon

The role of inline interpreted text can be customised with the "default-role" directive. This allows the use of the concise "backtick" syntax for the most often used role, e.g. in a chemical paper, one could use:

.. default-role:: subscript

The triple point of H\ `2`\O is at 0°C.

to produce

The triple point of H2O is at 0°C.

This customisation is currently not possible for block markup.

4.3   Proposal

  • Define a new "literal-block" directive syntax for an ordinary literal block. This would simply insert the block content into the document tree as "literal-block" element.
  • Define a "default-literal-block" setting that controls which directive is called on a block following ::. Default would be the "literal-block" directive (backwards compatible).

4.4   Motivation

Analogue to customising the default role of "interpreted text" with the "default-role" directive, the concise :: literal-block markup could be used for e.g.

  • a "code-block" directive for syntax highight
  • the "line-block" directive for poems or addresses
  • the "parsed-literal" directive

Example:

ordinary literal block::

   some text typeset in monospace

.. default-literal-block::  code-block python

this is colourful Python code::

   def hello():
       print "hello world"

In the same line, a "default-block-quote" setting or directive could be considered to configure the role of a block quote.

5   Odtwriter syntax

Attention!

The content of this section relates to an old version of the odtwriter. Things changed with the inclusion of the odtwriter into standard Docutils.

This is only kept for historical reasons.

Dave Kuhlman's odtwriter extension can add syntax highlighting to ordinary literal blocks.

The --add-syntax-highlighting command line flag activates syntax highlighting in literal blocks. By default, the "python" lexer is used.

You can change this within your reST document with the sourcecode directive:

.. sourcecode:: off

ordinary literal block::

   content set in teletype

.. sourcecode:: on
.. sourcecode:: python

colourful Python code::

   def hello():
       print "hello world"

The "sourcecode" directive defined by the odtwriter is principally different from the "code-block" directive of rst2html-pygments:

I.e. the odtwriter implements a configurable literal block directive (but with a slightly different syntax than the proposal above).