Docutils To Do List

Author: David Goodger (with input from many); open to all Docutils developers
Contact: goodger@python.org
Date: 2009-06-29
Revision: 6006
Copyright: This document has been placed in the public domain.

Contents

Priority items are marked with "@" symbols. The more @s, the higher the priority. Items in question form (containing "?") are ideas which require more thought and debate; they are potential to-do's.

Many of these items are awaiting champions. If you see something you'd like to tackle, please do! If there's something you'd like to see done but are unable to implement it yourself, please consider donating to Docutils: Support the Docutils project!

Please see also the Bugs document for a list of bugs in Docutils.

Minimum Requirements for Python Standard Library Candidacy

Below are action items that must be added and issues that must be addressed before Docutils can be considered suitable to be proposed for inclusion in the Python standard library.

General

Documentation

User Docs

  • Add a FAQ entry about using Docutils (with reStructuredText) on a server and that it's terribly slow. See the first paragraphs in <http://article.gmane.org/gmane.text.docutils.user/1584>.
  • Add document about what Docutils has previously been used for (web/use-cases.txt?).
  • Improve index in docs/user/config.txt.

Developer Docs

  • Complete Docutils Runtime Settings.
  • Improve the internal module documentation (docstrings in the code). Specific deficiencies listed below.
    • docutils.parsers.rst.states.State.build_table: data structure required (including StringList).
    • docutils.parsers.rst.states: more complete documentation of parser internals.
  • docs/ref/doctree.txt: DTD element structural relationships, semantics, and attributes. In progress; element descriptions to be completed.
  • Document the pending elements, how they're generated and what they do.
  • Document the transforms (perhaps in docstrings?): how they're used, what they do, dependencies & order considerations.
  • Document the HTML classes used by html4css1.py.
  • Write an overview of the Docutils architecture, as an introduction for developers. What connects to what, why, and how. Either update PEP 258 (see PEPs below) or as a separate doc.
  • Give information about unit tests. Maybe as a howto?
  • Document the docutils.nodes APIs.
  • Complete the docs/api/publisher.txt docs.

How-Tos

  • Creating Docutils Writers
  • Creating Docutils Readers
  • Creating Docutils Transforms
  • Creating Docutils Parsers
  • Using Docutils as a Library

PEPs

  • Complete PEP 258 Docutils Design Specification.

    • Fill in the blanks in API details.

    • Specify the nodes.py internal data structure implementation?

      [Tibs:] Eventually we need to have direct documentation in there on how it all hangs together - the DTD is not enough (indeed, is it still meant to be correct? [Yes, it is. --DG]).

  • Rework PEP 257, separating style from spec from tools, wrt Docutils? See Doc-SIG from 2001-06-19/20.

Python Source Reader

General:

Miscellaneous ideas:

reStructuredText Parser

Also see the ... Or Not To Do? list.

Directives

Directives below are often referred to as "module.directive", the directive function. The "module." is not part of the directive name when used in a document.

  • Make the directive interface object-oriented (http://article.gmane.org/gmane.text.docutils.user/1871).

  • Allow for field lists in list tables. See <http://thread.gmane.org/gmane.text.docutils.devel/3392>.

  • Unify table implementations and unify options of table directives (http://article.gmane.org/gmane.text.docutils.user/1857).

  • Allow directives to be added at run-time?

  • Use the language module for directive option names?

  • Add "substitution_only" and "substitution_ok" function attributes, and automate context checking?

  • Change directive functions to directive classes? Superclass' __init__() could handle all the bookkeeping.

  • Implement options or features on existing directives:

    • Add a "name" option to directives, to set an author-supplied identifier?

    • All directives that produce titled elements should grow implicit reference names based on the titles.

    • Allow the :trim: option for all directives when they occur in a substitution definition, not only the unicode directive.

    • Add the "class" option to the unicode directive. For example, you might want to get characters or strings with borders around them.

    • images.figure: "title" and "number", to indicate a formal figure?

    • parts.sectnum: "local"?, "refnum"

      A "local" option could enable numbering for sections from a certain point down, and sections in the rest of the document are not numbered. For example, a reference section of a manual might be numbered, but not the rest. OTOH, an all-or-nothing approach would probably be enough.

      The "sectnum" directive should be usable multiple times in a single document. For example, in a long document with "chapter" and "appendix" sections, there could be a second "sectnum" before the first appendix, changing the sequence used (from 1,2,3... to A,B,C...). This is where the "local" concept comes in. This part of the implementation can be left for later.

      A "refnum" option (better name?) would insert reference names (targets) consisting of the reference number. Then a URL could be of the form http://host/document.html#2.5 (or "2-5"?). Allow internal references by number? Allow name-based and number-based ids at the same time, or only one or the other (which would the table of contents use)? Usage issue: altering the section structure of a document could render hyperlinks invalid.

    • parts.contents: Add a "suppress" or "prune" option? It would suppress contents display for sections in a branch from that point down. Or a new directive, like "prune-contents"?

      Add an option to include topics in the TOC? Another for sidebars? The "topic" directive could have a "contents" option, or the "contents" directive" could have an "include-topics" option. See docutils-develop 2003-01-29.

    • parts.header & parts.footer: Support multiple, named headers & footers? For example, separate headers & footers for odd, even, and the first page of a document.

      This may be too specific to output formats which have a notion of "pages".

    • misc.class:

    • misc.include:

      • Option to select a range of lines?

      • Option to label lines?

      • How about an environment variable, say RSTINCLUDEPATH or RSTPATH, for standard includes (as in .. include:: <name>)? This could be combined with a setting/option to allow user-defined include directories.

      • Add support for inclusion by URL?

        .. include::
           :url: http://www.example.org/inclusion.txt
        
    • misc.raw: add a "destination" option to the "raw" directive?

      .. raw:: html
         :destination: head
      
         <link ...>
      

      It needs thought & discussion though, to come up with a consistent set of destination labels and consistent behavior.

      And placing HTML code inside the <head> element of an HTML document is rather the job of a templating system.

    • body.sidebar: Allow internal section structure? Adornment styles would be independent of the main document.

      That is really complicated, however, and the document model greatly benefits from its simplicity.

  • Implement directives. Each of the list items below begins with an identifier of the form, "module_name.directive_function_name". The directive name itself could be the same as the directive_function_name, or it could differ.

    • html.imagemap

      It has the disadvantage that it's only easily implementable for HTML, so it's specific to one output format.

      (For non-HTML writers, the imagemap would have to be replaced with the image only.)

    • parts.endnotes (or "footnotes"): See Footnote & Citation Gathering.

    • parts.citations: See Footnote & Citation Gathering.

    • misc.language: Specify (= change) the language of a document at parse time.

    • misc.settings: Set any(?) Docutils runtime setting from within a document? Needs much thought and discussion.

      Security concerns need to be taken into account (it shouldn't be possible to enable file_insertion_enabled from within a document), and settings that only would have taken effect before the directive (like tab-width) shouldn't be accessible either. (How about changing tab-width before an include directive though? Or should include rather grow a tab-width option?)

      See this sub-thread: <http://thread.gmane.org/gmane.text.docutils.user/3620/focus=3649>

    • misc.gather: Gather (move, or copy) all instances of a specific element. A generalization of the "endnotes" & "citations" ideas.

    • Add a custom "directive" directive, equivalent to "role"? For example:

      .. directive:: incr
      
         .. class:: incremental
      
      .. incr::
      
      "``.. incr::``" above is equivalent to "``.. class:: incremental``".
      

      Another example:

      .. directive:: printed-links
      
         .. topic:: Links
            :class: print-block
      
            .. target-notes::
               :class: print-inline
      

      This acts like macros. The directive contents will have to be evaluated when referenced, not when defined.

      • Needs a better name? "Macro", "substitution"?
      • What to do with directive arguments & options when the macro/directive is referenced?
    • Make the meaning of block quotes overridable? Only a 1-shot though; doesn't solve the general problem.

    • Docutils already has the ability to say "use this content for Writer X" (via the "raw" directive), but it doesn't have the ability to say "use this content for any Writer other than X". It wouldn't be difficult to add this ability though.

      My first idea would be to add a set of conditional directives. Let's call them "writer-is" and "writer-is-not" for discussion purposes (don't worry about implemention details). We might have:

      .. writer-is:: text-only
      
         ::
      
             +----------+
             |   SNMP   |
             +----------+
             |   UDP    |
             +----------+
             |    IP    |
             +----------+
             | Ethernet |
             +----------+
      
      .. writer-is:: pdf
      
         .. figure:: protocol_stack.eps
      
      .. writer-is-not:: text-only pdf
      
         .. figure:: protocol_stack.png
      

      This could be an interface to the Filter transform (docutils.transforms.components.Filter).

      The ideas in adaptable file extensions above may also be applicable here.

      SVG's "switch" statement may provide inspiration.

      Here's an example of a directive that could produce multiple outputs (both raw troff pass-through and a GIF, for example) and allow the Writer to select.

      .. eqn::
      
         .EQ
         delim %%
         .EN
         %sum from i=o to inf c sup i~=~lim from {m -> inf}
         sum from i=0 to m sup i%
         .EQ
         delim off
         .EN
      
    • body.example: Examples; suggested by Simon Hefti. Semantics as per Docbook's "example"; admonition-style, numbered, reference, with a caption/title.

    • body.index: Index targets.

      See Index Entries & Indexes.

    • body.literal: Literal block, possibly "formal" (see object numbering and object references above). Possible options:

      • "highlight" a range of lines

      • include only a specified range of lines

      • "number" or "line-numbers"

      • "styled" could indicate that the directive should check for style comments at the end of lines to indicate styling or markup.

        Specific derivatives (i.e., a "python-interactive" directive) could interpret style based on cues, like the ">>> " prompt and "input()"/"raw_input()" calls.

      See docutils-users 2003-03-03.

    • body.listing: Code listing with title (to be numbered eventually), equivalent of "figure" and "table" directives.

    • colorize.python: Colorize Python code. Fine for HTML output, but what about other formats? Revert to a literal block? Do we need some kind of "alternate" mechanism? Perhaps use a "pending" transform, which could switch its output based on the "format" in use. Use a factory function "transformFF()" which returns either "HTMLTransform()" instance or "GenericTransform" instance?

      If we take a Python-to-HTML pretty-printer and make it output a Docutils internal doctree (as per nodes.py) instead of HTML, then each output format's stylesheet (or equivalent) mechanism could take care of the rest. The pretty-printer code could turn this doctree fragment:

      <literal_block xml:space="preserve">
      print 'This is Python code.'
      for i in range(10):
          print i
      </literal_block>
      

      into something like this ("</>" is end-tag shorthand):

      <literal_block xml:space="preserve" class="python">
      <keyword>print</> <string>'This is Python code.'</>
      <keyword>for</> <identifier>i</> <keyword
      >in</> <expression>range(10)</>:
          <keyword>print</> <expression>i</>
      </literal_block>
      

      But I'm leaning toward adding a single new general-purpose element, "phrase", equivalent to HTML's <span>. Here's the example rewritten using the generic "phrase":

      <literal_block xml:space="preserve" class="python">
      <phrase class="keyword">print</> <phrase
       class="string">'This is Python code.'</>
      <phrase class="keyword">for</> <phrase
       class="identifier">i</> <phrase class="keyword">in</> <phrase
       class="expression">range(10)</>:
          <phrase class="keyword">print</> <phrase
           class="expression">i</>
      </literal_block>
      

      It's more verbose but more easily extensible and more appropriate for the case at hand. It allows us to edit style sheets to add support for new formats, not the Docutils code itself.

      Perhaps a single directive with a format parameter would be better:

      .. colorize:: python
      
         print 'This is Python code.'
         for i in range(10):
             print i
      

      But directives can have synonyms for convenience. "format:: python" was suggested, but "format" seems too generic.

    • pysource.usage: Extract a usage message from the program, either by running it at the command line with a --help option or through an exposed API. [Suggestion for Optik.]

Interpreted Text

Interpreted text is entirely a reStructuredText markup construct, a way to get around built-in limitations of the medium. Some roles are intended to introduce new doctree elements, such as "title-reference". Others are merely convenience features, like "RFC".

All supported interpreted text roles must already be known to the Parser when they are encountered in a document. Whether pre-defined in core/client code, or in the document, doesn't matter; the roles just need to have already been declared. Adding a new role may involve adding a new element to the DTD and may require extensive support, therefore such additions should be well thought-out. There should be a limited number of roles.

The only place where no limit is placed on variation is at the start, at the Reader/Parser interface. Transforms are inserted by the Reader into the Transformer's queue, where non-standard elements are converted. Once past the Transformer, no variation from the standard Docutils doctree is possible.

An example is the Python Source Reader, which will use interpreted text extensively. The default role will be "Python identifier", which will be further interpreted by namespace context into <class>, <method>, <module>, <attribute>, etc. elements (see pysource.dtd), which will be transformed into standard hyperlink references, which will be processed by the various Writers. No Writer will need to have any knowledge of the Python-Reader origin of these elements.

  • Add explicit interpreted text roles for the rest of the implicit inline markup constructs: named-reference, anonymous-reference, footnote-reference, citation-reference, substitution-reference, target, uri-reference (& synonyms).

  • Add directives for each role as well? This would allow indirect nested markup:

    This text contains |nested inline markup|.
    
    .. |nested inline markup| emphasis::
    
       nested ``inline`` markup
    
  • Implement roles:

    • "raw-wrapped" (or "raw-wrap"): Base role to wrap raw text around role contents.

      For example, the following reStructuredText source ...

      .. role:: red(raw-formatting)
         :prefix:
             :html: <font color="red">
             :latex: {\color{red}
         :suffix:
             :html: </font>
             :latex: }
      
      colored :red:`text`
      

      ... will yield the following document fragment:

      <paragraph>
          colored
          <inline classes="red">
              <raw format="html">
                  <font color="red">
              <raw format="latex">
                  {\color{red}
              <inline classes="red">
                  text
              <raw format="html">
                  </font>
              <raw format="latex">
                  }
      

      Possibly without the intermediate "inline" node.

    • "acronym" and "abbreviation": Associate the full text with a short form. Jason Diamond's description:

      I want to translate `reST`:acronym: into <acronym title='reStructuredText'>reST</acronym>. The value of the title attribute has to be defined out-of-band since you can't parameterize interpreted text. Right now I have them in a separate file but I'm experimenting with creating a directive that will use some form of reST syntax to let you define them.

      Should Docutils complain about undefined acronyms or abbreviations?

      What to do if there are multiple definitions? How to differentiate between CSS (Content Scrambling System) and CSS (Cascading Style Sheets) in a single document? David Priest responds,

      The short answer is: you don't. Anyone who did such a thing would be writing very poor documentation indeed. (Though I note that somewhere else in the docs, there's mention of allowing replacement text to be associated with the abbreviation. That takes care of the duplicate acronyms/abbreviations problem, though a writer would be foolish to ever need it.)

      How to define the full text? Possibilities:

      1. With a directive and a definition list?

        .. acronyms::
        
           reST
               reStructuredText
           DPS
               Docstring Processing System
        

        Would this list remain in the document as a glossary, or would it simply build an internal lookup table? A "glossary" directive could be used to make the intention clear. Acronyms/abbreviations and glossaries could work together.

        Then again, a glossary could be formed by gathering individual definitions from around the document.

      2. Some kind of inline parameter syntax?

        `reST <reStructuredText>`:acronym: is `WYSIWYG <what you
        see is what you get>`:acronym: plaintext markup.
        
      3. A combination of 1 & 2?

        The multiple definitions issue could be handled by establishing rules of priority. For example, directive-based lookup tables have highest priority, followed by the first inline definition. Multiple definitions in directive-based lookup tables would trigger warnings, similar to the rules of implicit hyperlink targets.

      4. Using substitutions?

        .. |reST| acronym:: reST
           :text: reStructuredText
        

      What do we do for other formats than HTML which do not support tool tips? Put the full text in parentheses?

    • "figure", "table", "listing", "chapter", "page", etc: See object numbering and object references above.

    • "glossary-term": This would establish a link to a glossary. It would require an associated "glossary-entry" directive, whose contents could be a definition list:

      .. glossary-entry::
      
         term1
             definition1
         term2
             definition2
      

      This would allow entries to be defined anywhere in the document, and collected (via a "glossary" directive perhaps) at one point.

Unimplemented Transforms

HTML Writer

PEP/HTML Writer

S5/HTML Writer

LaTeX writer

Also see the Problems section in the latex writer documentation and discussion and proposals in the latex-variants sandbox project.

Bug fixes

  • A multirow cell in a table expects empty cells in the spanned rows while the doctree contains only the remaining cells ("Exchange Table Model", see docs/ref/soextblx.dtd).

    Needs bookkeeping of "open" multirow cells (how many how long) and insertion of additional '&'s.

    See ../../test/functional/input/data/latex.txt

  • Too deeply nested lists fail: generate an error, a warning or provide a workaround?

  • Symbol footnotes fail with "use_latex_footnotes".

Generate clean and configurable LaTeX source

  • Check the generated source with package nag.

LaTeX macros for Docutils-specific objects

subtitle:

Is there a native construct?

section titles:

Custom command or package hypbmsec?

admonitions:

optional "type" argument:

\providecommand{\DUadmonition}[2][]{%
\ifcsname DU#1\endcsname%
  \csname DU#1\endcsname{#2}%
\else
    \begin{center}
      \fbox{\parbox{0.9\textwidth}{#2}}
    \end{center}
\fi%
}

Would allow individual restyling of e.g. Notes, e.g. with pdfcomment.sty or todonotes.sty.

Use a "type" argument also for \DUtopictitle? (would allow red error etc.)

Configurable placement of figure and table floats

  • Special class argument to individually place figures?

    Either:

    placement-<optional arg> -> figure[<optional arg>]{...}

    e.g. .. class::  placement-htb,

    or more verbose:

    H:place-here
    h:place-here-if-possible
    t:place-top
    b:place-bottom
    p:place-on-extra-page

    e.g.: .. class:: place-here-if-possible place-top place-bottom

LaTeX constructs and packages instead of re-implementations

Which packages do we want to use?

  • base and "recommended" packages

    (packages that should be in a "reasonably sized and reasonably modern LaTeX installation like the texlive-latex-recommended Debian package, say):

  • No "fancy" or "exotic" requirements.

    Or provide fallback solutions in case the required packages do not exist on the target system? (Can be done with \@ifpackageloaded{} or \IfPackageExists{}.)

  • pointers to advanced packages and their use in the latex writer documentation.

  • alltt environment for literal block.

  • footnotes

    • enable symbolic footnotes (parallel to numeric)

    • document customization (links to how-to and packages)

    • find out how to link and back-link to footnotes with hyperref

      The hyperref manual says:

      hyperfootnotes: Makes the footnote marks into hyperlinks to the footnote text. Easily broken ...

      the hyperref README says:

      The footnote support is rather limited. It is beyond the scope to use footnotemark and footnotetext out of order or reusing footnotemark. Here you can either disable hyperref's footnote

      And provides an example which is built on in this thread.

    • disable or properly support --footnote-references=bracket

      When supplying the command line options --footnote-references=brackets and --use-latex-footnotes with the LaTeX writer (which might very well happen when using configuration files), the spaces in front of footnote references aren't trimmed.

  • enumeration environment, field list

    use mdwlist from texlive-latex-recommended?

  • --use-latex-when-possible »super option« that would set the following:

       --use-latex-toc
       --use-latex-docinfo
       --use-latex-abstract
       --use-latex-footnotes
       --use-latex-citations
    
    ? (My preference is to default to use-latex-* whenever possible [GM])
    

Default layout

  • use-latex-doc by default?

  • Which default font should we use for the output?

    Proposal: Use one of the Postscript default fonts supported by standard LaTeX (pages 10 and 11 of the PSNFSS documentation), e.g. Times or Palatino.

    In PDF 1.3 there are 14 standard fonts:

    Times-Roman, Times-Bold, Times-Italic, Times-BoldItalic, Helvetica, Helvetica-Bold, Helvetica-Oblique, Helvetica-BoldOblique, Courier, Courier-Bold, Courier-Oblique, Courier-BoldOblique, Symbol, ZapfDingbats

    The rest you need to embed.

  • Use italic instead of slanted for titlereference?

  • Let meta directive insert comment into header instead of document?

  • More prominent system messages (use admonition?, red?)

  • Start a new paragraph after lists (as currently) or continue (no blank line in source, no parindent in output)?

    Overriding:

    • continue if the compound paragraph directive is used, or
    • force a new paragraph with an empty comment.
  • Indent/handle doctest blocks similar to literal blocks?

  • Sidebar handling (environment with framed, marginnote, wrapfig, ...)?

  • Use optionlist for docinfo? The table does only work for single page.

  • Recognize LaTeX and replace by \LaTeX (or not?).

  • Keep literal-blocks together on a page, avoid pagebreaks.

    Failed experiments up to now: samepage, minipage, pagebreak 1 to 4 before the block.

    Should be possible with --literal-block-env==lstlistings and some configuration...

  • Configurable classifier in description list? emph instead of bold as default?

Tables

  • Improve/simplify logic to set the column width in the output.

    • Assumed reST line length for table width setting configurable.
    • Maybe use ltxtable (a combination of tabularx (auto-width) and longtable (page breaks))?

    Use tabularx column type X and let LaTeX decide width?

  • csv-tables do not have a colwidth.

  • Add more classes or options, e.g. for

    • column width set by latex,
    • horizontal alignment and rules.
  • Use tabular instead of longtable for tables in legends or generally inside a float?

    Alternatively, default to tabular and use longtable only if specified by config setting or class argument (analogue to booktable)?

  • Table heads and footer for longtable (firstpage lastpage ..)?

  • In tools.txt the option tables right column, there should be some more spacing between the description and the next paragraph "Default:".

  • Paragraph separation in tables is hairy. see http://www.tex.ac.uk/cgi-bin/texfaq2html?label=struttab

    • The strut solution did not work.
    • setting extrarowheight added ad top of row not between paragraphs in a cell. ALTHOUGH i set it to 2pt because, text is too close to the topline.
    • baselineskip/stretch does not help.
  • Should there be two hlines after table head and on table end?

  • Place titled tables in a float ('table' environment)?

    The 'table', 'csv-table', and 'list-table' directives support an (optional) table title. In analogy to the 'figure' directive this should map to a table float.

Image and figure directives

Missing features

  • test and document selection of LaTeX fontsize.

    Add font size in points to the document options, e.g. --documentoptions=12, use extsize or some other package for values other than [10,11,12].

  • support "figwidth" argument for figures.

    As the 'figwidth' argument is still ignored and the "natural width" of a figure in LaTeX is 100 % of the text width, setting the 'align' argument has currently no effect on the LaTeX output.

  • better citation support

  • Meta keywords into PDF ?

  • Multiple author entries in docinfo (same thing as in html). (already solved?)

  • Hyperlinks are not hyphenated; this leads to bad spacing. See docs/user/rst/demo.txt 2.14 directives.

    --> Pass the right option to \hypersetup

  • Consider supporting the "compact" option and class argument (from rst2html) as some lists look better compact and others need the space.

  • If use-latex-citations is used a bibliography is inserted right at the end of the document.

    Put in place of the to-be-implemented "citations" directive (see Footnote & Citation Gathering).

  • Provide a --default-dpi (or better named) option to set the size of a pixel (lenght unit px)?

    You can do this already with:

    \pdfpxdimen=1in % 1 DPI
    \divide\pdfpxdimen by 92 % 92 DPI
    

    in a style sheet. So which should take precedence (or just leave it)?

template file

Using a template file instead of hard-coded string literals for the skeleton of the exported LaTeX file would provide improved configurability. A power user could replace the template with a custom version, e.g. to overcome issues with order of package loading or suppressing some parts.

  • Template Strings, from string import Template, provide a suitable syntax (shell-like $something substitution) but are new in version 2.4.

    For 2.3 compatibility, %-substitution is needed, which is less readable (and LaTeX' comment sign '%' must be escaped).

    -> provide a 2.3 compatibility implementation?

Unicode to LaTeX with unicodesymbols file from LyX

The LyX document processor has a comprehensive Unicode to LaTeX conversion feature with a file called unicodesymbols that lists LaTeX counterparts for a wide range of Unicode characters.

Use this in the LaTeXTranslator.unicode_to_latex() method. Think of copyright issues!

Allow choice between utf8 (standard) and utf8x (extended) encodings

  • Allow the user to select utf8 or utf8x LaTeX encoding. (Docutil's output encoding becomes LaTeX's input encoding.)

The ucs package provides extended support for UTF-8 encoding in LaTeX via the inputenc-option utf8x. It is, however, a non-standard extension and no longer developed.

Ideas:
  • Python has 4 names for the UTF-8 encoding (utf_8, U8, UTF, utf8) give a special meaning to one of the aliases,
  • scan "stylesheets" option and use utf8x if it contains ucs

Front-End Tools