Docutils: Multiple Input Files

Author:

Lea Wiemann <LeWiemann@gmail.com>

Date:
2007-08-21
Revision:
5417

1 Introduction

We would like to support documents whose source text comes from multiple files. For instance, the Docutils documentation could be considered a single large document; parsing all files into one single document tree would enable us to do cross-linking between parts of the documentation (our current way to cross-link between files is to link to HTML files and fragments, which is limited to HTML). Another example is a reference manual for a customized software system. The manual is built from a set of sub-documents that may differ from installation to installation.

Note that this issue is separate from that of output to multiple files; after implementing support for multiple input files, all we will be able to do is to generate a single huge output file.

This is a collection of notes and semi-random thoughts (many of which are credit to David, from IM conversations). Feel free to add yours and/or make changes as you see fit!

You can also discuss this proposal on the Docutils-develop mailing list, or reach us individually via email or Jabber/Google Talk at LeWiemann@gmail.com and dgoodger@gmail.com, respectively.

2 Terminology

Right now, we are using the following terminology: A document which includes other documents (using the subdocument directive described below) is called a super-document. The included documents are called sub-documents. Sub-documents can in turn include other documents and can thus be super-documents themselves. Any top-most super-document in the hierarchy, which is not included by any other document, is called a compound document.

The set of all documents that can be part of a compound document is the document set (or docset). The directory that is ancestor to all documents in the document set is the docset root.

3 The subdocument Directive

4 Cross-References

A major issue to think about is how to do cross-references (colloquially known as xrefs) between files. Things like local references or substitutions should not be shared between files (their definitions can simply be loaded using the "include" directive). However, sharing external targets and thereby allowing cross-references between files is one of the major features of an architecture that supports multiple input files.

(Implementation note: For this to work, we need to apply transforms to sub-documents; basically, all transforms but the one resolving external references should be applied. This means that a new reader instance must be created. r5266 makes applying transforms to sub-documents possible by pulling the responsibility for applying transforms out of the Publisher.)

Issues arise once we think about how to group target names into namespaces. Unfortunately, simply putting all targets into a global, document-wide namespace is bound to cause collisions; files that were processable stand-alone are no longer processable when used in conjunction with other files because they share common target names.

Since linking to targets outside the scope of the current sub-document has significant disadvantages, we will need some form of qualifiers.

4.1 Namespace Identifiers

This makes it necessary to add a notion of namespace identifiers.

It is possible to always name the namespace of the current file (as it is done in C++). For instance, ".. namespace:: frob" at the beginning of a file could declare that the namespace of the current file is called "frob". However, this is a little verbose as it adds a line at the top of each file (see the sidebar). Also, it removes the reader's ability to easily look up the referenced files (you might not know which file(s) declare the "frob" namespace).

On the other hand, namespace names could also be derived from paths and file names. (Note though that these two options need not be mutually exclusive.) Since using only the file name would cause ambiguity, it is necessary to include its path in the namespace name. For instance, the file docs/dev/todo.txt could be referenced by the implicit namespace identifier docs/dev/todo.txt; a reference would look like `<docs/dev/todo.txt> large documents`_. Using paths relative to the current file makes it hard to move files or document parts. Therefore, we need to establish the notion of a docset root which path names are relative to.

The docset root could be specified using a "docset-root" directive at the top of each sub-document that uses external named references. On the other hand, perhaps we do not need to know the docset root until we process the compound document (in which case it can be implicitly derived from the location of the master file). So let's wait with implementing a "docset-root" directive until the need arises.

4.2 Namespace Aliases

A general disadvantage of using paths as namespace identifiers is that changes in the directory structure cause a massive amount of changes in the reStructuredText files, because all the paths need to be updated. This is not any worse than the current situation. However, to improve maintainability it would be desirable to make the namespace of an often-referenced files known under a shorter name. (The shorter namespace identifier should only be valid within the file where it is declared, and possibly sub-documents.) For instance, one could make "docs/ref/restructuredtext.txt" known as "spec" using one of the following syntax alternatives:

.. namespaces::

   :spec: docs/ref/restructuredtext.txt

Namespace aliases can also be used make one namespace refer to different physical files depending on the super-document. Namespace definitions should therefore be inherited from super-documents to sub-documents. The "namespaces" directive overrides namespace definitions inherited from super-documents, unless the :inherit: option is specified. (The :inherit: option thus allows to provide default paths for namespace aliases, which can still be overridden in super-documents.)

4.3 Qualifier Syntax

Angled brackets:

`<namespace> target`_

This is similar to the syntax for embedded URI's (`target <URI>`_). It fits well into the existing syntax.

5 Implementation

In order to parse sub-documents, we need to create new parser instances.

For now, we'll instantiate them by calling parser.__class__(); in the long run the reader, parser, and writer parameters of the Publisher should be turned into classes (or callbacks) instead of instances.

The Parser must know about the Reader (or about the Publisher) and calls Reader.parse_[sub]document in order to parser sub-documents.

6 Dumpster

You can stop reading now. This section is only to archive sections we're no longer interested in.

6.1 Rejected Proposal: Local and Global Namespace, no Qualifiers

An obvious solution would be to add a notion of a file-local and a global namespace. When trying to resolve a reference, first the target name is looked up in the local namespace of the current file; if no suitable target is found there, the target name is searched for document-wide, in the global namespace; if the target name exists and is unique within the compound document, the reference can be resolved.

If references to the global namespace are not marked up as such, however, the individual files are no longer processable stand-alone because they contain unresolvable references. While it may be acceptable that external named cross-references do not (fully) work any longer when a file is processed stand-alone, it would be nice to be able to handle unresolved external references somehow (at least by marking them as "unresolvable" in the output), rather than simply throwing an error (see the sidebar).

This can be solved by marking external references as such, like this:

`local target name`_
`-> global target name`_

where "local target name" must be a unique target name within the current file, and "global target name" must be a unique target name within the current compound document.

(We would need to explicitly establish a notion of "stand-alone" vs. "full document" processing in this case. But since this proposal is being rejected, I'm not going to explore this further.)

6.1.1 Drawbacks

This approach turns out to have a major drawback though: External named references depend on the context of the containing super-document. However, as Joaquim pointed out, files should be expected to be part of several super-documents. This means that once a file is put into the context of a new document, its external named references might point to non-existing or duplicate targets. This seems like a maintenance problem for complex (large) collections of documentation.

Another peculiarity of this system is that, as long as a file is processed stand-alone, external named references are not associated with the file that defines the target. This brings the advantage that renaming and moving files won't invalidate reference names. On the downside, it lacks clarity for the reader because the file containing a target is often not inferable from the target name (try to guess which file `-> html4css1`_ links to) -- this may be significant since reStructuredText should be readable in its source form.

6.2 Importing Namespaces

While namespaces should generally be available without explicitly importing them (in order to avoid length headers), it would probably be handy to have a means of inserting all targets of another namespace into the current one (similar to Python's "from module import *"). The disadvantage is that it may cause confusion.

Contenders for the syntax:

.. import:: namespace   (Pythonic)

.. import-targets:: namespace   (more verbose)

.. using:: namespace    (like C++)

Or, provided that we use ".. namespace:: short-name <- namespace", and "`namespace -> target`_" as reference syntax, this would be a logical fit:

.. namespace:: <- namespace
`-> target`_                   (instead of `namespace -> target`_)

The advantage of this syntax is that we can prohibit importing more than one namespace, which might cause confusion. Importing only one namespace might be a handy shortcut though.

6.3 Caching

In order to be able to regenerate the whole compound document in a timely manner after changing a single file, it is necessary to implement a caching system.

Processing a document is done in the following steps:

  1. For each file in the docset, parse it and turn the target names into file-local ID's (this includes error handling for duplicate target names). Cache the parse tree, the name-to-ID mapping, and the list of all files included using the "include" directive. Skip this step for files whose cache entry date-stamp is newer than the file's mtime and ctime, and all included files mtimes and ctimes.

    This means that the "subdocument" directive must be resolved at transform time (and not at parse time), because otherwise we cannot store the doctree before the sub-document has been inserted.

  2. For each file, run transforms, resolving external named references using the cached name-to-ID mappings of other files.

  3. Write out the resulting document (currently a single file). (The writer needs to turn namespace/ID pairs into output-file-local ID's.)

Processing a file stand-alone is done in the same way, except that steps 1 and 2 are only performed for the file being processed, not for each file in the docset. If other files' cached name-to-ID mappings are not up-to-date (when being accessed in step 2), they should be automatically updated.

All cache entries should be stored in a docset cache file, in order to avoid LaTeX-like creation of many junk files. Possible names include docutils.cache, docutils.aux, or either of them with leading dot. The file is stored in the docset root and contains a header and a large pickle string (reading and writing even large strings of pickled data is reasonably fast). In the header of the cache file, store sys.version and docutils.__version__; discard cache files that have the wrong version.

Potential security issue: Since unpickling is unsafe, an attacker could provide a carefully crafted cache file, which is then automatically picked up by Docutils. Remedies: Insert some unguessable system-specific key (generate randomly and store in ~/.docutils.cache.key), and automatically discard cache files that have the wrong key. Or simply place a big warning in the documentation not to accept cache files from strangers.

No caching is done if no docset-root is defined (which means that the file being processed is independent and not part of any compound document).

6.4 Implementation

As described in section Caching, when processing files stand-alone and resolving their external named references, it may be necessary to process (or re-process) referenced files. Since this is during transform-time, the parser instance is no longer available; it is therefore necessary to create a new instance.

All requests for doctree and name-to-ID mappings should go through the caching system. In case of a miss, the caching system instantiates a parser and (re-)parses the requested file.

In fact, all calls by the standalone reader to the reStructuredText parser should go through the cache. In the case of independent files which are not part of a larger docset, the system always assumes a cache miss.