To cite from ``DocBook -- The Definitive Guide'' (see Further Reading at the end of this section), DocBook provides a system for writing structured documents using SGML or XML. In the following, I shall focus on the XML-variant of DocBook, because the SGML-variant is being phased out.
DocBook has been developed with a slightly different mindset than the systems I discussed in the two previous articles (POD article, LaTeX/latex2html article).
By changing the DTD, almost arbitrary constraints can be imposed on a DocBook document. For example, an organizing committee of a conference might adapt the DocBook DTD in such way that all the article of the conference's proceedings will have a uniform look and all the necessary author information.
The particular features of DocBook mentioned, imply uses of DocBook documents that are not possible, at least not easily, with POD or LaTeX documents.
For example, we load the
XML::DOM module into Perl to access XML compliant documents, and Python ships with the
xml.dom module, which has been designed for the same purpose.
The World Wide Web Consortium (W3C, http://www.w3c.org) has even defined a language for XML translations, called XSLT (see for example http://www.w3.org/TR/xslt and http://www.oasis-open.org/cover/xsl.html). XSLT itself is a language defined within the SGML framework, which makes XML and XSL look quite similar: loads of angle brackets.
Popular transformation tools are:
The installation of both tools including the necessary DSSSL stylesheets or XSL stylesheets is quite tricky, thus I would like to recommend to beginners the installation from .deb or .rpm packages.
Being general purpose translators, both tools are not restricted to transforming DocBook documents. If you feed them the right style sheets, they will do other translations, too.
The DocBook/XML syntax resembles HTML. The fundamental difference between the two being the strictness with which the syntax is enforced. Many HTML browsers are extremely forgiving about unterminated elements, and they often silently ignore unknown elements or attributes. DocBook/XML translators reject non-DTD complying input with detailed error messages, and refuse to produce any output in such cases.
DocBook/XML is spoken in several variants, where the variants differ in interpreting the closing tag of an element. The most verbose dialect always closes
</tag>. Another variant allows for abbreviating the closing tag to
</>, yet another allows dropping the closing tag for empty elements all together. I prefer writing out every end tag, a style that has proven advantageous in deeply nested structures such as nested lists. So, in this article only the form
<tag> ... </tag> will appear.
Special characters are written with the ampersand-semicolon convention as they are in HTML. The most frequently used special characters are
Comments are bracketed between ``
<!--'' and ``
As already mentioned, DocBook documents must adhere to the structure that is defined in a DTD. Every document starts with selecting a particular DTD:
<!DOCTYPE (1) book (2) PUBLIC "-//OASIS//DTD DocBook XML V4.1//EN" (3) "/usr/share/sgml/db41xml/docbookx.dtd" (4) [ ] (5) >
where I have broken the expression (from ``<'' to ``>'') into several lines for easier analysis, and added numbers in parentheses for reference.
Part (1) tells the system that we are about to choose our DTD. Part (2) defines element
book to be the root element of our document. part (3), the public identifier selects the DTD to use. The public identifier is the string in quotes. The system identifier, part (4) tells the translation tools where to find the DTD on the local computer system. Within the square brackets, part (5), we could place so called entity definitions, but I do not want go into detail on entities in this introduction, so we leave this space empty.
Now, we start the text with the root element, in our case
book. What elements go into
book is defined in the DocBook DTD. These are, for example,
chapter. For a comprehensive list of allowed elements, consult ``The Definitive Guide''. The elements allowed within
chapter are also defined in the DocBook DTD as are all elements. The only way constructing a valid document is by obeying all the rules prescribed by the DTD.
What might look like a drag on first sight -- Rules? Rules suck! -- is the key to open up the document to programmatic access. As the document complies to the DTD, all post-processing can rely on that very fact. Good for the programmers of the post-processors! I have to admit that the number of elements and the elements' mutual relationships is tough to pick up. However, the relations are logical: a chapter contains one ore more (introductory) paragraphs and one or more Level 1 sections. No section, on the other hand, contains a chapter, that would be nonsense. Having a copy of ``The Definitive Guide'' right next to the keyboard also helps to learn DocBook. Further down, there is a short compilation of commonly used tags.
Here comes a very short, but complete DocBook document.
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1//EN" "/usr/share/sgml/db41xml/docbookx.dtd" >
<book> <bookinfo> <title>XYZ (version 0.8.15) User's Manual</title> </bookinfo>
<chapter id = "chapter-introduction"> <title>Introduction</title>
<para> This chapter provides a quick introduction to XYZ. </para>
<sect1 id = "section-syntax"> <title>Syntax</title>
<para> In this section we present an outline of the syntax of the XYZ language. </para> </sect1>
<sect1 id = "section-core-library"> <title>Core Library</title>
<para> Even if no additional libraries are loaded to a XYZ program, it has access to some core library functions. </para> </sect1> </chapter>
<chapter id = "chapter-commands"> <title>Commands</title>
<sect1 id = "section-interactive-commands"> <title>Interactive Commands</title>
<para> ... </para>
<sect2 id = "section-interactive-commands-argumentless"> <title>Argumentless Commands</title>
<para> ... </para> </sect2> </sect1>
<sect1 id = "section-non-interactive-commands"> <title>Non-Interactive Commands</title>
<para> ... </para>
<sect2 id = "section-non-interactive-commands-argumentless"> <title>Argumentless Commands</title>
<para> ... </para> </sect2> </sect1> </chapter> </book>
To help the aspiring DocBook writer making sense of the loads of elements, the DocBook standard defines, I have compiled a bunch of useful tags, which are used often.
Root section tags define the outermost element of any document.
I<paragraphs or chapters>
I<paragraphs or level 1 sections>
Sectioning elements divide the document into logical parts like chapters, sections, paragraphs, and so on.
paragraphs or level N+1 sections
Define a section. Commonly, chapter and section elements carry the
id attribute, which allows for referencing the elements with, for example, <xref linkend = "label"></xref>.
Group several lines of text together to form a paragraph. This is the workhorse element in many documents.
Render a longish piece of program text -- preserving the line breaks. The program is assumed to be written in the language specified in the
role attribute. Note that within
programlisting all special characters retain their meaning!
This means in particular that you cannot use the control characters ``
>'', and ``
&'' inside of it. The several workarounds for this problem. Either you replace all control characters with their mnemonic equivalents (``
>'', and ``
&'' in our example), or you wrap the program code in a
CDATA, like, for example,
<programlisting> <![CDATA[ cout << "value = <" << &p << ">\n"; ]]> </programlisting>
or, if the program is stored in file <>my-program.pl</<EM>>, pull in the whole file with
<programlisting> <inlinemediaobject> <imageobject> <imagedata format = "linespecific" fileref = "my-program.pl"></imagedata> </imageobject> </inlinemediaobject> </programlisting>
Generate the three typical types of lists.
The items or definitions are typically formed by one or more paragraphs, but they are allowed to contain program listings, too. The terms usually are one or more words, not paragraphs.
Highlight a short part of the document; usually a single word.
Mark word as filename.
<literal role = "classification">literal something</literal>
Mark a word as being a literal expression. Use this tag only as last possibility, if no other more specific tag matches. To calm one's bad conscience,
literal often gets decorated with a
role attribute, which describes more precisely the kind of literal.
Mark a meta-variable.
Give a name to a section or a formal element, like a table.
Cross references refer to other parts of the same DocBook document or to other documents on the World Wide Web. Targets of the former are all elements that carry an
id attribute, targets of the latter are selected with universal resource locators (URLs).
Install a (hyper-)link to the spot identified via target within the current document.
Install a hyper-link to a WWW-accessible document identified by a complete URL. A complete URL includes the protocol, for example,
Install a (hyper-)link to the spot identified via target within the current document. A translator will add text around an
xref element. For example, a
xref to a section might be decorated with the text ``
Ugh, I left out tons of stuff, but only to give you a smooth, non-frightening introduction. Some great things DocBook handles that I have not discussed are
Also left out is everything related to changing the DTD or changing the style sheets.
Next month: Texinfo