Declarative and Operational Document Encodings

Document markup languages have slowly evolved from procedural to declarative markup. Powerful formatters such as troff and TeX merely allow the text of the document to be interspersed with formatting commands. If the document has to be formatted by another system, or for a different medium, (all) the markup has to be changed. Macro facilities may solve part of the problem because part of the (low-level) formatting decisions can be localized in the macro definition. Macro packages such as LaTeX provide high level markup constructs which allow for a more declarative style by emphasizing the structure of the text, instead of the physical document layout.

Content, Structure and Layout

To enhance inter-operability, that is the ability to edit, browse and typeset documents for different media on different platforms, one has to make a clear distinction between a document's contents, how these contents are structured, and the mapping of structure to format instructions.

Content data

Encoding mechanisms used to represent the content data largely depend on the medium used. Text encoding schemes, for example, have to be able to deal with multiple (international) character sets. Similarly, mechanisms used to represent graphics, images, audio and video data differ on the resolution and compression technique used. Because continuous media streams are encoded by sequences of digital samples, they have an inherent structure representing the temporal dimension that static media lack.

Declarative markup -- structure

Mechanisms to encode the logical structure of a document should be able to describe several relations between the various (data) parts of the document. Common examples are a part-whole relation, used to build a hierarchical structure (e.g article, section, subsection, paragraph), references, used to make bibliographical references, footnotes, annotations and other links. For text-based applications, tagging has proved to be an effective mechanism, but even if other media types are involved (e.g. images in HTML) tagging can still be very useful.

Style sheets -- layout

The information provided by the content data and declarative markup needs to be augmented with information about how the various elements should be mapped on an output device. Rather than embedding such information in the document or processing application, it is often better to store this kind of information independently, in a separate style sheet.

While the concept of a style sheet can be found in many of today's word-processors, style sheets are only slowly finding their way into the Web. At the moment, various schemes and style sheet languages are being discussed by a W3C  [W3-Style] working group in order to find a mechanism which meets the needs of both authors and readers.

Style sheets have been an issue for SGML-based document processing as well. While the SGML standard was originally published back in 1986, there still is no definitive language addressing the needs for standardized style sheets. The Document Style Semantics and Specification Language  [DSSSL] is still a draft international standard. As a result, many SGML vendors developed their own style sheet language, e.g.. SoftQuad's Panorama Pro  [Panorama].

SGML -- basic concepts

Declarative and Operational Document Encodings

SGML (Standard Generalized Markup Language)  [SGML] is an ISO standard which uses tagging to encode the logical structure of a document. The set of tags needed to describe a specific document structure depends on the type of documents involved. For example, the tags needed to markup a mathematical article are in general not suitable to describe the structure of a telephone book. As a consequence, SGML does not describe a set of tags, but a way to define an appropriate set of tags and the order these tags should appear in the document instance. Such a definition is called a document type definition or DTD. Note that the idea of defining a fine-tuned markup language for each document type is the exact opposite of the ``one size, fits all'' approach of current HTML-based web browsers. Many other advantages of having SGML aware Web browsers (and servers) are discussed in  [SperGold94].

An SGML document essentially consists of two parts: a prolog, containing the document type declaration, followed by the document instance, containing the data interspersed with markup.

Document instance

A document instance is a hierarchical structure of (possibly empty) elements, where each non-empty element contains other elements or character data. Each element has a name (the general identifier) and the start and end of an element are indicated by tags (typically <name>content</name>). Begin or end tags may be mandatory or optional. Moreover, elements contain zero or more attributes. Consider the example document below. The root element memo has one attribute, indicating the security level of the document. The memo element contains five other elements. A double to element stating the addressees, a from element specifying the sender, a subject and a body field. All elements of memo contain character data and no other elements. Note that the tags emphasize the logical structure of the document rather than stating how it should be formatted.

Document type declaration

The document instance above must be preceded by a doctype declaration. The main part of the document type declaration is the document type definition or DTD. It defines the elements of a document and the required order of their sub-elements. The elements and their contents are defined by the use of element declarations. The second line declares the memo element, and defines its content as a sequence of one or more to elements, a from, a subject and a body element. The two `O' characters stand for ``omit'', indicating that the begin and end tag may be omitted. The third line defines the elements containing character data only. Their start tags are mandatory, indicated by the `-'. The list of attributes of each element is declared by an attlist declaration. Attributes can be of different types, and be mandatory or optional. The DTD can specify a default value, as is shown in the case of the security attribute. The DTD declaration may be contained within the doctype declaration, but are typically defined by a separate file. In that case, the doctype declaration contains a reference to that file.

Processing instructions

Processing instructions are used to pass system dependent information to the application to tell how the document is to be processed. Processing instructions are contained within `<?' and `>' characters and can appear on arbitrary places within the document. In the following examples, we employ processing instructions to indicate the URL of a style sheet.

Entities

Fragments of markup and character data can be given a name using an entity declaration. The declaration of an entity is part of the document type declaration, but the entity may be used within the document instance. The contents of an entity may be defined by a string, or may be contained in an external file, in which case the entity declaration contains a reference to the file. External entities may be referenced by a system identifier. Support for these identifiers is system dependent and may include filenames, URLs and database queries. Consider a variant of the previous example: The first line references an external DTD by means of a system identifier, and the second line defines an entity for later usage. Note that the entity definition is enclosed within the square brackets of the doctype declaration. The third line contains a processing instruction to define the location of the style sheet. In contrast to the URL of the DTD, which is resolved by the parser, the URL of the style sheet is passed to the application without further processing by the parser.

To avoid system dependent identifiers such as filenames, an extra indirection is provided by the concept of public identifiers. These identifiers are assumed to be publicly known, and the SGML parser of the target application is expected to be able to resolve them. Typically, a local catalog file is used to map public identifiers onto system dependent ones. Formal public identifiers have a standardized and meaningful inner structure, to facilitate automatic resolving without the use of catalogs. In the following examples, we refer to the HTML 3.0 DTD by means of a formal public identifier.

Encoding Hypermedia Documents

The separation of content, structure and layout has proven to be effective for text-oriented document processing, but it is not clear whether this approach scales to time-dependent and highly interactive hypermedia documents. Synchronization constraints, user interaction, hyperlink traversal are important characteristics of such documents, but it depends on the application whether these issues are part of the style, structure or even the document contents.

While many of these issues can be defined in the structure of a document, SGML lacks standard mechanisms to do so. The Hypermedia/Time-Based Structuring Language (HyTime)  [HyTime] is a relatively new ISO standard defining syntactic SGML conventions representing semantics for hyperlinking, addressing, aligning and scheduling of multimedia documents. We will further discuss these issues in section issues.