Document markup languages have slowly evolved from procedural to declarative markup. Powerful formatters such as troff and TeX merely allow the text of the document to be interspersed with formatting commands. If the document has to be formatted by another system, or for a different medium, (all) the markup has to be changed. Macro facilities may solve part of the problem because part of the (low-level) formatting decisions can be localized in the macro definition. Macro packages such as LaTeX provide high level markup constructs which allow for a more declarative style by emphasizing the structure of the text, instead of the physical document layout.
To enhance inter-operability, that is the ability to edit, browse and typeset documents for different media on different platforms, one has to make a clear distinction between a document's contents, how these contents are structured, and the mapping of structure to format instructions.
Encoding mechanisms used to represent the content data largely depend on the medium used. Text encoding schemes, for example, have to be able to deal with multiple (international) character sets. Similarly, mechanisms used to represent graphics, images, audio and video data differ on the resolution and compression technique used. Because continuous media streams are encoded by sequences of digital samples, they have an inherent structure representing the temporal dimension that static media lack.
Mechanisms to encode the logical structure of a document should be able to describe several relations between the various (data) parts of the document. Common examples are a part-whole relation, used to build a hierarchical structure (e.g article, section, subsection, paragraph), references, used to make bibliographical references, footnotes, annotations and other links. For text-based applications, tagging has proved to be an effective mechanism, but even if other media types are involved (e.g. images in HTML) tagging can still be very useful.
The information provided by the content data and declarative markup needs to be augmented with information about how the various elements should be mapped on an output device. Rather than embedding such information in the document or processing application, it is often better to store this kind of information independently, in a separate style sheet.
While the concept of a style sheet can be found in many of
today's word-processors, style sheets are only slowly finding
their way into the Web. At the moment, various schemes
and style sheet languages are being discussed by a W3C
Style sheets have been an issue for SGML-based document
processing as well. While the SGML standard was originally
published back in 1986, there still is no definitive language
addressing the needs for standardized style sheets. The Document
Style Semantics and Specification Language
SGML (Standard Generalized Markup Language)
An SGML document essentially consists of two parts: a prolog, containing the document type declaration, followed by the document instance, containing the data interspersed with markup.
A document instance is a hierarchical structure of (possibly empty) elements, where each non-empty element contains other elements or character data. Each element has a name (the general identifier) and the start and end of an element are indicated by tags (typically <name>content</name>). Begin or end tags may be mandatory or optional. Moreover, elements contain zero or more attributes. Consider the example document below. The root element memo has one attribute, indicating the security level of the document. The memo element contains five other elements. A double to element stating the addressees, a from element specifying the sender, a subject and a body field. All elements of memo contain character data and no other elements. Note that the tags emphasize the logical structure of the document rather than stating how it should be formatted.
The document instance above must be preceded by a doctype declaration. The main part of the document type declaration is the document type definition or DTD. It defines the elements of a document and the required order of their sub-elements. The elements and their contents are defined by the use of element declarations. The second line declares the memo element, and defines its content as a sequence of one or more to elements, a from, a subject and a body element. The two `O' characters stand for ``omit'', indicating that the begin and end tag may be omitted. The third line defines the elements containing character data only. Their start tags are mandatory, indicated by the `-'. The list of attributes of each element is declared by an attlist declaration. Attributes can be of different types, and be mandatory or optional. The DTD can specify a default value, as is shown in the case of the security attribute. The DTD declaration may be contained within the doctype declaration, but are typically defined by a separate file. In that case, the doctype declaration contains a reference to that file.
Processing instructions are used to pass system dependent information to the application to tell how the document is to be processed. Processing instructions are contained within `<?' and `>' characters and can appear on arbitrary places within the document. In the following examples, we employ processing instructions to indicate the URL of a style sheet.
Fragments of markup and character data can be given a name using an entity declaration. The declaration of an entity is part of the document type declaration, but the entity may be used within the document instance. The contents of an entity may be defined by a string, or may be contained in an external file, in which case the entity declaration contains a reference to the file. External entities may be referenced by a system identifier. Support for these identifiers is system dependent and may include filenames, URLs and database queries. Consider a variant of the previous example: The first line references an external DTD by means of a system identifier, and the second line defines an entity for later usage. Note that the entity definition is enclosed within the square brackets of the doctype declaration. The third line contains a processing instruction to define the location of the style sheet. In contrast to the URL of the DTD, which is resolved by the parser, the URL of the style sheet is passed to the application without further processing by the parser.
To avoid system dependent identifiers such as filenames, an extra indirection is provided by the concept of public identifiers. These identifiers are assumed to be publicly known, and the SGML parser of the target application is expected to be able to resolve them. Typically, a local catalog file is used to map public identifiers onto system dependent ones. Formal public identifiers have a standardized and meaningful inner structure, to facilitate automatic resolving without the use of catalogs. In the following examples, we refer to the HTML 3.0 DTD by means of a formal public identifier.
The separation of content, structure and layout has proven to be effective for text-oriented document processing, but it is not clear whether this approach scales to time-dependent and highly interactive hypermedia documents. Synchronization constraints, user interaction, hyperlink traversal are important characteristics of such documents, but it depends on the application whether these issues are part of the style, structure or even the document contents.
While many of these issues can be defined in the structure of a
document, SGML lacks standard mechanisms to do so. The
Hypermedia/Time-Based Structuring Language (HyTime)