Web Publishing Formats
Recent Work, Current Problems, and Proposed Work
R. W. Matzen

Web Publishing Background

Open Problems and Proposed Work

Recent Work
 
 

Web Publishing Background

SGML

SGML (Standard Generalized Markup Language) was adopted as an international standard (ISO 8879) in 1986.  It is a meta-language scheme for document definition and interchange, rather than a specific markup scheme for document processing.  It was primarily designed as a standard to support electronic publishing, which began seriously in the early 1980’s with the introduction of the IBM XT and the development of CDROM.  Prior to SGML there were no widely accepted document interchange standards, and the concept of defining documents structurally was new.  Porting text and image data from word processors, desktop publishing systems and legacy data to the various publishing system formats was expensive.

In SGML, the structure of almost any kind of document can be described using SGML; productions called element declarations are used to define the logical (or structural) elements of documents and the context in which they can occur.  A finite set of element declarations called a document type definition (DTD) defines the high level syntax of a set of documents.  DTDs are similar to context free grammars but the productions are more complex.

SGML fell short of initial expectations for three primary reasons.  1. Although the core of SGML (DTDs) is elegant, there are a number of complex features.  It was unclear for some time how DTDs conformed (and did not conform) to known formal language models.  Significant progress has been made in this area.  Some relevant articles are listed under Recent Work, and other contributions can be found in the bibliographies of these papers, as well as in the The SGML and XML Web Page, a thorough and well organized bibliography created and maintained by Robin Cover  2.  SGML defines only the structure/syntax of documents.  Attempts to develop a widely accepted meta-standard for specifying the processing/semantics of SGML documents have not yet been completely successful.  Part of this problem is inherent in the complexity of this task. A reasonable subset of the task may be achievable, but there has yet been no agreement on what is reasonable and/or necessary.

HTML

Most of the documents published on the World Wide Web are in some version of  HTML (Hypertext Markup Language).  HTML 4.0 , adopted in 1997, is the current World Wide Web Consortium (W3C) Recommendation.  The syntax of the first version of HTML was roughly modeled on SGML by Tim Berners-Lee.  Subsequently, Berners-Lee, Dan Raggett, and others defined future versions of HTML with specific SGML DTDs.  However, most Web browsers still do not rigorously implement the HTML DTD specifications. For simple kinds of publishing this level of incorrectness is acceptable. However, for more serious publishing this is an unacceptable level of implementation.  As the number, complexity, and nesting of the elements in HTML grow, the need for more formal browser implementations is imperative.

HTML resolves the semantics problem associated with SGML fairly easily, because it is a single (actually, three related) DTDs. Therefore, a meta-standard approach to defining semantics is not necessary.  The semantics (or processing) of the specific HTML elements such as paragraphs, frames, tables, links, etc. are defined (recommended) in the HTML standard.  Similar to defining semantics for programming languages, this is a difficult and imperfect method, but results are satisfactory.  A more formal approach has been used;  Cascading Style Sheets (CSS) specifications have been developed for the HTML DTD, but it is not clear that they are complete.

Another open problem with HTML is the current (and increasing) complexity of the DTD(s).  The complexity of HTML 4.0 is already the equivalent of about 3000 BNF productions (1), and the demand for more element types to support new kinds of processing is inevitable.  In HTML, the recursion induced by allowing for nested elements is both limited and made more complicated by the use of an SGML feature called exceptions:  inclusions and exclusions that provide for elements to be included anywhere within an element or excluded from anywhere within an element.

XML

XML (Extensible Markup Language) was adopted as a W3C recommendation in 1998.  It is a stripped down version of SGML; it is still a meta-standard (as compared to the fixed DTD approach of HTML), but it does not use exceptions and some other features.  Removing exceptions limits the expressive power of document designers, but it also eliminates some complexity from SGML.  Although XML solves one of the basic problems of SGML (removing some complex features), it does not solve the other problem:  developing and adopting a meta-standard for document processing for SGML/XML.  The extensible Style Language (XSL) was proposed at the time XML was adopted, but since that time, it has not yet been completed nor become a universal standard.

XHTML

As W3C continues the development of XML, it is clear that the even if all problems are resolved with XML, including adopting a suitable standard for processing, that there is still a need for the single DTD approach that HTML is currently serving.  Also, it is clear that any new standard that is not compatible with HTML can only have limited success, given the tremendous volume of existing HTML data.

XHTML is a proposal by W3C to define XML DTDs for HTML that are compatible with the existing HTML DTDs (and perhaps extend them), thus providing a path for HTML to eventually be incorporated into XML  The primary roadblock is the exceptions feature.  In (1), there are methods shown to convert the HTML DTD(s) into DTD(s) without exceptions (and/or also a regular expression grammar), but the resulting number of productions is very large.  The current XHTML DTD(s) was designed heuristically and is considered to be only a temporary solution.
 
 

 

Open Problems and Proposed Work

Although Web publishing appears to be moving full speed ahead, there are existing problems that are growing. The HTML specification (DTD) is adding new features to meet evolving Web publishing expectations, and this growth causes an exponential growth in complexity of the DTD. The current HTML DTD is equivalent to 2887 relatively complex BNF productions (1).  There are simple and powerful shortcuts for parsing HTML documents, but these shortcuts are not applicable to static analysis of DTDs.  Thus, the HTML DTD is difficult for document authors to understand, and it is difficult to define the processing for HTML elements. In (1), an algorithm and prototype software tool show that the possible contexts for elements of a DTD (such as HTML) is finite and can be shown in tree form. These methods can be used as the foundation for effectively using the existing semantic meta-standards.

XHTML is an attempt to build a bridge from the existing base of HTML to the meta-standard of XML.  Because of the complexity involved, the XHTML DTDs have been derived informally and are considered temporary solutions. The results in (1) show that there is a relatively direct path from HTML to XML, but the results are not optimal. Optimizing this algorithm may significantly reduce the size/complexity of the problem by eliminating redundant productions.  The software prototype in described in this paper can be expanded to include this optimization.

Bridging the gap between HTML and XML is just one aspect of a larger problem:  verifying and maintaining compatibility of evolving Web publishing standards.  The compatibility problem reduces to a subsets problem.  Thus, the following two statements are equivalent for practical purposes:  1. Is HTML 3.2 a subset of HTML 4.0?  2. Is every HTML 3.2 document also an HTML 4.0 document?

The methods developed to date show that SGML DTDs (including HTML) can be expressed as regular expression grammars (context free).  In general, the subset problem is not decidable for context free grammars.  However, if the omitted tag feature is not considered, then all elements (nonterminals) are bounded by a begin tag and an end tag.  I believe that this restriction makes the subset problem solvable.  Sperberg-McQueen outlines an approach that depends on a solution to the exceptions problem.  Given the solution to the exceptions problem in (1), I believe that the prototype software tool described therein can be extended to formally and practically solve the subset problem. This would provide an answer for the HTML compatibility question and could also be used to derive/develop DTDs for XHTML that are subset compatible with the HTML DTDs. (We note that there are some inherent restrictions. The HTML DTDs do not have exactly equivalent XML DTDs; only pseudo-equivalent DTDs in which each element type may have multiple instances).
 
 

 

Recent Work

1. Matzen, R. W.  A New Generation of Tools for SGML,  Markup Languages, Theory and Practice, Winter 99, Volume 1, Issue 1,  (MIT Press, 1999), pp. 47-74.
Full document (MIT press)

2. Matzen R. W. and Hedrick, G. E.,  A New Tool for SGML on the World Wide Web. Proceedings of the ACM Symposium on Applied Computing (February, 1998, Atlanta, GA) ACM.

3. Matzen, R. W.,  Unraveling Exceptions. Conference Proceedings: SGML / XML 97 (December, 1997, Washington DC) pp. 289-295.

4. Matzen, R. W., George, K. M. and Hedrick, G. E. A Formal Language Model for Parsing SGML. Journal of Systems and Software, February, 1997, (36): pp. 147-166.

Abstract

5. Matzen, R. W., Hedrick, G. E., and George, K. M.,  A Model for Studying Ambiguity in SGML Element Declarations.  Proceedings of the 1993 ACM/SIGAPP Symposium on Applied Computing,  ACM, NY, 1993.

6. Matzen, R. W.,  A Formal Language Model for Detecting Ambiguity in SGML,  Ph.D. dissertation, Oklahoma State University, December, 1993.