by Jim O'Donnell
HTML pages are like annotated bibliographies: they give you the opportunity to expand on an endless variety of topics and present additional factual or thematic resources to further explore a subject.
Of course, HTML pages are also like gossip magazines: sooner or later you'll see just about everything on them.
But regardless of how anyone perceives HTML, everyone who uses it speaks the same language. Elements, tags, anchors, hyperlinks, URLs, and attributes: they're all part of the lexicon of the Web's documents. To create inspired Web pages (and to cast a critical eye on those already on the Web), you need to have an intimate familiarity with the building blocks of HTML.
This chapter answers the following questions:
How is HTML related to SGML?
What is a DTD?
What's the difference between empty and container elements?
What are the basic components of HTML?
HTML is a subset of SGML (Standardized General Markup Language). SGML documents are more complex and programming-like than HTML. Figure 4.1 shows how an SGML document describes the HTML standard (the figure is, in fact, the SGML declaration for HTML-the SGML document that defines HTML).
Figure 4.1 : SGML coding provides machine-level display format and function commands.
HTML resembles simplified SGML. The observation that SGML is to
HTML as HTML is to plain text seems reasonable on the surface. When you take a
look under the hood, though, it's easy to see how HTML shares the advantages of
both systems of marking text.
|
Troubleshooting |
|
In his World Wide Web Research Notebook
Daniel Connolly outlines the advantages and disadvantages to carrying over SGML practices and constructs into the current HTML standard.
These are the benefits of using SGML to define HTML:
Basing HTML on SGML makes it easy to test whether or not an HTML document conforms to the current standard. Document authors can have confidence in their documents that pass automatic verification processes.
The SGML definition for HTML defined a document called the Entity Structure Information Set. This form allows a standard interpretation of all HTML documents.
Like HTML, SGML provides a clear and widely supported standard for creating interchangeable documents.
These are the disadvantages of using SGML to define HTML:
SGML coding is meant to be interpreted at the machine-level, and SGML documents are difficult for people to read and understand. This makes an HTML standard based on SGML difficult to understand by reading it.
Due to its structural complexity, it's possible to read related SGML documents and come to incorrect assumptions about SGML usage and the standards they define.
SGML is defined at a level of complexity beyond the function and purpose of HTML, and certain modular capabilities that use SGML are too complex for the level of author manageability HTML strives to provide.
HTML's strength comes from its combination of SGML machine-level constructs (the tags and elements that tell a viewer the purpose of document text) and standard English text markup notation.
For example, the <B> container tag is mnemonically correct (it stands for bold), and it signals a format change to the document's viewing software, which changes the display format of the following text. When the viewer comes across the </B> closing tag, which tells it to turn off the bold attribute, it returns to the previous text formatting.
The versatility of SGML and HTML is becoming widely acknowledged
as they are adopted as hypertext document standards by more content managers,
including the federal and many state governments.
|
Creating the Standards |
|
It's debatable who has contributed more to the "acronymization" of our culture. In a world where ATM can have two totally different meanings (one's great for convenience banking and the other for high-speed data networking), you might expect a language like HTML (itself an acronym) to continue the tradition.
And it does. From its elements-UL stands for, appropriately enough, un-ordered list-to its parent language SGML, HTML is defined by acronyms. An acronym defines HTML as well-HTML's DTD.
DTD stands for Document Type Definition. It's a document that describes the HTML language, its elements, and their legal uses. The HTML DTD has many levels that pertain to different categories of use or compatibility with the HTML standard. These levels are:
Level 0. Minimal conformance to or use of HTML elements.
Level 1. HTML compatibility with (or use of) HTML with Level 1 extensions.
Level 2. HTML compatibility with (or use of) HTML with Level 2 extensions.
The HTML DTD is written in SGML and can be difficult to interpret. Figure 4.2 shows a portion of the HTML DTD for Level 0 (for the complete DTD, see appendix A, "HTML Tags"). The document coding is complex and difficult to read; it's not meant entirely to be read by people, but by SGML interpreters. Don't be surprised if it makes no sense to you-it doesn't to the vast majority of people.
Annotated versions of the HTML DTD make it easier for developers and end users to verify conformity issues. Daniel Connolly maintains one popular version, and you can find it at:
The Web sites listed in appendix D, "WWW Bibliography," collect other descriptions of the various HTML standards.
It is possible to check your HTML documents for conformance with HTML standards. The Webtechs HTML Validation Service can be found at:
As shown in figures 4.3 and 4.4, you can check for conformance at different levels, and supply the HTML document either as a URL to an existing document (see figure 4.3) or by directly inputting the HTML (see fig. 4.4).
After you submit your URL or HTML code, the Webtechs service will analyze it and return a report such as that shown in figure 4.5. If it conforms to the HTML 2.0 standard, you are invited to include the validation icon on your web pages.
HTML is composed of elements, or instructions, to WWW viewers to perform a defined task (make text bold, insert a paragraph break, or format and number a list in a predetermined manner). HTML tags consist of individual elements inside angle brackets. Figure 4.6 shows a few typical elements and how they are written in tag format.
Figure 4.6 : HTML
tags are "invisible" when the WWW viewer displays the document.
|
Troubleshooting |
|
If WWW viewers read HTML tags as instructions, how did you show them in figure 4.6? Why didn't the viewer just mark up the text in the tags?
|
As the name implies, HTML marks up text in a document by defining the specific formatting for sections of the document. HTML is a hybrid, using some elements to define the abstract value of text (such as "emphasized") and others to define the actual on-screen representation in the WWW viewer's window (such as "italicized"). This "split personality" created quite a controversy in the authoring community, spawning two camps of thought that support the different uses of HTML markup.
Unlike the file systems of some operating systems, HTML element
names are case independent. You can write tags with any mixture of upper and
lowercase characters. For example, you can write one tag that defines the
formatting of a section of text as <BLOCKQUOTE>, <blockquote>, <BlockQuote>, or
any capitalization combination. Some authors use unorthodox capitalization
schemes, such as <bLocKquOtE>, but that doesn't make for easy-to-read HTML, and
your site administrator probably discourages this brand of "net.hipness."
|
Note |
|
HTML uses two types of elements: empty (or open) and container tags. These tags differ because of what they represent. Empty tags represent formatting constructs, such as line breaks and horizontal rules. These tags indicate "one time" instructions that WWW viewers can read and execute without concern for any other HTML construction or document text.
Container tags define a section of text (or of the document itself) and specify the formatting or construction for all of the selected text. A container tag has both a beginning and an ending: the ending tag is identical to the beginning tag, with the addition of a forward slash. Most containers can overlap and hold other containers or empty tags (see fig. 4.7).
I'm not talking about disagreements between tags in HTML documents. Like command-line applications, many HTML elements use additional parameters (known as arguments or attributes) to increase their functionality. These arguments are passed on to the client software and affect the way the element is applied to the section of text (or, with empty tags, how the tag's construct is displayed in the viewing software's window).
For example, the anchor element uses arguments to define the function of the anchor (whether it's a marker or a hypertext link to another document or anchor). So, a document can contain links to specific sections of text and named anchors at those text locations (see fig. 4.8). Notice that the parameters are contained in the tag's angle brackets.
In this example, the last line in the list
<LI><A HREF="#Anzio">Battle of AnzioD-Day</A>
is an anchor that points to a named anchor somewhere else in the document. The named anchor it points to would be found in a line such as
<A NAME="Anzio"><H1>The Battle of Anzio</H1></A>
When the user clicks the list item D-Day in the viewed document,
the WWW browser would jump immediately to the associated named anchor.
|
Caution |
|
Some WWW viewers, notably Netscape Navigator and Microsoft
Internet Explorer, provide support for non-standard arguments that primarily
affect the display of the HTML text in the viewer's window. WWW viewers that
don't support non-standard elements or arguments just ignore them. Non-standard
usage is noted in chapters 15 and
16.
|
Note |
|
Tables 4.1, 4.2, and 4.3 provide a brief overview of some of
more common HTML elements found in different sections of HTML documents. These
tables don't include arguments but they do include the element's tag type. The
entire HTML document should be contained in the HTML container element. For a
complete description of each element and its associated arguments, see appendix
A.
Table 4.1 HTML Elements for Head Sections in HTML Documents
|
Element |
Element Type |
Description |
|
BASE |
empty |
Base context document |
|
HEAD |
container |
Document head |
|
ISINDEX |
empty |
Document is a searchable index |
|
LINK |
empty |
Link from this document |
|
META |
container |
Generic meta-information |
|
NEXTID |
empty |
Next ID to use for link name |
|
TITLE |
container |
Title of document |
Table 4.2 HTML Elements for Body Sections in HTML Documents
|
Element |
Element Type |
Description |
|
A |
container |
Anchor: source and/or destination of a link |
|
ADDRESS |
container |
Address, signature, or byline for a document or passage |
|
B |
container |
Bold text |
|
BLOCKQUOTE |
container |
Quoted passage |
|
BODY |
container |
Document body |
|
BR |
empty |
Line break |
|
CITE |
container |
Name or title of cited work |
|
CODE |
container |
Source code phrase |
|
DD |
empty |
Definition of term |
|
DIR |
container |
Directory list |
|
DL |
container |
Definition list, or glossary |
|
DT |
empty |
Term in definition list |
|
EM |
container |
Emphasized phrase |
|
H1 |
container |
Heading, level 1 |
|
H2 |
container |
Heading, level 2 |
|
H3 |
container |
Heading, level 3 |
|
H4 |
container |
Heading, level 4 |
|
H5 |
container |
Heading, level 5 |
|
H6 |
container |
Heading, level 6 |
|
HR |
empty |
Horizontal rule |
|
I |
container |
Italic text |
|
IMG |
empty |
Image; icon, glyph, or illustration |
|
KBD |
container |
Keyboard phrase, such as user input |
|
LI |
empty |
List item |
|
LISTING |
container |
Computer listing |
|
MENU |
container |
Menu list |
|
OL |
container |
Ordered or numbered list |
|
P |
empty |
Paragraph |
|
PRE |
container |
Preformatted text |
|
SAMP |
container |
Sample text or characters |
|
SELECT |
empty |
Selection of option(s) |
|
STRONG |
container |
Strong emphasis |
|
TT |
container |
Typewriter text |
|
UL |
container |
Unordered list |
|
VAR |
container |
Variable phrase or substitutable |
|
XMP |
container |
Example section |
|
Note |
|
Table 4.3 HTML Elements for Forms in HTML Documents
|
Element |
Element Type |
Description |
|
FORM |
container |
Fill-out or data-entry form |
|
INPUT |
empty |
Form input datum |
|
TEXTAREA |
empty |
Area for text input |
|
OPTION |
empty |
Selection option |