Introduction to HTML & Markup

This lecture is a more conceptual discussion of HTML, 

It is not a tutorial on how to use HTML to make web pages, though some information about design and applications is included.  There is really no need for me to write you a lecture on how to use HTML to produce web pages since so many terrific HTML tutorials already exist on the web.  I have collected many of these tutorials, plus a couple of good reference sites about HTML and web design for you, these are available from the link off the "External Links" section of our Blackboard class site (or from the link above).  So, after reading this lecture to get a good fundamental understanding of HTML you should scan a number of these tutorials, find at least two that you like and study them to learn how to make web pages with HTML.  You will also notice one link in the list points to a nice listing of all of the HTML tags and their attributes this makes a good reference page you can download to your local disk and use when developing web pages.  

What is HTML?

HTML stands for HyperText Markup Language.  The first draft of a standard for HTML was proposed in August 1991 (Happy 10th anniversary to HTML"!! ) and we are currently up to version 4.  Today HTML is the most commonly used method of displaying information in a nonlinear manner. (Nonlinear information is information that doesn't just flow from the beginning of the page to its end but rather can have branchings and diversions of the flow of information in the document.)  HTML is a derivative of SGML (Standard Generalized Markup Language) which is the grand-daddy of many a markup language and is used extensively in the publishing industry, the bottom line is that just about anything with the "ML" ending is a simplified form of SGML (e.g. HTML, WML, MathML, XML).  What separates one SGML implementation from another is a document/file called a DTD or Document Type Definition. This is simply a file that lists all the entities, markup and rules for their interaction and application.  HTML is really then just SGML using a standardized DTD.  

Users of the web generally never really see the HTML DTD because it is built into the web browser applications that they use to view HTML pages over the web.  (A consequence of this is that a new browser version must be released in order to implement any changes in the HTML standard DTD.  More on browsers and their implementation of HTML standards is given a little later.)  However if you look at HTML source you should see (if its good HTML like you will write for me) the first line is a statement declaring which version of the HTML DTD that page is written for.  The statement will look something like this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

which specifies not only the DTD being implemented (in this case the HTML 4.0 Transitional one) but the language ("EN" for english) used and which DTD it is (the public W3C one) .  And, as you might have guessed from the previous statement the standards body for developing the HTML DTD is the World Wide Web Consortium or W3C.

Really though, from a working perspective, HTML is a structured language composed of "Tags" which define "Elements" that are used to markup, or designate, structural information, content and associated, or linked, documents. As specified above, it requires a DTD to define what tags are allowed, where they can go and what they can contain (this is why it is a "structured" language).  (Please don't make the common mistake of calling HTML a programming language, it is in no way, shape, or form a programming language, it is simply a markup language.  Nothing turns me off a web person more than when they list HTML as one of the programming languages they know on their resume (sorry just one of my pet-peeves as a person who has forgotten more real programming languages than most people ever learn).)  The purpose of any markup language is to designate (or mark, hence the name) the structure of a document, i.e. what is a heading, what is a chapter, what is a paragraph, what is a quote, etc.  So in a marked up document you will find two things: the content (i.e. all the information that the document is trying to convey) and the markup or structure designated by the tags and elements of the markup language.  This is very different from what happens when you do word processing or Desktop publishing, in those application environments the markup, content and any information about how you want both the content and structure to be displayed (e.g. choices of fonts, placing items on the page, etc.) are not separated (unless you are an old WordPerfect user and have ever used its "reveal codes" feature which shows you the markup being applied to you document).   It is the express purpose of markup languages to separate the content information, the structure information and information about how that content and structure should be stylistically displayed. When a document is marked up the document is reusable in many different publishing environments because it is up to the viewing device to stylistically display the document based upon the structures designated by the markup. Also designating structure could allow for "chunking" of the document, this is where logical pieces of the document (like chapters of a book, or sections of a chapter, or paragraphs of a section) are seperated to be handled differently.

Why Separate Content and Style?

Conveying information is the reason for making web pages and both the Structure and Content of your document are inherently part of the information you are trying to convey. But the stylistic displaying of your content and structure does not significantly contribute to the information being transferred (though it can convey some meaning, e.g. having bold or italicized text could convey meaning to the user in some situations). So, the whole idea of markup is to generate a more flexible, structured presentation of your information without having to worry about how it will look to the end user, because you know the display device (in the case of web pages this is either the browser on screen or a printed form of the page) will handle all of the stylistic display formatting for you. This separation then lets you focus on the information more.

Other significant reasons for separating (or removing altogether) stylistic display information and content are:

So ideally your HTML should only contain content and structural markup and it should then leave it up to the users viewing system to determine how both that content and marked up structure are displayed. In the real world though, in many forms of communication today, style is everything and as HTML and the web evolved together more and more stylistic tagging was added to HTML. Now the W3C has returned to its roots and when HTML 4 came out it had stripped out almost all of the previously evolved stylistic tagging. What was put in its place was a method to apply a style sheet to a document. So you could use markup to specify structure and then use a style sheet sent with the document to tell the display device how to stylistically portray it. The current standard for these style sheets is CSS2 (Cascading Style Sheets version 2). It is because of these radical changes to HTML that the HTML 4.0 Transitional DTD was written so that older legacy pages with stylistic HTML tags would still be usable.

What is an HTML Element?

HTML is composed of elements which are defined by their tags and tag names.  Here is an example of 2 different HTML elements (the "A" or anchor element and the "IMG" or image element):

<A Href="http://www.site.com/filename.html">
   Some textual contents
   <IMG Src="images/image_filename.jpg" width="100" Alt="Just a test Image">
</A>

The example above shows the 2 major types of HTML elements: 

  1. Container Elements (e.g. the "A" tag), and
  2. Empty or Standalone Elements (e.g. the "IMG: tag) 

Elements are composed of 3 possible parts: 

  1. Opening Tag (everything between a left angle brace, "<", and a right angle brace,">"), and 
  2. Closing Tag (everything between a left angle brace forward slash, "</", and a right angle brace,">"), and 
  3. Contents, all the stuff between the opening and closing tags of a container element (n.b. this is what defines container elements, the fact that they enclose (or markup) content and information.  Similarly Empty elements have no contents or closing tags).

An opening tag of a element is composed of:

In general spacing or hard returns is irrelevant between attributes and Elements and everything but the contents of an element and (possibly) the avalues are case-insensitive. 

Now that we understand all of the terminology we can go back to our original example in this section (

<A Href="http://www.site.com/filename.html">
   Some textual contents
   <IMG Src="images/image_filename.jpg" width="100" Alt="Just a test Image">
</A>

) and describe it in detail as a single Anchor or "A" element which has the following components:

HTML 4.0 Elements/Tags

This section lists the tags of many HTML elements and categorizes them based upon their function in marking up a document, please refer to the tutorials discussed above or HTML reference material to find out exactly what they do, how they are applied and what possible attribute values they may take to alter their behavior.   HTML elements have 2 major categories:

  1. Block-Level elements (see listing below) - These elements imply that their contents (With the exception of the <HR> element which has no contents because is is an empty element) start a new paragraph and when a browser displays them most of them will cause an extra line break as if a new paragraph were started.
  2. Inline elements - These are character formatting elements that don't imply a new paragraph, i.e. they can be used within or "inline" a Block-level element. So, an inline element should be contained within a blocklevel element, and block level element can NEVER be within an inline element.
Tags that define the major structural parts of a document:
<HTML>, <HEAD>, <BODY>, <FRAME>, <FRAMESET>, <NOFRAMES>
Tags for marking up Text, there are:

Structural Block Elements:

<DIV>, <P>, Heading Tags <Hn> where n={1-6}, <BLOCKQUOTE>, <HR>, <PRE>, <FORM>, <TABLE>
Tags to generate Lists:
<OL>,<UL>,<LI>;<DL>,<DT>,<DD>
  • Ordered List (sequentially listed items designated with numbers or letters) <OL>, <LI>
  • Unordered List (items designated with symbols) <UL>, <LI>
  • Dictionary List <DL>, <DT>, <DD>
Tags to generate Tables
<TABLE> and its associated sub-elements: <TR>, <TD>, <TH>, <THEAD>, <TBODY>, <TFOOT>, <COL>, <COLGROUP>
Tags to generate Forms
<FORM> and its associated sub-elements: <INPUT>, <SELECT>, <OPTION>, <BUTTON>, <BUTTON>, <LABEL>, <OPTGROUP>, <TEXTAREA>

Structural Inline Elements:

<SPAN>,<BR>,<CITE>,<CODE>,<SAMP>,<VAR>,<BR>,<PRE>

Inline Tags for designating stylistic information:

<B>,<BIG>,<SMALL>,<STRONG>,<SUB>,<SUP>,<TT>

New Content Tags

(Not fully implemented yet across browser platforms):
<ABBR>,<ACRONYM>,<DEL><INS>,<KBD>, <Q>

Old Text Tags

- Deprecated Stylistic tags(i.e. no longer considered standard in HTML 4.0 but still allowed in HTML 4.0 Transitional DTD):
<BASEFONT>,<FONT>,<S>,<STRIKE>,<U>,<CENTER>

Other Common Tags

<A> ,<HR>, Comment tags written as <!-- Comment goes inside this funny looking tag structure -->

Elements in the Head of a Document.

Material in Document <Head> is not displayed and only provides information about the document.  Markup you will find as the contents of the HEAD tag in an HTML file  are:

Please be aware that the above list is not exhaustive list of all the HTML 4 elements, there are more elements than I have listed here. What the listing should do is give you a very good overview of the types of elements out there.

HTML and Browsers
(When standards are thrown out the door)

Most people who use the internet are aware of the 2 major competing browsers Netscape and Internet Explorer. They are both fiercly competetive seeking users, and one way this competition has expressed itself is by allowing fancier display controls and easier design of web documents for web developers. I.e. they both want to encourage web developers to develop pages for their browsers to display and not their competitors. On the surface this may seem a good thing for developers, and in truth many of the innovative things in the modern HTML standards were first implemented in on or the other of the browsers. But what it has also done is to give the web developers headaches that no designer has ever had to deal with before. As a web devloper you simply can not guarantee that a web page, whose HTML displays great in one browser, will even display anything in the other browser. In is to the point today that you need a good HTML reference to tell you what tags and attributes are useable at what certainty in what browser.

Part of this problem are the browser specific elements and attributes that each browser incorporates into its implementation of the HTML DTD but part of it is also from lax enforcement of the specifications of the DTD. I have already mentioned one of these lax implementations previously when discussing quoting of attributes. Things have gotten so bad that many web design and HTML books and tutorials now teach the lax implementations of the DTD as if thats the way things were supposed to be done, or at least teach that the incorrect way is an allowed alternative. What follows is a listing of things that are NOT correct HTML, despite what you may be told, see or have practiced in the past, the items in the following list are not proper HTML. (n.b. the correct explanation and implementation of these errors is given in the different font after each item.)

You may ask, well if these are actually incorrect why do browsers support them? Browsers support them because they are such very common errors; and a browsers only alternative would be to improperly display the page or not display the page. The bottom line is that anybody can try to make a web page and the browsers try to accomodate all the novices out there who just hack something together without understanding it. Tutorials and books promote these I can only assume because they see them as cute tricks and shortcuts that they can give to their readers. You should not practice these good habits just for aesthetic reasons but because they develop excellent markup habits that will serve you very well if you ever move to the next step up which is XML (we will discuss XML later in the semester). XML and all other markup languages are NOT forgiving like a web browsers implementation of HTML is, and the other ML's require proper markup.

Today, no matter what W3C puts out as a standard for HTML or Style sheets, it is really what the browsers implement that determines what people put into their web pages. There is nothing wrong with that, as a developer you have to work with the system the way it is. But the more standardized and compatible the HTML you produce the better you will as a developer and the better your HTML will last in an ever changing system of standards suport.

(Another common mistake people make is to assume that text in a Table data cell (the <TD> element) must be enclosed in a paragraph tag. As long as there is only one block of text in the table data cell the paragraph element is not required because the DTD specifies that the <TD> element can contain just text data. This misconception actually can cause some problems. Because the contents of the <TD> element are displayed as text it is one of the few places in writing HTML where you have to be very careful about where you put in spaces, tabs and hard returns when formatting your HTML to make it look pretty. In the 2nd lecture for this week on Web Design I show an example using a table to make a composite image (often this is called a sliced image) and I show how this very issues can cause problems.)

(Here are a good article talking about web design issues and browser interoperability, please read these to better understand the issues involved:

HTML Editors

Types:

(See the item in "External Links" to find out where to download some of these items)

Advantages (+'s) & Disadvantages (-'s):

My personal preference is to start laying and structuring a document (both existing information I'm about to markup or new content I'm typing in with the markup) in a good HTML editor and then switch to a good WYSIWYG editor only to edit the content portions I have already marked up.  Some of the more sophisticated WYSIWYG and HTML editors will have nice functionality to help you do web site production and maintainance.

Validating your HTML

My last comment is about ways to check your HTML. There are many validators out there that you can give a URL to you web page to and they will check your HTML against the DTD you specify in your DOCTYPE tag. One of the most well-known ones is from W3C at http://validator.w3.org/. You can also download local validators or HTML cleaning programs, the best known of these is HTML Tidy (which is built into the HTML-Kit editor). It is a configurable, and powerful tool for producing standardized, clean and formatted HTML.

(Another good source for you to see what good HTML looks like is to simply view source on any of the HTML web pages that I have posted for this class. These show great examples of proper structure and markup of textual information. They are simple web pages with minimal graphics and fancy design elements. This page also shows a very little bit of using CSS. You may have noticed that all the blocks of HTML code are in a different font and a different color. This is done with a quick little bit of CSS applied throughout the document.

Scott A. Wymer
Copyright 2001

Hosted by www.Geocities.ws

1