Introduction to HTML & Markup

This lecture is a more conceptual discussion of HTML,

where it comes from,
what it is,
what is the philosophy behind it,
how is it implemented and used.

It is not a tutorial on how to use HTML to make web pages, though some information about design and applications is included. There is really no need for me to write you a lecture on how to use HTML to produce web pages since so many terrific HTML tutorials already exist on the web. I have collected many of these tutorials, plus a couple of good reference sites about HTML and web design for you, these are available from the link off the "External Links" section of our Blackboard class site (or from the link above). So, after reading this lecture to get a good fundamental understanding of HTML you should scan a number of these tutorials, find at least two that you like and study them to learn how to make web pages with HTML. You will also notice one link in the list points to a nice listing of all of the HTML tags and their attributes this makes a good reference page you can download to your local disk and use when developing web pages.

What is HTML?

HTML stands for HyperText Markup Language. The first draft of a standard for HTML was proposed in August 1991 (Happy 10th anniversary to HTML"!! ) and we are currently up to version 4. Today HTML is the most commonly used method of displaying information in a nonlinear manner. (Nonlinear information is information that doesn't just flow from the beginning of the page to its end but rather can have branchings and diversions of the flow of information in the document.) HTML is a derivative of SGML (Standard Generalized Markup Language) which is the grand-daddy of many a markup language and is used extensively in the publishing industry, the bottom line is that just about anything with the "ML" ending is a simplified form of SGML (e.g. HTML, WML, MathML, XML). What separates one SGML implementation from another is a document/file called a DTD or Document Type Definition. This is simply a file that lists all the entities, markup and rules for their interaction and application. HTML is really then just SGML using a standardized DTD.

Users of the web generally never really see the HTML DTD because it is built into the web browser applications that they use to view HTML pages over the web. (A consequence of this is that a new browser version must be released in order to implement any changes in the HTML standard DTD. More on browsers and their implementation of HTML standards is given a little later.) However if you look at HTML source you should see (if its good HTML like you will write for me) the first line is a statement declaring which version of the HTML DTD that page is written for. The statement will look something like this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

which specifies not only the DTD being implemented (in this case the HTML 4.0 Transitional one) but the language ("EN" for english) used and which DTD it is (the public W3C one) . And, as you might have guessed from the previous statement the standards body for developing the HTML DTD is the World Wide Web Consortium or W3C.

Really though, from a working perspective, HTML is a structured language composed of "Tags" which define "Elements" that are used to markup, or designate, structural information, content and associated, or linked, documents. As specified above, it requires a DTD to define what tags are allowed, where they can go and what they can contain (this is why it is a "structured" language). (Please don't make the common mistake of calling HTML a programming language, it is in no way, shape, or form a programming language, it is simply a markup language. Nothing turns me off a web person more than when they list HTML as one of the programming languages they know on their resume (sorry just one of my pet-peeves as a person who has forgotten more real programming languages than most people ever learn).) The purpose of any markup language is to designate (or mark, hence the name) the structure of a document, i.e. what is a heading, what is a chapter, what is a paragraph, what is a quote, etc. So in a marked up document you will find two things: the content (i.e. all the information that the document is trying to convey) and the markup or structure designated by the tags and elements of the markup language. This is very different from what happens when you do word processing or Desktop publishing, in those application environments the markup, content and any information about how you want both the content and structure to be displayed (e.g. choices of fonts, placing items on the page, etc.) are not separated (unless you are an old WordPerfect user and have ever used its "reveal codes" feature which shows you the markup being applied to you document). It is the express purpose of markup languages to separate the content information, the structure information and information about how that content and structure should be stylistically displayed. When a document is marked up the document is reusable in many different publishing environments because it is up to the viewing device to stylistically display the document based upon the structures designated by the markup. Also designating structure could allow for "chunking" of the document, this is where logical pieces of the document (like chapters of a book, or sections of a chapter, or paragraphs of a section) are seperated to be handled differently.

Why Separate Content and Style?

Conveying information is the reason for making web pages and both the Structure and Content of your document are inherently part of the information you are trying to convey. But the stylistic displaying of your content and structure does not significantly contribute to the information being transferred (though it can convey some meaning, e.g. having bold or italicized text could convey meaning to the user in some situations). So, the whole idea of markup is to generate a more flexible, structured presentation of your information without having to worry about how it will look to the end user, because you know the display device (in the case of web pages this is either the browser on screen or a printed form of the page) will handle all of the stylistic display formatting for you. This separation then lets you focus on the information more.

Other significant reasons for separating (or removing altogether) stylistic display information and content are:

Factors outside the author’s control can influence stylistic display. In the case of web pages these factors are many and varied, like:
- which browser a user chooses to use (Internet Explorer and Netscape are the 2 most popular but by no means the only ones out there)
- users monitor size
- current monitor resolution
- users installed fonts
- browser window size
- users viewing hardware (are they viewing from a lap top, a Desktop PC, a palm device, a wireless phone, or printing it)
- users Operating System (e.g. Mac and PCs interpret colors differently to the screen).
There is no way possible to reasonably account for all the different possible ways a user can view your information so if you try to include stylistic information so that your pages views the way you like in one environment you could be making the page totally unviewable in another environment.
To reach the broadest audience successfully with your information its best to allow the users system to display the information optimally, e.g. there are browsers out there for blind people which will render properly marked up HTML into a spoken description of a pages content and structure.

So ideally your HTML should only contain content and structural markup and it should then leave it up to the users viewing system to determine how both that content and marked up structure are displayed. In the real world though, in many forms of communication today, style is everything and as HTML and the web evolved together more and more stylistic tagging was added to HTML. Now the W3C has returned to its roots and when HTML 4 came out it had stripped out almost all of the previously evolved stylistic tagging. What was put in its place was a method to apply a style sheet to a document. So you could use markup to specify structure and then use a style sheet sent with the document to tell the display device how to stylistically portray it. The current standard for these style sheets is CSS2 (Cascading Style Sheets version 2). It is because of these radical changes to HTML that the HTML 4.0 Transitional DTD was written so that older legacy pages with stylistic HTML tags would still be usable.

What is an HTML Element?

HTML is composed of elements which are defined by their tags and tag names. Here is an example of 2 different HTML elements (the "A" or anchor element and the "IMG" or image element):

<A Href="http://www.site.com/filename.html">
   Some textual contents
   <IMG Src="images/image_filename.jpg" width="100" Alt="Just a test Image">
</A>

The example above shows the 2 major types of HTML elements:

Container Elements (e.g. the "A" tag), and
Empty or Standalone Elements (e.g. the "IMG: tag)

Elements are composed of 3 possible parts:

Opening Tag (everything between a left angle brace, "<", and a right angle brace,">"), and
Closing Tag (everything between a left angle brace forward slash, "</", and a right angle brace,">"), and
Contents, all the stuff between the opening and closing tags of a container element (n.b. this is what defines container elements, the fact that they enclose (or markup) content and information. Similarly Empty elements have no contents or closing tags).

An opening tag of a element is composed of:

Enclosing angle braces "<" & ">")
Tag Name (a word which is generally shortened form of the tags function)
List of Attributes composed of:
- Attribute Name followed by
- an equal sign and an
- Attribute Value (Always in quotes) (You will sometimes see examples of HTML attribute values without quotes around them, this is incorrect and sloppy HTML. (Some books and tutorials even teach this habit. ) All attribute values need to be quoted all the time. If you get into this habit from the start then if you ever make the transition to another markup language (like maybe XML) where the DTD is more strictly enforced than it is by web browsers displaying HTML, you will not have to change you habits.)

In general spacing or hard returns is irrelevant between attributes and Elements and everything but the contents of an element and (possibly) the avalues are case-insensitive.

Now that we understand all of the terminology we can go back to our original example in this section (

<A Href="http://www.site.com/filename.html">
   Some textual contents
   <IMG Src="images/image_filename.jpg" width="100" Alt="Just a test Image">
</A>

) and describe it in detail as a single Anchor or "A" element which has the following components:

one attribute the "Href" attribute with the contents "http://www.site.com/filename.html", which specifies that this A element should hyperlink its contents and direct the user to the URL specified in the " HREF" attribute value,
and contents consisting of the text "Some textual contents" and another HTML element the IMG element or IMG tag. (This is a tag which displays an image. The IMG tag has 3 attributes:
1. the "SRC" attribute with the attribute value "images/image_filename.jpg" which lists the image file to be displayed,
2. the "Width" attribute with the attribute value "100" specifying the width of the image to be 100 pixels wide
3. the "Alt" attribute with the attribute value "Just a test Image" which specifies the text to be displayed if the image can not display for some reason (e.g. the user has display images turned off in their browser, or the user is using a Braille browser for the blind).

HTML 4.0 Elements/Tags

This section lists the tags of many HTML elements and categorizes them based upon their function in marking up a document, please refer to the tutorials discussed above or HTML reference material to find out exactly what they do, how they are applied and what possible attribute values they may take to alter their behavior. HTML elements have 2 major categories:

Block-Level elements (see listing below) - These elements imply that their contents (With the exception of the <HR> element which has no contents because is is an empty element) start a new paragraph and when a browser displays them most of them will cause an extra line break as if a new paragraph were started.
Inline elements - These are character formatting elements that don't imply a new paragraph, i.e. they can be used within or "inline" a Block-level element. So, an inline element should be contained within a blocklevel element, and block level element can NEVER be within an inline element.

Tags that define the major structural parts of a document:

<HTML>, <HEAD>, <BODY>, <FRAME>, <FRAMESET>, <NOFRAMES>

Tags for marking up Text, there are:

Structural Block Elements:

<DIV>, <P>, Heading Tags <Hn> where n={1-6}, <BLOCKQUOTE>, <HR>, <PRE>, <FORM>, <TABLE>

Tags to generate Lists:
<OL>,<UL>,<LI>;<DL>,<DT>,<DD>

Ordered List (sequentially listed items designated with numbers or letters) <OL>, <LI>
Unordered List (items designated with symbols) <UL>, <LI>
Dictionary List <DL>, <DT>, <DD>

Tags to generate Tables: <TABLE> and its associated sub-elements: <TR>, <TD>, <TH>, <THEAD>, <TBODY>, <TFOOT>, <COL>, <COLGROUP>
Tags to generate Forms: <FORM> and its associated sub-elements: <INPUT>, <SELECT>, <OPTION>, <BUTTON>, <BUTTON>, <LABEL>, <OPTGROUP>, <TEXTAREA>

Structural Inline Elements:

<SPAN>,<BR>,<CITE>,<CODE>,<SAMP>,<VAR>,<BR>,<PRE>

Inline Tags for designating stylistic information:

<B>,<BIG>,<SMALL>,<STRONG>,<SUB>,<SUP>,<TT>

New Content Tags (Not fully implemented yet across browser platforms):

<ABBR>,<ACRONYM>,<DEL><INS>,<KBD>, <Q>

Old Text Tags - Deprecated Stylistic tags(i.e. no longer considered standard in HTML 4.0 but still allowed in HTML 4.0 Transitional DTD):

<BASEFONT>,<FONT>,<S>,<STRIKE>,<U>,<CENTER>

Other Common Tags

<A> ,<HR>, Comment tags written as

<!-- 
    Comment goes inside this funny looking tag structure -->

Elements in the Head of a Document.

Material in Document <Head> is not displayed and only provides information about the document. Markup you will find as the contents of the HEAD tag in an HTML file are:

<TITLE> Tag (no attributes) - Title used by search engines and browsers
<META> Tag (Empty Element) - Attributes are:
- HTTP-EQUIV - used to provide HTTP protocol statements as the document loads
- Name - Details read by people and robots looking at document
  - Author
  - Creation-Date
  - Description
  - Last-Modified-Date
  - Keywords
- Contents - This is the metadata corresponding to the Name attribute.
e.g.
```
<META CONTENT="Scott A. Wymer" NAME="Author">
<META CONTENT="Mon, 10 Sep 2001 10:13:52 PM" NAME="Creation-Date">
<META CONTENT="Wed, 12 Sep 2001 10:27:05 PM" NAME="Last-Modified-Date">
```
<SCRIPT> - Used to designate client side scripts (i.e. scripts run by the browser) for the entire document.
<STYLE> - Used to designate Document specific Cascading Style Sheet information
<BASE> - Used to give a base URL for all relative URLs used within the document
<LINK> - (not implemented fully) specifies relationships between the document and other documents that it leads to or was called from.

Please be aware that the above list is not exhaustive list of all the HTML 4 elements, there are more elements than I have listed here. What the listing should do is give you a very good overview of the types of elements out there.

HTML and Browsers
(When standards are thrown out the door)

Most people who use the internet are aware of the 2 major competing browsers Netscape and Internet Explorer. They are both fiercly competetive seeking users, and one way this competition has expressed itself is by allowing fancier display controls and easier design of web documents for web developers. I.e. they both want to encourage web developers to develop pages for their browsers to display and not their competitors. On the surface this may seem a good thing for developers, and in truth many of the innovative things in the modern HTML standards were first implemented in on or the other of the browsers. But what it has also done is to give the web developers headaches that no designer has ever had to deal with before. As a web devloper you simply can not guarantee that a web page, whose HTML displays great in one browser, will even display anything in the other browser. In is to the point today that you need a good HTML reference to tell you what tags and attributes are useable at what certainty in what browser.

Part of this problem are the browser specific elements and attributes that each browser incorporates into its implementation of the HTML DTD but part of it is also from lax enforcement of the specifications of the DTD. I have already mentioned one of these lax implementations previously when discussing quoting of attributes. Things have gotten so bad that many web design and HTML books and tutorials now teach the lax implementations of the DTD as if thats the way things were supposed to be done, or at least teach that the incorrect way is an allowed alternative. What follows is a listing of things that are NOT correct HTML, despite what you may be told, see or have practiced in the past, the items in the following list are not proper HTML. (n.b. the correct explanation and implementation of these errors is given in the different font after each item.)

Quotes are not required around attribute values:
- that are only one word long
- that are only numbers
- correct - Quotes are required around all attribute values
A closing tag is optional for:
- the <P>, Paragraph tag
- the <LI>, List Item tag
- the <TD>, Table Data Item tag
- the <DT>, Dictionary List Term tag
- the <DD>, Dictionary List Definition tag
- correct - Closing tags are required for ALL container tags
The ordering of nested tags doesnt matter, e.g.
1. ```
<H1><A Name="Name">A title</A></H1>
```
2. ```
<A Name="Name"><H1>A title</A></H1>
```
3. ```
<H1><A Name="Name">A title</H1></A>
```
4. ```
<A Name="Name"><H1>A title</H1></A>
```
are all equivalent.
- correct - Only #1 is correct, for 2 reasons:
- You must close tags from the innermost element first to the outer most element last (#'s 2 & 3 violate this rule).
- You can not nest a block element (e.g. <H1>) inside an Inline element (e.g. <A>). (#4 violates this rule)

You may ask, well if these are actually incorrect why do browsers support them? Browsers support them because they are such very common errors; and a browsers only alternative would be to improperly display the page or not display the page. The bottom line is that anybody can try to make a web page and the browsers try to accomodate all the novices out there who just hack something together without understanding it. Tutorials and books promote these I can only assume because they see them as cute tricks and shortcuts that they can give to their readers. You should not practice these good habits just for aesthetic reasons but because they develop excellent markup habits that will serve you very well if you ever move to the next step up which is XML (we will discuss XML later in the semester). XML and all other markup languages are NOT forgiving like a web browsers implementation of HTML is, and the other ML's require proper markup.

Today, no matter what W3C puts out as a standard for HTML or Style sheets, it is really what the browsers implement that determines what people put into their web pages. There is nothing wrong with that, as a developer you have to work with the system the way it is. But the more standardized and compatible the HTML you produce the better you will as a developer and the better your HTML will last in an ever changing system of standards suport.

(Another common mistake people make is to assume that text in a Table data cell (the <TD> element) must be enclosed in a paragraph tag. As long as there is only one block of text in the table data cell the paragraph element is not required because the DTD specifies that the <TD> element can contain just text data. This misconception actually can cause some problems. Because the contents of the <TD> element are displayed as text it is one of the few places in writing HTML where you have to be very careful about where you put in spaces, tabs and hard returns when formatting your HTML to make it look pretty. In the 2nd lecture for this week on Web Design I show an example using a table to make a composite image (often this is called a sliced image) and I show how this very issues can cause problems.)

(Here are a good article talking about web design issues and browser interoperability, please read these to better understand the issues involved:

Browsers, Browsers, Browsers! A Strategic Guide to Browser Interoperability
Effective Cross-Browser Development (Only the first 2 sections or so of this article are relevant now the rest will be relevant when we start talking about scripting for the web.
Common Browser Implementation Issues
and a terrificWeb Technology Browser Compatability Chart

HTML Editors

Types:

(See the item in "External Links" to find out where to download some of these items)

General Text Editors - Note Pad (The default text editor that comes with MS windows OSs), GWD Text Editor
Dedicated HTML Editors - Homesite, AceHTML, HotMetal, HTML-Kit
WYSIWYG (What-You-See-Is-What-You-Get) Page Editors - Dreamweaver, Frontpage, AOL Press

Advantages (+'s) & Disadvantages (-'s):

HTML & Text Editors -
(+) Gives full control over tagging and layout
(+) Gives access to all possible tags and all their possible attributes
(+) Allows for easier production of clean, easy to read and edit HTML
(-) Requires greater knowledge and understanding of the markup language and process
(-) Is harder edit content in since both content and markup are displayed together
WYSIWYG -
(+) Simplifies complex markup tasks (e.g. tables)
(+) Allows easier editing of already marked up content since markup is not shown or at least not emphasized in display
(-) User has less control over structure and layout of markup
(-) Sometimes it limits or controls users choice of tags & attributes
(-) Will rewrite or at a minimum reformat your markup.

My personal preference is to start laying and structuring a document (both existing information I'm about to markup or new content I'm typing in with the markup) in a good HTML editor and then switch to a good WYSIWYG editor only to edit the content portions I have already marked up. Some of the more sophisticated WYSIWYG and HTML editors will have nice functionality to help you do web site production and maintainance.

Validating your HTML

My last comment is about ways to check your HTML. There are many validators out there that you can give a URL to you web page to and they will check your HTML against the DTD you specify in your DOCTYPE tag. One of the most well-known ones is from W3C at http://validator.w3.org/. You can also download local validators or HTML cleaning programs, the best known of these is HTML Tidy (which is built into the HTML-Kit editor). It is a configurable, and powerful tool for producing standardized, clean and formatted HTML.

(Another good source for you to see what good HTML looks like is to simply view source on any of the HTML web pages that I have posted for this class. These show great examples of proper structure and markup of textual information. They are simple web pages with minimal graphics and fancy design elements. This page also shows a very little bit of using CSS. You may have noticed that all the blocks of HTML code are in a different font and a different color. This is done with a quick little bit of CSS applied throughout the document.

Hosted by www.Geocities.ws

Introduction to HTML & Markup

What is HTML?

Why Separate Content and Style?

What is an HTML Element?

HTML 4.0 Elements/Tags

Structural Block Elements:

Structural Inline Elements:

Inline Tags for designating stylistic information:

New Content Tags

Old Text Tags