Getting from Networks to web pages - HTTP

(As an aside: If you didn't peruse it when you saw the link off a page from the last lesson, you might be interested in this tutorial on: COMPUTER NETWORKING BASICS. It is pretty good, and if you are interested in knowing a little more about networking, its well worth pursuing.)

1 Under the Hood - TCP and IP protocols

1.1 IP addresses and URLs

We all understand somewhat how Internet applications work: (many of you will probably be familiar with only e-mail and The Web)

You tell your client application what resource or location you are looking for on the Internet, i.e. which Internet server or host you want to connect to and what resources you want from the place you are going to.
Your client establishes a connection to the other machine on the Internet that has the resource you want, generally called the "host".
You exchange data with the host.
You terminate your connection to the host.

And indeed, this is how almost all Internet interchanges happen.� What we want to look at in this section though is the first part and second part of that process. (You will be playing with the 3rd part in your exercise associated with this lesson.) How does one system on the Internet find another one?� The answer is something called an IP address (alternate definition), it is a unique identifying number assigned to a particular host which is connected to the Internet This number (by the current scheme, which is being replaced by a� new scheme called CDIR which is tied to the completion of and adoption of the next generation of the IP protocol scheme).� The IP address is simply a 4 part number, each part is separated by a period and can take a value between 0 and 255, e.g. the IP address for your MSU Blackboard site is: 147.133.1.23 .� These IP addresses are the real way to find your path on the Internet and are, as the name implies part of the IP protocol.

1.1.1 Host Names, Domains and DNS

So what about all those funny addresses and names on the Internet; "after all", you might ask, "I don't get to yahoo by going to some strange 4 part number, but by simply going to www.yahoo.com" (Actually you could use a funny number to get to Yahoo. One of the funny numbers that will get you to Yahoo is 216.115.108.245. We will use it later in your exercise.) The next question then is how do the the Internet names of sites and companies get assigned, and once assigned how are they related to those IP addresses that are the real addressing system on the Internet?

First of all we should start using proper terminology, the core part of any Internet site name is called the "domain name" for that site, and this is associated with a certain IP address range called the "domain". The whole process of assigning IP addresses and also assigning domain names is managed by an organization called InterNIC (alternate definition). I'm sure as a user of the Internet you will have noticed that most site names end in one of about 5 ways:

.com - Generally for commercial sites
.edu - For Educational institutions
.net - Generally for network or Internet service providers
.org - Generally for non-profit organizations
a funny 2 letter abbreviation - Used to designate country (A list of country abbreviations is available)

These and other possible such endings are called top-level domains (alternate definition). (There have been a couple of pushes over recent years to construct more abbreviations for these top level domains.) When someone wants to start a new Internet service the first thing they must do is register a domain name for themselves. There are many companies today that offer domain name registration services. In order to complete the registration process though you must have a place to "park", as they say, your domain name. What this means is that every domain must have an associated IP address or address range, after all domain names are only mnemonics for the underlying IP addresses they point to. (FAQ about domain name registration) There are many stories about domain name theft, domain name rights and domain name squatting. Such controversies are inevitable since the domain name is the public face of whatever organization it represents, and if an organization has an established word, name or phrase that represents them (or that they might have trademarked) they will want to own any domain name that is associated with this word or phrase. The domain name is only a part of the address need to identify a particular host on the Internet, you also need what is called the hostname. Since the domain, and its associated domain name, can describe a whole network and all of the network services available in that network, the hostname is that part which identifies particular host or Internet service in the domain. Thus to uniquely identify the particular Internet host, or server or service you are looking for you need what is called a fully qualified domain name (FQDN) with a hostname, domain name and top-level domain name all put together, e.g. "www.yahoo.com". As you can see from the example the parts of an FQDN (In Internet lingo the whole FQDN is sometimes referred to as the "hostname", which if you consider the Internet as one big network would be a correct usage of the word.) are strung together with periods separating them just like an IP address.

Now, we are getting close to putting it all together. We know that computers are really identified on the Internet by their IP address, but that users generally reference them using their hostname (we will start using the word "hostname" from here forward to refer to an FQDN on the Internet, since the only network we will be talking about from here on out is the Internet). So how are the 2 things related? There obviously must be some system somewhere that translates hostnames into IP addresses. Indeed there is such a system, and it is called the DNS (alternate definition) ("Domain Name System", or if you are referring to a particular machine which does that translation DNS can stand for Domain Name Server). So each domain, and all its associated hostnames, must be hosted and entered into a domain name server somewhere for it to be accessible on the Internet Not all domains run their own domain name server though, for instance if I go rent some space with an Internet Service Provider, ISP, and have this ISP host my domain, they will use their systems DNS to host my domain and many others as well. However, a large organization like MSU runs its own DNS as do all such organizations that run and manage their own network hardware.

1.1.2 Ports

We are now only missing one small piece of this network naming and addressing system, and that is the concept of a network port. Just like IP addresses are a built in addressing scheme for the IP layer of the Internet protocol stack, ports are an addressing scheme for the next protocol layer the TCP layer. If an IP address specifies a unique host computer system, then the finer gradations of ports must specify some subsystem of that computer. Specifically these ports are logical (in computer lingo this just means they are virtual or not physical ports) access points into separate network services that the host system makes available to the TCP/IP protocol stack (i.e. the Internet). So various network applications on the specified hosts system are said to "listen" to on a particular port; this means they are waiting for some other host system on the Internet to request whatever type of service that application provides.

Maybe you are wondering if these ports are an essential part of how we access TCP/IP (i.e. Internet) applications and services how come I don't need to specify one when I access a service like www.yahoo.com's web pages. The answer to this is because there are a large number of these port numbers for which everyone has agreed to use a specific port number for accessing a specific Internet service, and most Internet clients, like your web browser, know and use these ports by default if the user does not specify a different port. So unbeknownst to you, your Internet client software has been dabbling with the Internet addresses you specify, quietly adding the appropriate port numbers to the hostnames when you forgot or neglected to. Here is a short list of the most common of these accepted port numbers:

Internet Application Protocol	Port Number
FTP - File Transfer Protocol	21
TELNET	23
SMTP - Simple Mail Transfer Protocol	25
Gopher	70
Finger	79
HTTP - Hyper Text Transfer Protocol - Used for WWW	80
POP3 - Post Office Protocol	110
NNTP - Network News Transfer Protocol - Net News	110
HTTPS - Secure HyperText Transfer Protocol - Used for WWW	443

(In a section titled "WELL KNOWN PORT NUMBERS" RFC 1700 has a fairly extensive listing of commonly accepted port numbers.) All these standardized port numbers will be below 1024, so system and Internet service administrators should never use numbers below 1025 when assigning their own port numbers to services.

1.1.3 URLs - the real addressing scheme for the Internet

Now we know how to address a specific host (via hostname or IP address) and any network services that host offers (via the appropriate port number). The next step would be to access specific data, information or other resources available through that network service, e.g. we want to access a specific web page from web server running a certain host machine. The protocol used to find individual pieces of information or resources is called URL, Uniform Resource Locator (quite aptly named wouldn't you say?).� So, the URL is not actually a way to address a host but rather a way to locate a particular resource on that host.� In fact the hostname is only a small part of what makes up a URL.��( You should peruse these two links: 1. A Beginner's Guide to URLs, 2. Web Naming and Addressing Overview (URIs, URLs, ...).)

The most basic structure of a URL is:
Scheme:Source
, where the scheme is what we might call the protocol or service being referenced for the resource, and the source is some reference to the location of that resource on the host system. The most common schemes are:

http - for the World-Wide-Web,
https - for secure web transactions,
file - for accessing a file on the local machine,
ftp - for file transfer via FTP,
telnet - for logging into a machine via a telnet client
gopher - an older text-based version of the web,
mailto - for sending e-mails.

The elements that make up the source vary considerably depending upon the scheme type and the resource being sought. For two of the most common protocols the source looks as follows:

For http scheme: http://user:password@host:port/path?searchpart
- user - a user name if one is required for authorization
- password - the password required for authorization
- host - The FQDN or IP address of the host system
- port - the numerical port if the server does not listen to the default HTTP port 80
- path - the complete path and filename necessary to locate the resource
- searchpart - user data being sent to host if GET HTTP method is being used for request
Many of these parts are optional and proper URLs come in a wide variety of shapes and sizes combining any number of these parts. Some examples would be:
- http://www.yahoo.com/ - note that no path or file details are specified here in the source part, only the hostname is specified. This still works because (as we will learn more about later when we discuss web sites and web servers) there is a default document called index.html that all servers will try to serve as the default page if none other is specified. So, this URL is equivalent to http://www.yahoo.com/index.html
- http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?query=HTTPS - note that this calls a CGI script and provides user data to the script via the searchpart of the URL (Can you lookup, or guess which country this site is hosted in from its URL?)
- http://www.kyenc.com:8040/ - note the use of a different port number other than default port 80
For mailto scheme: mailto:proper-email-address, e.g. mailto:mailto:[email protected].

1.2 Using the Telnet protocol

The TCP (Transport Control Protocol) has 2 major application protocols essentially built as a part of it. These are Telnet and FTP. What does this mean for us? Well, since the Telnet application protocol is built into the TCP protocol, and any other application protocols stack on top of TCP (see the diagrams of OSI Internet network model from last lesson) the telnet application has access to all other application protocols on top of the TCP layer. Therefore it is via this telnet protocols that we can access other server applications (via their appropriate port number of course) running on a host machine.

We will do just this in the exercise associated with this lesson. We will use the telnet protocol to issue commands, through a network port, to the application that handles HTTP protocol, the HTTP server (otherwise known as a web server). We will in essence become a HTTP client (Better known as a web browser) and interact directly with the web server over the network. (An exercise for you to use some FTP commands will come next week.)

If, while you are doing the exercise you feel a bit like a hacker, you are right in some respects. This access to application protocols through FTP and telnet protocols is a HUGE vulnerability in networked systems. Almost all hacking of systems, and attacks on networked systems will happen through these protocols. The recent scare about the viruses attacking Microsoft web servers was a worm which propagated itself using these same protocols.

2 On to HTTP

As mentioned above the protocol that is used to communicate between web client and web server is the Hypertext Transfer Protocol, HTTP, (and its more paranoid and secure cousin HTTPS). The most current version of this protocol is version 1.1 and it is specified in RFC 2616 (Read and understand at least the sections 1.4, 3.3.1 (You'll need these standard data formats for properly making good web pages), 4.1-4.3, 5.1-5.3, 6.1.1, 9.3-9.6)

2.1 What is HTTP

This protocols specifies the steps that make up a Web transaction between client and server and how they both will exchange information. Most of the information they exchange via this protocol is in the form of statements in what is called the "header" of the transaction. The protocol provides, via these headers, the capabilities for both the client and the server to send information about themselves and the resources they are going to transmit or request.

The basics steps of an HTTP transaction are:

Client opens a connection on the servers HTTP port
Client issues a request for a resource composed of:
- An appropriate HTTP method requesting a resource
- Header information
- A blank line (The blank line is always used to signal the end of the header data in any part of an HTTP transaction)
The server processes the request (which could be as simple as fetching a web page, and all its associated image and other support files, or as complex as running an application script on the server.)
The server returns a response composed of:
- A response code
- Header information
- A blank line (The blank line is always used to signal the end of the header data in any part of an HTTP transaction)
- A body containing any appropriate data to be returned (like a web page)
The server closes the connection

The 2 most common HTTP request methods (request methods are the ways a client asks for resources from a server, see section 5.1 of RFC 2616) are GET and POST. (N.b. all web designers specify one of these methods or the other whenever they make a form in an HTML page. We will discuss the difference between these 2 methods later when we discuss forms and server-side scripts.) These are the 2 request methods that return web pages to a web browser. Two other common methods are

HEAD:: used to request only the header from the server and not the body of a resource (This is used, for instance, by caching web proxy servers to check whether a document that they currently hold in their cache is fresh. You play with this feature of HTTP in your exercise.), and
PUT:: this is used to allow a client to send a resource to the server. (This is not the same as sending data from a form like you do in your discussion board, this is allowing a client to send a specific file to the server.)(If you use the University provided web publishing system it uses this PUT feature to allow users to post pages to their campus web sites.)

When the server issues a response the first thing it does is issue a response code to which tells of the success or failure of the request. This is called the status-code. The first digit of the Status-Code defines the class of response. There are 5 values for the first digit:

1xx: Informational - Request received, continuing process
2xx: Success - The action was successfully received, understood, and accepted
3xx: Redirection - Further action must be taken in order to complete the request
4xx: Client Error - The request contains bad syntax or cannot be fulfilled
5xx: Server Error - The server failed to fulfill an apparently valid request

You can see all of the possible status codes listed in section 6.1.1 of RFC 2616 and all of section 10 of that RFC is dedicated to defining these status codes in detail.

Data about the server, the client, the resource the client is requesting, and the resource the server is returning are all embedded in the HTTP header. This data is sent in various HTTP header fields. There are too many possible Header fields to list here, but you can see a number of them (both client are server) from a little script I have which returns all of the HTTP header fields that the script sees from both client and server. An important one that is not in this list is the "Last-Modified" Header data. All of the possible header fields are discussed in RFC 2616

2.2 How is HTTP used

Most Web designers rarely see the HTTP protocol because the web server, which delivers their pages, and the users web client which views their pages handle all of the HTTP transparently in the background to properly deliver the page. However, even a simple web page designer can benefit from knowing it since there are ways to embed HTTP into a web page to force certain actions from the web server and web client when they handle your page. Web programmers on the other hand can't always depend upon the client and server to assist in their transactions and so they must sometimes send HTTP commands directly as part of their scripts.

A few other useful application protocols

Ping - used to check your connection and network connection speed to a host on the internet. Try if from a DOS window, e.g. ping www.yahoo.com . I use this all the time if I have network trouble, its a simple, reliable protocol that will allow me to check my network connections without a complex client application (like a web browser) between me and the internet.
nslookup - used to access information from a systems DNS to get data on the hostname and IP address of a network host. (This generally only functions from a UNIX prompt and is not a default application on MS systems the way ping is.)

Hosted by www.Geocities.ws