This will help you learn HTML by first introducing you to XML.


Hierarchy

XML arranges data in a hierarchy, a tree-like structure in which objects can branch off into other objects.

<root>
 |
 +-<child1>
 |  |
 |  +-<child3>
 |
 +-<child2>

Each object is a node. There is one node from which all other nodes are descended. This is called the root.

When a node is descended from another node, it is a child of the node it is descended from, which is its parent. In the above example, <child1> is the parent of <child3> and is also the child of <root>.


XML

XML is a language for representing a hierarchy in a plain-text file. Nodes are represented by tags which surround text data, looking like: <tag>data data data</tag>. Before the data, there is a start tag which looks like <tag_name>. After the data, there is an end tag which looks like </tag_name> with a forward slash before the tag name. These are generally called opening and closing tags.

<root>
 <child1>
  <child3>Data in child3</child3>
 </child1>
 <child2>Data in child2</child2>
</root>

There are programmers' tools designed to read XML, so adding an XML structure to a file allows programmers to easily use the data in their programs.

There is also another type of tag, used for representing a node that does not contain any data: <tag/>, with a forward slash before the final >.


Escape Sequences for Special Characters

So what if you want to have > or < characters in your data? An XML reader will get confused when it sees them, thinking they are the start or end of nodes. The solution is to use escape sequences: &lt; for < and &gt; for >. All escape sequences start with an ampersand '&' and end with a semicolon ';'.

Okay, so now what if you want to have '&'s in your data? There is an escape sequence for them too: &amp;

For advanced users: If you want to use symbols that are not on your keyboard but you know their numeric values in the character set you're using, you can escape them with &#nnnn; where nnnn is the symbol's number.


Comments

An XML file can also contain comments, which are notes that a programmer leaves inside a file to be read by other programmers (or himself six months later). A comment starts with <!-- (less-than, exclamation point, two hyphens) and ends with --> (two hyphens and a greater-than) and otherwise may not contain a -- double-hyphen.

<!-- Section 2 was added 10/04/2003 by Fred -->
<p id="section2">this is section 2...
...
</p>

HTML is XML

XML is little more than the above method of defining a node hierarchy. XML does not actually define any specific, meaningful nodes. You can name the nodes whatever you want, and they mean what you want them to.

This means you have to be somewhat wary of software marketers who proclaim that their file formats are "standard XML". The node structure can be of their own, proprietary design, and the data contained within the nodes can be scrambled or encrypted in any way, and it can still be called "standard XML"!

XML-based formats are subsets of XML. Where XML allows you to do what you want, a subset defines a few specific nodes and gives them meaning. Programs can then be written to use the subset. HTML is one such subset of XML, and it is implemented by several web browsers and web development tools.


Simple HTML

Paragraphs

The simpler data structures used in HTML are the paragraph and the list. Paragraphs are defined with the <p> tag:

<p>This is a paragraph</p>
<p>This is
another paragraph.</p>

By wrapping the paragraphs in <p> tags, the web browser will know to automatically format them as separate paragraphs. Web browsers ignore carriage returns, tabs, spaces, and other whitespace between tags. No matter whether you placed the two paragraphs on the same line in your HTML file or whether you added several empty lines between them, the browser will format them the same way. You can also have carriage returns or whitespace in the middle of your paragraphs so your HTML looks pretty, and the browser will ignore them and draw only a single space between words.

If you actually want to have a line break in the middle of a paragraph, you would insert a <br/> tag (note the /> shorthand for an empty tag) where you want the line to wrap. In most cases, however, you will want the browser to automatically format your text as paragraphs. People will be viewing your page on any kind of display from monitors to cell phones to the printed page, and manual line wrapping might break lines in the wrong place if the display width is different. If you want to have a space that the browser will not line-wrap, use the escape sequence &nbsp; which stands for non-breaking space.

Lists

Lists are made up of two tags. Items in the list are placed inside <li> tags, for "list item". The entire list is placed inside <ul> tags if the list is not supposed to be in any particular order, or <ol> tags for an ordered list.

<ul>
<li>Bread</li>
<li>Butter</li>
<li>Milk</li>
<li>Eggs</li>
</ul>

This produces a list:

  • Bread
  • Butter
  • Milk
  • Eggs

XML Attributes

Now, how does the Web allow for special things like linking to other pages, using fonts, or showing images? The instructions to change appearance or define a link to a random page will not be part of the page data, the meaningful information that might exist in a printout.

XML allows for nodes to have attributes, a kind of meta-data that decribes the node structure. Attributes are defined as follows: <tag attribute="value"> where attribute is the name of the attribute and value is that value that it is set to. Note that the value must be surrounded by double-quote characters!

Like node names, XML does not define attribute names, with one major exception (and a few minor ones which will not be mentioned here). XML nodes can have an attribute named id whose value must be unique within the entire XML document.

Links

To create a link, you use a new tag: the <a> tag. You put the URL of the target in the <a> tag's href attribute:

<a href="http://somewhere">this text would link to somewhere</a>

You can also link to a certain spot within a page. First, give the target tag an id attribute: <p id="here">I want to link to this paragraph</p>, for instance. Then for your link, use an <a> tag whose href is the other tag's id preceded by a hash mark '#': <a href="#here">go to the spot marked "here"</a>. Examples:

<a href="#here">go to the spot marked "here" on this page</a>
<a href="page2.html#here">go to #here on page2.html</a>

Images

For images, you use the <img> tag and place the image's URL in its src attribute. You also should use the alt attribute to display alternate text if the image does not diplay or for browsers which do not display images. For browsers which do support images, you can use the title attribute to make text appear when the user waves their mouse pointer over the image.

<img src="path/to/image.png" alt="[picture of puppy]" />
<img src="/images/aboutus.png" alt="About Us" />
<img src="smile.png" alt=":-)" title="Hi there!"/>

Note the use of the shorthand /> to end the tags, since <img> tags do not contain any data.

While it is possible to load another site's images in your web page, it is considered very rude to do so. Images can have large file sizes, and bandwidth is expensive! To be polite, it is best to have images on your own server. Also, another server's owner could delete the image or change it to something you don't want your users to see, or could simply block your users' access to his files.

Formatting

Visual formatting is done with Cascading Style Sheets, which is beyond the scope of this document. HTML is not meant to be a word processor or a page layout editor, so graphical formatting in handled by CSS. However, HTML has tags you can use to describe the content of your page.

There's no guarantee that all browsers will draw a tag in the same way. In fact, they're not supposed to. Browsers can be on computer monitors, cell phones, televisions, or as yet uninvented devices, or can be programs which read the page out loud. By using descriptive tags, browsers can express the content in whatever way the programmer or browser user thinks is the best way of expressing it, no matter what platform the browser runs on.

If you want to give some words a special emphasis, you can wrap them in the <em> tag. If you want to speak strongly, you can wrap the words in a <strong> tag.

<p>To calibrate the XYZ-42 frobnitz, apply the Illudium Q-36 Modulator
to the primary thingamajig interface. It <em>will not work</em>
if you use the secondary interface. And above all, <strong>do not push
the big red "self-destruct" button!</strong>
</p>

If you have an acronym or abbreviation and you want your users to access the full name of the work, you can use the <abbr> tag and its title attribute:

<p>
<abbr title="Hypertext Markup Language">HTML</abbr> is neat!
</p>

If you want to put a chunk of a plain text file in your web page, you can wrap it in <pre> tags (for "pre-formatted"):

<p>Flowers?<p>
<pre>
  * **
  _V/
 * | 
==]3
<pre>

If you want to quote somebody else, you can use the <q> tag if the quote is inside a paragraph, or the <blockquote> tag if the quote is long and/or spreads several paragraphs. If you have a URL for the source of the quote, you can include that in the cite attribute for either tag.


<p>Freedom of the press is nice, but as H.L. Menken said,
<q cite="http://watchfuleye.com/mencken.html">"Freedom of press
is limited to those who own one."</q>
</p>


<blockquote cite="http://www.circus.com/~nodhmo/">
<p>
Dihydrogen monoxide is colorless, odorless, tasteless, and kills
uncounted thousands of people every year. Most of these deaths
are caused by accidental inhalation of DHMO, but the dangers of
dihydrogen monoxide do not end there.
</p>
<p>
Prolonged exposure to its solid form causes severe tissue damage.
Symptoms of DHMO ingestion can include excessive sweating and
urination, and possibly a bloated feeling, nausea, vomiting and
body electrolyte imbalance. For those who have become dependent,
DHMO withdrawal means certain death.
</p>
</blockquote>

HTML Skeleton Page

The tags you've learned so far are for the most common HTML nodes. Some browsers can render simple HTML tags by themselves, but a page is not considered semantically proper HTML unless it starts with this:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html><head><title>PAGE TITLE GOES HERE</title>
</head>
<body>
</body></html>

This is a skeleton page. You will put the HTML you have written inside the <body> tags. Also, you would put a title where "PAGE TITLE GOES HERE" is. What these tags mean and do is explained in the appendix.

Now that you have a simple HTML file, you can look at it in your browser and see if it looks good. If you like it, you can find a web server and put it on the web so other people can see it. Most ISPs let you have space for a website, but you will have to talk to them to find out how you can transfer your files to your web space and what you will need to type into your browser to find the page.


Farewell

This tutorial cannot cover everything that you can do with HTML, so it won't try. There are other tutorials which cover HTML well, but be aware that there were several versions of HTML before XHTML. These older versions were not based on XML, and examples using them will not always conform to XML's requirements. Two specific things to watch are for are making sure your tags are closed, and that tags are written in lower case (all xhtml tags are lower-case).

Good resources include Maricopa Community College's Writing HTML tutorials, The Web Design Group's HTML 4.0 reference (most HTML 4.0 tags also exist in XHTML 1.0) and Web Authoring FAQ, The "Viewable With Any Browser" Campaign's Design Guide, and the University of Indiana at Urbana-Champaign's venerable Beginner's Guide to HTML. Use your browser's "view source" feature to view the HTML source code of interesting web pages (such as this one). For the final word, read the HTML standard (upcoming version) and CSS standard directly, and use the w3c validator

The validator can be a bit hard to use. First, your HTML file must have a correct !DOCTYPE tag (like the one in the skeleton page) or you can manually give a doctype to the validator. Otherwise, it will refuse to run. As soon as the validator runs into a single error, it will get confused and report errors throughout the rest of the page even if there was only the one error, so you will want to fix the first errors first. Sometimes the actual error in your code will be several lines above where the validator first notices something is wrong.


Appendix

Now, about that skeleton:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html><head><title>PAGE TITLE GOES HERE </title>
</head>
<body>
</body></html>

The <?xml?> notation is something special that is supposed to go at the start of every XML file so programs can know that it is XML and what character set it uses. UTF-8 is an English-language character set that is easy to remember.

<!DOCTYPE is another special kind of notation for XML. It defines the subset of XML that the file uses and gives the URL address of a site where a browser can find a computer-readable definition of the format, in this case XHTML 1 Strict. Note that the !DOCTYPE line looks like a tag, but does not close. That's because it's a special type of tag that is a holdover from SGML, the forerunner of XML. It's also going away in a future version of XML, since many people don't like the way !DOCTYPE works and want to replace it with XML Schemas, which are not yet widely implemented enough to use.

<html> is the root node of the HTML document.

<head> is the HTML header. This contains the page title (in a <title> tag), any possible client-side scripts you might run (in a <script> tag), stylesheets for visual formatting (in a <style> tag), and links to external stylesheets or related documents (in a <link> tag). This is essentially the metadata of the HTML document.

<body> is where you place your HTML data.


Definintions

Plain-text

Computers represent the alphabet as numbers. For instance, the capitol letter 'A' is the number 65 for computers using the ASCII character set. A lowercase 'a' is the number 97; since they are different characters, the computer sees them as different numbers. Computers actually represent everything as numbers. A file is just a stream of numbers that the computer must be programmed to recognize as something special.

The ASCII character set is an industry standard that only defines a few such number-symbol relationships. For the more advanced features that word processors have -- bold text, fonts, paragraph spacing, etc. -- the word processor makers had to create their own data format standard.

A plain-text file is one which only uses ASCII characters, or ASCII's less widely implemented successor Unicode. Likewise, plain-text editors are programs which specialize in opening such files. If you use a plain-text editor to open a word processor document, with all its extra word processor codes, the file will look like garbage as the special word processor codes cannot be understood or are matched to strange symbols.

Widely used text editors include Notepad and the command-line tool Edit for Windows, BBEdit for Macintosh, and vi, emacs, pico, jed, and joe for Unix. Many word processor programs have a feature that will allow you to save any document as plain-text.

Metadata

Data is the information that you want people to read: the words on the page.

Metadata is information that describes the page structure: information about the nodes, the file locations of images and other external objects used in the page.

If you are designing an XML format, you want to be clear on whether your information is data or metadata. Since XML attributes look like leaf nodes, you might feel tempted to put leaf-node data in attributes. Don't do it!

URL

Abbreviation for Uniform Resource Locator. The "address" of Internet data which is used in a link or typed into a browser's address bar. URLs generally take the form of:

user:password@service://subdomain.domain.top-level-domain:port/path/to/file

Example URLs are:

  • http://google.com
  • http://news.bbc.co.uk/1/hi/world/default.stm
  • ftp://ftp.winzip.com
  • mailto:user@aol.com
  • gopher://wiretap.area.com/

URLs are also known as URIs (Uniform Resource Identifiers), which is considered the more correct name but people are used to calling them URLs.


Written by David Turover, September 2003 (revised October 2003, June 2005) and released to the public domain excepting the DHMO quotation which is copyright the Coalition to Ban DHMO (that's water if you hadn't figured it out).

Thanks to attutle for suggestions.