HTML

Goals

Concepts

Lesson

One of the primary drivers of the popularity of the Internet, along with the Hypertext Transfer Protocol (HTTP), is the HyperText Markup Language (HTML), the most popular content format sent over HTTP. HTML is a markup language because it allows you to add meaning to a document by inserting text delimiters to mark certain locations. You already saw a snippet of HTML when you studied the Extensible Markup Language (XML).

Excerpt from an HTML document discussing XML.
…
<p>To process <abbr title="Extensible Markup Language">XML</abbr> a computer must <dfn>parse</dfn>
the XML document, but still the computer may not know what the resulting tags <em>mean</em>
unless it is familiar with the XML vocabulary being used.</p>
…

History

Tim Berners-Lee, generally considered to be the inventor of the World Wide Web (WWW), created HTML in the 1990s as a specialized format for linking documents via hyperlinks. HTML sent over HTTP is the basis for the web that still exists today.

HTML evolved from the Standard Generalized Markup Language (SGML) from the 1980s. Originally HTML was an application of SGML—a restricted syntax with a set vocabulary. The most recent versions of HTML have abandoned full SGML compliance, however.

HTML 3

The browser wars began in which companies such as Microsoft and Netscape tried to to gain a control of the browser market. They introducing extension tags that were only understood by their own web browsers, Microsoft Internet Explorer and Netscape Navigator. This caused many inconsistencies and incompatibilities in the way web pages appeared across browsers.

HTML 4

By the end of the 1990s HTML had advanced to version 4. By then the HTML specification was being maintained and improved by the World Wide Web Consortium (W3C), headed by Tim Berners-Lee. Around 2000 the W3C released HTML 4.01, which was to remain unchanged for many years, becoming the most common form of HTML. In the meantime XML had become very popular, so the W3C decided to stop work on HTML as an unrelated format, and produce later versions of HTML based on XML.

HTML5

Many thought that the W3C was moving too slowly in creating new markup languages, and that the languages they were creating were too large and complicated. For this reason Apple, Mozilla Foundation, and Opera Software formed the Web Hypertext Application Technology Working Group (WHATWG) to more quickly develop an updated simple version of HTML. This effort proved popular, and eventually the W3C reconsidered its abandonment of HTML as a specification independent of XML.

In 2014 the W3C released the official specification of what is now known as HTML5, based on the WHATWG's work. The WHATWG continues to improve HTML5 as a living standard, meaning that it is constantly being updated and improved. The W3C intends to periodically release new updates of HTML5 as snapshots, such as HTML 5.1.

Content

The general syntax of HTML is very similar to that of XML, reflecting HTML's shared evolution from SGML.

Comments

A comment in HTML, as in XML, takes the form <!-- … --> and can span multiple lines. Comment text must not contain the a sequence of two hyphen -- characters.

Elements

Just like in XML, the primary structure for delimiting information is an element, which can consist of a start-tag and an end-tag. While XML simply provides a general syntax for tags, HTML actually provides an element vocabulary, a set of tag names that mean certain things. For example the <p> element indicates that the marked up character data between the tags is a normal paragraph of text, as opposed to a figure or a caption.

HTML elements with no end-tags.
…
<p>Here is a picture of a car.
<p><img href="car.jpg" alt="A car.">
<p>Another type of vehicle is a truck.
…

The biggest difference from XML is that in HTML the end-tag of an element is sometimes optional! For example the figure to the side shows is a valid HTML fragment with no closing tags.

HTML5 does not provide a DTD. It does however provide a sort of schema by placing elements in different categories. Elements of certain categories should only contain certain other categories, as you will see throughout this lesson. This is called a content model. Some elements may fall into several categories. See HTML 5.1 § 3.2.4.2. Kinds of content.

Attributes

Unlike XML, HTML does not require attribute values to be quoted, and can even be present without a value in certain contexts. See HTML 5.1 § 8.1.2.3. Attributes. Here are the ways you might see an attribute in HTML:

<html lang="en-US">
The attribute is quoted, just like in XML. Single quotes are also allowed, as in XML.
<html lang=en-US>
If an attribute value consists of a single word composed of a limited set of characters, it does not need to be quoted. It is still a good idea to quote the value for XML compatibility, as discussed in XHTML below.
<button disabled>
Some Boolean attributes allow the empty attribute form; their mere presence acts as a flag. An empty attribute is equivalent to setting the attribute to the empty string, as in <button disabled="">. If you want compatibility with XHTML, you can set the value to the same string as the attribute name, as in <button disabled="disabled">; see XHTML below.
Global Attributes

HTML provides several global attributes, so called because they be placed on any element in the document. Here are some of the ones you will use frequently. See HTML 5.1 § 3.2.5. Global attributes.

id
Creates an identifier for the element, unique within the document. You can use the ID to create internal hyperlinks; see Links below.
title
Contains general advisory information about the element. Browsers often show the title in a tooltip when the mouse is over an element. Do not depend on browser behavior; don't make the title attribute the only source of essential information, because it may not be shown.
lang
Indicates the language of a section of the document, using a language tag as described in BCP 47. This is most commonly placed on the root <html> element to indicate the language of the entire document. See an example in Structure.
class
"Classifies" the element into one or more categories for applying style information. You will learn more about styles and stylesheets in an upcoming lesson.
style
Includes style information directly in the element. As you will learn in the upcoming lesson on styles, placing literal style definitions in an element's style attribute usually not a good idea.

Formatting

TODO

TODO nbsp

Document Type Declaration

HTML, like its cousin format XML, can provide a document type declaration (or doctype) at the top of the document. Originally the DTD was intended to indicate to a web browser which version of HTML the document used, just as XML uses the doctype to indicate which XML DTD is being used as the document schema. Eventually the doctype provided several functions:

version
The version of HTML.
vocabulary
Whether the document included only HTML or included additional vocabularies such as Scalable Vector Graphics (SVG).
transitional
Whether the document adhered to the latest HTML recommendations, or whether it included older, deprecated elements that were being phased out.
quirks
Whether the browser should display the content as the standards prescribe, or whether it should exhibit old, erroneous behavior for backwards compatibility.
Recent popular HTML doctypes.
HTML 4.01 Strict
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
HTML 4.01 Transitional
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
HTML 5
<!DOCTYPE html>

Structure

HTML skeleton document.
<!DOCTYPE html>
<html lang="en-US">
<head>
  <meta charset="UTF-8"/>
  <title>HTML Skeleton</title>
  <meta name="…" content="…"/>
</head>
<body>
  …
</body>
</html>

HTML documents require a <html> root element, containing a <head> and a <body>. The W3C recommends adding a lang attribute to the root element, using a language tag to indicate the language of the document, as shown in the figure.

Metadata

Inside the <head> element are various elements that provide metadata, data that is not document content but instead information about the content data. The most important of these is the <title> element, which provides a title for the document. Other metadata can be provided in general <meta> elements using the name and content attributes to provide name/value pairs, such as <meta name="author" content="Jane Doe"/>. Document metadata is important because its helps categorize and search for content.

The <meta> element with a charset attribute is special; if an HTML file is encoded in any  character set compatible with ASCII, the HTML parser can determine the actual charset when it reaches this <meta> element. For this reason HTML documented encoded in UTF-8 should always include a charset <meta> element as early as possible in the document (before the <title> element, for example). HTML files using XML should declare the charset in the XML declaration, and web servers should  indicate the charset in the response Content-Type header as well. See HTML 5.1 § 4.2.5.5. Specifying the document’s character encoding.

Sectioning Content

The main content of the document goes inside the <body> element, and in the early days of the web, the <body> element was all that was needed. HTML content is not as simple as it used to be, and some documents now contain entire books or more. HTML5 introduced a set of sectioning content elements that allow documents to organize complex material within the <body> element. These are optional but highly useful in dividing the document into sections. See HTML 5.1 § 3.2.4.2.3. Sectioning content.

Examples of sections.
…
<body>
  <header><h1>Newspaper</h1></header>
  <section>
    <h1>Sports</h1>
    <h2>Regional Games</h2>
    <article>
      <h1>Last Night's Win</h1>
      …
      <aside>…</aside>
      …
<article>
A self-contained work such as an article or blog post, such as one of many stories appearing on a news site.
<aside>
Information that is related to the main comment but tangential, such as the tips and warning boxes appearing in this lesson.
<nav>
A section set apart just for providing navigation links within the document or to other documents.
<section>
A general section of content, such as inside an <article>.

Sections are optional; for simple documents you can simply place your content directly inside <body> as traditionally done.

Headings

Within a section or directly inside the <body> you may use one of the heading content elements: <h1>, <h2>, <h3>, <h4>, <h5>, <h6> to provide a sort of title that appears above a portion of content. The headings are hierarchical: first heading <h1> represents a top-level heading, while <h2> represents a subordinate heading. See HTML 5.1 § 3.2.4.2.4. Heading content.

TODO discuss sections, outlines, and headings in the HTML5 era; mention multiple <h1>s

Headers and Footers

If you want to group information at the top or bottom of a section, use the <header> or <footer> element, respectively.

Flow Content

HTML5 Content Venn Diagram
HTML5 content Venn diagram (HTML 5.1)

Most of the elements in HTML, the ones traditionally used to mark up text, are those in the flow content category. They include the heading and sectioning elements you've already seen, along with some metadata content. See HTML 5.1 § 3.2.4.2.2. Flow content.

Paragraphs

Content within the <body> or within one of the section elements are grouped into paragraphs. The obvious element for defining a paragraph is the the <p> element, but HTML's definition of paragraph is broader than this. Sections of text inside a section that are not enclosed in any element implicitly form paragraphs as well.

In the following example, the group of sentences starting with Vehicles can be classified … and ending with … wheels they have is considered to be a type of paragraph (though not a <p> paragraph) even though they have no surrounding tags. HTML's idea of a paragraph then is closer to what you might think of as a block of text. To avoid confusion, this lesson will use the term block to refer to HTML's general idea of paragraph, while reserving the term paragraph to refer to blocks marked by the <p> element.

Implicit paragraph or block.
…
<body>
  <p>The term “vehicle” is a broad concept. In a program,
  classes can be used to represent vehicles.</p>
  Vehicles can be classified in several ways. One is by the
  number of wheels they have.
  <p>A vehicle that has single wheel is called a <dfn>unicycle</dfn>.</p>
</body>
…

Groups

HTML technically has no grouping content content model category, but there are several elements that are made for grouping content. Most of these element may include other elements. Some of them, such as the <p> element which you've already seen, is primarily made to create text blocks. Others may even contain other grouping elements. See HTML 5.1 § 4.4. Grouping content.

<blockquote>
A longer quotation from some source. You can use the cite attribute to indicate the URL of the source, such as a blog article.
<div>
A purely grouping element that has no meaning in itself. You can use a <div> as an element on which to add attributes, such as grouping paragraphs written in Hindi using <div lang="hi-IN">…</div>.
<figure>
A way to group self-contained content, such as text, images, and source code. Unlike <aside>, containing content that is tangential, content inside <figure> is still an essential part of the main flow. You can specify a caption for a <figure> by placing a <figcaption>, another grouping element, inside it. See the section on Images for a full example.
<hr>
A thematic break in the content at the same level as a paragraph. This is often used in novels when there is a break in thought or in time in the narrative. An <hr> element is not meant to contain other elements. The name hr originally stood for horizontal rule, and many browsers still render this element as a horizontal line by default.
<main>
The main content of the document., if there is a need to distinguish it from other parts such as navigation, logos, and copyright information.
<p>
A paragraph of text and other content.
<pre>
Contains text that is preformatted; that is, the text spacing and arrangement has already been determined, such as computer output or ASCII art. A <pre> element is often used to surround a <code> element for computer source code. Be careful with line breaks; because whitespace is significant inside <pre>, you must indent subsequent lines to match the source, not the surrounding HTML content.
Grouping content.
<p>The lesson about indirection mentioned a poem by Lydia Maria Child:</p>
<!-- A <blockquote> is used because only part of the entire is being quoted. -->
<blockquote cite="https://www.poetryfoundation.org/poems/43942/the-new-england-boys-song-about-thanksgiving-day">
  <pre>Over the river, and through the wood,
    To grandfather's house we go;
      The horse knows the way,
      To carry the sleigh,
    Through the white and drifted snow. …</pre>
</blockquote>
Lists

There are several types of HTML lists, which are grouping elements that indicate a sequence of items. Most often used are the unordered list <ul> and the ordered list <ol>. Inside either of these elements, each item in the list must be placed inside a list item <li>. Make sure to choose the correct type of list, based upon whether your content is naturally in some order, such as steps to be performed, or in no required order, such as a list of the primary colors. Normally ordered lists are shown with numbers, while unordered lists are shown with bullet points.

Ordered list
<ol>
  <li>Cross the river.</li>
  <li>Go through the woods.</li>
  <li>Arrive at grandfather's house!</li>
</ol>
Description list with hostname IP address mappings.
<dl>
  <dt>localhost</dt>
  <dd>127.0.0.1</dd>
  <dd>www.myserver.com</dt>
  <dd>1.2.3.4</dd>
  <dd>www.example.com</dt>
  <dd>198.51.100.27</dd>
</dl>

HTML also offers a description list <dl> element for marking up a list of items and their associated descriptions. The items need not be actual definitions as in a dictionary, but may contain any content that is associated with other content, such as teams and their rankings. The term or the thing being described is placed in a <dt> element, followed by its description in a <dd> element. If you indeed intend use a description list to hold dictionary-like definitions, you may additionally indicate that each term is being defined by using <dt><dfn>foobar</dfn></dt>, as explained below.

Phrasing Content

The elements in the phrasing content category help markup the content inside a block of text. They appear inline and do not create new blocks of text, although the <br> element will create a line break within the current block. TODO move note on <br> to aside The following are some of the most useful phrasing content elements.

Element Description Example Example Rendering
<abbr> Indicates that something is an abbreviation or an acronym. You may use the optional title attribute to indicate the non-abbreviated form. <abbr title="Java API for RESTful Web Services">JAX-RS</abbr> JAX-RS
<cite> Represents the reference to some creative work such as a book or magazine, indicating the title and/or author. This element is commonly rendered in italics. <cite>A Tale of Two Cities</cite> A Tale of Two Cities
<del> Indicates text that has been removed from the document, such as during editing. This element is commonly rendered in strikeout. See <ins>. The book is on the <del>the</del> table. The book is on the the table.
<dfn> Indicates a reference to a new word where it is being defined. This element is commonly rendered in italics. A magazine published four times a year is sometimes called a <dfn>quarterly</dfn>. A magazine published four times a year is sometimes called a quarterly.
<em> Places emphasis on the contents. This element is commonly rendered in italics. See also <strong>. The <dfn>penultimate</dfn> is the <em>second</em> to last. The penultimate is the second to last.
<ins> Indicates text that has been added to the document. This element is commonly rendered in underline. See <del>. The book is on <ins>the</ins> table. The book is on the table.
<q> Indicates text quoted from some other source. You may use the optional cite attribute to indicate the URL of the source of the quote. This element usually causes quotation marks to be shown around the content. The <q> element is useful, because a browser will usually show quotation marks appropriate for the content language. Don't use <q> if not referring to an actual quotation, such as in a sarcastic reference. Donald Knuth said that so-called "premature optimization" is <q>the root of all evil</q>. Donald Knuth said that so-called "premature optimization" is the root of all evil.
<s> Indicates text that is no longer accurate or relevant. This element is commonly rendered in strikeout. The <s> element originally referred to general strikeout rendering, but should now only be used if semantically appropriate. If the content has actually been removed from the document as an edit, use <del> instead. Final closeout sale: <s>Two</s> Three for the price of one! Final closeout sale: Two Three for the price of one!
<small> Contains text sometimes referred to as "small print" such as disclaimers and legal restrictions. The <small> element originally was a way to show any text in a smaller font, but now should only be used if the meaning is appropriate. <small>Offer not valid in all areas.</small> Offer not valid in all areas.
<span> A general element that has no meaning in itself for grouping phrases. A <span> is useful for adding a class attribute to a section of text, if appropriate styles have been defined. Similar to the <div> element, the <span> element should not be overused; try to find another element that is more semantically appropriate, as explained under Semantics.
<strong> Places strong emphasis on the contents to indicate importance or urgency. This element is commonly rendered in bold. See also <em>. If an electrical outlet is placed near a sink, <strong>it must protected by GFCI</strong>. If an electrical outlet is placed near a sink, it must protected by GFCI.
<sub> Represents a subscript. See also <sup>. The binary logarithm is log<sub>2</sub>. The binary logarithm is log2.
<sup> Represents a super. See also <sub>. Go to the 4<sup>th</sup> floor. Go to the 4th floor.

Hyperlinks in HTML appear when the <a> element is used. The “hyperlink reference” attribute href indicates the location to load when the link is activated or clicked. Most commonly the href value is a URL to another HTML page. If it may be a relative reference, it will be resolved in the context of the URL of currently loaded document.

The href URL may contain a fragment identifier, which is marked by the number sign # character. This works for relative references, and may even be used alone to reference a location within the same document. When a fragment identifier is given, it indicates the id attribute of the destination element; if there is no matching id attribute value, the first matching name attribute will be used. The following example shows both an external and an internal link. The Images section contains more examples of links.

Hyperlink.
<p>You can find <a href="#more">more information</a> below.</p>
…
<h3 id="more">More Information</h3>
<p>More information can be found
  on the <a href="https://www.example.com/">example page</a>.</p>

Here are a few useful attributes for <a>. See HTML 5.1 § 4.5.1. The a element for more details.

href
Identifies the destination of the link. Can be a URL or a relative reference, which will be resolved to the URL of the HTML document. If possible you should use a relative reference in href to provide flexibility deploying your web site or document.
rel
The "relationship" of the referenced document to the current one: one or more space-separated tokens such as help or license. See HTML 5.1 § 4.8.6. Link types for the allowed values HTML defines, and microformats: existing rel values more registered extension link types.
target
Indicates the browsing context (usually a browser window) to use when navigating the link. The most commonly used target value is _blank, which typically causes the browser to open the link in a new window or tab. Use this capability sparingly; most of the time you do not want to force a new window to open when the user navigates a link.

TODO: Mention different types of links such as mailto:

Media

HTML has included support for images since early on, but only in HTML5 has support for audio and video become available without the need for browser plugins.

Images

Images are embedded in an HTML document using the <img> element. Similar to a link's href attribute, <img> specifies the image to embed using the src attribute. The alt attribute allows you to provide a short text description as “alternate text” to display in case the image cannot be loaded, or for users with visual disabilities. If possible you should use a relative reference in src to provide flexibility deploying your web site or document. To provide accessibility to the largest number of users, providing an alt attribute is required in almost all circumstances.

<figure>
  <a href="https://upload.wikimedia.org/wikipedia/commons/e/eb/Lafd_ladder_truck.jpg">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/eb/Lafd_ladder_truck.jpg/800px-Lafd_ladder_truck.jpg"
        alt="Tiller truck fire engine." />
  </a>
  <figcaption>
    Tiller or “hook-and-ladder” truck fire engine.
    (<a href="https://commons.wikimedia.org/wiki/File:Lafd_ladder_truck.jpg">Wikimedia Commons</a>)
  </figcaption>
</figure>
Example of a small image, linked to a larger version of the same image, embedded in a figure with a caption.
Audio

TODO

Video

TODO

TODO audio, video

Code

A common need especially for developers is to represent information that is input to or output from a computer. The most commonly used element for representing such information is <code>, usually rendered by browsers in some sort of monospace type. But there are other related elements that represent shades of semantics for describing computer-related information.

<code>
Represents a portion of computer code. This includes source code, a file name, a section of JSON, or even a keyword. You may indicate the computer language the code represents by indicating it, along with a language- prefix, in the class attribute, e.g. class="language-java"; this provides additional information and may allow a syntax highlighter to better format the code.
<kbd>
Represents user input to a computer, such as keyboard commands to enter. This includes not only text but also voice commands, menu items, or keystrokes.
<samp>
Represents the output of a computer program.

When showing blocks of code, you should wrap the entire block in a <pre> element to indicate that the line breaks are predetermined.

Example Code Block Sample Rendering
<pre><code class="language-java">package com.example;

public class HelloWorld {

  public static void main(String[] args) {
    System.out.println("Hello, World!");
  }

}</code></pre>
package com.example;

public class HelloWorld {

  public static void main(String[] args) {
    System.out.println("Hello, World!");
  }

}
Data

TODO <data>, <time>, etc.

TODO mention data-* and how history repeats itself regarding XML namespace prefixes

Tables

HTML has long had the ability to present tabular data, that is, information arranged in rows and columns. The <table> element is one of the most useful but also most abused parts of HTML. Its purpose is to present data in cells that are arranged in rows and columns.

An HTML table can be divided into an optional header, an optional footer, and the main table body. Each section comprises a series of rows, and each row is composed of several cells. There is no element for a “column” as such—a column simply comprises the cells in each row that are in the same position.

The following are the fundamental table elements. See HTML 5.1 § 4.9. Tabular data for more details.

<table>
  <caption>Command-Line Options</caption>
  <thead>
    <tr>
      <th>Option</th>
      <th>Alias</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>list</code></td>
      <td></td>
      <td>Lists all available widgets.</td>
    </tr>
    <tr>
      <td><code>--help</code></td>
      <td><code>-h</code></td>
      <td>Prints out a help summary.</td>
    </tr>
  </tbody>
</table>
Command-Line Options
Option Alias Description
list Lists all available widgets.
--help -h Prints out a help summary.
Example HTML table with sample rendering.
<table>
The outermost element of an HTML table.
<caption>
(optional) Contains a title for the table. If the <table> is the only content in a <figure>, the W3C recommends you use a <figcaption> for the entire figure instead.
<thead>
(optional) Groups a set of rows that represent the headers of the table. These rows may be repeated after a page break in the middle of the table when printing, for example. A user agent may allow the table body to scroll separately from the header.
<tbody>
Contains the rows that make up the main part of the table. The <tbody> element is technically optional (but still a good idea); the rows may be placed directly within the <table> element. Multiple <tbody> elements are allowed if you want to divide your table into sections.
<tfoot>
(optional) Groups a set of rows that represent the footers of the table.
<tr>
Represents a row of information, and contains the cells to appear in each column
<td>
Contains a single cell of information within a row.
<th>
Used in place of <td> to represents a header cell, such as at the beginning of a row. Normally header cells that represent column headers are placed in a table header <thead> element.

Normally in each row there appears a single <td> or <th> in the position of each column, containing that column's cell contents (even if empty). HTML allows you to “merge” a cell with several others in front and/or below it by indicating a column and/or row span attributes for the cell element. If a span greater than one is given, then no additional <td> or <th> elements are provided for the merged cells.

colspan
(optional; defaults to 1) Indicates the number of columns the cell should take up.
rowspan
(optional; defaults to 1) Indicates the number of rows the cell should take up.

Semantics

When HTML was created, although some of its elements indicated the purpose of the content (such as a <p> to indicate a paragraph), other elements specified how the content should appear to the user. One of the most notorious examples of such presentation-oriented elements was the <font> element, which specified a specific type, size, and color of text. This caused numerous problems. Besides often lacking a specific font on the user's browser, the element did nothing to indicate why the text should be presented in a different way. A screen reader used by someone with visual limitations (see Accessibility below) would be at a loss to be able to convey the significance of the indicated style.

Modern web design stresses using elements that indicate the semantics or the meaning of the content. Rather than indicating that text should be in italics, for example, the <em> element should be used on indicate that the text is emphasized. The browser may indeed show the text in italics, depending on the styles in effect. But the <em> element indicates why the text is italicized (as emphasis rather than a definition, for example). This allows accessibility technology to more appropriately derive the meaning of the document. It allows authors to tweak styles more easily and consistently. And it allows computer to better search, process, and transform documents if they are semantically rich.

Accessibility

TODO

XHTML

During the years of HTML stagnation before the introduction of HTML5, the W3C concentrated on reformulating HTML in terms of XML, which was wildly popular at the time. The W3C created several specifications it referred to as XHTML, using the media type application/xhtml+xml. The first version, XHTML 1.0 included several DTDs for various combinations of HTML and SVG, some including elements for backwards compatibility. The second version, XHTML 1.1, attempted to modularize the DTDs and added XML Schema definitions. The third version was to produce XHTML 2.0, a reformulated XHTML abandoning backwards compatibility and integrating several new XML vocabularies; this effort was eventually abandoned.

In theory XML brings several benefits, including simpler parsing via a more predictable tree structure, along with the ability to intersperse elements from other namespaces. Unfortunately the W3C's formulation of XHTML encountered several problems:

HTML5 XHTML

Because of the benefits of a predictable syntax, the XHTML has been revived in the HTML5, but not with a complicated set of DTDs or XML schemas. Rather the W3C simply allows HTML5 to be stored in either of two syntaxes: the “HTML syntax”, which has been discussed throughout this lesson; and the “XHTML syntax”, which in large part simply means that the document follows the well-formedness rules of XML. See HTML 5.1 § 1.6. HTML vs XHTML.

XML Declaration

An HTML5 document using the XML syntax may include an XML declaration, but this is incompatible with the HTML syntax.

Media Type

Documents using the XML syntax must be transmitted using the application/xhtml+xml media type rather than text/xml. The media type is the primary indicator of which syntax an HTML5 document uses. When loading an HTML5 document from a file system, a browser may infer the XHTML syntax by the use of an xhtml filename extension and/or the presence of an XML declaration.

Namespaces
(X)HTML polyglot skeleton document.
<!DOCTYPE html>
<html lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta charset="UTF-8"/>
  <title>(X)HTML Polyglot Skeleton</title>
</head>
<body>
  …
</body>
</html>

In the XML syntax the HTML namespace http://www.w3.org/1999/xhtml must be declared. Because HTML does not recognize the colon : character in names as indicating a namespace prefix, for compatibility with the HTML syntax documents should declare the HTML namespace as the default namespace, as shown in the figure.

Character References
XML Predefined Entities
Entity Value
&amp; &
&lt; <
&gt; >
&apos; '
&quot; "

Both the HTML and XML syntaxes support the predefined entity references XML, shown in the figure on the side. In addition HTML supports over 2,000 named character references, including letters such as &Aacute; for Á (U+00C1), symbols such as &copy; for © (U+00A9), and even icon characters such as &phone; for (U+0260E). See HTML 5.1 § 8.5. Named character references.

The rules of XML, however, indicate that all entities other than the predefined entities must be defined in an internal or external XML DTD. If you try to parse an HTML5 document using the XML syntax, or you try to open an HTML5 document as XML in a browser, the XML parser will refuse to load the document if it encounters one of the HTML named character references.

Servers

TODO review information about how to deploy to Tomcat; mention index.html; discuss content type mapping

Review

Summary

TODO

Gotchas

In the Real World

Think About It

Self Evaluation

Task

Create a simple web site to serve as the web user interface for Booker. The site will as of yet not actually list any books.

See Also

References

Resources

Acknowledgments