XML

Goals

Concepts

Library

Preview

Example XML storing address information.
<?xml version="1.0" encoding="UTF-8"?>
<address type="work" since="2016">
  <street>123 Main Street</street>
  <city>Anytown</city>
  <state>OK</state>
  <postalCode>00000</postalCode>
  <country>USA</country>
</address>

Lesson

You've now have had first-hand experience about how directly working with bytes in files can be tricky. Even with using Java's serialization classes, the resulting file format is inflexible. Opening a file containing serialized classes appears to humans to be little more than a mass of arbitrary byte values, unless you're trained in the specific format used by Java serialization.

Editing byte-based files by hand is even more difficult. The layout of the values cannot be changed; they rely on a specific order usually hard-coded into the program that reads and writes them. It is complicated to document these binary file formats and harder still to gain compliance across implementations.

For these reasons text file formats, which store information in human-readable files yet have a structure understandable by computers, have almost completely replaced binary formats for general configuration and sharing of small data sets. Their popularity stems from several benefits:

Text file formats usually specify a syntax for how information is arranged in the file. Usually the syntax will use certain characters are delimiters to separate one piece of information from another. Recognizing the delimiters and extracting the relevant information based on the syntax is called parsing.

XML

Historically the most widespread text file format, still in use and ubiquitously supported, is the Extensible Markup Language (XML), which is standardized and maintained by the World Wide Web Consortium (W3C). XML allows data to be separated into tags with identifier names. A parser will be able to extract the data and return a tree of nodes containing the data and the names given to each value.

Example XML file for storing information about a vehicle.
<car vin="123456789">
  <color>blue</color>
  <type>
    <make>Camaro</make>
    <model>Z28</model>
  </type>
</car>

XML Declaration

An XML document should begin with an XML declaration, which indicates the version of XML and optionally the charset of the document. A typical XML declaration looks like this:

Typical XML declaration.
<?xml version="1.0" encoding="UTF-8"?>

Document Type Declaration

After the XML declaration, an XML document may have a document type declaration identifying the schema prescribing the vocabulary and other constraints of some type of XML document. Originally such a grammar was defined in a separate document type definition (DTD), although nowadays other types of definition files may be used.

A document type declaration is placed between <!DOCTYPE and > delimiters. The name after the opening delimiter characters would indicate the name of the outer element, as explained below. Traditionally the doctype would then indicate the public identifier used to identify the DTD across all documents. The doctype would also indicate a system identifier indicating from where the parser could load the DTD if it wanted to validate the document. Most of the time a doctype declaration can be merely copied and pasted from the specification for the XML format you are using.

For example here is a doctype indicating that an XML document contains Scalable Vector Graphics (SVG) markup for SVG 1.1:

Document type declaration for SVG 1.1.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg xmlns="http://www.w3.org/2000/svg">
  …
</svg>

Comments

A comment in XML takes the form <!-- … --> and can span multiple lines. Comment text must not contain the a sequence of two hyphen -- characters.

Elements

The primary XML structure for delimiting information is an element, which consists of a start-tag and and end-tag. A start-tag contains an identifier enclosed in angle bracket < and > characters, such as <foo>. The end-tag contains the same identifier, prefixed with a forward slash / character, such as </foo>. The element content between the tags may be character data (text that is not markup), other elements, or both.

Example XML elements for storing address information.
<?xml version="1.0" encoding="UTF-8"?>
<address> <!-- start-tag for element containing mixed content -->
  <street>123 Main Street</street> <!-- element containing character content -->
  <city>Anytown</city>
  <state>OK</state>
  <postalCode>00000</postalCode>
  <country>USA</country>
</address> <!-- end-tag -->

Attributes

Each element start-tag can optionally have several name-value pairs called attributes. Each attribute name is separated from its value by the equals = sign. The value itself must be enclosed either in paired quotation mark " characters, such as foo="bar". The attribute value alternatively appear in single quote ' characters, such as foo='bar'.

XML element for an address information, with a attributes.
<?xml version="1.0" encoding="UTF-8"?>
<address type="work" since="2016">
  <street>123 Main Street</street>
  <city>Anytown</city>
  <state>OK</state>
  <postalCode>00000</postalCode>
  <country>USA</country>
</address>

Character References

A character reference is simply a way to include XML character data by indicating a character's Unicode code point. A character reference begins with the characters &# and ends with a semicolon ; character. The Unicode code point is provided either as decimal value or, if preceded by an x, a hexadecimal value. When parsed a character reference produced the character it represents. In other words, including &#x092E; is no different, once the document is parsed, than simply having entered the Hindi character म from the beginning.

Entity References

An entity reference looks similar to a character reference except that it is missing the number # sign. An entity reference stands for a character or a sequence of characters, identified by the the entity reference name rather than a character's Unicode code point. For example the entity reference &lt; is equivalent to including the less than < character as XML character content.

Normally an entity must be declared in a DTD, indicating which character(s) it represents, before it can be referenced by name. However there are five predefined entities that are guaranteed to be recognized by every XML parser and thus may be used in any XML document, even if it has no DTD:

Predefined Entities
Entity Value
&amp; &
&lt; <
&gt; >
&apos; '
&quot; "

XML Namespaces

The element and attribute names available to describe some data in XML is called a vocabulary, and at times it may be useful to use elements from multiple vocabularies in a single XML document. Because different vocabularies may use the same names for different meanings, XML provides a way to separate the vocabularies into different namespaces.

Each namespace is identified by a URI, a broader term for the URLs used when browsing the web. The easiest and most common approach for using namespaces is to set the default namespace for all the elements in the document by using the xmlns attribute on the root element. For example an XHTML will indicate the default namespace http://www.w3.org/1999/xhtml on the root <html> element.

Declaring a default namespace in an XHTML5 document.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head><title>Example XHTML Document</title></head>
  <body>
    …

A specific namespace can be explicitly assigned to certain elements and/or attributes by using a namespace prefix. First the prefix is defined using an attribute in the form xmlns:prefix="namespace-uri", where prefix is the prefix to define and namespace-uri is the URI identifying the namespace. Then to reference a namespace, use the prefix with the element or attribute name, separated by a colon : character.

The Maven POM, which you have been using since some of the earliest lessons, provides an example of XML namespaces. The Maven POM namespace itself is declared as the default using xmlns="http://maven.apache.org/POM/4.0.0". A separate namespace for XML Schema is associated with the xsi namespace prefix using xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance". Lastly the XML Schema namespace is used to specify the location of the schema for the POM vocabulary using xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd".

Declaring namespaces in a Maven POM.
<project xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  …

Parsing XML

DOM tree for an HTML document.
DOM tree of an HTML document.

To read XML so that it can be processed you must use a parser. There are two broad types of parsers for XML. The Simple API for XML (SAX) is an event-driven parser; rather than loading the entire file in memory at once, each type the parser encounters elements, attributes, and other markup this type of parser will call methods in a handler you specify. Event-driver parsers are very memory efficient, but you are forced to process the file sequentially in the order markup occurs in the document.

A tree-based parser on the other hand reads the entire XML document and creates a tree data structure in memory representing elements, attributes, other markup, and text content. You can then access various portions of the tree in any order as needed, searching for relevant information. The trees returned by most Java XML parsers adhere to the Document Object Model (DOM), a set of interfaces standardized by the W3C for navigating the parsed nodes in an XML tree.

Here we will use a DOM-based parser for processing XML information, which are accessed using classes in the javax.xml.parsers package. Java supports the DOM through the built-in org.w3c.dom package.

Getting a DOM Parser

The first step to parsing is retrieving an implementation of a DOM parser, referred to as javax.xml.parsers.DocumentBuilder, via a javax.xml.parsers.DocumentBuilderFactory. The factory retrieved by DocumentBuilderFactory.newInstance() can be configured to produce a validating parser and/or one that supports namespaces. Once the factory is appropriately configured, the parser may be retrieved using  DocumentBuilderFactory.newDocumentBuilder().

Retrieving an XML DOM parser implementation.
final DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(false); //optional; only if needed
documentBuilderFactory.setValidating(false); //optional; only if needed
final DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();

Parsing the XML Document

After getting a parser, you must tell it to parse the your document. DocumentBuilder has methods for parsing from various sources, including from a String and Reader, but the most common method is DocumentBuilder.parse(InputStream is), which parses an XML document from an InputStream. All the DocumentBuilder.parse(…) methods return an instance of org.w3c.dom.Document, which represents the DOM of your XML.

Parsing an XML document from a file using a DocumentBuilder.
…
final Document document;
final Path path = Paths.get("/etc/foo/bar.xml");
try(final InputStream inputStream = new BufferedInputStream(Files.newInputStream(path))) {
  document = documentBuilder.parse(inputStream);
}
//no need to keep the input stream open after parsing the document

Traversing the DOM

Every node in the resulting tree produced by the XML parser is represented by an org.w3c.dom.Node. The specific type of node can be retrieved using Node.getNodeType() which returns a short integer value indicating a node type; the value will be one of the Node defined constants such as Node.ELEMENT_NODE or Node.TEXT_NODE. The base Node type comes with many methods such as Node.getNodeName() that apply to most nodes, as well as methods such as Node.getAttributes() that apply only to certain node types. Although you could work with most nodes using the the general Node methods, it is usually easier upon discovering the node type to cast the Node to the appropriate subtype such as org.w3c.dom.Element and access the specialized methods such as Element.getAttribute(String name) it provides.

Text within the DOM requires special care. Each sequence of non-markup text within an element is represented by a org.w3c.dom.Text node (of type Node.TEXT_NODE). This means that even if an element only appears to contain other elements, the end-of-line characters and indentation whitespace will be stored as Text nodes! Consider the following simple XML document:

Example XML document to traverse using the DOM.
<?xml version="1.0" encoding="UTF-8"?>
<foo>
  <bar>test</bar>
</foo>

Assuming the newline "\n" character was used to end each line, and that the tab "\t" character was used for indentation, when the document is parsed the <foo> element when parsed will contain three children:

The XML document itself is represented by a org.w3c.dom.Document node. The root element of the XML document structure, however, is retrieved by Document.getDocumentElement(). From there child nodes can be traversed using the org.w3c.dom.NodeList returned by Node.getChildNodes(). Unfortunately NodeList does not implement the standard java.util.List interface and thus must be iterated using NodeList.getLength() and NodeList.item(int index) as shown below.

Traversing a simple XML document using the DOM.
<?xml version="1.0" encoding="UTF-8"?>
<food>
  <vegetable color="green">lettuce</vegetable>
  <fruit color="red">apple</fruit>
  <vegetable color="red">tomato</vegetable>
</food>
…
final Element rootElement = document.getDocumentElement();
if(!rootElement.getNodeName().equals("food")) {
	throw new IOException("Unrecognized root element name.");
}
//print out the red food
final NodeList childNodes = rootElement.getChildNodes();
for(int i = 0; i < childNodes.getLength(); ++i) {
	final Node childNode = childNodes.item(i);
	if(childNode.getNodeType() == Node.ELEMENT_NODE) {
		final Element foodElement=(Element)childNode;
		if(foodElement.getAttribute("color").equals("red")) {
			final String foodType = foodElement.getNodeName();
			final String foodName = foodElement.getTextContent();
			System.out.println(String.format("%s (%s)", foodName, foodType));
		}
	}
}

Modifying the DOM

In addition to traversing the nodes of a DOM, you can also change the structure of the tree. The most useful methods are Element.setAttribute(String name, String value), which sets an attribute value for an element; and Node.appendChild(Node newChild), which adds a child node (such as an Element) to any Node (including an Element). The Document instance itself functions as a factory to produce new nodes, such as Document.createElement(String tagName) to create an Element node.

Creating a DOM Instance

Not only can you change an existing DOM tree, you can create an entire DOM instance from scratch in memory. A org.w3c.dom.DOMImplementation, which represents the specific XML implementation configured in your JVM, acts as factory to create new documents. You can retrieve an instance of DOMImplementation from the DocumentBuilder you retrieved above by using DocumentBuilder.getDOMImplementation().

Once you have a DOMImplementation, calling DOMImplementation.createDocument(String namespaceURI, String qualifiedName, DocumentType doctype) will create a new document, which you can then modify as described above. The qualified name indicates the name to use for the root element, which you can then retrieve using Document.getDocumentElement() as you would normally do when traversing the tree. You may provide null for both the namespace URI and the doctype if you want to create a simple document without namespaces or a doctype.

Creating a simple XML document <foo>bar</foo> using the DOM.
final DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
final DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
final DOMImplementation domImplementation = documentBuilder.getDOMImplementation();
final Document fooDocument = domImplementation.createDocument(null, "foo", null);  //<foo></foo>
final Text barText = fooDocument.createTextNode("bar");  //"bar"
fooDocument.getDocumentElement().appendChild(barText);  //<foo>bar</foo>

Generating XML

As with generating byte representations of Java objects, producing a text or byte representation of an XML DOM instance is referred to as serialization.

XML Serialization using Transformer

Java provides XML serialization capabilities in the javax.xml.transform package. Use a javax.xml.transform.TransformerFactory to retrieve a javax.xml.transform.Transformer. Use Transformer.setOutputProperty(String name, String value) as needed to configure the transformer, using javax.xml.transform.OutputKeys values such as OutputKeys.ENCODING. Finally use Transformer.transform(Source xmlSource, Result outputTarget) to serialize the XML represented by a javax.xml.transform.dom.DOMSource to a javax.xml.transform.stream.StreamResult.

Serializing a DOM instance using a Transformer.
final TransformerFactory transformerFactory = TransformerFactory.newInstance();
final Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, StandardCharsets.UTF_8.name());
//optional: transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
//optional: transformer.setOutputProperty(OutputKeys.INDENT, "yes");
//optional: transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
final StreamResult streamResult = new StreamResult(new StringWriter());  //TODO write to actual file
final DOMSource domSource = new DOMSource(document);
transformer.transform(domSource, streamResult);
System.out.println(streamResult.getWriter().toString());

XML Serialization using LSSerializer

As noted above for parsing, the DOM Level 3 Load and Save Specification provides a pure DOM approach for serializing XML. First access a org.w3c.dom.bootstrap.DOMImplementationRegistry to retrieve a org.w3c.dom.ls.DOMImplementationLS, which works as a factory to create the actual org.w3c.dom.ls.LSSerializer. You will also need to create a special org.w3c.dom.ls.LSOutput object to represent the output stream or writer.
final DOMImplementationRegistry domImplementationRegistry = DOMImplementationRegistry.newInstance();
final DOMImplementationLS domImplementationLS = (DOMImplementationLS)domImplementationRegistry.getDOMImplementation("LS");
final LSSerializer lsSerializer = domImplementationLS.createLSSerializer();
//optional: lsSerializer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE);
final LSOutput lsOutput = domImplementationLS.createLSOutput();
lsOutput.setCharacterStream(new StringWriter());  //TODO write to actual file
lsOutput.setEncoding(StandardCharsets.UTF_8.name());
lsSerializer.write(document, lsOutput);
System.out.println(lsOutput.getCharacterStream().toString());

Review

Gotchas

In the Real World

Think About It

Self Evaluation

Task

Upgrade the configuration file used to store the Booker application user name to XML format, using the following template:

<config>
  <user>Jane Doe</user>
</config>

See Also

References

Resources

Acknowledgments