Internet Protocols

Goal

Concepts

Library

Lesson

Rarely nowadays is an application completely isolated from others. Almost every program has at least one feature that requires it to be connected to some larger network, even if only to get updates from the Internet. Knowledge of network communication is essential for modern programmers. Today's network communication uses the Internet protocol suite, most commonly using HTTP over TCP/IP, two protocols you'll learn more about in this and upcoming lessons.

TCP/IP

Just as you've been designing your software by separating concerns into layers, network communication is also conceptually divided into a set of layers called a protocol stack. The layers of a protocol stack help to separate communication into different responsibilities For instance an email application can send mail using the high-level email protocol, without worrying about whether the computer is using a fixed cable or WiFi for a connection (which use different protocols for low-level signalling). The most common protocol stack used on the Internet is the Transmission Control Protocol and Internet Protocol (TCP/IP).

TCP/IP communication is performed by exchanging packets of information, each of which travels independently. Individual packets may take different routes and may even arrive at the destination out of order, to be reassembled correctly by the receiver. At each layer of communication, the packet of data from the higher next higher level is wrapped in a larger packet with a header appended. In this way data from higher levels can be sent without knowing or caring what the data actually contains. The four TCP/IP layers defined by RFC 1122 are the Application, Transport, Internet, and Link layers.

TCP/IP stack compared with OSI Model.
TCP/IP Description Packet Examples OSI Model
Application Communication of actual user data.
Data
HTTP, FTP, SMTP Application
Presentation
Session
Transport Manages end-do-end communication, on the same or on different computers; indicates port.
UDP Header Data
TCP, UDP Transport
Internet Manages routing packets across network boundaries; indicates IP address.
IP Header UDP Header Data
IP Network
Link Navigates the specific protocol within each network type the packet passes through.
Frame Header IP Header UDP Header Data Frame Footer
ARP, PPP Data Link
Packet diagram from Wikipedia. Ethernet Physical

IP Address

To communicate using TCP/IP, both the sender and receiver (referred to as hosts) must have a unique IP address, which is managed by the Internet Layer. For most of the history of the Internet until now, IP addresses used IP Version 4 (IPv4) addresses which, 32-bit values which are usually presented as four groups of decimal values separated by full stop (period) characters, for example 198.51.100.27.

Because of the incredible growth of the Internet, especially recently with multiple smaller devices connected, the number of available IPv4 quickly ran out. To address this problem the Internet is (very) slowly migrating to the use of IP Version 6 (IPv6) addresses, which contain 128 bits and are separated into eight groups of hexadecimal values separated by colon characters, for example 2001:0db8:fe09:0000:0000:0000:0000:001b. When representing IPv6 addresses, leading zeros can be removed, and consecutive sections of zeros can be replaced by two colon characters, e.g. 2001:db8:fe09::1b.

Port

TCP/IP does not limit communication to a single connection between hosts. Each host can have multiple communication channels open between other hosts or even between applications on the same host. The endpoint of these communication links is a port on the host, identified by a 16-bit number, and managed by the Transport Layer. Although many ports are free for any application to use, some ports are defined to be used with specific protocols. For example the HTTP protocol, discussed below, by default uses port 80.

DNS

As you learned during the early lesson on indirection, the Internet Domain Name System (DNS) provides a series of hierarchical names of machines across the Internet, such as www.example.com. Although TCP/IP packets are ultimately routed using IP addresses, domain names provides a practical way for humans to exchange addresses as well as provide a flexible level of indirection. A separate computer called a DNS server contains a mapping of domain names and their corresponding IP addresses.

Conceptual mapping of domain names to IP addresses on a DNS server.
Domain Name IP Address
www.myserver.com 1.2.3.4
www.example.com xxx.xxx.xxx.xxx

When you ask your computer to browse to http://www.myserver.com:

  1. Your computer first goes out to a DNS server and asks for the IP address of www.myserver.com.
  2. The DNS server responds that the IP address of www.myserver.com is 1.2.3.4.
  3. Your computer then makes a connection directly to the computer with IP address 1.2.3.4 (the computer running the web site of www.myserver.com).
DNS lookup diagram for www.myserver.com resolving to IP address 1.2.3.4.
Simplified DNS lookup (Dyn).

This extra layer of indirection has two main benefits:

URI

URI Euler Diagram
URI Euler Diagram

The basis for identifying resources (the general term for identifiable items, including web pages or images) on the Internet is the Uniform Resource Identifier (URI), defined most recently by RFC 3986. URIs begin with a scheme and a colon : character. A common URI scheme is http, found in web addresses.

There are two types of URIs: the Uniform Resource Locator (URL) and the Uniform Resource Name (URN). A URN uses the scheme urn and is followed by a URN-specific namespace indicating some formal identification scheme, followed by another colon : character. For example a URN for ISBN book identifiers is urn:isbn:9780486275437 identifies the Dover Thrift Edition of the book Alice's Adventures in Wonderland.

All URIs that are not URN are considered URLs. Many URLs contain a hierarchical part indicating how to locate a resource. A web addresses for example contains a domain name or IP address, an optional port separated by a colon : character, followed by a path. For example http://www.example.com:8080/foo/bar.txt is a typical format of a web address URL, which is also considered in general terms a URI.

scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]

                    hierarchical part
        ┌───────────────────┴─────────────────────┐
                    authority               path
        ┌───────────────┴───────────────┐┌───┴────┐
  abc://username:password@example.com:123/path/data?key=value#fragid1
  └┬┘   └───────┬───────┘ └────┬────┘ └┬┘           └───┬───┘ └──┬──┘
scheme  user information     host     port            query   fragment

  urn:example:mammal:monotreme:echidna
  └┬┘ └──────────────┬───────────────┘
scheme              path
Generic URI components. (Wikipedia)
Java comes with a class java.net.URI for representing a URI. The class comes with many ways to create an instance of a URI, including constructors and static factory methods. The simplest way to create a URI from a string you already know to contain a valid URI is to call URI.create(String str), which throws an IllegalArgumentException if the string is not in valid URI format. You can also resolve a path string to a URI instance, similar to resolving a path string to a Path instance for the file system, using URI.resolve(String str).
final URI exampleBaseURI = URI.create("http://www.example.com/");
final URI indexURI = exampleBaseURI.resolve("index.html");  //yields http://www.example.com/index.html

URI Encoding

The characters that may be stored in a URI are limited to a few ASCII characters, and some of them even have special meaning. (e.g. A slash / character is used to separate path segments.) Those characters that are outside the ASCII range or that are restricted must be encoded if they are to be included in a URI. RFC 3986 specifies that characters may be encoded in a URI by using the following algorithm:

  1. Encode the character in UTF-8.
  2. Encode each resulting byte in two uppercase hexadecimal digits preceded by a percent % sign (e.g. %0F).

Thus the word touché could thus be encoded in a URI relative to http://example.com/ as such:

http://example.com/touch%C3%A9

URL

By far the most common type of URI is the URL. Here are some URL schemes you are likely to run into:

Scheme Description Default Port Example
file File on a file system. The file scheme has many inconsistencies, especially on Windows when used from Java. N/A file:///usr/local/foo/bar.txt
ftp Resource accessible via FTP. 20 ftp://ftp.example.com/foo/bar.txt
http Resource accessible via HTTP. 80 http://www.example.com/foo/bar.txt
https Resource accessible via HTTP over SSL/TLS. 443 https://www.example.com/foo/bar.txt
mailto Email address varies mailto:jdoe@example.com
tel Telephone number. N/A tel:+1-415-555-0123

The java.net.URL class predates the java.net.URI class and represents not only a URL endpoint, but also several implementations of how to retrieve data from the URL.

Media Types

The data you can transfer across the Internet comes in various flavors. An HTTP GET request could retrieve a plain text file, an HTML file, an audio clip, or an image. Rather than relying on filename extension, Internet architecture uses a more robust mechanism called a media type for determining the type of an entity. Most recently specified in RFC 6838, media types have a formal IETF registration process and provide fixed, descriptive identifiers for media content.

A media type has a specially formatted identifier consisting of a type and subtype. Some common media types include:

The media type value allows a suffix, separated from the media type by a semicolon ; character, with additional parameters. The most common media type value parameter is charset, which indicates using the equal = character the charset of the type. The following media type would indicate a type of plain text with a UTF-8 charset:

HTTP

The Hypertext Transfer Protocol (HTTP) is a application-level protocol in the TCP/IP layer. It is the basis of browsing the world-wide web, as well as the foundation for many new Internet-based APIs. HTTP has almost become synonymous with the Internet.

Communication with HTTP is conceptually very simple, which is one reason for its ubiquity. An HTTP command consists of a request and a response. Each HTTP request includes a method (or verb) indicating an action to perform. The response indicates the result of the request.

  1. Request: verb resource-URI [content] (Do something with the identified resource using the given content, if any.)
  2. Response: response-code [content] (Here is the outcome of the request, with content if appropriate.)

The most commonly used used HTTP method is GET, which indicates that the user agent (such as a web browser) wants to retrieve a resource (such as a web page). The response code 200 indicates that everything went OK; the requested web page will be returned in the body of the response. At the beginning of each request and response is a series of headers—names and values, separated by a colon : character—which provide more information about the message. The conversation may look like this when retrieving the http://www.example.com/index.html home page:

Example HTTP Request (Wikipedia)
GET /index.html HTTP/1.1
Host: www.example.com
Example HTTP Response (Wikipedia)
HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34 GMT
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
ETag: "3f80f-1b6-3e1cb03b"
Content-Type: text/html; charset=UTF-8
Content-Length: 138
Accept-Ranges: bytes
Connection: close

<html>
<head>
  <title>An Example Page</title>
</head>
<body>
  Hello World, this is a very simple HTML document.
</body>
</html>

URLConnection

Java has provided an abstract class java.net.URLConnection, which allows reading from and writing to a URL location. Most important in the current context is the abstract subclass java.net.HttpURLConnection, which provides special access to HTTP-specific information. You can specify the HTTP method to use with the HttpURLConnection.setRequestMethod(String method) method, and then ask for the response code sent sent back in the response by calling HttpURLConnection.getResponseCode().

You can get an instance of URLConnection by calling URL.openConnection(). If the URL you provided has one of the HTTP schemes, you can be sure that the returned value will be an HttpURLConnection:

final URI exampleURI = URI.create("http://www.example.com/index.html");
final HttpURLConnection connection = (HttpURLConnection)exampleURI.toURL().openConnection();
connection.setRequestMethod("GET");	//included for clarity; GET is already the default
final int responseCode = connection.getResponseCode(); //TODO check response code
try(final InputStream inputStream = new BufferedInputStream(connection.getInputStream())) {
  //TODO read from the input stream
}

Review

Gotchas

In the Real World

Think About It

Self Evaluation

Task

Add a new info command-line option to the Booker program to provide information on a particular book. Add a new --lookup flag to the Booker program, to be used in conjunction with the list command, indicating that information should be retrieved from the Internet. If present retrieve and print information from the Google Books API.

Example usage: booker list --isbn 9780486275437 --lookup

Option Alias Description
list Lists all available publications.
load-snapshot Loads the snapshot list of publications into the current repository.
purchase Removes a single copy of the book identified by ISBN from stock.
subscribe Subscribes to a year's worth of issues of the periodical identified by ISSN.
--debug -d Includes debug information in the logs.
--help -h Prints out a help summary of available switches.
--isbn Identifies a book, for example for the purchase command.
--issn Identifies a periodical, for example for the subscribe command.
--locale -l Indicates the locale to use in the program, overriding the system default. The value is in language tag format.
--lookup Retrieves from the Internet information on a book identified by its ISBN.
--name -n Indicates a filter by name for the list command.
--type -t Indicates the type of publication to list, either book or periodical. If not present, all publications will be listed.

See Also

References

Resources

Acknowledgments