Text I/O

Goals

Concepts

Library

Dependencies

Lesson

The basis of file I/O in Java, as you've learned, is formed by the byte-based InputStreams and OutputStreams. But human-readable text, based upon characters represented by the Java char primitive type, must be translated to a series of bytes using a charset; which includes the concepts of a character set, a character coding, and a byte order. As you've experienced from the tasks in previous lessons, manually extracting bytes and converting them to characters using the correct character encoding algorithm can be tedious, not to mention complicated.

Java has provided a set of reader and writer I/O classes for translating between byte and char automatically based upon a charset. The java.io.Reader class is for decoding bytes from an underlying InputStream, and the java.io.Writer class is for encoding bytes to an underlying OutputStream.

Readers

The following reader classes are all in the java.io package.

Reader class diagram.
Reader class diagram.
Reader
Abstract class that forms the basis of all input streams.
BufferedReader
Provides buffering of other readers.
CharArrayReader
A reader to an existing array of characters.
FileReader
A direct reader to a file. This class uses the old java.io.File class and should only be used with legacy code.
FilterReader
A simple reader wrapper allowing subclasses to do more processing on data after reading.
InputStreamReader
Central reader for wrapping an input stream and decoding bytes to characters.
LineNumberReader
A reader that keeps track of line numbers.
StringReader
A reader to the characters of an existing string.

Writers

The following writer classes are all in the java.io package.

Writer class diagram.
Writer class diagram.
Writer
Abstract class that forms the basis of all writers.
BufferedWriter
Provides buffering of other writers.
CharArrayWriter
A writer to a dynamically managed internal array of characters.
FileWriter
A writer to a file. This class uses the old java.io.File class and should only be used with legacy code.
FilterWriter
A simple writer wrapper allowing subclasses to do more processing on data before writing.
OutputStreamWriter
Central writer for wrapping an output stream and encoding characters to bytes.
StringWriter
A writer to an internal string buffer, which can later be used to produce a string.

Reading and Writing char

The biggest distinction between readers and writers on the one hand; and input and output streams on the other is that the former group work in terms of characters rather than byte values. A single character can be read using Reader.read(); this method returns an int, just as does InputStream.read(), and both use the value -1 to indicate the end of the stream. However the range of values of InputStream.read() is that of eight bits of information: 0x000xFF. The range of values returned by Reader.read() is that of 16 bits of information (i.e. the range of a char): 0x00000xFFFF.

If the source of the data is already stored in characters, then no conversion to or from bytes has to take place. Here is how to read the characters from an string, for example, using java.io.StringReader:

Reading individual characters from a StringReader using Reader.read().
final String inputString = "touché";
try(final Reader reader = new StringReader(inputString)) {
  int charValue;
  while((charValue = reader.read()) != -1) {  //read the characters
    System.out.println(String.format("U+%04X", charValue));  //U+XXXX
  }
}

Analogous to InputStream.read(byte[] b) there exists Reader.read(char[] cbuff) which reads multiple characters at a time into an existing buffer. Similarly for writing, analogous to OutputStream.write(byte[] b) and OutputStream.write(byte[] b, int off, int len) there exist Writer.write(char[] cbuf) and Writer.write(char[] cbuf, int off, int len), respectively.

You can therefore read and write buffers of characters, and move them between readers and writers similarly to how you would with byte streams.

Copying from a Reader to a Writer using a buffer.
final char[] inputChars = "abcdefghijklmnopqrstuvwxyz".toCharArray();

//create a buffer array for copying up to 16 characters at a time (an arbitrary value)
final char[] buffer = new char[0x10];

//create a destination writer for the characters
final StringWriter stringWriter = new StringWriter();

//copy a buffer at a time until we reach the end of the reader
try {
  try(final Reader reader = new CharArrayReader(inputBytes)) {
    int count;
    while((count = reader.read(buffer)) != -1) {  //-1 indicates end of stream
      stringWriter.write(buffer, 0, count);
    }
  }
} finally {
  baos.close();
}

//print out the string of the characters copied to the Writer
System.out.println(stringWriter.toString());

Adapting a Byte Stream

Just as input and output streams can wrap other input and output streams, some readers and writers can wrap other readers and writers; this is the decorator pattern you learned about already. However the important java.io.InputStreamReader and java.io.OutputStreamWriter classes do not wrap other readers and writers; rather they wrap instances of InputStream and OutputStream, respectively. Calling InputStreamReader.read() for example will call the underlying InputStreamReader.read() to read the correct number of bytes, decoding them (according to the specified charset) to the correct character which is then returned according to the Reader.read() contract. Similarly OutputStreamWriter.write(int c) will convert the input character to the correct number of bytes and write them to the underlying OutputStream based on the specified charset.

In order for InputStreamReader and OutputStreamWriter to convert characters to and from byte streams, they must be provided a charset, usually in the form of a java.nio.charset.Charset instance. Let's take the UTF-8 encoding of the string "touché" introduced in the charsets lesson and read the bytes as characters, letting the InputStreamReader take care of converting the bytes for us:

Reading UTF-8 encoded bytes using an InputStreamReader.
final byte[] inputBytes = new byte[] { 0x74, 0x6F, 0x75, 0x63, 0x68, (byte)0xC3, (byte)0xA9 };
try(final Reader reader = new InputStreamReader(new ByteArrayInputStream(inputBytes), StandardCharsets.UTF_8)) {
  int charValue;
  while((charValue = reader.read()) != -1) {
   System.out.println((char)charValue);
  }
}

Byte Order Marks

Unfortunately, although Java's readers and writer do a fine job of converting between bytes and characters based upon charsets, they don't handle byte order marks at all. A BOM will be interpreted as bytes to be converted into characters, even though they should really be used to determine the charset and then discarded. Instead you'll need to detect and write any BOM manually.

Reading BOMs

One approach for auto-detecting a charset is to read bytes from the underlying InputStream to see if they constitute a BOM before creating the InputStreamReader that wraps around it. The logic might look like this:

Outline for determining a charset by detecting a BOM when creating a Reader.
public class BOMCharsetDetectInputStreamReader extends InputStreamReader

  public BOMCharsetDetectInputStreamReader(@Nonnull final InputStream inputStream) throws IOException {
    super(inputStream, detectCharset(inputStream));  //detect charset on the fly during construction
  }

  private static Charset detectCharset(final @Nonnull InputStream inputstream) throws IOException
    //TODO make sure input stream supports mark/reset
    inputStream.mark(4);  //reserve enough room for the largest BOM we support
    final byte[] buffer;
    //TODO read two bytes into the buffer if possible
    if(…) {  //TODO see if it is one of the UTF-16 BOMs; if so
      return …;  //return StandardCharsets.UTF-16BE or StandardCharsets.UTF-16LE
    }
    //TODO read another byte into the buffer if possible
    if(…) {  //TODO see if it is the UTF-8 BOM; if so
      return StandardCharsets.UTF_8;
    }
    //TODO read one more byte into the buffer if possible
    if(…) {  //TODO see if it is one of the UTF-32 BOMs; if so
      return …;  //return the correct UTF-32BE or UTF-32LE charset
    }
    //TODO if nothing matches, assume that this was UTF-8 with no BOM
    inputStream.reset();    //don't forget to put the bytes back---they weren't a BOM!
    return StandardCharsets.UTF_8;
  }

}

Writing BOMs

Similarly an OutputStreamWriter knows how to translate from characters to the correct byte sequence based on a charset, but it will not prepend the bytes with any byte order mark. Creating a Writer to do this might appear like this:

Outline for prepending an output stream with the correct BOM when creating a Writer.
public class BOMOutputStreamWriter extends OutputStreamWriter {

  public BOMOutputStreamWriter(final @Nonnull OutputStream outputStream,
      final @Nonnull Charset charset) throws IOException {
    outputStream.write(getBOM(charset));  //write the BOM to the underlying output stream
  }

  private static byte[] getBOM(final @Nonnull final Charset charset) {
    //TODO return the correct BOM byte sequence for the given charset
  }

}

Buffered Readers and Writers

Analogous to BufferedInputStream and BufferedOutputStream, Java comes with java.io.BufferedReader and java.io.BufferedWriter classes for wrapping an existing reader or writer. These classes provide buffering at the character level rather than the byte level.

Buffered reading from a text file using a BufferedReader.
final static Path path = Paths.get("/etc/foo/bar.txt")
try(final Reader reader = new BufferedReader(new InputStreamReader(Files.newInputStream(path), StandardCharsets.UTF_8))) {
  int charValue;
  while((charValue = reader.read()) != -1) {
   System.out.println((char)charValue);
  }
}

Review

Gotchas

In the Real World

Think About It

Self Evaluation

Task

Your existing IO.readText(InputStream) class fakes UTF-8 support by making sure all bytes in the input stream are values within the ASCII range. Upgrade the method to have true UTF-8 support using InputStreamReader.

  1. Make sure the input stream supports mark/reset. Remember that if it doesn't you can always wrap it in BufferedInputStream.
  2. Read the first few bytes, and if it is the UTF-8 signature, throw them away. Otherwise, reset the stream so that the actual content bytes aren't lost, and assume the content is encoded in UTF-8.
  3. (optional) If you like you can also check for the various UTF-16 and UTF-32 BOMs. Otherwise, be sure to document that the method only supports UTF-8.
  4. Create a CharsetDecoder for the charset and configure it to use CodingErrorAction.REPORT to report illegal byte sequences or invalid code points.
  5. Create an InputStreamReader with the appropriate charset using the CharsetDecoder.
  6. Read and return the text.

You can use the online Unicode code converter to find the UTF-8 representations of Unicode code points.

Resources

Acknowledgments