Java Programming Notes # 2188
- Preface
- Preview
- Understanding Base64
- Program Code
- Run the Programs
- Summary
- What’s Next?
- Program
Listings
Preface
This is the third in a series of lessons designed to teach you how to
write Java programs to protect your email inbox from spam and
email-borne viruses. The first lesson in the series was
entitled
Overview of the BigDog Email Protection Program.
The previous lesson was entitled
Getting Started
with the BigDog Email Protection Program.
In addition, the material in this lesson has broad applicability in
other areas such as Security,
Introduction to Message Digests and Servlets, Session
Tracking Using Basic Authentication.
I have published several earlier lessons that deal
exclusively with spam and email-borne viruses, such as the series that
began with the lesson entitled Enlisting
Java in the War Against SPAM: The Communications Module and the
series that began with
the lesson entitled Enlisting
Java in the War Against Email Viruses. Information in those
lessons serves
as background material for this series.
Viewing tip
You may find it useful to open another copy of this lesson in a
separate browser window. That will make it easier for you to
scroll back and forth among the different listings and figures while
you are reading about them.
Supplementary material
I recommend that you also study the other lessons in my extensive
collection of online Java tutorials. You will find those lessons
published at Gamelan.com.
However, as of the date of this writing, Gamelan doesn’t maintain a
consolidated index of my Java tutorial lessons, and sometimes
they are difficult to locate there. You will find a consolidated
index at www.DickBaldwin.com.
Preview
This lesson explains the
use of base64 encoding and decoding in general,
and illustrates base64 encoding and decoding using sample programs.
A future lesson will
explain how base64 decoding is used in the BigDog program.
Understanding
Base64
What
is base64 encoding?
As I understand it, the base64 encoding scheme was originally devised
to make it possible to reliably transmit eight-bit data through
transmission systems constrained to handle seven-bit data. The
encoding scheme has been in use for many years.
Among other things, the use of base64 encoding makes it possible to:
- Transmit image data reliably across the Internet.
- Transmit non-English characters reliably across the Internet.
Can be used to hide spam
Unfortunately, the use of base64 encoding also makes it possible for
spammers to hide offensive text from spam blocking programs that are
not equipped to deal effectively with the hiding technique. The
spam screening module used by the BigDog set of programs deals
with the following:
- Encoded subject lines.
- Encoded body text in single-part messages.
- Encoded body text in multipart messages.
This lesson explains base64 in general. Future lessons will
explain the Java code that I have written
to deal
with these issues in the BigDog program.
RFC 1521
One of the best resources that I have found for understanding base64
encoding is the document entitled Mechanisms
for Specifying and Describing the Format of Internet Message Bodies,
otherwise known as Request for Comments (RFC) 1521.
(This is a
rather large document that covers numerous topics in addition to base64
encoding.)
The author of RFC 1521
states:
“STD
11, RFC
822
defines a message representation protocol which specifies considerable
detail about message headers, but which leaves the message content, or
message body, as flat ASCII text. This document redefines the format of
message bodies to allow multi-part textual and non-textual message
bodies to be represented and exchanged without loss of information.“
Support for richer text, audio, video, and
non-English languages
In justifying RFC 1521, the author also states:
“Even
in the case of text, however, RFC 822
is inadequate for the needs of mail users whose languages require the
use of character sets richer than US ASCII [US-ASCII]. Since RFC 822
does not specify mechanisms for mail containing audio, video, Asian
language text, or even text in most European languages, additional
specifications are needed.“
After discussing several
problems that existed prior to RFC 1521, the
author states:
“This
document describes several
mechanisms that combine to solve most of these problems without
introducing any serious incompatibilities with the existing world of RFC 822
mail.“
The author of RFC 1521 goes
on to describe several features proposed by
RFC 1521, including the use of base64 encoding in email messages.
The author tells us that base64 as described in RFC 1521 is “…adapted
from RFC
1421, …” RFC 1421 describes
Message Encryption and Authentication Procedures.
(I have
written several previous lessons involving encryption and
authentication that briefly describe the use of base64 encoding.)
The
base64 encoding process
Eight-bit data values are mapped into a 65-character
subset of the US-ASCII code, enabling subgroups of 6 bits each to be
represented
by 64 different printable
characters.
(The extra 65th character, ‘=’, is used to signify a
special processing function, which I will describe later.)
The encoding process causes 24-bit groups, each representing three
eight-bit data values, to be
represented as output groups of four encoded characters that are
derived from a base64 alphabet.
Concatenate, subdivide, and translate
Proceeding from left to right, a 24-bit input group is formed by
concatenating three 8-bit input groups. These 24 bits are then treated
as 4 concatenated 6-bit groups, each of which is translated into a
single character in the base64 alphabet.
(Table 1, which I will present later, shows the base64
alphabet.)
The order of the bits is important
The input bit stream must be ordered with the most significant bit
first. The first bit in the stream must be the high order bit in
the first byte, and the eighth bit must be the low order bit in the
first byte, etc.
The base64 alphabet
The base64 alphabet is made up of 64 printable characters plus the
equal ‘=’ character. The ‘=’ character is used as a pad
when the number of input bytes is not evenly divisible by three, and
therefore doesn’t produce a number of output
characters that is evenly divisible by four.
(For
example, the four input bytes represented by the eight-bit characters
klmn produce the following six output characters plus two pad
characters: a2xtbg==. In addition, the output stream of
characters is terminated by a carriage return character and a line feed
character.)
The
base64 alphabet
The base64 alphabet is shown in Table 1. Whenever the value of a
six-bit group matches one of the values in the Value columns in
Table 1, that value is replaced by the seven-bit ASCII value of the
corresponding
character shown in the Char column to the right of the Value
column.
Value |
Char |
Value |
Char |
Value |
Char |
Value |
Char |
0 |
A |
17 |
R |
34 |
i |
51 |
z |
1 |
B |
18 |
S |
35 |
j |
52 |
0 |
2 |
C |
19 |
T |
36 |
k |
53 |
1 |
3 |
D |
20 |
U |
37 |
l |
54 |
2 |
4 |
E |
21 |
V |
38 |
m |
55 |
3 |
5 |
F |
22 |
W |
39 |
n |
56 |
4 |
6 |
G |
23 |
X |
40 |
o |
57 |
5 |
7 |
H |
24 |
Y |
41 |
p |
58 |
6 |
8 |
I |
25 |
Z |
42 |
q |
59 |
7 |
9 |
J |
26 |
a |
43 |
r |
60 |
8 |
10 |
K |
27 |
b |
44 |
s |
61 |
9 |
11 |
L |
28 |
c |
45 |
t |
62 |
+ |
12 |
M |
29 |
d |
46 |
u |
63 |
/ |
13 |
N |
30 |
e |
47 |
v |
pad |
= |
14 |
O |
31 |
f |
48 |
w |
||
15 |
P |
32 |
g |
49 |
x |
||
16 |
Q |
33 |
h |
50 |
y |
Encoding and decoding example from Table 1
For example, when encoding, the six-bit value of zero is replaced in
the base64 output by a value of 65, which is the seven-bit ASCII value
that
represents the character A.
When decoding, the base64 character A is replaced by a six-bit group
of bits with a value of zero.
Line length limitations
According to RFC 1521, the output stream of encoded characters must be
represented in lines of no more than 76 characters each.
As you
will see later, the Sun encoding software accepts input data as an
array of eight-bit bytes. The output stream is always terminated
by a
carriage return and a line feed. If the number of bytes in the
input array produces more than 76 characters in the output stream, each
group of 76 output characters is terminated by a carriage return and a
line feed, and the final partial line, if any, is also terminated by a
carriage return and a line feed.
The final line also includes pad characters, if necessary, to guarantee
that the total number of base64 characters in the output is evenly
divisible by four.
The pad character
Here is part of what the author of RFC 1521 has to say about the use of
the pad character:
“Special
processing is performed if fewer than 24 bits are available at the end
of the data being encoded. A full encoding quantum is always completed
at the end of a body. When fewer than 24 input bits are available in an
input group, zero bits are added (on the right) to form an integral
number of 6-bit groups. Padding at the end of the data is performed
using the ‘=’ character.”
Three
possible cases regarding padding
The author of RFC 1521 goes
on to tell us:
“Since all
base64 input is an integral number of octets, only the following cases
can arise:
- the final
quantum of encoding input is an integral multiple of 24 bits; here, the
final unit of encoded output will be an integral multiple of 4
characters with no ‘=’ padding,- the final
quantum of encoding input is exactly 8 bits; here, the final unit of
encoded output will be two characters followed by two ‘=’ padding
characters, or- the final
quantum of encoding input is exactly 16 bits; here, the final unit of
encoded output will be three characters followed by one ‘=’ padding
character.”
Program Code
Two
different programs
I am going to present and discuss two different programs in this
lesson. I will begin with a program named Base64_02.java.
The sole purpose of this program is to illustrate the base64 encoding
and decoding algorithms in a very simple setting.
Next, I will present a program that explains the use of encoding and
decoding classes and methods in an undocumented Sun package named sun.misc.
Along with that discussion, I will also point you to alternative
documented
resources for encoding and decoding base64.
In a future lesson, I will explain several methods that are
incorporated
into the BigDog set of programs that is designed to protect
your email inbox from viruses and spam.
The program named Base64_02
This program is not intended for production use. Rather, it is
intended solely to illustrate the encoding and decoding algorithms for
base64. I will point you to programs that are intended for
production use later.
Not fully tested
Note that this program has not been fully tested. Don’t use it
for any significant purpose without first testing the conversion to
base64 for all possible values in a group of three eight-bit bytes.
Documented encoding and decoding classes
For documented software that you can use to encode and decode base64,
see the following encoder
and decoder
classes. I haven’t tested these programs, but I am assuming that
they are correct. They are published on the excellent web site of Professor Douglas Lyon , who provides the source code for dozens of different
algorithms including those used to encode and decode base64.
Undocumented Sun classes
If you are willing to use undocumented Sun classes to encode and
decode base64, you can use the encodeBuffer method of the sun.misc.BASE64Encoder
class and the decodeBuffer method of the sun.misc.BASE64Decoder
class. I will show you how to use these methods in the next
program in this lesson. For now, however, let’s get back to the
discussion of the program named Base64_02.
This program was tested using SDK 1.4.2 under WinXP.
Will discuss in fragments
As usual, I will discuss the
program in fragments. A complete listing is provided in Listing
19 near the end of the lesson.
The first program fragment begins in Listing 1.
class Base64_02 { |
Listing 1 shows the beginning of the main method, which creates
a byte array object containing three eight-bit characters, and
passes the array to a method named showData for display.
(Each of the eight-bit characters in the array consists
of the least significant eight bits of the sixteen-bit Unicode
character contained in the String “klm”.)
The showData method
The showData method displays the data in an incoming byte array
as
character data and also as binary data. The showData
method is shown in its entirety in Listing 2.
(Note that if there are more than four bytes in the
incoming array, the binary
data will not be correct. Bits will have been lost on the most
significant end. Note also that leading zeros are not
displayed in the binary data.)
static void showData(byte[] data){ |
Process using a for loop
The showData method processes the incoming array using a for
loop based on the length of the array. One of the bytes
is displayed during each iteration of the for
loop in Listing 2. The byte is cast to type char to cause
it to be displayed as a character.
Also, a binary shift operation is used to construct an int
value containing shifted versions of each the bytes in the incoming
array during successive iterations of the for loop.
Shift eight bits during each iteration
During each iteration of the loop, the current contents of the int
variable named save are shifted eight bits to the left, and the
next data byte from the incoming array is placed in the
least significant eight bits of the variable.
(As mentioned above, if there are more than four bytes
in the array, byte data will be shifted off the most significant end of
the variable, and the data will be corrupted.)
Display the binary value
After all of the bytes in the array have been processed, the method
named toBinaryString,
which is a class variable of the Integer class, is used to
display the contents of the variable named save as a binary
value.
(As mentioned above, this method does not display
leading zeros on the most significant end of the binary value.)
The output
Figure 1 shows the output produced by this method when called from the
code in Listing 1.
klm |
As you can see, the three letters in the first line of output
correspond to the characters represented by each of the bytes in the
incoming array.
The binary bits represented
by the 1’s and 0’s in the second line correspond to the binary bits in
each of the bytes in the incoming array after the bytes have been
concatenated.
(Note that I manually inserted spaces in the second line
in Figure 1 to separate the bits into eight-bit groups. This
makes it easier to analyze the visually.)
What do the bits represent?
The eight bits on the right correspond to the least significant eight
bits in the character
‘m’.
The seven bits on the left correspond to the bits in the
character ‘k’, with the left-most zero bit not being displayed.
The remaining bits in the middle correspond to the bits in the
character ‘l’.
(Note that is a lower-case L, not a numeric 1.)
We will be working with the binary output in Figure 1 later.
Encode and display the data
Now let’s return to our discussion of the main method.
The first statement in Listing 3 passes the array object containing the
raw data to the method named encoder. The purpose of the encoder
method is to encode the three
eight-bit bytes as four six-bit characters. This method returns a
four-element array containing the four six-bit characters in the
least significant six bits of four eight-bit bytes.
The second statement in Listing 3 passes the array containing the four
base64 characters to the showData method for display in both
character and binary format.
byte[] encodedData = encoder(rawData); |
Do it by hand
Before getting into the details of the encoder method, let’s
walk through our example and perform the encoding from eight bit to
base64 manually.
The first two lines in Figure 2 shows the data
from Figure 1. This time however, I
manually inserted space characters in the second line of Figure 2 to
separate the bits into six-bit
groups (instead of eight-bit groups as before), and manually
added the missing zero bit on the left.
klm |
Mapping into the base64 alphabet
The third line in Figure 2 shows the decimal equivalent value of each
of
the six-bit groups in the second line.
The fourth line in Figure 2 shows the base64 alphabet character
corresponding to each of the decimal equivalent values, taken from
Table 1.
Thus, the four-character base64 encoding of the string “klm”
is “a2xt”. This is what we should expect the encoder
method to return when we pass it an array object containing the
eight-bit characters ‘k’, ‘l’, and ‘m’.
The method named encoder
The beginning of the encoder method is shown in Listing
4. This method is designed to encode a group of three eight-bit
bytes into four six-bit characters from the base64 alphabet.
Because this method is being called from the main method, it
must be declared static.
The code in Listing 4 simply confirms that the size of the incoming
array is correct, and aborts the program if it is not correct.
static byte[] encoder(byte[] data){ |
Concatenate the bytes
There are probably many ways to accomplish the encoding. I
elected to begin by concatenating the three bytes contained in the
incoming array object into the least significant 24 bits of a variable
of type int.
int concat = (data[0]<<16) | (data[1]<<8) |
I concatenated the bits using the binary shift
capability of Java in conjunction with the bitwise or
operator. This is essentially the same thing that was done in the
showData method, except that in this case, the number of bytes
is always three and therefore, there is no need to use a loop.
Concatenating the eight-bit bytes into a sequence of twenty-four bits
makes it relatively easy to separate the twenty-four bits into four
groups of six bits each.
Instantiate an output array
Listing 6 instantiates a four-element byte array that will be
populated and returned containing the four base64 characters in the
least significant seven bits of each eight-bit array element.
byte[] output = new byte[4]; |
Separate and map the bits to base64 characters
The method that is actually used to map the values of each group of six
bits is named mapTo. I will discuss the behavior of that
method shortly.
Each of the four statements in Listing 7 extracts one group of six bits
from the sequence of twenty-four bits and passes that group to the
method named mapTo. The return values from mapTo
are used to populate the output array.
output[3] = (byte)(mapTo(concat & 'u003f')); |
Note that the output array is populated in reverse order. In
other words, the right-most six bits in the twenty-four bit sequence
are used to populate the last element in the array, while the left-most
six bits are used to populate the first element in the array.
Shift right and mask
In case you are unfamiliar with the code in Listing 7, each statement (except
the first) shifts a group of six bits into the rightmost six
bits. The first statement doesn’t need to perform a shift because
the six bits of interest are already in the rightmost six-bit
position. Then each statement performs a bitwise and
operation with the following bit mask to convert all bits except the
rightmost six to zeros:
00000000000000000000000000111111
The method named mapTo
The method named mapTo is shown in Listing 8.
This method maps the value of the least significant six bits of an
incoming int value to the corresponding seven-bit character
from the
base64 alphabet shown in Table 1.
static int mapTo(int val){ |
Alternative approaches
One obvious way to accomplish this would have been to create a Vector
object containing the values corresponding to the 64 characters in
Table 1. Then the value of the six-bit group could be used as an
index into the Vector object to retrieve the base64 character
corresponding to that value.
However, that would have required me to populate the Vector
object, which in the worst case would have required me to write 64
statements. I could have reduced the amount of code by breaking
the problem down into the ranges of values shown in Table 2 and using a
for
loop to populate the Vector object for each range, but this
still would have required more code than I wanted to write.
Value Range |
Character Range |
0-25 | A through Z |
26-51 | a through z |
52-61 | 0 through 9 |
62 | + |
63 | / |
Conversion on the fly
Therefore, I elected to use a somewhat different approach that computes
the required character on the fly rather than using a table
lookup. My approach is shown in Listing 8, and should not require
a detailed explanation.
The output
The screen output produced by the two statements in the main
method of Listing 3 is shown in Figure 3. However, I manually inserted
space characters in the binary representation in Figure 3 to
visually
separate the bits into eight-bit groups.
a2xt |
The first line of text in Figure 3 shows the base64 characters returned
to
represent the eight-bit input characters given by ‘k’, ‘l’, and
‘m’. As
you can see, these four base64 characters match the characters that we
identified via manual table lookup in Figure 2.
Figure 3 also shows the binary representation of this sequence of four
characters. Each character is represented by eight consecutive
bits, with a leading zero missing on the left end.
Most significant bit is always zero
If you start on the right and count bits, you will find that the most
significant bit in each group of eight bits has a value of zero.
Therefore, the most significant bit can be discarded in order to
transmit these characters through a transmission system that is limited
to seven bits. No loss of information would result from
discarding the most significant bit.
Decode and display
Listing 9 shows the end of the main method.
byte[] decodedData = decoder(encodedData); |
The code in Listing 9 passes the array of encoded data to the method
named decoder, which returns an array containing decoded
data. The
decoded data is stored in an array object of type byte referred
to by the reference variable named decodedData.
Then the code in Listing 9 passes the array containing decoded data to
the method named showData where it is displayed in both
character and binary form.
The method named decoder
The method named decoder is used to decode a group of four
base64 characters into three eight-bit bytes. The method begins
in Listing 10.
static byte[] decoder(byte[] data){ |
The decoder method begins by confirming that the incoming array
is of the correct length, and terminating the program if it is of the
wrong length.
Steps in the process
This method accomplishes its purpose by performing the following steps:
- Convert the base64 characters back to the original six-bit values
according to the relationship between characters and values given in
Table 1. - Concatenate the four six-bit values into a 24-bit int
value in a variable named concat. - Separate the 24-bit int value into three eight-bit values
that represent the decoded data values.
Convert and concatenate
The first two steps in this process are accomplished by the code in
Listing 11.
int concat = ((mapFrom(data[0]))<<18) |
The code in Listing 11 invokes the mapFrom method to convert
each base64 character to the corresponding value from Table 1.
Although the values are returned from the mapFrom method as
eight-bit values, the maximum possible value cannot be greater than
63. Therefore, the two most significant bits in the values
returned from the mapFrom method are guaranteed to be zero.
Why is this important?
This is very important because the two most significant bits of each
eight-bit value overlap the two least significant bits of the value
previously shifted six bits to the left in Listing 11. Because a bitwise
inclusive or is used to combine the values, the two most
significant bits having a value of zero cannot interfere with the
values of the two bits that they overlap.
I will have more to say about the method named mapFrom shortly.
Extract the eight-bit data
The code in Listing 12 accomplishes the third step in the above list of
three steps.
byte[] output = new byte[3]; |
This code extracts the three eight-bit values from the twenty-four bits
stored in the int variable named concat, and uses those
bits to populate the individual bytes in the output byte
array. This code should be self-explanatory.
The method named mapFrom
The method named mapFrom is used to convert from a base64
character to a six-bit value using the relationships between characters
and values given in Table 1. The method is shown in Listing 13.
static int mapFrom(int val){ |
Reverses the earlier process
The method named mapFrom shown in Listing 13 essentially
reverses the process provided by the method named mapTo that
was shown
in Listing 8.
The code in Listing 13 should be self-explanatory and should not
require further explanation.
The Output
The total output from this program is shown in Figure 4.
(Note that I manually added the missing bits with a zero
value on the left end. I also inserted spaces to separate the
data into eight-bit groups.)
The first four lines of text in Figure 4 repeat what you have already
seen in Figure 1 and Figure 3.
klm |
The decoded output
This program performs the following steps:
- Create and display three eight-bit characters in both character
and binary format. - Encode the three eight-bit characters into four characters from
the base64 alphabet. Display those characters in both character
and binary format. - Decode the four base64 characters back into three eight-bit
characters. Display those characters in both character and binary
format.
The last two lines of text in Figure 4 show the result of decoding and
displaying the base64 data according to the code in Listing 9. As
you can see, the final result matches the starting data represented by
the first two lines of text in Figure 4.
The first and fifth lines of text in Figure 4 each represent the same
three eight-bit bytes of data. The third line shows the four
seven-bit base64 characters that represent those three eight-bit bytes
of data.
(Each seven-bit data value is actually stored in an
eight-bit byte. However, the most significant bit is always
zero. Therefore, it could be discarded without loss of
information.)
Documented production software
The program named Base64_02 is provided solely to illustrate
the conversion algorithms to and from base64. It is not suitable
for production use because it doesn’t deal with several of the
issues defined in the document entitled Mechanisms
for Specifying and Describing the Format of Internet Message Bodies.
For example, the program is incapable of dealing with the situation
where the number of eight-bit bytes is not evenly divisible by
three. In that case, the algorithm must append pad characters
consisting of ‘=’ characters to guarantee that the number of characters
in the base64 data is evenly divisible by four.
Also, the code in the program named Base64_02 does not deal
with the issue having to do with a maximum line length of 76 characters
for the base64 data.
As of this writing, documented classes suitable for production use are
available from which you can compile encoder
and decoder
objects.
Undocumented production software
If you are willing to use undocumented software, J2SE SDK
version 1.4.2 contains undocumented classes from Sun that you
can use to compile encoder and decoder objects.
Listing 20 near the end of the lesson presents a program named Base64_03
that illustrates the use of these classes.
The program named Base64_03
This program illustrates the use of the undocumented sun.misc
package for encoding and decoding base64.
This program uses the encodeBuffer method of the sun.misc.BASE64Encoder
class and the decodeBuffer method of the sun.misc.BASE64Decoder
class.
Information based on introspection
Introspection shows that the BASE64Encoder class inherits the
following two methods from the sun.misc.CharacterEncoder class,
possibly overriding one or both:
- encode
- encodeBuffer
This program illustrates the use of the encodeBuffer
method. I have been unable to find any information on the encode
method.
The sun.misc.CharacterEncoder class is a direct subclass of Object.
Introspection also shows that the BASE64Decoder class inherits
the decodeBuffer
method from the sun.misc.CharacterDecoder class, possibly
overriding the method:
The sun.misc.CharacterDecoder class is a direct subclass of Object.
No documentation available
I have never been able to find any documentation on the use of either
of these
classes. I can’t remember how I learned how to use them.
However, I will show you what I know.
This program was tested using SDK 1.4.2 under WinXP.
The main method
The main method begins in Listing 14.
public static void main(String[] args) { |
The code in Listing 14
- Creates a byte array containing the
eight-bit representations of four characters. - Displays the length of the array.
- Displays the contents of the array.
This array will be used as the input to the base64 encoding
process. I purposely caused this array to contain a number of
bytes that is not evenly divisible by three to illustrate the use of
the pad character ‘=’ in the base64 encoding process.
The output
As you might expect, the code in Listing 14 produces the screen output
shown in Figure 6.
4 |
Encode the data as base64
Continuing with the main method, the code in Listing 15:
- Invokes the method named encodeBase64 to encode the four
eight-bit bytes into base64 characters, returning the encoded data as a
String object. - Displays the length of the string of base64 characters.
- Displays the characters in the string.
String encoded = encodeBase64(dataBuffer); |
The output
The code in Listing 15, plus the remaining code in the main
method (which I will discuss shortly) produces the output shown
in Figure 7.
10 |
Two important points
I included all of the output in Figure 7 to illustrate two important
points.
Recall that the previous program converted the eight-bit
representations of the characters in the string “klm” to the
four seven-bit base64 characters represented by “a2xt” (see
Listing 1 and Figure 3).
This program converts the eight-bit representations of the four
characters in the string “klmn” to the eight seven-bit base64
characters represented by “a2xtbg==” as shown in Figure 7.
Note the pad characters
Note in particular the use of the pad character “=” at the end of the
output string to guarantee that the number of base64 characters is
evenly divisible by four.
Note the length of the output string
Also note that the output in Figure 7 reports the number of characters
in the string of base64 characters to be 10 instead of 8. This is
because the Sun encoder used to perform the conversion to base64 always
appends a carriage return and a line feed onto the end of the string of
base64 characters. You can see the evidence of this by the blank
line between the second and third lines of text in the output shown in
Figure 7.
What if the input exceeds 57 bytes?
If the number of eight-bit bytes passed to the encoder exceeds 57,
the encoder will return multiple lines of base64 characters. Each
returned line is 76 characters in length plus a carriage return and a
line feed
appended onto the end of each line.
Thus the actual number of
characters returned for each line other than the last line will be 78
characters. The last line will contain the base64 characters that
represent the leftover eight-bit characters plus a carriage return and
a line feed.
Output for multiple lines of base64 data
This is illustrated by the program output in Figure
8, which shows an input consisting of 58 eight-bit bytes and a base64
output containing a total of 84 bytes. The 84 bytes include two
sets of
carriage return and line feed characters.
58 |
(The output
in
Figure 8 was produced by modifying the first statement in Listing 14 to
contain the 58-character string shown in the second line of Figure.)
The encodeBase64 method
I am going to set the main method aside while I discuss the
method named encodeBase64. I will return to the main
method later.
The encodeBase64 method, which is used to
encode an array of eight-bit bytes into a string of base64 characters,
is shown in its entirety in Listing 16.
static String encodeBase64(byte[] data){ |
The Sun encodeBuffer method
The code in Listing 16 instantiates an object of the undocumented class
named sun.misc.BASE64Encoder, and invokes the encodeBuffer
method on that object, passing the array of eight-bit bytes as a
parameter.
The encodeBuffer method converts the bytes in the incoming
array to seven-bit base64 characters. Each of the base64
characters is encapsulated in the least significant seven bits of a
character in a Java String object, which is returned by the
method. Thus, the returned base64 characters are encapsulated in
the Unicode characters that comprise a Java String object.
As described earlier, the returned string includes a carriage return
and a line feed at the end. If the number of base64 characters
exceeds 76 characters, the string contains multiple lines with each
line terminated by a carriage return and a line feed.
If the number of input eight-bit characters is not evenly divisible by
three, the encoder appends the base64 pad character ‘=’ at the end to
guarantee that the number of base64 characters is evenly divisible by
four.
Decode the data
Returning now to the discussion of the main method, the code in
Listing 17:
- Invokes the decodeBase64 method to convert the encoded
base64 data back to eight-bit bytes. - Displays the number of eight-bit bytes.
- Displays the values of the eight-bit bytes.
String decoded = decodeBase64(encoded); |
Examples of the output produced by the code in Listing 17 are shown in
Figure 7 and Figure 8. As expected, the output produced by
decoding the base64 data matches the input that was encoded into base64
data earlier.
The decodeBase64 method
The decodeBase64 method is shown in Listing 18.
static String decodeBase64(String encoded){ |
The method instantiates an object of the undocumented class named sun.misc.BASE64Decoder.
Then it invokes the decodeBuffer method on that object passing
the encoded data as a parameter.
The decodeBuffer method
The decodeBuffer method converts the base64 characters into the
corresponding set of eight-bit values. Although it isn’t obvious
in Listing 18, the decodeBuffer method returns an array object
of type byte with each element in the array containing one of
the resulting eight-bit values.
The code in Listing 18 encapsulates each of the resulting eight-bit
values in the least significant eight bits of the Unicode characters
that make up a Java String object, and returns that string.
Run the Programs
I encourage you to copy, compile and run the code in Listing 19 and
Listing 20. Modify it and experiment with it until you fully
understand it.
Summary
This lesson explains the
use of base64 encoding and decoding in general,
and illustrates base64 encoding and decoding using sample programs.
What’s Next?
A future lesson will
explain how base64 decoding is used in the BigDog program.
Program Listings
Complete listings of the two programs explained in this lesson are
provided in Listing 19 and Listing 20.
/*File Base64_02.java Copyright 2004, R.G.Baldwin |
/*File Base64_03.java Copyright 2004, R.G.Baldwin |
Copyright 2004, Richard G. Baldwin. Reproduction in whole or
in
part in any form or medium without express written permission from
Richard
Baldwin is prohibited.
About the author
Richard Baldwin
is a college professor (at Austin Community College in Austin, TX) and
private consultant whose primary focus is a combination of Java, C#,
and XML. In addition to the many platform and/or language independent
benefits of Java and C# applications, he believes that a combination of
Java, C#, and XML will become the primary driving force in the delivery
of structured information on the Web.
Richard has participated in numerous consulting projects, and he
frequently provides onsite training at the high-tech companies located
in and around Austin, Texas. He is the author of Baldwin’s
Programming Tutorials, which
has gained a worldwide following among experienced and aspiring
programmers. He has also published articles in JavaPro magazine.
Richard holds an MSEE degree from Southern Methodist University
and has many years of experience in the application of computer
technology to real-world problems.
-end-