Java Programming Notes # 2150
- Preface
- Preview
- Discussion and
Sample Code - Run the Program
- Summary
- What’s Next?
- Complete Program
Listing
Preface
This is the first lesson in a series designed to teach you how to write
a Java program to remove SPAM from your Email server before it is
downloaded into your primary Email client.
The communications module
This lesson explains the communications module used to communicate with
your Email server, and to remove SPAM messages from the
server.
SPAM screening algorithm
The program is designed to allow you to use my SPAM screening
algorithm, or to invent your own. Subsequent lessons will explain
the
inner workings of my SPAM screening algorithm. You can use my
algorithm as a starting point if you decide to invent your own.
Those lessons will
also explain how the system can be trained to do an increasingly better
job of screening SPAM over time.
Viewing tip
You may find it useful to open another copy of this lesson in a
separate browser window. That will make it easier for you to
scroll back and forth among the different listings and figures while
you are reading about them.
Supplementary material
I recommend that you also study the other lessons in my extensive
collection of online Java tutorials. You will find those lessons
published at Gamelan.com.
However, as of the date of this writing, Gamelan doesn’t maintain a
consolidated index of my Java tutorial lessons, and sometimes
they are difficult to locate there. You will find a consolidated
index at www.DickBaldwin.com.
Preview
Can you write better SPAM screening
algorithms?
Did you ever think that you might be able to write better SPAM
screening algorithms than those available in the SPAM screening
software that you are now using? If so, this lesson is for you.
Even if that is not the case, like most of us, you are probably
overwhelmed by SPAM
and therefore you may find this lesson interesting.
Remove SPAM from the server
In this lesson, I will show you how to write a Java program
that supplements the SPAM screening software that you are currently
using. This program is used to identify and remove SPAM from your
Email server before it is downloaded into your primary Email client.
Any SPAM that makes it past this program can be further acted upon
by the SPAM screener that is built into your Email client.
The communications module
This series will consist of four lessons. This lesson, which
is the first in the series,
will explain the communications module used to communicate with
your Email server, and to remove SPAM messages from the
server.
As mentioned earlier, the program is designed to allow you to invent
and implement your own
SPAM screening algorithm in addition to, or as an alternative to my
algorithm.
My algorithm and algorithm training programs
The second lesson will explain the inner
workings of my SPAM screening algorithm. My algorithm operates
separately on the Subject line, the From line,
and the body text of each Email message.
The third lesson will explain a companion program designed to make
use of historical data to easily train the algorithm to do a better job
of identifying SPAM based on the Subject of the message.
The fourth lesson will explain another companion program designed to
make use of historical data to easily train the algorithm to do a
better job of identifying SPAM based on the body text of
the message, which includes the From line.
Effectiveness of my algorithm
At this point in time, after about one week of training, my
algorithm reliably identifies about ninety percent of all SPAM and
allow me to delete it
from my Email server before downloading it into my primary Email
client. Only time will tell if that percentage improves in the
future.
Discussion
and Sample Code
Stripped down Screen class
The version of the program that I will discuss in this lesson has a
stripped-down version of a class named Screen. This version of
the program
allows for testing the communications module on your system with your
Email server
without doing any actual screening for SPAM and without deleting any
messages from
the server.
I will explain the full version of the class named Screen in
the next lesson when I explain my algorithm for identifying SPAM.
Purpose of the program
The purpose of this program is to read messages from a POP3 (Post
Office Protocol – Version 3) server, to
analyze the messages according to a set of screening rules, and to
delete those messages from the server that fail the screening test.
(As written, the program asks the user to confirm the
deletion of each message from the server, but this confirmation step
could easily be
removed if you decide to do so.)
Key words and phrases
This version of the program screens for SPAM on the basis of key words
or
phrases in the From line, key words or phrases in the Subject
line, and
key words or phrases in the body text.
Friendly Email addresses and subjects
A list of friendly Email addresses and friendly subjects is used to
screen the From
line and the Subject line. Messages that are from
friendly Email addresses, and messages that have known good Subject
lines are not
deleted from the server and no information about those messages is
saved on the local disk. They are simply ignored after determining that
they are friendly.
Different lists for Subject and body
text
Different lists of words and phrases are used for screening Subject
lines and body text for SPAM. This is important because the same set of
words
and phrases can’t always be used for both cases.
For example, the word ANTIVIRUS is appropriate for screening the
Subject line, but is not appropriate for screening the
body text. The
word ANTIVIRUS often appears legally in the header of Email messages
that have been scanned for viruses by the server, but also often
appears in the Subject line of SPAM messages.
Common spammer tricks are defeated
Several common spammer tricks are defeated by my SPAM screening
algorithm.
For example, the common spammer trick of inserting extra characters
between the
characters in an offending word or phrase is defeated. Also, the
common trick of mixing the case of the
characters in an offending word or phrase is also defeated.
As a specific example, my algorithm will recommend deletion of any
message having
any of the following in its Subject line or its body
text if the word VIAGRA is included in the lists used to screen for
SPAM:
vIaGrA
V.IagRA
V.I.A.G.R.A
These two characteristics alone have a significantly positive
impact on the effectiveness of training the algorithm to do a better
job of
identifying SPAM in the future.
My algorithm also defeats the common trick of appending random
characters to the end of the Subject line, because it
doesn’t require a
match for the entire Subject line. Rather, it
searches for words
or phrases internal to the text of the Subject line.
The user interface
Figure 1 shows the GUI through which the user controls the program.
Figure 1 Graphical User Interface
(Note that this GUI was purposely made narrow in order
to cause it to fit into this narrow publication format. I
recommend that you increase the width of the Frame to at least 750
pixels, and increase the width of the TextField and TextArea objects to
at least 100 characters each.)
The Offending Phrase
When the program identifies a message that is a candidate for deletion,
the reason for that recommendation is shown in the third text field
from the top in Figure 1.
(An actual SPAM message is being displayed in the
GUI in Figure 1, but the stripped-down version of the class
named Screen was being used, so no Offending Phrase is
shown in Figure 1.)
Deleting a message from the server
The user confirms that the message should be deleted from the Server by
clicking the Delete button
in Figure 1. If the user doesn’t want to delete the message, she should
click the Start/Next button instead.
(Note that the capability to actually delete messages
from the server was disabled in the program shown in Listing 42 near
the end of this lesson. Make certain that you are ready to
actually delete messages from the server before re-enabling that
capability.)
The Netscape approach to SPAM screening
I currently use Netscape version 7.1 as my Email client.
Basically, it provides two forms of SPAM screening. One form,
which is referred to as Junk Mail Controls, is
apparently based on some sort of artificial intelligence. This
capability can be
trained over time to identify the kinds of messages that you consider
to be junk mail. This capability is very easy to train.
However, it
produces lots of false positives and is very difficult to un-train when
that happens. (I will have more to say
about false positives later.)
The other form of SPAM screening used by Netscape 7.1 is referred to as
Message Filters. This approach depends on exact
character matching in the subject, the body, or in other parts
of the message, such as sender, date, priority, etc.
In this case, you must enter the exact words or phrases
into a form that will be used for matching purposes. This
approach is
practically useless for SPAM screening due to the tendency of spammers
to insert random
characters into the offending words and phrases and to randomly modify
the case of the characters in offending words and phrases. Also,
the process of entering the words and phrases into the form is very
tedious and time consuming. I long ago gave up on using
Netscape’s Message Filters for SPAM filtering.
False positives
All SPAM screening algorithms are subject to reporting false
positives to some degree. That is to say, a message may be
erroneously
identified as SPAM when it is actually a good message.
One of the
major problems with my Netscape 7.1 system results from false
positives. Because of the high rate of false positives produced
by Junk Mail Controls, whenever
a message is identified as SPAM, I must confirm that it is SPAM before
deleting it. At that point in time, unless I am willing to
actually open the message and to be confronted with a variety of
offensive images and other offensive material, I must make my decision
solely on the basis of the subject and
the from address information. Often this is not sufficient
information to
make an informed decision and I have no choice but to open the message.
Also, as I mentioned earlier, when Junk Mail Controls does
report a false positive, there is no definitive way to make certain
that it doesn’t happen again in the future. It is necessary to un-train
the algorithm regarding messages of that type, which can be a long
process, possibly involving many similar occurrences in the future.
More information is available with my system
When a user of my system is required to confirm deletion of a message
from the server, the following information is available to assist in
the making
of the decision:
- From line
- Subject line
- Offending line of text, which may or may not be the subject
- Offending word or phrase in the offending line of text
- Entire raw text of the message down to and including the
offending line
No images are rendered in my system, so it is not necessary for the
user to view offending images in order to make the decision to delete.
Having viewed the above information, if the user is still unable to
make
an informed decision to delete the message, the user still has the
option to let the message pass through and be downloaded into the
primary Email client. Once having viewed the message later in the
primary Email client, the user still has the option of updating the
offending word lists in my system with IP addresses, URLs, etc, so that
deletion decisions on future similar messages will be easier to make.
Saved in local archive folder
The raw text of all messages that are identified as candidates for
deletion from the
server are saved in an archive folder on the local disk, regardless of
whether the user elects to delete them from the server or not. Thus if
a message is deleted from the server and it is later determined that
was a mistake, a raw text copy of the deleted message is available
locally in the archive folder.
(You should probably empty this folder periodically so
that it won’t fill up your disk.)
Saved in history folder
Except for messages from friendly Email addresses or messages with
friendly Subject lines, all messages that
are not identified as candidates for deletion from the server are saved
in a history
folder on the local disk. These messages are used later to train
the algorithm to do a better job of identifying SPAM in the
future. I will explain this process in Part 3 and Part 4 of this
series of lessons.
Protection against viruses
Before any message is saved in a local file, asterisks are inserted
into the text on ten-character intervals in an attempt to destroy any
virus code that may be embedded in the message.
If a message makes it through the screen and is later identified as
having a virus as an attachment, a series of ten or more bytes can be
extracted from the virus code and added to the word list as an
offending phrase. This will cause any future messages having that
same virus code as an attachment to be identified as a candidate for
deletion from the server.
Possible upgrades
Numerous upgrades to my system are possible and I’m confident that you
will have ideas that I haven’t thought of. If so, I would like to
hear about them.
One possible upgrade would be to create a premium list of words and
phrases that will always result in automatic deletion of the message
from the server without prior confirmation by the user. For example,
the user might want to have any message containing the word VIAGRA to
be automatically deleted.
Be careful with this
However, care is urged in this regard. Certain words such as SPAM
and PORN occasionally occur in a message with the letters separated by
only a few characters. Depending on the degree of separation, my
algorithm may identify those messages as
being candidates for deletion.
For example, the offending word PORN occurs in the non-offending word
imPORtaNt with the letters R and N separated by only two characters.
The word SLUT appears in the word SoLUTion with only one character
between the S and the L. The word SPAM often occurs in different
variations of body text.
If such a premium word list is used for automatic deletion, it should
probably be restricted to only those situations where the characters is
the word exactly match (except for case) a word in the subject
or the body of the message with no intervening characters separating
the characters in the message. Experience shows, however that
very few matches would be made on this basis, so it may not be worth
the effort.
Number of separation characters
Another possible upgrade would be to allow the user to specify the
number of characters that may occur between the letters of an offending
word or phrase in the message.
That value is currently hard-coded into the
program. As of this writing, that value is set to one for
screening against offending words or phrases. The value is set to
zero when testing for friendly Email
addresses in the From line and known good data in the Subject
line.
If the number of characters is set to zero, many spam messages with
offending words or phrases will avoid detection. If that value is set
to a large number, many false positives will occur. Therefore, care
should be taken when adjusting this value.
Automatic deletion of all SPAM candidates
For the brave among us, another possible modification would be to allow
the program to automatically delete all messages that are determined to
be candidates for deletion.
Since a text version of each of these messages is saved locally in an
archive folder, a separate program could be written to allow the user
to review those messages locally at her convenience, just in case a
valid message was inadvertently deleted from the server.
Training programs
Companion programs that I have written provide for maintaining and
upgrading the offending word and phrase lists. These lists are
saved in local text files.
These
training programs are used to analyze the non-deleted message files
saved
locally in the history folder in order to train the algorithm to do a
better job of identifying SPAM messages in the future.
These programs
are designed for extreme ease of use to encourage the user to train the
algorithm frequently. The better the algorithm is trained, the
better it will perform.
I will explain these training programs in Part 3 and Part 4 of this
series of
lessons.
Simple text files
All three word lists are maintained in local text files, which can be
created and edited with an ordinary text editor if need be. Thus,
if some
corruption gets into one of the word lists, it is easy to correct the
situation using an ordinary text editor.
Technical information on POP3 protocol
For technical information on the POP3 protocol, see http://www.cis.ohio-state.edu/htbin/rfc/rfc1725.html.
I will frequently refer to this document as the technical
document in the discussion that follows.
Command summary
A POP3 command summary based on the technical
document is
shown in Figure 2.
Minimal POP3 Commands: Figure 2 |
This program uses the commands that are highlighted in red in Figure
2. I will explain those commands in conjunction with the code
that
uses them.
File names
The following file names are hard-coded into the program. You may
want to change these file names for your version of the program.
- Local copy – the file name for a local copy of each message is
based on the
unique identifier for that message (UIDL) obtained from the
mail server. - Pop302a.txt – contains a word list for screening the Subject
lines for offensive words and phrases. - Pop302b.txt – contains a word list for screening the body text
lines for offensive words and phrases. - Pop302c.txt – contains a list of friendly Email addresses and
subjects for
screening the From and Subject lines to
identify friendly messages.
Three classes
This program consists of two main classes and one minor class. An
object of the class named
Pop302 handles all communications with the POP3 server.
A method belonging to an object of the class named Screen is
used to screen each message in an
attempt to identify SPAM.
This class can be totally replaced by Java programmers who wish to
design their own screening algorithm provided that they maintain the
interface with the object of the class named Pop302.
An object of a very simple class named ScreenResult is used as
a wrapper to return several items of information from the screening
method.
Testing
The program was tested using SDK 1.4.2 under WinXP in conjunction with
two different POP3 Email servers.
The class named Pop302
As mentioned earlier, an object of the class named Pop302
handles all communications with the Email server, including the
deletion of messages from the server. An object of the class
named Screen applies screening rules in an attempt to identify
SPAM.
Stripped-down version of the Screen class
I will explain the class named Pop302 in this lesson, and will
explain the class named Screen in the next lesson.
However, I will provide a stripped-down version of the Screen
class in this lesson. You can use the stripped-down version to
test Pop302 on your
system with your Email server, but no actual screening for SPAM will
take place.
Will discuss in fragments
I will discuss the program in fragments. A complete listing of
the program is provided in Listing 42 near the end of the lesson.
You should be able to copy and paste that listing into your Java IDE to
compile and test the program on your system.
Instance variables
The Pop302 class begins in Listing 1 with the declaration of
several instance variables. The purpose of these variables will
become clear when I discuss them in conjunction with their use.
class Pop302 extends Frame{ |
As you can see, Pop302
extends Frame. Therefore, an object of the class Pop302
is a GUI.
The main method
The main method is shown in its entirety in Listing 2.
public static void main(String[] args){ |
When you start this program running, you need to provide the following
information regarding your Email server as command line parameters in
the order shown:
- server
- user name
- password
The main method then instantiates an object of the Pop302
class, passing this information as parameters to the constructor.
The constructor
The Pop302 class consists mainly of the constructor plus a
couple of helper methods. The constructor code begins in Listing
3.
Pop302(String server,String userName, |
The constructor begins by instantiating an object of the class named Screen,
passing a reference to the Pop302 object as a parameter.
Code in the Screen class uses this reference later to display a
progress indicator in the third text field in Figure 1.
(Note that the stripped-down version of the Screen
class discussed in this lesson doesn’t display the progress
indicator. You will have to wait until the next lesson to see
that code.)
Get a socket
The code in Listing 4 instantiates a new Socket object on the
standard port for POP3 servers.
int port = 110; //pop3 mail port |
When the constructor for the Socket class returns successfully,
a TCP/IP connection will have been made with port 110 on the Email
server identified as server.
If the attempt to make the connection fails, the program will throw an
exception. For example, if the value of server is
invalid, the program will throw an UnknownHostException.
(If you are unfamiliar with socket programming in Java,
see the lessons beginning with number 550 at www.DickBaldwin.com.)
Ready to communicate
At this point, the Email server is ready to communicate using the POP3
protocol. In order to communicate, the program must be able to
send messages to the server and read messages that are sent from the
server.
Input and output streams
The code in Listing 5 gets input and output streams on the Socket
object that make it possible to send messages to the server and to read
messages sent from the server.
inputStream = new BufferedReader( |
The code in Listing 5 is straightforward and shouldn’t require
further explanation. If you are unfamiliar with this code, see
the lessons on socket programming and input/output at www.DickBaldwin.com.
Basic POP3 operation
The following is a quotation from the technical
document referred to earlier:
“Initially, the server host starts the POP3 service by
listening on TCP port 110. When a client host wishes to make use of the
service, it establishes a TCP connection with the server host. When the
connection is established, the POP3 server sends a greeting.
The client and POP3 server then exchange commands and responses (respectively)
until the connection is closed or aborted.”
The document goes on to explain:
“Commands in the POP3 consist of a keyword, possibly
followed by one or more arguments. All commands are terminated by a
CRLF pair. Keywords and arguments consist of printable ASCII
characters. Keywords and arguments are each separated by a single SPACE
character. Keywords are three or four characters long. Each argument
may be up to 40 characters long.”
Finally, the document tells us:
“Responses in the POP3 consist of a status indicator
and a keyword possibly followed by additional information. All
responses are terminated by a CRLF pair. There are currently two status
indicators: positive (“+OK”) and negative (“-ERR”).”
The greeting
That brings us to the greeting mentioned above.
The code in Listing 6 gets and displays the greeting received from the
Email server. In the process, the code in Listing 6 invokes the
method named validateOneLine to confirm that the message
received from the Email server begins with +OK,
and not with -ERR.
String connectMsg = validateOneLine(); |
(If the
response begins with -ERR, the program terminates the communication
session with the server, prints an error message, and terminates.)
The validateOneLine method
The code in Listing 6 invokes the method named validateOneLine
to get and validate the message sent by the server. At this
point, I am going to set the discussion of the
constructor aside for a moment and discuss the method named validateOneLine.
The validateOneLine method begins in Listing 7.
private String validateOneLine(){ |
The method begins by reading a line of text sent by the server and
confirming that the text begins with +OK. If so, the
method simply returns that line of text as a String object,
where it is displayed by the second statement in Listing 6.
If -ERR is received
If the received line of text does not begin with +OK, it
must begin with -ERR, which is the only other possibility
allowed by the protocol.
Listing 8 shows the behavior of the validateOneLine method when
the received line of text does not begin with +OK.
else{ |
In this case, the method:
- Displays the line of text that was received.
- Sends a QUIT command to the server to terminate the
session. - Closes the socket.
- Prints an error message.
- Terminates the program.
As you will see later, this method is invoked at numerous places in the
program to get and validate a server response to commands sent to the
server by the
program.
The greeting
The greeting sent by one of my Email servers is shown in Figure
3.
+OK POP3 server1.yohance.com v2001.78rh |
(The actual text in the greeting will vary from
one Email
server to the next.Note that I manually inserted a line break immediately
following
78rh in Figure 3 to force the greeting to fit in this narrow
publication format.)
The AUTHORIZATION state
The following is a quotation from the technical
document mentioned earlier:
“A POP3 session progresses through a number of states
during its lifetime. Once the TCP connection has been opened and the
POP3 server has sent the greeting, the session enters the AUTHORIZATION
state. In this state, the client must identify itself to the POP3
server.”
Returning to the constructor
At this point, the greeting has been received, and the POP3 session is
in the AUTHORIZATION state. It is now time for the program to
send the user name and the password to the server.
Commands are sent in plain text, upper case to the server. Some
commands require an argument following the command, as is the case with
the USER command shown in Listing 9.
//Send the command |
The code in Listing 9
produces the output shown in Figure 4 on my system with my Email server.
(The response from your Email server may differ.)
USER +OK User name accepted, password please Figure 4 |
The APOP command
There is an optional APOP command, which allows the user name
and password to be encrypted before being sent to the server. The
use of the APOP command would be more secure than the approach
shown in Listing 9 and Listing 10. However, this command is not
supported by all Email servers, and apparently is not supported by my
server.
Send the password
The code in Listing 10 sends the password, validates the response, and
displays the response.
//Send the password to the server |
The code in Listing 10 produces the output shown in Figure 5.
PASS +OK Mailbox open, 7 messages Figure 5 |
(Obviously
the number of messages available will vary from one run to the next.)
The TRANSACTION
state
Returning now to the technical
document, we find:
“… the client must identify itself to the POP3 server.
Once the client has successfully done this, the server acquires
resources associated with the client’s maildrop, and the session enters
the TRANSACTION state. In this state, the client requests actions on
the part of the POP3 server.”
Having received the +OK response shown in Figure 5, our POP3
session is now in the TRANSACTION state.
The QUIT command and the UPDATE state
We find the following information in the technical
document:
“When the client has issued the QUIT command, the
session enters the UPDATE state. In this state, the POP3 server
releases any resources acquired during the TRANSACTION state and says
goodbye. The TCP connection is then closed.”
Terminating the POP3 session
We are still discussing the constructor. Listing 11 shows the
code used to register a WindowListener object on the close
button on the Frame. The purpose of this listener
is to terminate the POP3 session and to terminate the program when the
user presses the close button.
this.addWindowListener( |
(Note that the code in Listing 11 is an anonymous class
definition. If you are unfamiliar with anonymous class
definitions in Java, you can learn about them by studying the tutorial
lessons at www.DickBaldwin.com.)
The windowClosing method
By defining the windowClosing method in the anonymous class,
the code in Listing 11:
- Sends a QUIT command to the server.
- Validates and displays the response.
- Closes the socket.
- Terminates the program
The goodbye message from the server
In addition to displaying the response on the command-line screen, the
code in Listing 11 also displays it in the large text area in Figure
1. However, you will have to look very quickly to see it there
before the GUI disappears.
The response provided by my server is shown in Figure 6.
QUIT +OK Sayonara Figure 6 |
The UPDATE state
At this point, the POP3 session is in the UPDATE state.
Among other things, this means that the server will delete all of the
messages that were marked for deletion by the DELE command
while the session was in the TRANSACTION state.
Here is some of what the technical
document has to say about the UPDATE state:
“When the client issues the QUIT command from the
TRANSACTION state, the POP3 session enters the UPDATE state. (Note that
if the client issues the QUIT command from the AUTHORIZATION state, the
POP3 session terminates but does NOT enter the UPDATE state.)If a session terminates for some reason other than a
client-issued QUIT command, the POP3 session does NOT enter the UPDATE
state and MUST not remove any messages from the
maildrop.The POP3 server removes all messages marked as deleted from the
maildrop. It then releases any exclusive-access lock on the maildrop
and replies as to the status of these operations. The TCP connection is
then closed.”
Defining the GUI
Note that the GUI shown in Figure 1 was purposely made narrow so that
it would fit into this narrow publication format. However, it is
much
more useful if it is wide enough to display each text line in
the message in its entirety without a requirement for horizontal
scrolling. Therefore, I recommend that you resize the GUI to make
it
at least 750 pixels wide. I also recommend that you make each of
the text
fields and the text area at least 100 characters wide.
Set the layout
Listing 12 sets the GUI layout to FlowLayout. Although
this isn’t very fancy, it works pretty well in this case.
setLayout(new FlowLayout()); |
Construct GUI components
Listing 13 constructs the two buttons, the three text fields, and the
text area shown in Figure 1.
final Button startButton = |
No labels are provided
In order to preserve real estate on the screen, I
did not provide labels to identify the text fields in Figure 1.
Rather, when the text fields are instantiated, the initial text
showing in each text field indicates its purpose. For example,
the initial text that appears in the topmost text field is “Display
From line here.”
The last statement in Listing 13 also displays the purpose of the text
area in the text area when it first appears on the screen.
Not yet added to the GUI
Note that at this point, the GUI components have been constructed, but
have not yet been placed in the GUI. This will be taken care of
later.
References to buttons are final
Note also that it is necessary to declare the references to the two Button
objects to be final, because they are accessed later from
within an anonymous class definition. Local and anonymous classes
can access local variables only if they are declared final.
ActionListener on the Start/Next button
Listing 14 shows the beginning of the registration of an anonymous ActionListener
object on the Start/Next button shown in Figure 1.
startButton.addActionListener( |
Listing 14 clears the third text field from the top in Figure 1 by
storing an empty string in that text field.
Retrieve and screen messages for SPAM
As mentioned earlier, the POP3 session is now in the TRANSACTION
state. The code in Listing 15 begins the process of retrieving
all the messages currently on the server and screening those messages
for SPAM.
The number of messages on the server
One of the first things that we need to know is how many messages are
currently in the dropbox on the server. The code in Listing 15
sends a STAT command to the server to get this information.
try{ |
Get number of messages only at beginning of
session
As the session progresses and DELE commands are sent to the
server, messages are marked for deletion. Once a message is
marked for deletion, it is no longer included in the count of messages
on the server. Therefore, we must make certain that we obtain the
number of messages on the server only at the beginning of the session.
As you will see later, the variable numberMsgs is used by the
program
to count the number of messages processed that have been
processed. Since we must
retrieve the number of messages on the server only once at the
beginning of the session, we execute this code only when the value of numberMsgs
is zero.
Issue a STAT command
The code in Listing 15 begins by issuing a STAT command, and
then getting, validating, and saving the response. Here is part
of what the technical
document has to say about the response to the STAT command.
“The POP3 server issues a positive response with a line
containing information for the maildrop. This line is called a “drop
listing” for that maildrop.In order to simplify parsing, all POP3 servers required to use
a certain format for drop listings. The positive response consists of
“+OK” followed by a single space, the number of messages in the
maildrop, a single space, and the size of the maildrop in octets.”
Get number of messages as a String
Having saved the response to the STAT command, the code in
Listing 15 extracts a substring from that string containing the number
of messages as a String.
Convert the String to an int
Then the code in Listing 15 invokes the parseInt method of the Integer
class to convert the string representing the number of messages to an int.
Referring to a message by its number
Later we will see that messages can be referred to by their message
number.
(Note that message numbers begin with 1 and not
with 0.)
Retrieve and screen each message
The next step is to retrieve each message from the server and to screen
it for SPAM. Basically this consists of:
- Retrieving each message from the server
- Writing that message into a local disk file
- Passing the disk file
to a method belonging to an object of the Screen class where it
is screened for SPAM
The screening method returns a boolean value
indicating whether or not the message is a candidate for deletion from
the server due to a failure to satisfy one of the SPAM rules.
Get the unique ID
Each message is stored on the server with a unique ID. The unique
ID for the message is retrieved first and is used to create a unique
file name for
storing the message in a local disk file.
Note that the msgCounter variable was initialized to 0 when it
was declared in Listing 1. We will see later that this value is
incremented each time a new message is processed. Because the
message numbers start with 1 instead of 0, msgNumber must
always
be one greater than msgCounter.
The unique ID for a message is obtained from the server by issuing a UIDL
command and saving the response. Listing 16 shows the code used
to get, validate, and save the unique ID for the next message.
msgNumber = msgCounter + 1; |
The UIDL command
Here is some of what the technical
document has to say about the UIDL command:
“Arguments: a message-number (optionally) If
a message-number is given, it may
NOT refer to a message marked as deleted.Restrictions: may only be given in the TRANSACTION state.
Discussion: If an argument was given and the POP3 server issues a
positive response
with a line containing information for that message. This line is
called a “unique-id listing” for that message. … A unique-id
listing consists
of the message-number of the message, followed by a single space and
the unique-id of the message.”
No need to parse the response
In this case, I will use the entire response string as a file
name and therefore I won’t be concerned about parsing the response.
(I’m also not interested in the response produced when
the UIDL command is issued without a message number because
this program never
issues the command without a message number.)
While writing this lesson, it has occurred to me that a useful safety
upgrade would be to:
- Parse the response to the UIDL command
- Extract and save the message number
- Compare that value with the value of msgNumber being
maintained internally by this program before sending a DELE
command to the server
That would ensure that this program is properly synchronized with the
server’s view of message numbers before a command is given to delete a
message.
Open an
output file
The code in Listing 17 uses the unique ID to open an output file in
which to save the message.
String fileName = |
(You may want to modify this code to cause the messages
to be stored in a different location on the disk. If so, modify
the string shown in blue in Listing 17. Make certain
that the folder where you plan to save the files exists before running
the program.)
The code in Listing 17 is straightforward and shouldn’t require further
explanation. If you are unfamiliar with code like this, see the
tutorials on file I/O at www.DickBaldwin.com.
Begin the message retrieval process
Listing 18 issues a RETR command to begin the message retrieval
process, and then validates the response.
outputStream.println( |
Note that the RETR command specifies a particular message based
on its message number.
Response to the RETR command
Figure 7 shows a typical response produced by my Email server to the
receipt of a RETR command.
+OK 1818 octets Figure 7 |
The RETR command
Here is some of what the technical
document has to say about the RETR command:
“Arguments:
a message-number (required) which may not refer to a message marked as
deleted.Discussion:
If the POP3 server issues a positive response, then the response given
is multi-line. After the initial +OK, the POP3 server sends the message
corresponding to the given message-number, being careful to byte-stuff
the termination character (as with all multi-line responses).”
What is meant by byte-stuffing?
Here is part of what the technical
document has to say about multi-line responses and byte-stuffing.
“Responses to certain commands are multi-line. In these
cases, … after sending the first line of the response and a CRLF, any
additional lines are sent, each terminated by a CRLF pair. When all
lines of the response have been sent, a final line is sent,
consisting of a termination octet (decimal code 046, “.”) and a CRLF
pair. If any line of the multi-line response begins with the
termination octet, the line is “byte-stuffed” by pre-pending the
termination octet to that line of the response.”
In other words, a message is terminated by a line that has a period as
the first character followed immediately by a CRLF pair. If the
first character of a normal line begins with a period, byte-stuffing is
used to deal with that situation.
Didn’t strip any bytes
In the event that a line in the message begins with a period, then it
will begin with two periods after byte-stuffing takes place on the
server.
Since having two periods at the beginning of the line is unlikely to
have a detrimental impact on the screening process, I didn’t bother to
strip any bytes
that may have been prepended onto the line by the server during
byte-stuffing.
However, you may
want to upgrade the program to cause it to deal more correctly with
this situation if you consider it to be a problem.
Clear the text area
The code in Listing 19 clears the text area at the beginning of each
message. If you don’t do this, the string contained in the text
area will become very long and the program will run slowly as a result.
textArea.setText(""); |
Read first line and insert stars
The code in Listing 20 reads the first line of the message from the
server. Then it invokes the method named insertStars to
insert asterisks on ten-character intervals in the text.
//Read first line of message |
There is a possibility of retrieving a message that contains executable
virus code. My purpose in inserting an asterisk every ten
characters is to break up the byte pattern and hopefully to
corrupt any executable virus code that may be contained in the byte
stream before writing those bytes to in a local disk file.
The insertStars method
At this point, I will set the discussion of the constructor aside and
present the method named insertStars, which is shown in Listing
21.
The code in this method is straightforward and should not require
further explanation.
private String insertStars(String stringIn){ |
Read and save all lines of message
Returning now to the discussion of the constructor, the code in Listing
22 continues reading lines of text from the server, inserting stars,
and writing those lines of text into the output file until a line is
received that contains a single period.
while(!(msgLine.equals("."))){ |
Newline characters are written at the end of each line of text when it
is written into the output file.
Display messages for the user
It is almost time to pass the file containing the message to the
screening method to allow it to screen for SPAM. Before doing
that, however, the code in Listing 23 writes messages in the text
fields and text area of Figure 1 to let the user know what is happening.
fromField.setText("Call screener"); |
The progress indicator
Occasionally a very long message is received that requires a
perceptible amount of time for screening. When that happens (with
the version of the Screen class that will be discussed in the
next lesson), the screening method writes a stream of periods into
the
text area to let the user know that the system is actually working on a
message and isn’t simply hung up. Hence the words “Progress
Meter” are placed in the text area in
Listing 23 to tell the user what that stream of periods indicates.
(The stripped-down version of the Screen method
that I will discuss in this lesson does not provide this type of visual
feedback.)
Information from the screening method
Several different pieces of information need to be returned from the
screening method. However, in Java, a method can return only one
value. To accommodate this, an empty object instantiated from the
ScreenResult class is passed as a parameter to the screening
method. The code in the screening method populates the fields in
that object so as to make the information available upon return.
The ScreenResult class
At this point, I will set the discussion of the constructor aside and
show you the ScreenResult class in Listing 24.
class ScreenResult{ |
As you can see, this is a very simple class, an object of which exists
solely as a place to store four strings that are populated by the
screening method for later use by the calling method.
Screen the file for SPAM
Returning now to the constructor, the code in Listing 25:
- Declares a local variable named match and initializes it
to false. - Instantiates a new empty object of the ScreenResult class.
- Invokes the screenMsg method belonging to an object of
the Screen class, passing the name of the disk file containing
the message, the unique identifier for the message, and the empty ScreenResult
object as parameters, and storing the returned value in the variable
named match.
boolean match = false; |
Upon further reflection
Frequently when I write a lesson explaining
code that I have written, I realize that there are sections of code
that I would write differently if I had it to do over again. That
is the case here.
In this case, if I were to rewrite this program, I would upgrade the
definition of the ScreenResult class to include an additional
field of type boolean named match.
Then I would require the screenMsg method of the Screen
class to return a reference to a populated object of type ScreenResult
instead of returning type boolean. I would eliminate the ScreenResult
parameter from the parameter list of the screenMsg method.
Then I would cause the code in the calling method to accommodate those
changes and to extract the value of match from the object
returned by the screenMsg instead of dealing with match
separately as is the case in Listing 25.
In my opinion, this would result in a somewhat cleaner user
interface. However, at this point, I am too far down the road to
turn back, so I will just leave the program as it is. I may
upgrade it sometime in the future to implement this improvement.
Designing your own SPAM screening algorithm
Should you decide to design your own screening algorithm, this is where
you would connect your algorithm to the communication module. In
other words, your version of the method named screenMsg should
return true if it is recommending that the message be deleted from the
server. Also, the object of type ScreenResult passed as a
parameter to the method should be populated with information to be
displayed in the text fields and the text area of the GUI shown in
Figure 1.
You may or may not decide to make callbacks on the communication module
to support the progress indicator while your method is working.
Display the results of the screening process
Listing 26 displays the information that was encapsulated in the ScreenResult
object by the screening method in the text fields and text area of
Figure 1.
fromField.setText(theResult.from); |
The code in Listing 26 is straightforward and shouldn’t require further
explanation.
Information available to the user
At this point, the user can view:
- The contents of the From line of the message
- The contents of the Subject line of the message
- The complete raw text of the message down to the line containing
the offending word or phrase, if any - The offending word or phrase, if any
If the screening method returned true, this information will remain on
the screen for the user to ponder. However, if the screening
method returned false, it will disappear from the screen very quickly,
and probably won’t even be seen by the user.
Increment the message counter
Listing 27 increments the message counter in preparation for processing
the next message.
msgCounter++; |
A candidate for deletion from the server
A return value of true from the screenMsg method means that the
screening method is recommending
that the message be deleted from the server.
Listing 28 shows the behavior of the actionPerformed method
registered on
the Start/Next button under this circumstance.
if(match == true){ |
Wait for further action by the user
The message has been identified as a candidate for deletion from the
server. The actionPerformed method simply returns with
the information described above showing in the text fields and text
area of Figure 1. The user can view this information while
deciding what to do next. Nothing further will happen in the
program until the user presses either the Delete button
or the Start/Next button.
Pressing the Delete button
If the user presses the Delete button in Figure 1, the
message will
be deleted from the server. I will explain exactly how this
happens later when I discuss the ActionListener object that
will be registered on the Delete button.
Pressing the Start/Next button
If the user presses the Start/Next button in Figure 1,
the message
will not be deleted from the server, the actionPerformed method
belonging to the ActionListener object registered on that
button will be executed, and the next message on the server
will be retrieved and screened for SPAM.
Message is not a candidate for deletion
If the screenMsg method returns false, the message has not been
identified as a candidate for deletion, and control reaches the point
in the actionPerformed
method shown in Listing 29.
Toolkit.getDefaultToolkit(). |
At this point, we could require the user to press the Start/Next
button to retrieve and screen the next message. However, in the
interest of convenience, we will relieve the user of that
responsibility.
Firing a synthetic event
The code in Listing 29 fires an ActionEvent identical to that
which would be fired if the user were to press the Start/Next
button. This causes the program to retrieve the next message on
the server and to begin the screening process immediately.
(If you are unfamiliar with the concept of posting
events in the system event queue, you can learn about that in the
tutorial lessons at www.DickBaldwin.com.)
When all messages have been screened …
Listing 30 shows the completion of the registration of an anonymous ActionListener
object on the Start/Next button that was begun in
Listing 14.
else{//msgNumber > numberMsgs |
The code in Listing 30 is executed when all of the messages on the
server have been
screened.
This code disables the Start/Next button and posts
messages instructing the user to press the close button to terminate
the program.
Beyond that, the code in Listing 30 simply completes a try/catch
block,
and wraps up the cryptic code required for the definition of an
anonymous
class.
An ActionListener on the Delete button
The Delete button shown in Figure 1 is used to cause
messages to be deleted from the server. Listing 31 shows the
beginning of the registration of an anonymous ActionListener object
on the Delete button.
deleteButton.addActionListener( |
The code in Listing 31 simply clears the third text field from the top
in Figure 1 when the user presses the Delete button.
Marking messages for deletion from the server
Deletion of a message from the server is accomplished by marking the
message for deletion while in the TRANSACTION state. The
message is actually deleted later when the client sends a QUIT command
to the server causing the server to enter the UPDATE state.
(If the program aborts prematurely before sending a QUIT
command,
marked messages are not deleted from the server.)
The deletion code
Listing 32 shows the code used to
- Mark the message for deletion
- Validate the response
- Display a deletion message
outputStream.println( |
(See the earlier section entitled A possible safety upgrade for a suggestion
related to upgrading this program.)
The DELE code has been temporarily disabled
Note that the three corresponding statements in Listing 42
near the end of the lesson have been disabled by marking them as
comments. I did this to keep you from accidentally deleting
messages from your server during your early stages of testing this
program with your Email server.
You can enable the three statements in Listing 42 by removing the
comment indicators. However, you should not enable them until you
are confident
that you really do want to delete messages from the server.
(Once a message is deleted from the server, it cannot be
recovered from the server.)
A synthetic ActionEvent
The code in Listing 33 fires a synthetic ActionEvent identical
to that which would be fired if the user presses the Start/Next
button.
Toolkit.getDefaultToolkit(). |
Thus, when the user presses the Delete button, the
message is marked for deletion on the server and the next message on
the server is retrieved immediately for SPAM screening without a
requirement for the user to request the next message.
Finish configuring the GUI
The code in Listing 34 finishes configuring the GUI by placing the
various components in the Frame, setting its size, and making
it visible.
add(startButton); |
As I mentioned earlier, you will probably find the program to be more
useful if you increase the width of the Frame to at least 750
pixels and increase the size of the text fields and text area in
Listing 13 to be at least 100 characters wide.
That completes the discussion of the class named Pop302.
Stripped-down Screen class
The following sections provide a brief discussion of a stripped-down
version of the class named Screen, which you can use to test
this program on your system with your Email server.
This stripped-down version of the Screen class doesn’t actually
do any SPAM screening. Rather, it populates the ScreenResult
object with information from the message and toggles its return value
between true and false for each successive message.
My full version of the Screen class implements my SPAM
screening algorithm. I will explain the details of my full Screen
class in the next lesson in this series.
A dummy constructor
The definition of the stripped-down Screen class begins in
Listing 35.
class Screen{ |
A dummy constructor is required to satisfy the instantiation of the Screen
object in Listing 3.
The screenMsg method
The code in Listing 25 invokes the screenMsg method of an
object of the Screen class for the purpose of applying SPAM
screening rules to a message stored in a disk file.
The definition of the stripped-down screenMsg method begins in
Listing 36.
public boolean screenMsg(String fileName, |
The code in Listing 36 gets a BufferedReader object that will
be used to read the raw text of the message stored in the file whose
name was received as a parameter.
Initialize the ScreenResult object
The code in Listing 37 populates three of the fields in the ScreenResult
object received as an incoming parameter. Two of these fields are
populated with messages that will be overwritten later if Subject
and
From data is successfully extracted from the file
containing the
message.
theResult.subject = "No Subj line found"; |
The text that is stored in the field named thePhrase will
not be overwritten later because this stripped-down version knows
nothing about offending SPAM word or phrases.
Get the Subject data
Without getting into the details, the code in Listing 38 attempts to
extract a text line from the message that begins with “Subject:”.
If successful, the data is used to overwrite the contents of the subject
field of the ScreenResult object.
String data; |
Get the From data
Similarly, the code in Listing 39 attempts to extract a text line from
an upper-case version of the message that begins with “From:”.
If successful, the data is used to overwrite the contents of the from
field of the ScreenResult object.
inData.reset(); |
Get the entire message
The code in Listing 40 attempts to read the entire message and deposit
it in the text field of the ScreenResult object.
inData.reset(); |
Return a boolean value
Finally, the code in Listing 41 returns a boolean value. This
value toggles between true and false as each successive message is
processed. Therefore, it has no meaning insofar as SPAM is
concerned.
Notice: A true return value should not be used to
indicate that you should delete a message from the server.
if(returnValue == false){ |
This boolean value will be stored in the variable named match
in Listing 25, and will be tested in the if statement of
Listing 28.
If the return value is true
If the return value is true, the actionPerformed method will
return immediately in Listing 28, allowing the user to ponder the data
returned by the screenMsg method in deciding whether or not to
delete the message from the server.
Once again, let me caution you not to enable the DELE
code in Listing 42 near the end of the lesson until you are certain
that you actually want to delete messages from the server. If you
do enable it, do not press the Delete button just because this
stripped-down version of the screenMsg method returns true.
If the return value is false
If the screenMsg method returns false, the code in Listing 29
immediately fires a synthetic ActionEvent, attributable to the Start/Next
button, which cases the next message to be retrieved from the server.
Run the Program
I encourage you to copy the code from Listing 42 into
your text editor. Compile and execute the
program. Experiment with it, making changes, and observing the
results
of your
changes.
You may want to modify this code to cause the messages to be stored
in a different location on your disk. If so, modify the string in
the statement
in Listing 17 that reads “c:/MailFiles/”
+ uidl + “.txt” to
specify a different folder. Make certain that the folder
where you plan to save the files exists before running the program.
(Once again, let me caution you not to enable the
DELE
code in Listing 42 until you are certain
that you actually want to delete messages from the server. Once a
message is deleted from the server, there is no way to recover it from
the server.)
Summary
This lesson explains the communications module used to communicate with
your Email server, and to remove SPAM messages from the
server before they are downloaded into your primary Email client.
The program is designed to allow you to use my SPAM screening
algorithm, or to invent your own. I will present the details of
my SPAM screening algorithm in the next lesson in the series.
The version of the program discussed in this lesson has a
stripped-down version of a class named Screen. This version of
the program
makes it possible for you to test the communications module on your
system with your
Email server
without doing any actual screening for SPAM.
The capability to actually delete messages from the server is disabled
in the version of the program shown in Listing 42. You should not
enable that capability until you fully understand what you are doing
and you are certain that you really do want to delete messages from the
server. Once a message is deleted from the server, it cannot be
recovered from the server.
What’s Next?
In the next lesson, I will present and explain my version of the
class named Screen. This class contains my version of a
SPAM screening algorithm. You may want to use my version, replace
my version with an algorithm of your own, or do some combination of the
two.
Complete Program Listing
A complete listing of the program follows in Listing 42. Note
that this listing contains a stripped-down version of the class named Screen.
The full version of the class named Screen will be provided in
the next lesson in this series.
Also, the three DELE statements shown in red in Listing 42
have been purposely disabled to prevent you from accidentally deleting
messages from your server while testing this program.
Do not enable these three statements until you are ready
to actually delete messages from the server. Once a message is
deleted from the server, it cannot be recovered from the server.
Disclaimer of responsibility: If you elect to use this program
you use it at your own risk. Make absolutely certain that you
understand what you are doing before you execute the program. The
author of this program, Richard G. Baldwin, accepts no responsibility
for any losses that you may incur as a result of using this program.
/*File Pop302.java Copyright 2004, R.G.Baldwin |
Copyright 2004, Richard G. Baldwin. Reproduction in whole or
in
part in any form or medium without express written permission from
Richard
Baldwin is prohibited.
About the author
Richard Baldwin
is a college professor (at Austin Community College in Austin, TX) and
private consultant whose primary focus is a combination of Java, C#,
and XML. In addition to the many platform and/or language independent
benefits of Java and C# applications, he believes that a combination of
Java, C#, and XML will become the primary driving force in the delivery
of structured information on the Web.
Richard has participated in numerous consulting projects, and he
frequently provides onsite training at the high-tech companies located
in and around Austin, Texas. He is the author of Baldwin’s
Programming Tutorials, which
has gained a worldwide following among experienced and aspiring
programmers. He has also published articles in JavaPro magazine.
Richard holds an MSEE degree from Southern Methodist University
and has many years of experience in the application of computer
technology to real-world problems.
-end-