JavaEnlisting Java in the War Against SPAM: The Screening Module

Enlisting Java in the War Against SPAM: The Screening Module

Java Programming Notes # 2152


Preface


This is the second lesson in a series designed to teach you how to
write
a Java program to remove SPAM from your Email server before you
download it into your primary Email client.  The first lesson was
entitled Enlisting
Java in the War Against SPAM, Part 1, The Communications Module
.

The communications module

The first lesson explained the communications module used to
communicate with
your Email server, and to remove SPAM messages from the
server.

SPAM screening algorithm

The program is designed to allow you to use my SPAM screening
algorithm, or to invent your own.  This lesson explains the
inner workings of my SPAM screening algorithm.  You can use my
algorithm as a starting point if you decide to invent your own.

Training the algorithm

The next two lessons will
explain how my algorithm can be trained to do an increasingly better
job of screening SPAM over time.

Viewing tip

You may find it useful to open another copy of this lesson in a
separate browser window.  That will make it easier for you to
scroll back and forth among the different listings and figures while
you are reading about them.

Supplementary material

I recommend that you also study the other lessons in my extensive
collection of online Java tutorials.  You will find those lessons
published at Gamelan.com
However, as of the date of this writing, Gamelan doesn’t maintain a
consolidated index of my Java tutorial lessons, and sometimes
they are difficult to locate there.  You will find a consolidated
index at www.DickBaldwin.com.

Preview

Can you write better SPAM screening
algorithms?

Did you ever think that you might be able to write better SPAM
screening algorithms than those available in the SPAM screening
software that you are now using?  If so, this series of lessons is
for you.

Even if that is not the case, like most of us, you are probably
overwhelmed by SPAM
and therefore you may find this lesson interesting.

Remove SPAM from the server

In this and the previous lesson, I am showing you how to write a
Java program
that supplements the SPAM screening software that you are currently
using.  This program is used to identify and remove SPAM from your
Email server before it is downloaded into your primary Email client.

Any SPAM that makes it past this program can be further acted upon
by the SPAM screener that is built into your Email client.

The communications module

This series consists of (at least) four lessons.  The
first lesson in the
series explained the communications module used to communicate with
your Email server, and to remove SPAM messages from the
server.

My SPAM screening algorithm

As mentioned above, this program is designed to allow you to invent
and implement your own
SPAM screening algorithm in addition to, or as an alternative to my
algorithm.

This lesson explains the inner
workings of my SPAM screening algorithm.  My algorithm operates
separately on the Subject line, the From line,
and the body text of each Email message.

Algorithm training programs

The third lesson will explain a companion program named Pop302d,
designed to make
use of historical data to train the algorithm to do a better job
of identifying SPAM in future messages based on the Subject
of the message.

The fourth lesson will explain another companion program named Pop302e,
designed to
make use of historical data to train the algorithm to do a
better job of identifying SPAM based on the body text of
the message, (which includes the From line).

Because of the need to train the algorithm, and the ease with which
these companion programs make that possible, the companion programs are
equally as important as the main program.

Operational sequence

Here is the typical operational sequence that I go through each
morning to remove SPAM from my Email server before downloading it into
my primary Email client, and to train the algorithm to recognize any
future SPAM messages that made it through the screen that morning.

  1. Run the main program named Pop302 (explained in this
    and the previous lesson)
    to identify SPAM and remove it from the
    server.  This normally allows a few (typically about ten
    percent)
    SPAM messages (stragglers) to get
    through, which are stored in a history folder on my local disk.
  2. Run the program named Pop302d (explained in the next
    lesson)
    to train the algorithm to recognize the stragglers as SPAM
    based on
    information in the Subject line.
  3. Run the program named Pop302e (explained in Part 4)
    to train the algorithm to recognize the stragglers as SPAM based on
    information
    in the body text.
  4. Go back and run the main program named Pop302 to remove
    those SPAM stragglers messages from the server.
  5. Run my primary Email client to download the remaining good
    messages into my local Email inbox.

When I am in a hurry …

However, it isn’t necessary to perform all of these steps every
day.  On those mornings when I am in a hurry, I skip steps
2, 3, and 4, leaving the straggler messages in the local history folder
for use later.

(The straggler messages will, of course, end up in my
local Email inbox when I run my primary Email client without purposely
removing them from the server beforehand.)

Sometime later (perhaps the next day or several days later)
I will perform steps 2 and 3 to train the algorithm to recognize future
SPAM
messages represented by the characteristics of the messages that have
been saved in the local
history folder.

Effectiveness of my algorithm

After about one week of training, my
algorithm was reliably identifying about ninety percent of all SPAM
messages, allowing me to delete them from my Email server before
downloading them into my primary Email
client.  By executing steps 2, 3, and 4 above, I am able to also
eliminate the remaining ten percent of the SPAM messages before
downloading them into my primary Email client.

Discussion
and Sample Code


The full Screen class

The version of the program that I discussed in the previous lesson
contained a
stripped-down version of a class named Screen. This version of
the program
allowed for testing the communications module on your system with your
Email server
without doing any actual screening for SPAM.

I will explain the full version of the class named Screen in
this lesson.  In so doing, I explain my algorithm for identifying
SPAM.

Purpose of the program

The purpose of this program is to read messages from a POP3 (Post
Office Protocol – Version 3)
server, to
analyze the messages according to a set of screening rules, and to
delete the messages that fail the screening test from the server.

(As written, the program asks the user to confirm the
deletion of each message from the server, but this confirmation step
could easily be
removed if you decide to do so.)

Key words and phrases

My SPAM screening algorithm screens for SPAM on the basis of words
or
phrases in the From line, words or phrases in the Subject
line, and
words or phrases in the body text.

Friendly Email addresses and subjects

A list of friendly Email addresses and friendly subjects is used to
screen the From
line and the Subject line.  Messages that are from
friendly Email addresses, and messages that have known good Subject
lines are preserved on the server and no information about those
messages is
saved on the local disk. They are simply ignored after determining that
they are friendly.

Different lists for Subject and body
text

Different lists of words and phrases are used for screening Subject
lines and body text for SPAM. This is important because the same set of
words
and phrases can’t always be used for both cases.

For example, the word ANTIVIRUS is appropriate for screening the
Subject line, but is not appropriate for screening the
body text. The
word ANTIVIRUS often appears legally in the header of Email messages
that have been scanned for viruses by the server, but also often
appears in the Subject line of SPAM messages.

Common spammer tricks are defeated

Several common spammer tricks are defeated by my SPAM screening
algorithm.

For example, the common spammer trick of inserting extra characters
between the
characters in an offending word or phrase is defeated.  Also, the
common trick of mixing the case of the
characters in an offending word or phrase is defeated.

As a specific example, my algorithm will recommend deletion of any
message having
any of the following in its Subject line or its body
text if the word VIAGRA is included in the lists used to screen for
SPAM:

vIaGrA
V.IagRA
V.I.A.G.R.A

Very important characteristics

These two characteristics of the algorithm alone have a significantly
positive
impact on the effectiveness of training the algorithm to do a better
job of
identifying SPAM in the future.

(You don’t have to identify all of the variations of a
word or phrase commonly used by spammers to fool the system.  The
program does that for you automatically.)

My algorithm also defeats the common trick of appending random
characters to the end of the Subject line, because it
doesn’t require a
match for the entire Subject line.  Rather, it
searches for words
or phrases internal to the text of the Subject line.

The user interface

Figure 1 shows the GUI through which the user controls the program.

Graphical user interface

Figure 1 Graphical User Interface

(Note that this GUI was purposely made narrow to cause
it to fit into this narrow publication format.  I
recommend that you increase the width of the Frame to at least 750
pixels, and increase the width of the TextField and TextArea objects to
at least 100 characters each.

Note also that this is an actual SPAM message, from which I purposely
removed the Email address of the sender prior to publication.  The
message may not have actually been sent by the individual whose Email
address appeared on the From line.)

The Offending Phrase

When the program identifies a message that is a candidate for deletion,
the reason for that recommendation is shown in the third text field
from the top in Figure 1.

Deleting a message from the server

The user confirms that the message should be deleted from the Server by
clicking the Delete button
in Figure 1. If the user doesn’t want to delete the message, he should
click the Start/Next button instead.

(Note that the capability to actually delete messages
from the server was disabled in the program shown in Listing 34 near
the end of this lesson.  Make certain that you are ready to
actually delete messages from the server before you enable that
capability.)


Information available at decision time

As currently written, this program requires the user to confirm the
actual deletion of each SPAM message from the server before that
message is actually deleted.

At the point in time that the user is required to confirm deletion of a
message
from the server, the following information is available to assist the
user in
making
the decision:

  • From line
  • Subject line
  • Offending line of text, which may or may not be the subject
  • Offending word or phrase in the offending line of text
  • Entire raw text of the message down to and including the
    offending line

No images are rendered

No images are rendered by the program, so it is not necessary for the
user to view offending images in order to make the decision to delete.

Deletion is not required

Having viewed the above information, if the user is still unable to
make
an informed decision to delete the message, the user has the
option to let the message pass through and to be downloaded into his
primary Email client.  Once having viewed the message in the
primary Email client, the user still has the option of updating the
offending word lists with IP addresses, URLs, etc, so that
deletion decisions on future similar messages will be easier to make.

Saved in local archive folder

The raw text of all messages that are identified as candidates for
deletion from the
server are saved in an archive folder on the local disk, regardless of
whether the user elects to delete them from the server or not. Thus if
a message is deleted from the server and it is later determined that
was a mistake, a raw text copy of the deleted message is available
locally in the archive folder.
(You should probably empty this folder periodically so
that it won’t fill up your disk.)

In addition, I have plans to write several additional programs that
will analyze large numbers of SPAM messages in the archive folder for
at least two purposes:

  • To remove words and phrases from the word lists that occur in
    only a very small percentage of SPAM messages, thereby increasing run
    time without contributing significantly to the desired result.
  • To search for common characteristics among SPAM message that can
    be used to improve the effectiveness of the screening algorithm.

Saved in history folder

Except for messages from friendly Email addresses and messages with
friendly Subject lines, all messages that
are not identified as candidates for deletion from the server are saved
in a history
folder on the local disk.  These messages are used later to train
the algorithm to do a better job of identifying future SPAM
messages.  I will explain this training process in Part 3 and Part
4 of this
series of lessons.

Protection against viruses

Before any message is saved in a local file, asterisks are inserted
into the text on ten-character intervals in an attempt to destroy any
virus code that may be embedded in the message.

If a message makes it through the screen and is later identified as
having a virus as an attachment, a series of ten or more bytes can be
extracted from the virus code and added to the word list as an
offending phrase.  This should cause any future messages having
that
same virus code as an attachment to be identified as a candidate for
deletion from the server.

Training programs

Companion programs that I have written are used to analyze the
non-deleted message files
saved
locally in the history folder in order to train the algorithm to do a
better job of identifying SPAM messages in the future.

These programs
are designed for extreme ease of use to encourage the user to train the
algorithm frequently.  The better the algorithm is trained, the
better it will perform.

I will explain these training programs in detail in Part 3 and Part 4
of this
series of
lessons.  A brief preview
of the training programs is provided below.

Simple text files

All three word lists are maintained in local text files, which can be
created and edited with an ordinary text editor if need be.  Thus,
if one of the lists becomes corrupted, it is easy to correct the
situation using an ordinary text editor.

File names

The following file names are hard-coded into the program.  You may
want to change these file names for your version of the program.

  • Local copy – the unique file name for a local copy of each
    message is
    based on the
    unique identifier for that message (UIDL) obtained from the
    mail server.
  • Pop302a.txt – contains a word list for screening the Subject
    line for offensive words and phrases.
  • Pop302b.txt – contains a word list for screening the body
    text
    for offensive words and phrases.
  • Pop302c.txt – contains a list of friendly Email addresses
    and
    friendly subjects for
    screening the From and Subject lines to
    identify friendly messages.

Location of the text files

As written, the program requires the three .txt files to
be in the same folder as the compiled .class files for
the programs named Pop302, Pop302d, and Pop302e
However, you can easily modify the programs to change the location of
the .txt files if
you choose to do so.  Just be sure to change the location in all
three programs.

The local copies of the messages are stored in two different
folders.  Some of the
local copies are stored in a history folder while the remainder are
stored in an archive folder.  The locations of these folders on
the disk are hard-coded into the three programs.  You can change
the locations if you like, but be sure to make appropriate changes to
all three programs.

Three classes

This program consists of two main classes and one minor class. As
discussed in the previous lesson, an
object of the class named
Pop302 handles all communications with the POP3 server.

A method belonging to an object of the class named Screen is
used to screen each message in an
attempt to identify SPAM.  This is the class that I will explain
in this lesson.

This class can be totally replaced by Java programmers who choose to
design their own screening algorithm provided that they maintain the
interface with the object of the class named Pop302.

An object of a very simple class named ScreenResult is used as
a wrapper to return several items of information from the screening
method to the calling method.

Testing

The program was tested using SDK 1.4.2 under WinXP in conjunction with
two different POP3 Email servers.

Will discuss in fragments

I will discuss the class named Screen in fragments.  A
complete listing of
the program is provided in Listing 34 near the end of the lesson. 
You should be able to copy and paste that listing into your Java IDE to
compile and test the program on your system.

Improvements in
the class named Pop302

Before getting into the details of the class named Screen, I
want to mention that the program shown in Listing 34 contains a couple
of improvements relative to the version explained in the previous
lesson.

One of the improvements involves displaying the message number in the
bottom of the text area of Figure 1.

The other improvement involves making a safety check to confirm that
the message number being maintained locally is in synchronization with
the message number on the server (in the UIDL) before deleting
a message from the server.

If you understand the rest of the program, these two modifications
should not require a detailed explanation.

Deletion of messages is disabled

Also before getting into the details of the Screen class I
want to show you a fragment containing three statements that are
disabled in the Pop302 class in Listing 34.  The three
disabled statements are shown in Listing 1.  (Note that the
statements are separated by comments in Listing 34.)

/*Begin comment block
outputStream.println(
"DELE " + msgNumber);
textArea.append(
"DELE "+validateOneLine()+"n");
textArea.append(
"Deleted:" + msgNumber + "n");

*/End comment block
Listing 1

The three statements shown in Listing 1 were purposely disabled (by
including them in a comment block)
to prevent you from accidentally
deleting messages from the server during your early testing of the
program. Do not enable these three statements until you are ready to
actually delete messages from the server.  At that point in time,
you can enable the three statements by removing the comment indicators
that surround them.

The Screen class

The Screen class implements a set of rules for identifying SPAM
messages and for recommending whether or not a message should be
deleted from the server.

If you have a better way to identify SPAM, you can replace this class
by a completely different class definition, so long as you maintain the
user interface.

An object of this class has one entry point and one exit point, which
is the public instance method of the Screen class named screenMsg.

A callback to the GUI

However, there is an additional linkage between the two objects that
you need to
consider.  The constructor for the Screen class receives
a reference to the GUI object
created by instantiating the class named Pop302. A method in
the object of the Screen class uses that reference to display
progress on the text area belonging to the GUI.

This display of progress is comforting on those occasions when a very
long message is encountered and the user needs assurance that the
system is still working, and isn’t hung up.

This callback link could easily be eliminated by deleting code from
several locations in the Screen class and removing the
callback parameter from the constructor.

Beginning of the Screen class

The Screen class begins in Listing 2, which declares several
instance variables.

class Screen{

TreeSet subjWordList;
TreeSet bodyWordList;
TreeSet friendlyWordList;
Pop302 theGui;//save callback reference here
String phrase;

Listing 2

The purpose of these
instance variables will become clear as I discuss the code in which
they are used.

The constructor

The constructor for the Screen class is shown in its entirety
in Listing 3.

  Screen(Pop302 theGui){//constructor
this.theGui = theGui;

makeSubjWordList();
makeBodyWordList();
makeFriendlyWordList();
}//end constructor

Listing 3

As you can see, the
constructor receives and saves a reference to the GUI.  This
reference is used later to display progress as discussed above.

Make word lists as TreeSet objects

The last three statements in the constructor invoke methods that read
text files containing
lists of words or phrases, and create TreeSet objects
containing those words and phrases.  These TreeSet
objects are used later to test for the occurrence of the words or
phrases in raw text versions of Email messages.

The TreeSet objects are created and populated by invoking three
very similar methods:

  • makeSubjWordList
  • makeBodyWordList
  • makeFriendlyWordList


I will discuss each of these methods in the sections that follow.

The makeSubjWordList method

The purpose of the makeSubjWordList method is to
create a TreeSet object containing words and phrases used
later to screen the message Subject lines.

The makeSubjWordList method is shown in Listing 4.
 
This method reads strings from a text file named Pop302a.txt
and creates the list as a TreeSet object. 

  private void makeSubjWordList(){
subjWordList = new TreeSet();

try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302a.txt"));
String data; //temp holding area

while((data = inData.readLine()) != null){
subjWordList.add(data);

}//end while loop
inData.close();//Close file
}catch(Exception e){e.printStackTrace();}
}//end makeSubjWordList

Listing 4

Why use the TreeSet class?

The TreeSet class was chosen for this purpose because it
eliminates duplicates.

(Duplicates in the list are bad because they increase
runtime with no beneficial effect. 
One of the major problems with the message filter in the commercial
Email client
program that I use is that
there is no way to avoid duplicates other than simply remembering that
an item was previously placed in the filter.)

With my screening algorithm, even if the user creates duplicates in the
text
file while training the algorithm, duplicates are eliminated from the TreeSet
object and also from the text file before
actual processing begins.

The code in Listing 4 is
straightforward and shouldn’t require further explanation.

The makeBodyWordList
method

The purpose of the makeBodyWordList method is to create a TreeSet
object containing words and phrases used later to screen the text
in the body of the message.

Separation of lists is important

It is important to maintain separate lists for screening the Subject
line and the body text.  Because of the larger number of
characters in the body text, false positives are more likely when
screening the body text.

(A false positive arises when a message that is not SPAM
fails one of the SPAM screening rules and is identified as SPAM by the
screening algorithm.)

Some words work well and some don’t

Therefore, some words and phrases that work well when screening the Subject
line may produce false positives when screening the body text.
For example, the common spammer word SLUT appears in the word SoLUTion
with only one character separating the S and the L.  It is much
more likely that the word SOLUTION will appear somewhere in
the body text than in the Subject line (although it
may appear in the Subject line as well, thus producing a false
positive in either case).

On a more definitive note, the word ANTIVIRUS works well when
screening the Subject line, but cannot be used to
screen the body text.  Many servers insert the word ANTIVIRUS
into the message header after they test the message for viruses. On the
other hand, the word ANTIVIRUS often appears in the Subject
line of SPAM messages.

IP addresses and URLs

IP addresses and URLs can be very useful in identifying SPAM during the
screening of the body text.  However, they rarely occur in the Subject
line. Therefore, testing the Subject line
against a long list of IP addresses and URLs simply wastes computer
time.

Some words to avoid

The following words (among others) probably should not be
included in the list used to screen body text for the reasons
given.  Undoubtedly you will identify other words and phrases that
should be excluded from the list as you gain experience with the system.

  • PORN may be confused with IMPORTANT.
  • SPAM causes lots of false positives. As a remedy, I inserted a
    space following the M as in “SPAM ” to decrease false positives. (This
    may also decrease valid hits as well.)
  • ANTIVIRUS appears in some valid message headers.
  • WEIGHT often appears in messages regarding HTML fonts.
  • SLUT may be confused with SOLUTION.

The makeBodyWordList code

The code for the makeBodyWordList method is shown in Listing
5. 
This method reads strings from a text file named Pop302b.txt
and creates the list as a TreeSet object.

  private void makeBodyWordList(){
bodyWordList = new TreeSet();

try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302b.txt"));
String data; //temp holding area

while((data = inData.readLine()) != null){
bodyWordList.add(data);

}//end while loop
inData.close();//Close file
}catch(Exception e){e.printStackTrace();}
}//end makeBodyWordList

Listing 5

This code is
straightforward and should not require explanation.

The makeFriendlyWordList
method


The purpose of the makeFriendlyWordList method is
to
create a TreeSet object containing words and phrases used to
pre-screen the message From and Subject lines
before screening against the SPAM lists.  The objective of this
pre-screening step is to identify messages that claim to be from an
approved list of senders (often referred to as a white list),
or messages with a known good Subject line.

Messages that have From or Subject lines
matching the words or phrases in this list are not deleted from the
server, and are not subjected to screening for SPAM.

Format for Email addresses

When adding Email addresses to the list contained in the text file,
only the primary portion of the friendly Email address should be
included.  For example, you will often see an Email address
presented as follows:

Mary Smith <msmith@somewhere.com>

In this case, only the following portion should be included in the
friendly list:

msmith@somewhere.com

This is the portion that is most likely to remain stable over
time. 
The remaining portion is simply window dressing added to the primary
Email address by the program used to compose and address the message.

Code for the makeFriendlyWordList method

The makeFriendlyWordList method is shown in Listing 6. 
This method reads strings from a text file named Pop302c.txt
and creates the list as a TreeSet object.

  private void makeFriendlyWordList(){
friendlyWordList = new TreeSet();

try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302c.txt"));
String data; //temp holding area

while((data = inData.readLine()) != null){
friendlyWordList.add(data);

}//end while loop
inData.close();//Close file
}catch(Exception e){e.printStackTrace();}
}//end makeFriendlyWordList

Listing 6

This code is completely
straightforward and therefore shouldn’t require further explanation.

The screenMsg method

Up to this point, the code that I have presented has been rather
mundane.  However, things should start getting a little more
interesting at this point.

Setting the stage

The statement in Listing 7 was extracted from the Pop302 class
in Listing 34.  This is the statement that ties the communication
module (an object of the Pop302 class) to the
screening module (an object of the Screen class).

              match = screener.screenMsg(
fileName,uidl,theResult);

Listing 7

Where
are we at this point?

At this point in the execution of the program, the communication module
has retrieved a message from the server and has written it into a file
on the local disk with the path and file name given by fileName.

(The file
name is based on the server’s unique identifier for the message, given
by uidl in Listing 7.)

Pass
the file to the screenMsg method

The communication module passes the file name for the disk file
containing a raw text
copy of the message to the method named screenMsg where it will
be screened for SPAM.

The screenMsg
method needs fileName
in order to read the
file from the disk to perform the screen.

Why does the screenMsg
need uidl?


If the file is identified as SPAM, it will be moved from the history
folder to an archive folder.  In order to do that, the
screenMsg method needs to create a path
and file name pointing to the archive folder.  For this, it needs
the unique identifier, uidl.

(Obviously I
could have parsed fileName inside the
screenMsg method to get uidl, but I
found it easier to simply pass it as a parameter to the
screenMsg method.)

What
is theResult?

In addition to fileName and uidl, the method needs a
reference to an empty object of the class ScreenResult
It populates that object in order to send information back to the
calling method.  That object’s
reference is represented by theResult in Listing 7.  The
screenMsg method populates this object
with several pieces of data for later use by the communications
module.

The screenMsg method
returns boolean


The screenMsg method returns a boolean value, which is assigned
to match in Listing 7.

If the return value is true, the screenMsg method has concluded that the
message is SPAM and is a
candidate for deletion from the server.

(Recall,
however, that the
communications module allows the user to make the final decision
regarding deletion of the message from the server.)


If the return value is false

If the return value is false, the screenMsg method has found nothing to indicate
that the message is SPAM.

(The message
might not be SPAM, or it might be a form of SPAM that the algorithm
doesn’t yet know how to identify.  If it is deemed by the user
that the message is SPAM, the message will be used later to train the
algorithm to recognize that form of SPAM in future messages.)

About
90 percent effective

As of this writing, I am finding that the algorithm is able to identify
about 90 percent of SPAM messages on the average.  The remaining
ten percent of the SPAM messages are used to further train the
algorithm to recognize spam of that type in the future.  I am
hopeful that this performance will improve in the future as the
algorithm becomes better trained.

Beginning of the screenMsg method

The code for the screenMsg method begins in Listing 8.

  public boolean screenMsg(String fileName,
String uidl,ScreenResult theResult){

//Initialize return value to false
boolean match = false;

Listing 8

As you can see, the method
signature matches my earlier description with respect to the statement
in
Listing 7 that calls the screenMsg method.

The code in Listing 8 initializes the variable named match to
false.  This is the value that will be
returned from the method if it is not overwritten later by the
discovery of a match between a test phrase and a text line in the
message.

Purpose of the screenMsg
method


The
screenMsg
is
used to identify messages that are candidates for deletion from the
server. Such identification is based on analyzing the file in which the
message is stored locally, and comparing that file with the contents of
TreeSet objects populated earlier with the contents of the
files named Pop302a.txt, Pop302b.txt, and Pop303c.txt.

A return value of true means that the
message is identified as SPAM and should be deleted from the server.

Returned String values

In addition to the boolean return value, references to four String
objects
are encapsulated in the incoming object of type ScreenResult
The populated object is used later by the calling method in the
communication module.  These String objects
represent:

  • The text of the messages’ Subject line.
  • The text of the messages’ From line.
  • The offending word or phrase (if any) that was found in
    the Subject line or in the body of the message, which
    includes the From line.
  • The raw text of the message down to the line that includes the
    offending word or phrase, or the entire raw text of the message if no
    offending words or phrases were found.

Display information on the GUI for benefit of
the user

This information is displayed in the GUI in Figure 1 by the
communication module, which is an object of the class named Pop302
This is information is presented to help the user make an informed
decision regarding deletion of the message from the server.

(If the screenMsg method returns false, the
program doesn’t pause for the user to make such a decision, and
processing of the next message on the server begins immediately, with
all of the above information having been removed from the GUI.)

Refer
back to Figure 1

The list of items placed in the ScreenResult
object includes the offending word or phrase that was found in
the Subject line or in the body of the message.  Referring back to Figure
1, the Offending Phrase for the message shown
in Figure 1 was V1AGRA.

(This was an
easy one because there were no extraneous characters inserted in the
offending word, although the spammer did use a numeral 1
character in place of an I.)

The
Subject line

Also in Figure 1, the Offending Phrase was found in the Subject
line, which means that the program made the decision very quickly
(it didn’t have to examine a large amount of body text in order to make
a decision).

Data in the text area

As shown in the large text area in Figure 1, this was Message Number 5
in the
dropbox on the server.

The raw message text is displayed in the text area down to the line
that contained the Offending Phrase, which in this case was the Subject
line.

The From line

The top-most text field in Figure 1 originally contained an Email
address that
purportedly was the address of the sender of the message. 
However, I suspect that the person identified by that Email address
wasn’t actually the sender of the message, so I deleted the Email
address before publishing this image.

(On the
basis of the earliest RECEIVED: FROM line in the raw message text in
Figure 1, the
message appears to have been sent by a computer having an IP address of
204.85.84.207.  However, given the identity of the organization to
which that IP address is assigned, (according to a WHOIS Database
Search at http://www.arin.net/whois/)
this seems somewhat unlikely as well.  But, one never knows who
may be spamming.  Maybe that computer is infected with a Trojan
horse that is broadcasting SPAM messages without the knowledge of its
owner.)

Open
the disk file for reading

The code in Listing 9 opens the file containing the local copy of the
raw message text for reading.  Listing 9 also declares a local
variable named data that will be used in the file reading
process.

    try{
BufferedReader inData
= new BufferedReader(new FileReader(
fileName));
String data;//temp holding area

Listing 9

(Recall from
the previous
lesson that asterisks were inserted into the data on ten-character
intervals in an attempt to destroy any executable virus code that may
be included in the byte stream.)

Prepare
to process the file

The code in Listing 10 is executed in preparation for processing the
file containing the raw message data.

      inData.mark(10000);

theResult.subject = "No Subj line found";
theResult.from = "No From line found";

Listing 10

Mark
the beginning of the input stream

The first statement in Listing 10 marks the beginning
position in the input stream. Subsequent calls to reset will
attempt to reposition the stream to this point.

This mark will be used later to rewind the stream to the beginning.

Populate the ScreenResult object


The last two statements in Listing 10 populate two of the fields in the
ScreenResult object with default values, just in case one or the
other of the corresponding lines are not found in the message.  If
the lines are found in the message, these default values will be
overwritten with the actual data from the message.

The removeStars method

Before going any further, I am going to put the discussion of the screenMsg
method on hold for a moment and discuss the removeStars method
shown in Listing 11.

  private String removeStars(String stringIn){
StringBuffer stringBuf =
new StringBuffer(stringIn);
int index = 0;
while(index > -1){
index = stringBuf.lastIndexOf("*");
if(index > -1){
stringBuf.delete(index,index+1);
}//end if
}//end while
stringBuf.append("**");
return new String(stringBuf);
}//end removeStars()

Listing 11

The purpose of this method is to remove the asterisks that were
inserted into the data by the method named insertStars before
the file was written (see the previous lesson).  The
method also appends two asterisks at
the end of each line.

(Note that this method removes all asterisks, not just
those inserted earlier.  If this proves to be a problem, this
method should be modified to remove only those asterisks that occur on
ten-character intervals.)

The code in this method is straightforward and shouldn’t require
further explanation.

Return to discussion of the screenMsg method

Returning to the discussion of the screenMsg method, we are
ready to examine the code used to screen the Subject line. 
That code begins in Listing 12.

      while((data = inData.readLine()) != null){
data = removeStars(data).toUpperCase();
if(data.startsWith("SUBJECT:")){

Listing 12

The code in Listing 12 is
the beginning of a while loop that reads successive lines of
text from
the disk file until it either runs out of lines (null) or
encounters a line that starts with SUBJECT: (see Figure 1
for the format of the Subject line in the message).

If it runs out of lines before finding the Subject line,
the loop will terminate.  If it finds the Subject line
before running out of lines, the line will be processed and a break
will be executed to terminate the loop.  Thus, the code in this
loop will process only the Subject line.

(Note that
the removeStars method is called to remove the asterisks from the data
and the data is converted to upper case before testing for SUBJECT:)

Populate
the output object

We are now in the body of the if statement begun in Listing
12.  A match for SUBJECT: has been found.  The code
in Listing 13 populates the output object with the Subject line
data overwriting the default value put there by the code in Listing 10.

          theResult.subject = data.toUpperCase();

Listing 13

This will result in the Subject line
data being displayed in the second text field of the GUI of Figure 1
when the screenMsg method returns.

Screen against friendly words and phrases

The next step is to screen the Subject line data
against the words and phrases in the friendly list.  If a word or
phrase from the friendly list appears in the Subject line
data, the message will be preserved on the server and will not be
subjected to SPAM screening.

In addition, the local copy of the message currently located in the
history folder will be deleted, because it is not considered to be
SPAM.  That way, the message will not be used later when training
the algorithm to do a better job of identifying SPAM.

Executing the screen

The code that screens the Subject line data against the
friendly list begins in Listing 14.

          Iterator iterator =
friendlyWordList.iterator();
while(iterator.hasNext()){
String friendlyWord =
((String)(iterator.next())).
toUpperCase();
match = false;
if(!(friendlyWord.equals(""))){
match = screenOnPhrase(
data,friendlyWord,0
);
}//end if

Listing 14

The friendly list is stored
in a TreeSet object referred to by the reference variable named
friendlyWordList.

An Iterator loop

The code in Listing 14 shown the beginning of an Iterator loop,
used to iteratively extract each word or phrase from the friendly
list and compare it with the Subject line data. 
The comparison is actually rather complex and is performed in a method
named screenOnPhrase, which I will discuss later.

As each friendly word or phrase is extracted from the friendly list, it
is
passed, along with the Subject line data to the method
named screenOnPhrase.

That method will return true if a match is found, and
will return false
if no match is found.

Extraneous characters

The third parameter in the call to the screenOnPhrase method
specifies the
number of extraneous characters allowed to occur between the characters
in the data and still declare a match to be true.  In this
case, a value of zero is passed for this parameter, meaning that no
extraneous characters are allowed.

(As it turns
out, the value of zero results in a trivial case, and I could have
accomplished this more simply than by invoking the rather complex code
in the screenOnPhrase method.  However, when I wrote the
prototype for the program, I hadn’t decided that I was going to use a
value of zero here.)

Behavior of the screenOnPhrase method

Basically, in this case, the screenOnPhrase method is testing
to see if the friendly word or phrase occurs anywhere within the Subject
data line, and if so, it will return true.  Otherwise, it
will return false.

(Note that
everything has been converted to upper case at this point, so matching
the case is not an issue.)

At this point, I can either
branch off and discuss the screenOnPhrase method, or
continue discussing the code in the screenMsg method.  I
have decided to do the latter, and explain the inner
workings of the screenOnPhrase method later.

If a match was found

Listing 15 shows what happens if the current friendly word or phrase
was found in the Subject data line (the current
word or phrase is that word or phrase most recently extracted from the friendlyWordList
by the iterator).

            if(match == true){
inData.close();
new File(fileName).delete();

theResult.thePhrase = phrase;

return false;
}//end if match = true

}//end while iterator has next

Listing 15

Delete
the file from the SPAM history folder

The first thing that happens when a match is found is that the input
stream is closed and the file containing the raw message is deleted
from the folder in which SPAM history messages are stored.  The
rationale is that this is not a SPAM message, and should not be used
later when training the algorithm to do a better job of identifying
SPAM.

Populate the output object

The next thing that happens is that the matching phrase is stored in
one of the fields of the ScreenResult object.

(In this case, that String
isn’t currently used in any significant way by the communications
module, but it is available to be used in the future if needed.)


Return a false value

Perhaps the most significant thing that happens in Listing 15 is that
the screenMsg method
terminates and returns a false value to the calling
method in the communications module Listing 7.  That
essentially terminates the processing of this message.  It is
preserved on the server and is not subjected to further screening for
SPAM.

If no match was found

If none of the words or phrases in the friendlyWordList match
the Subject data line, control will fall out of the
loop at the bottom of Listing 15 when the data in the friendlyWordList
is exhausted.  This will transfer control to the code in Listing
16.

          break;
}//end if data.startsWithSUBJECT:
}//end while input is not null

Listing 16

The
break statement

The break statement in Listing 16 is inside the body of the if
statement that began in Listing 12.  This code is being executed
because a data line was found that starts with SUBJECT:

If no match was found, the return statement in Listing 15 would
not be executed, and control would reach this point.  Since this
part
of the code deals exclusively with the Subject line,
and a Subject line
was found, there is no point in reading any more input lines. 
Hence
the break statement in Listing 16 terminates the read loop
that
began in Listing 12.  No more data will be read from file in this
part of the code.


Rewind the input stream to the beginning

The code in Listing 17 resets the stream back to the mark that was set
on the stream in Listing 10.  Since that mark was set at the
beginning of the file, this code rewinds the data file back
to the
beginning.

      inData.reset();

Listing 17

Contents
of ScreenResult object

At this point, the subject
field in the output ScreenResult object contains Subject
line data if a Subject line was found, or
contains a message to the effect that no Subject line
was found.  This latter message was put there by default in
Listing 10, and was not overwritten if no Subject line
was found.

Process the From line

The next step in the process is to process the From line
for the purposes of:

  • Returning the From
    data in one of the fields of the ScreenResult
    object if a From line exists.
  • Determining if the
    message was sent from a friendly Email address.  If so, return
    false causing this message to be preserved on the server and exempt
    from SPAM screening.

Except for the fact that
the code in Listing 18 extracts and processes a text line that starts
with From:, the code in Listing 18 is essentially the same as
the code used to process the Subject line beginning in
Listing 12 and ending in Listing 17.  Therefore, it should not be
necessary for me to explain that code again.

      while((data = inData.readLine()) != null){
data = removeStars(data.toUpperCase());
if(data.startsWith("FROM:")){
theResult.from = data;
Iterator iterator =
friendlyWordList.iterator();
while(iterator.hasNext()){
String friendlyWord =
((String)(iterator.next())).
toUpperCase();
match = false;
if(!(friendlyWord.equals(""))){
match = screenOnPhrase(
data,friendlyWord,0);
}//end if

if(match == true){
inData.close();
new File(fileName).delete();
theResult.thePhrase = phrase;
return false;
}//end if match = true
}//end while iterator has next
break;
}//end if data starts with From
}//end while input is not null

inData.reset();

Listing 18

Rewind
the input stream

Assuming that the method
did not return false as the result of a match on a friendly Email
address, the code in Listing 18 also resets the input stream in
preparation for screening the entire message for SPAM.

Under that same assumption, at this point, the from field in the output ScreenResult
object contains From line data if a From line
was found, or
contains a message to the effect that no From line
was found.  This latter message was put there by default in
Listing 10, and was not overwritten if no From line
was found.


Missing Subject line and From line
is rare

Experience indicates that it is very rare to receive an Email message
that doesn’t contain both a FROM: line and a SUBJECT:
line in the header, although either or both may not contain any
characters to the right of the space following the colon.  In
fact, it is very common for the Subject line of SPAM
messages
to be completely blank.  (Perhaps that causes people to read
the messages out of curiosity.)

The screenOnPhrase method

While discussing both the From line and the Subject
line, I asked that you simply accept that the screenOnPhrase
method can determine if one upper-case String object is
contained as a substring within another upper-case String object. 
I did that because there is no great challenge to programming such an
operation when there are no extraneous characters.  In
fact, one of the indexOf methods of the String
class, which searches for a substring within a String,
can accomplish this very handily.

No extraneous characters allowed

This assumes, of course that extraneous characters are not allowed
between the characters of the substring within the String.  
That was the case in processing the From line and
the Subject line above because I set the third
parameter value to zero in the method call.  However, that is not
the
case in searching for offending words and phrases in a SPAM screen.

A SPAM example

For example, here is a typical Subject line taken from
a SPAM message in my archive folder.

Subject: T@ke 5O% off Ge|neric V*i*a*g*r*a 0nline t:0day

In order to recognize that the Subject line contains
the word Viagra, it is necessary that the program be able to
ignore the asterisks that separate the letters of the word
V*i*a*g*r*a.

In order to recognize that the line contains the word Generic,
it is necessary that the program be able to ignore the vertical bar
that separates the e and the n in
Ge|neric.

In order to recognize that the line contains the word t0day, it
is necessary that the program be able to ignore the colon that
separates the t and the 0 in
t:0day.

Other common spammer tricks

This example illustrates
another common spammer trick of switching zero characters with
alphabetic O characters, replacing the lower-case a
character with the @ character, etc.

I haven’t attempted to automate the resolution of substitution issues
such as this,
and probably won’t.  Given the training programs that I will
explain in the next two lessons, it is easy to use historical SPAM data
to train the algorithm to recognize these variations.  If such a
mangled word occurs only once in the SPAM history folder, the algorithm
can be trained in a single training session to recognize it as SPAM in
all future messages.

Ignoring extraneous characters

Now getting back to the
issue of ignoring extraneous characters in the offending words and
phrases, that is what the method named
screenOnPhrase knows how to do very well.

(It is also
what the Email filtering capability of the commercial Email client that
I use doesn’t
know how to do at all.)

The capability to ignore
extraneous characters is one of
the keys to a successful SPAM screening program.

(The
spammers never make it easy to identify and block their Email
messages.  That is why a successful SPAM screening program must
have the capability to learn each new spammer trick as soon as it
appears through an ongoing, simple to use algorithm training effort.)

The screenOnPhrase method

The screenOnPhrase method requires an incoming parameter of
type int named spanLim.  This is the parameter
by which the programmer specifies how many extraneous characters
will be allowed between letters in the offending word or phrase and
still have it be recognized as an offending word or phrase.

When the screenOnPhrase
method was used to process the From and Subject data,
the value of this parameter was set to zero.  Thus no extraneous
characters were allowed in the matching friendly Email address data or
the matching friendly Subject line material.

In order for the screenOnPhrase method to recognize the words Viagra, Generic, and t0day in the above
example, the value of the spanLim parameter would have to be 1
or greater.

As you will see later, the current version of this program uses a spanLim
value of 1 to screen for SPAM.  Experience shows that this is
successful in identifying most of the offensive words and phrases
without
unduly increasing the occurrence of false positives.

At this point, I will put the discussion of the screenMsg
method on hold for a short while and explain the inner workings of the
screenOnPhrase method.

Description of the
screenOnPhrase
method


This method tests a String to see if it contains
a word or phrase that may have extraneous characters inserted into it,
such as VI*A-GRA.

The method requires an incoming parameter of type int named spanLim
If the String contains the sequence of characters, in the
correct order, that make up
the word or phrase, with spanLim or fewer extraneous
characters between any two of the matching characters, the
method returns true.  Otherwise, it returns false.

A spanLim example

For example, if spanLim = 1, the spammer can insert one
character between any two of the characters that make up the offending
word in
the String and the offending word will still be detected.

However, if the spammer inserts two or more extraneous characters, the
offending word will not be detected.

Be careful of false positives

You should be careful and avoid making spanLim too large. 
Large values of spanLim result in higher false positives due
to the fact that widely-separated characters can be considered to be
part of the word or phrase. For example, if spanLim = 2 or
greater, the word PORN will be found in the word IMPORTANT
However, if spanLim =1, the word PORN will not be
found in IMPORTANT.

Operation of the screenOnPhrase method

Basically this is how the screenOnPhrase method does what it
does.  The method
receives incoming String parameters named data and phrase,
along with an int parameter named spanLim.  The
objective is to determine if phrase is contained in data
with no more than spanLim extraneous characters separating the
matching characters.

Search for matching characters

First the method searches data for characters that match the
characters in phrase discarding all other characters. 
While doing this, however, it keeps track of the original positions of
the
matching characters in data.

A new compressed string

The result is a new string containing only the characters that match
the characters in phrase, all in their original order. 
Let’s refer to this as str.  All extraneous characters
have been discarded from data producing a new string named str.

For example, if the phrase is SPAM, str might look like
the following after all extraneous characters have been discarded:

SMSPMASPAMMPAS

Does
str contain phrase?

A test is made to determine if str contains sequences of
characters that match phrase(In the above example, str
does contain a sequence of characters matching SPAM, which I
highlighted using boldface.)

How many extraneous characters were discarded?

If a match is found, this means that the original data did
contain phrase with the possibility of extraneous characters in
between the characters of phrase.  However, there is still
the issue of how many
extraneous characters were discarded in order to get the positive
match. 
This is determined by examining the original position information that
was saved while extraneous characters were being discarded.

If the number of characters discarded between any two of the characters
matching the sequence was less than or equal to spanLim, the
method returns true.  Otherwise it returns false.

Let’s see this in code

The code to accomplish this is a little bit complex.  The screenOnPhrase
method begins in Listing 19.

  private boolean screenOnPhrase(String data,
String phrase,
int spanLim){
this.phrase = phrase;
StringBuffer str = new StringBuffer();
ArrayList locationData = new ArrayList();

Listing 19

The code in Listing 19
saves phrase in an instance variable, and declares a couple of
new local variables that will be used later.  Note that str
is a StringBuffer object and is not a String
object. 

(An object
of
the StringBuffer class can have its contents modified, while an
object of the String class is immutable.)


Compare data with phrase

The next step is to compare the characters in data with the
unique characters in phrase, saving only the matching
characters in str, and saving the original locations of the
matching characters in locationData, which refers to an object
of type ArrayList.

First, however, it is necessary to eliminate duplicate characters from phrase.

Eliminate duplicate characters from phrase

This is accomplished by the code in Listing 20 by storing the
characters from phrase into a TreeSet object. 
Storing the characters in a TreeSet object eliminates
duplicates.

(It also
sorts the characters, but that doesn’t matter
one way or the other in this case.)

    TreeSet treeSet = new TreeSet();
for(int cnt = 0; cnt < phrase.length();
cnt++){
treeSet.add(
new Character(phrase.charAt(cnt)));
}//end for loop

Iterator iter = treeSet.iterator();
StringBuffer tempPhrase = new StringBuffer();
while(iter.hasNext()){
tempPhrase.append(
((Character)(iter.next())).charValue());
}//end while

Listing 20

Having stored the
characters from phrase in the TreeSet object, the code
in listing 20 goes on to use an Iterator to extract the unique
characters from the TreeSet and store them in a new StringBuffer
object named tempPhrase.

(The
characters are stored in a StringBuffer object because it is
possible to build such an object one character at a time.  This is
not possible with a String object.)

Extract
matching characters from data

Listing 21 uses a pair of nested for loops along with tempPhrase
to extract matching characters from data and to store them, (in
their original order),
in str.  The original position
of each matching character is stored in locationData.

    for(int i = 0; i < data.length(); i++){
for(int j = 0; j < tempPhrase.length();
j++){
if(data.charAt(i) ==
tempPhrase.charAt(j)){
str.append(data.charAt(i));
locationData.add(new Integer(i));
}//end if
}//end for on tempPhrase
}//end for on data

Listing 21

This converts the original data into a compressed string of
characters, each of which matches a character in phrase. All
other
characters have been discarded. Thus, if data contains phrase,
it will occur somewhere in str with no extraneous characters
separating the characters in phrase.

Does str contain phrase?

The next step is the easy one.  Listing 22 tests to see if the new
compressed string named str contains the original phrase.

    int match = str.indexOf(phrase);
if(match == -1){
return false;//no match
}//end if

Listing 22

This is accomplished by
invoking the indexOf method of the StringBuffer class,
which
returns the index within str of the first
occurrence of phase.

Behavior of the indexOf method

The indexOf method returns -1 if phrase does not occur
within str, in which case the screenOnPhrase method
simply returns false.  Otherwise, the method goes on to test for
the maximum number of extraneous characters that separated the matching
characters in the original data.

(While writing this explanation, I realized that phrase
might have occurred more than once in data, with too many
extraneous characters in the first occurrence and an acceptable number
of extraneous characters in a later occurrence.  In this case,
however, the algorithm would return false.

Given the reason for the extraneous characters in the first place,
it is probably unlikely that this will happen.  However, this is a
logic error that would be worth fixing.)

Check number of extraneous characters

When there is a match, we need to confirm that the span between
matching characters does not exceed the number allowed by the incoming
parameter
spanLim.  This is accomplished by the code in Listing 23.

    int maxSpan = 0;
int locA = ((Integer)locationData.
get(match)).intValue();
int locB = 0;
for(int cnt = 1; cnt < phrase.length();
cnt++){
locB = ((Integer)locationData.get(
match + cnt)).intValue();
int span = locB - locA;
if(span > maxSpan){
maxSpan = span;
}//end if
locA = locB;
}//end for loop

if(maxSpan > spanLim+1){
return false;//span too large
}else{
return true;//made a match
}//end else

}//end screenOnPhrase

Listing 23

What
happened before was …

As each matching character was extracted from data in Listing
21, the original position of that character in data was
encapsulated in
an object of type Integer.  That Integer object’s
reference was
appended to a list of such references in the object of type ArrayList
referred to by locationData.

Thus the elements in the ArrayList refer to objects containing
the original positions of successive matching characters in data
This information will be used to calculate the maximum number of
characters
separating those matching characters.

The code in Listing 22 found the position of the first character of phrase
in the compressed string referred to by str and saved that
value in a local int variable
named match.

What happens now is …

The code in listing 23 uses the value of match to extract an Integer
object from the list containing the original position of the first
matching character of phrase in data.  The
contents of the Integer
object are extracted and saved as locA.

Then the code in Listing 23 enters a for loop, extracting
successive references from the list and uses the information
encapsulated therein to calculate the difference in original positions
of the successive matching characters.  The maximum value of that
difference
is calculated.  This process continues until a number
of original positions equal to the number of characters in phrase
have been examined.  Then the maximum difference in position
is compared with spanLim.


If it is determined that the number of characters between the
original positions of the matching characters in data exceeds spanLim,
the method returns false.  Otherwise, it returns true.

Return to the discussion of the screenMsg method


Now that we have an understanding of how the method named screenOnPhrase
works, it is time to return to the discussion of the method named screenMsg.

Up to this point, the method has examined the Subject and
From lines for two purposes:

  • To determine if they
    are friendly, and if so, to terminate SPAM screening for the current
    message.
  • To provide the
    contents
    of the two lines for later display in the GUI.

Screening
for SPAM

If control still resides in the screenMsg method at this point (meaning
that the message wasn’t declared to be friendly on the basis of either
the From line or the Subject line),
it is time to
screen the entire message looking for indications that the message is
SPAM.

This is accomplished in two parts.  The Subject line
is screened for SPAM using one word list and the body text
is screened for SPAM using a different word list.  If either the Subject
line or the body text is
determined to contain SPAM, the method terminates returning true.

(A typical
message contains a few lines of body text before the Subject
line and potentially many lines of body text following
the Subject line.

The lines are screened in the order that they occur in the message.
Therefore, if the Subject line is determined to contain SPAM,
the screening process will often require much less time than will be
required to locate SPAM in the body text.)

If
any line is determined to contain SPAM, the method terminates at that
point returning true.

If no SPAM is identified in any line of the message, the method returns
false.

The screening process

The screening process begins in Listing 24.

      int progressCounter = 0;
while((data = inData.readLine()) != null){
data = removeStars(data);
if(data.startsWith("Subject")){

data = data.toUpperCase();

theResult.text =
theResult.text + data + "n";

//Display progress on the GUI.
if(++progressCounter < 50){
theGui.textArea.append(".");
}else{//Display progress on a new line
progressCounter = 0;
theGui.textArea.append(".n");
}//end else

Listing 24

The code in Listing 24
reads a line of text from the input file, removes asterisks from the
line and tests to see if the line starts with Subject: (note
that the line hasn’t been converted to upper case yet at this point).

If the line does not start with Subject:, an else
clause, (to be discussed later), is executed to screen the line
as body
text.

If the text line is
the Subject line …

For the case where the line does start with Subject:

  • The line is converted
    to upper case.
  • The line is appended
    to the contents of the field named text in the output object of
    type ScreenResult.
  • A single period is
    appended to the string currently residing in the text area of the GUI
    to be displayed as a progress indicator.  (Each set of 50
    periods appears on a new line in the progress indicator.)

Screen
the Subject line for SPAM

The code beginning in Listing 25 uses an Iterator to screen an
upper-case version of the Subject line against
upper-case versions of each of the offensive words and phrases stored
in
a TreeSet object referred to by subjWordList.


match = false;
Iterator iterator =
subjWordList.iterator();
while(iterator.hasNext()){
String subjWord =
((String)(iterator.next())).
toUpperCase();
if(!(subjWord.equals(""))){
match = screenOnPhrase(
data,subjWord,1);

}//end if

Listing 25

Invoking
screenOnPhrase

The actual process of screening against each offensive word or phrase
in
the TreeSet object occurs as
a result of invoking the screenOnPhrase method (discussed
earlier)
to determine if the Subject line contains
the offensive word or phrase.  A one-character separation is
allowed between the characters in the offensive phrase in the Subject
line.  The
boolean value returned by
screenOnPhrase is stored in the variable named match.

The value of match will eventually be returned to indicate
whether or not the screenMsg method found a match between a
text line in the message and an offensive word or phrase from one of
the word lists.

If the returned value is
false …

If the returned value is false, the Iterator loop
continues looping, attempting to match offensive words or phrases from subjWordList
with the Subject line until there are no more
offensive words or phrases stored in subjWordList.

At that point, it is concluded that the Subject line
doesn’t contain SPAM.  Control is transferred back to the top of
the while loop in Listing 24, where another line of text is
read and screened.

(There can be only one Subject line in a
properly
formatted message, so the remaining lines will probably all be screened
as body
text.)

If
the returned value is true …

If the screenOnPhrase method returns true, the body of the if
statement in Listing 26 is executed.

            if(match == true){
//Move local message file
inData.close();
boolean moved =
new File(fileName).renameTo(
new File(
"c:/MailFiles/Archives/"
+uidl+".txt"));
if(!moved)System.out.println(
"Unable to move file " + uidl);

//Break out of Iterator loop
break;
}//end if match == true
}//end while iterator has next
}//end if data starts with Subject

Listing 26

Basically two things happen
in the body of this if statement:

  • The local copy of the
    message is moved from a history folder to an archive folder.  (The
    message won’t be needed for training the algorithm later because the
    algorithm already knows how to identify the message as SPAM.)
  • Control breaks out of
    the Iterator loop.  (Only one match against an
    offensive word or phrase is required to declare that a message is SPAM.)

I won’t try to explain the
process of moving the file.  If you don’t understand that code,
look it up in the Java API documentation.

Transfer of control

After breaking out of the Iterator loop, control transfers to
the return statement in Listing 33 below with match
containing true.  A true value for match indicates that
the Subject line identifies the message as SPAM.

If the text line is not the Subject
line …


A line of text was read in Listing 24, and a test was made to see if
that line was the Subject line for the message. 
If not, control transfers to the code that begins in Listing 27.

Since the line of text is not the Subject line, it
needs to be screened against the offensive words and phrases in a
different list designed for screening body text.

Listing 27 executes several steps in preparation for that screening
process.

        else{
data = data.toUpperCase();
theResult.text =
theResult.text + data + "n";
if(++progressCounter < 50){
theGui.textArea.append(".");
}else{
progressCounter = 0;
theGui.textArea.append(".n");
}//end else

Listing 27

The code in Listing 27:

  • Converts the text
    line to upper case.
  • Appends the text line
    to the contents of the field named text in the output object of
    type ScreenResult.
  • Causes a single
    period to be displayed in the progress indicator on the GUI of Figure 1.

Screen
the message body text line


The line of message body text is actually screened by invoking the screenOnPhrase
method in Listing 28.  The third parameter in the invocation of
this method allows for one extraneous character to separate the
characters of the offending phrase in the line of body text.

          Iterator iterator = bodyWordList.
iterator();
match = false;
while(iterator.hasNext()){
String bodyWord =
((String)(iterator.next())).
toUpperCase();
if(!(bodyWord.equals(""))){
match = screenOnPhrase(
data,bodyWord,1);
}//end if

Listing 28

Loop
on an Iterator

An upper-case version of the message text line is screened against
upper-case versions of each of the offending words and phrases in a TreeSet
object referred to by bodyWordList.

An Iterator is used to cause this process to continue until
either a match is found, or the items in the list are exhausted.

The boolean value returned by screenOnPhrase is stored
in the variable named match.

If the returned value is true …

If the value returned by screenOnPhrase is true, the code in
the body of the if statement of Listing 29 is executed.

            if(match == true){
inData.close();
boolean moved =
new File(fileName).renameTo(
new File(
"c:/MailFiles/Archives/"
+uidl+".txt"));
if(!moved)System.out.println(
"Unable to move file " + uidl);

break;
}//end if match == true
}//end while iterator has next
}//end else for line not Subject line

Listing 29

This code performs the
following operations:

  • Move the local copy
    of the message from the history folder to an archive folder for the
    same reasons given with respect to the Subject line
    earlier.
  • Break out of the Iterator
    loop because there is no need to test against any additional offensive
    words or phrases.

At this point, the value of
match is true meaning that a match has been found.  Control
is transferred to the if statement at the top of listing 30.

If screenOnPhrase returned false …

On the other hand, if no matches were found for any of the words and
phrases in bodyWordList, control reaches the if
statement at the top of Listing 30 with match containing a
value of false.

        if(match == true)break;
}//end while loop on read until null

Listing 30

If match is false,
the code in Listing 30 loops back to the top of Listing 24, reads the
next line of text from the message, and begins the screening process
all
over again.

Close the file


If match is true in Listing 30, there is no need to do any
further testing so
the code in Listing 30 breaks out of the while loop responsible
for reading lines of text from the file, transferring control to the
top of Listing
31.

Control can also transfer to the top of Listing 31 when the end of the
text file has been reached.  In that case, the value of match
will be false, indicating that no match was found.

      inData.close();//Close file if still open
}catch(Exception e){e.printStackTrace();}

Listing 31

The code in Listing 31
closes the file and finishes off the obligatory code for a try/catch
block.

Store the final phrase

In the event that a match was found, the variable phrase
contains the offending phrase that identifies the message as
spam.  If a match was not found, the contents of phrase
are of no significant value.  In either case, however, the value
of phrase is stored in the field named thePhrase in the
output object of type ScreenResult in Listing 32.

    theResult.thePhrase = phrase;

Listing 32

Return
the value of match

The code in Listing 33 returns the value of the variable match
This will either be the initial value of false if no match was found (see
Listing 8)
or will be true if a match was found and the initial
value was overwritten with true.

    return match;
}//end screenMsg method

}//end class Screen

Listing 33

Return
points for the screenMsg method

The screenMsg method contains three return statements.

The first occurs in Listing 15 where the code explicitly returns false,
indicating that a friendly phrase was found in the Subject line,
and that the message should not be deleted from the server.

The second occurs in Listing 18 where the code explicitly returns
false, indicating that the message was sent by a friend and therefore
shouldn’t be deleted from the server.

The third occurs in listing 33.  A match value of false
at this point indicates that the message was not identified as SPAM and
should not be deleted from the server.  A match value of
true at this point indicates that the message is believed to be SPAM
and probably should be deleted from the
server.

Preview of Future Lessons

This program is most useful
when you have well-developed lists of offending words and
phrases.  Although it is possible to create those lists with a
text editor, you can be much more productive, and you are much more
likely to update the lists using the programs that I will present in
the next two lessons.

Therefore, I will give you a preview of those two
programs.  I will show you three images that partially illustrate
the
capabilities of the two programs.

The first image (Figure 2) shows the GUI used to train the
algorithm to do a
better job of identifying SPAM in the Subject line of a
message. 

The second and third images (Figures 3 and 4) show two
different aspects of the GUI used to train the algorithm to do a better
job of identifying SPAM in
the body text of a message.

(In all
three cases, the width of the GUI was reduced to make it fit into this
narrow publication format.  The version that I routinely use is
much wider, and can therefore display much more information.)

Training
the algorithm on the Subject line

Figure 2 illustrates the procedure that I use to train the algorithm to
do a better job of identifying SPAM in the Subject line
of future messages.

User interface

Figure 2 User interface for
training on Subject line


In Figure 2, a message previously stored in the history folder has been
loaded into the GUI.  The complete raw text of that message is
available for viewing in the large text area if desired.  The From
line and the Subject line are displayed in the
top two text fields in the GUI.  (I purposely deleted the
Email address of the sender in all three of these images.)
 
User instructions are displayed in the fourth text field.

Offensive text in the Subject line

In this case, the user has identified offensive text (XANAX) in
the Subject
line and has selected that text with the mouse.  Two
additional steps are required to add that text to the
word list used to screen the Subject
line of future messages.


The first step is to press the button labeled Copy Selected Text. 
This will cause the selected text to be copied into the third text
field from the top where it can be edited if desired.

(In the
event that the spammer inserted extra characters into the offensive
text,
such as in X-ANAX, the extra characters should be deleted before
proceeding to the second step.)

The second step is to press
the button labeled Post Text. This will cause the selected and (possibly)
edited text to be automatically added to the word list.

That is all that is required to cause the program to identify this
offensive text in the Subject lines of all future
messages.

Process the next message

If the user then presses the Next button, the next message in
the history folder will be loaded into the GUI.  The current
message will
not be deleted from the history folder.

(This is
what you would normally do if you are going to use the same message
later to train the algorithm to better identify SPAM on the basis of
the body text.)

If the user presses the Delete
Local File
button, the current file will be deleted from the
history folder and the next message in the history folder will be
loaded into the GUI.

(This is
what you would normally do if you have determined that the message is
not SPAM, or should not be used for further training of the algorithm
for some other reason.  Perhaps the message was received from a
friend whose Email address has not yet been added to the list of
friendly Email addresses discussed earlier in this lesson.  Note
that deleting the message file from the local disk does not delete the
message from the server.)

A
very simple process

As you can
see, the process of training the algorithm on the Subject line
consists simply of selecting text with the mouse and pressing buttons
to cause the selected text to be added to the word list.  This can
be accomplished very quickly with very little effort.  Except for
the possible requirement to delete extra characters, no actual typing
is required.

(As an
alternative, the user can type anything into the third text field and
press the Post Text button to cause it to be added to the word
list.  Any number of items can be added to the word list before
moving on to the next message.)

Training
the algorithm on the body text

Figure 3 illustrates one aspect of the procedure for training the
algorithm to do a better job of identifying SPAM on the basis of the
body text of future messages.

User interface

Figure 3 User interface for
training on body text and IP address


Once again, a
message previously stored in the history folder has been
loaded into the GUI.  The complete raw text of that message is
available for viewing in the large text area.  The From
line and the Subject line are displayed in the
top two text fields in the GUI.  User instructions are displayed
in the fourth text field from the top.


Add originating IP address to the list

At this point, the user has pressed the button labeled Select IP
This caused the program to search out the IP address of the computer
that originally sent this message, and to copy that IP address into the
third text field from the top.  All that is required to add that
IP address to the list of offending phrases is to press the button
labeled Post Word.

As you can see, getting the originating IP address of a SPAM message
and adding it to the word list is very simple.  As before, once it
is in the third text field, you can edit if you like before adding it
to the list.

Getting offending text from the body text

Also at this point, you can scroll the text area.  If you visually
identify something in that text that you believe will uniquely identify
messages from this spammer in the future, you can copy and paste that
text into the third text field, and then add the text to the list
by pressing the Post Word button.

Adding URLs to the list

One of the best ways to identify SPAM is to identify URLs referenced in
the SPAM messages.  This is something that is difficult,
or at least expensive for the spammer to change frequently.  (Sometimes
the identification of one critical URL will cause hundreds and perhaps
thousands of future messages to be identified as SPAM.)

Adding URLs is very easy

Figure 4 illustrates a special feature of the program designed to let
you capitalize on that weakness.  Once a message is loaded into
the GUI, each time you press the button labeled Select URL, the
program will search down through the message until it finds the next
block of text that begins with HTTP://(This is
normally an indication of a URL.)

The program will select that URL, beginning with HTTP:// and including
everything out to the character before the next / character.  That
/ character normally separates the domain name from a directory or file
name.  Then the program copies the selected text into the third
text field.

(The spammer
can much more easily change directory and file names than domain names,
so they are excluded from the text that is selected and copied.)

User interface

Figure 4 User interface for
training on body text and URL


A URL has been identified

In Figure 4, the program has selected the URL being used by the spammer
and has copied it into the text field.  At this point, the user
can edit the URL if appropriate, and can add it to the list by
pressing the Post Word button.

Each time the user presses the Select URL button the next URL
in the message is copied into the text field.  When no more URLs
can be found, a message to that effect is displayed in the text field.

Thus, it is very easy for the user to identify all the URLs being used
by the spammer and to add some or all of them to the list.

The next message

The behavior of the Next button and the Delete Local
File/Next
button are the same as discussed relative to Figure
2.  I typically delete the file from the history folder after I
have used it to train the algorithm on the basis of body text.

Stay tuned

So, stay tuned.  I will explain the programs that provide this
training capability in the next two lessons in this series.

Run the Program

I encourage you to copy the code from Listing 34 and the three
starter text files in Listing 35, Listing 36, and Listing 37 into
your text editor.  Compile and execute the
program.  Experiment with it, making changes, and observing the
results
of your
changes.

You may want to modify this code to cause the message files to be
stored
in a different location on your disk.  If so, modify the strings
in Listing 34 that read “c:/MailFiles/”
+ uidl + “.txt”
and “c:/MailFiles/Archives/” + uidl +”.txt”
to
specify a different folder. Make certain that the folder
where you plan to save the files exists before running the program.

Before running the program, you will need to create three text files
having the following names and purposes and store them in the folder
containing your compiled Java class files for this program:

  • Pop302a.txt – contains offensive Subject line
    words and phrases
  • Pop302b.txt – contains offensive body text words and phrases
  • Pop302c.txt – contains friendly Email addresses and friendly Subject
    line material

Eventually you will need to populate these files with words
and phrases that work well for you.  (The algorithm training
programs that I will present in the next two lessons will be extremely
helpful in this regard.)

In the meantime, I have provided sample files in Listing 35,
Listing 36, and Listing 37 that you can use as starter lists.  If
you receive the same kinds of SPAM that I receive, the words in these
lists should make it possible for you to test the program and get a few
hits on SPAM messages.

These are simply text files so feel free to add other words and
phrases as appropriate.

(Let me caution you not to enable the
DELE
code in Listing 34 until you are certain
that you actually want to delete messages from the server.  Once a
message is deleted from the server, there is no way to recover it from
the server.)

Summary

The previous program explained the communications module of a program
used to remove SPAM from your Email server before it is
downloaded into your primary Email client.

This program explains my algorithm used to identify SPAM.  You can
use the algorithm as is, or modify it to better suit your needs.

After about one week of training my algorithm was reliably identifying
about ninety percent of all SPAM messages.  I expect this
performance to improve over time as the algorithm becomes better
trained.

This program is most useful
when you have well-developed lists of offending words and
phrases.  Although it is possible to create those lists with a
text editor, you can be much more productive, and you are much more
likely to update the lists using the programs that I will present in
the next two lessons.

What’s Next?

In the next lesson in this series, I will present and explain my
program named Pop302d, which provides an easy way to
train my screening algorithm to do a better job of identifying SPAM in
the Subject line of a message.

Complete Program Listing


A complete listing of the program is provided in Listing 34.  In
addition, starter text files are provided in Listing 35, Listing 36,
and Listing 37.

The three DELE statements shown in red in Listing 34
have been purposely disabled to prevent you from accidentally deleting
messages from your server while testing this program.

Do not enable these three statements until you are ready
to actually delete messages from the server.  Once a message is
deleted from the server, it cannot be recovered from the server.

Disclaimer of responsibility:  If you elect to use this
program
you use it at your own risk.  Make absolutely certain that you
understand what you are doing before you execute the program.  The
author of this program, Richard G. Baldwin, and the websites Developer.com and Gamelan.com
accept no responsibility
for any losses that you may incur as a result of using this program.

/*File Pop302.java Copyright 2004, R.G.Baldwin
Rev 01/02/04

Upgraded on 01/02/04 to do the following:
-Display msgNumber in text area while awaiting
decision to delete or not to delete.
-Confirm that local msgNumber is in synch with
message number on server (in UIDL) before
deletion of a message from the server.

The purpose of this program is to read messages
from a POP3 server, analyze the messages
according to screening rules, and delete those
messages from the server that fail the screening
test. (As written, the program asks the user
to confirm the deletion of each message, but
this confirmation step could easily be removed.)

This version of the program screens on the basis
of key words or phrases in the From line, key
words or phrases in the Subject line, and key
words or phrases in the body text.

A list of friendly Email addresses is used to
screen the From line. Messages that are from
friendly Email addresses are not deleted from
the server and no information about those
messages is saved on the local disk. They are
totally ignored after determining that they were
sent from a friendly Email address.

Different lists of words are used for screening
Subject lines and body text. For example,
ANTIVIRUS is appropriate for screening the
Subject line, but is not appropriate for
screening the body text. The word ANTIVIRUS
often appears legally in the header of Email
messages that have been scanned for viruses by
the server, but also often appears in the Subject
line of SPAM messages.

The common spammer tricks of inserting extra
characters between the characters in the
offending word and mixing the case of the
characters in the offending word is defeated by
this program.

For example, this program will flag for deletion
a message having any of the following in its
Subject line or its body text:

vIaGrA
V.IagRA
V.I.A.G.R.A

This program also defeats the common trick of
appending random characters to the end of the
Subject line, because it doesn't require a match
for the entire Subject line.

When the program detects a message that is a
candidate for deletion, the user is asked to
verify the deletion by clicking the Delete
button. If the user doesn't want to delete the
message, she should click the Start/Next
button.

The following information is available to the
user for making that decision:
- From
- Subject
- Offending line, which may also be the subject
- Offending word or phrase
- Entire raw text of the message up to and
including the offending line

All messages that are candidates for deletion
from the server are saved in an archive folder
on the local disk, regardless of whether the
user elects to delete them from the server. Thus
if a message is deleted from the server and it is
later determined that was a mistake, a raw text
copy of the deleted message is available locally
in the archive folder. You should probably empty
this folder periodically so that it won't fill
up your disk.

Except for friendly messages, all messages that
are not candidates for deletion from the server
are saved in a history folder on the local
disk. These messages can be used later to train
the program to do a better job of recognizing
SPAM.

Before any message is saved in a local file,
asterisks are inserted into the text on
ten-character intervals in an attempt to destroy
any virus code that may be embedded in the
message.

Numerous upgrades are possible. One possible
upgrade is to create a premium list of words and
phrases that will always result in deletion of
the message from the server without prior
approval by the user. For example, the user
might want to have any message containing
VIAGRA to be automatically deleted. However,
great care is urged in this regard. Certain
words such as SPAM and PORN occasionally occur
in a message with the letters separated by only
a few characters. This program would identify
those messages as being candidates for deletion.
For example, the offending word PORN occurs in
the non-offending word imPORtaNt with the letters
R and N separated by only two characters. The
word SLUT appears in the word SoLUTion with only
one character between the S and the L. The word
SPAM often occurs in different variations of
body text.

Another possible upgrade would be to allow the
user to specify the number of characters that may
occur between the letters of an offending word
or phrase. As programmed, that value is
hard-coded into the program, and as of this
writing, that value is one.

If the number of characters is set to zero, many
spam messages will avoid detection. If that
value is set to a large number, many false alarms
will occur. Therefore, care should be taken when
adjusting this value.

Another possible modification would be to allow
the program to automatically delete all
messages that are determined to be candidates
for deletion. Since these messages are saved
locally in an archive folder, a separate program
could be written to allow the user to review
those messages locally at her convenience just
in case a valid message was inadvertently
deleted from the server.

Companion programs that I have written provide
for creating and maintaining the word lists
discussed above in disk files. These programs
are used to analyze the non-deleted message files
saved locally in the history folder in order to
train this program to do a better job of
identifying SPAM messages in the future. These
programs are designed for ease of use to
encourage the user to train the program
frequently.

All three word lists are maintained in simple
text files, which can be edited with an
ordinary text editor if need be.

For technical information on POP3, see RFC 1725
at
http://www.cis.ohio-state.edu/htbin/rfc/rfc1725.
html

A POP3 Command Summary follows based on the
information at that web site.

Minimal POP3 Commands:
USER name
PASS string
QUIT
STAT
LIST [msg]
RETR msg
DELE msg
NOOP
RSET
QUIT

Optional POP3 Commands:
APOP name digest
TOP msg n
UIDL [msg]

POP3 Replies:
+OK
-ERR

File names: The following file names are hard-
coded into the program:

The file name for a local copy of a message is
the unique identifier for that message obtained
from the mail server.

Pop302a.txt - contains a word list for screening
the Subject lines.

Pop302b.txt - contains a word list for screening
the body text lines.

Pop302c.txt - contains a list of friendly Email
addresses for screening the From lines to
identify friendly messages.

This program consists of two main classes. An
object of the class named Pop302 handles all
communications with the Pop3 server.

An object of the class named Screen screens each
message in an attempt to identify SPAM. This
class can be totally replaced by Java programmers
who wish to design their own screening algorithm
provided they maintain the interface with the
object of the class named Pop302.

Tested using SDK 1.4.2 under WinXP
************************************************/

import java.net.*;
import java.io.*;
import java.util.*;
import java.awt.*;
import java.awt.event.*;

class Pop302 extends Frame{
int msgCounter = 0;
int msgNumber;
TextArea textArea;
TextField subjField;
TextField fromField;
TextField operMsgField;
int numberMsgs = 0;
String uidl = "";//unique msg ID
BufferedReader inputStream;
PrintWriter outputStream;
Socket socket;
Screen screener;
String fileName;

public static void main(String[] args){
if(args.length != 3){
System.out.println("Usage: java Pop301 "
+ "server userName password");
System.exit(0);
}//end if

new Pop302(args[0],args[1],args[2]);
}//end main
//===========================================//

Pop302(String server,String userName,
String password){
//Instantiate a new Screen object and pass
// this to allow for the object to call back
// and update the progress indicator.
screener = new Screen(this);

int port = 110; //pop3 mail port
try{
//Get a socket, connected to the
// specified server on the specified
// port.
socket = new Socket(server,port);

//Get an input stream from the socket
inputStream = new BufferedReader(
new InputStreamReader(
socket.getInputStream()));

//Get an output stream to the socket.
// Note that this stream will autoflush.
outputStream = new PrintWriter(
new OutputStreamWriter(
socket.getOutputStream()),true);

//Display the msg received from the
// server on the command-line screen
// immediately following connection.
String connectMsg = validateOneLine();
System.out.println("Connected to server "
+ connectMsg);

//The communication process is now in the
// AUTHORIZATION state. Send the user
// name and password to the server. Note
// that the use of an APOP command
// for sending user name and password
// would probably be more secure
// if it is supported by the server.
// However, my server apparently doesn't
// support APOP.
//Commands are sent in plain text, upper
// case to the server. Some commands
// require an argument following the
// command, as is the case with USER.
//Send the command.
outputStream.println("USER " + userName);
//Get response and confirm that the
// response was +OK and was not -ERR.
String userResponse = validateOneLine();
//Display the response on the command-
// line screen. Cannot display in the
// GUI at this point in time because the
// GUI object is not ready for use at
// this point in the execution of the
// constructor.
System.out.println("USER " + userResponse);
//Send the password to the server
outputStream.println("PASS " + password);
//Validate the server's response as +OK.
// Display the response in the process.
System.out.println(
"PASS " + validateOneLine());
}catch(Exception e){e.printStackTrace();}

//Register a window listener to service
// the close button on the Frame. This is
// an anonymous class defiition.
this.addWindowListener(
new WindowAdapter(){
public void windowClosing(WindowEvent e){

//Terminate the session with the
// server.
outputStream.println("QUIT");
String quitResponse =
validateOneLine();
//Display the response on the
// command-line screen.
System.out.println(
"QUIT " + quitResponse);
//Also display the response on the
// GUI. However, you probably won't
// see it because the GUI is
// closing.
textArea.append(quitResponse + "n");

//Server is now in the UPDATE mode.
// It will delete all files marked
// with the DELE command earlier
// in the execution of the program.
//Close the socket
try{
socket.close();
}catch(Exception ex){
ex.printStackTrace();}

System.exit(0);
}//end windowClosing
}//end WindowAdapter()
);//end addWindowListener

//Note, this GUI was purposely made narrow
// in order to make it fit into the
// publication format. You should make
// it wider and also increase the width of
// the text fields and the TextArea defined
// below to make it more useful.
setLayout(new FlowLayout());
//Note that the compiler requires the
// references to the following buttons to
// be final because they are accessed from
// within an anonymous class definition.
final Button startButton =
new Button("Start/Next");
final Button deleteButton =
new Button("Delete");
subjField = new TextField(
"Display Subj here",50);
fromField = new TextField(
"Display From line here",50);
operMsgField = new TextField(
"Display operator messages here",50);
textArea = new TextArea(15,50);
textArea.append("Display raw data heren");

//Register an ActionListener on the
// startButton. This is an anonymous
// class definition.
startButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Clear the operator message field
operMsgField.setText("");

try{
//The communication process is now
// in the TRANSACTION state.
//Retrive and screen messages
if(numberMsgs == 0){
//Calculate numberMsgs only at
// the beginning of the run,
// because it changes when
// messages are deleted.
outputStream.println("STAT");
String stat = validateOneLine();
//Get the number of messages as
// a String.
String numberMsgsStr =
stat.substring(
4,stat.indexOf(" ",5));
//Convert the String to an int.
numberMsgs = Integer.parseInt(
numberMsgsStr);
}//end if numberMsgs == 0
//NOTE: Msg numbers begin with 1,
// not 0.
//Retrieve and screen each
// message. Each msg ends with a
// period on a new line.
msgNumber = msgCounter + 1;

if(msgNumber <= numberMsgs){
//Process the next message.

//Get and save a unique identifier
// for the message from the server
// and validate the response.
outputStream.println(
"UIDL " + msgNumber);
uidl = validateOneLine();

//Open an output file to save
// the message. Use the UIDL
// as the file name. Others
// may need to modify the
// following code to identify
// a folder for local storage of
// the messages.
fileName =
"c:/MailFiles/" + uidl +".txt";
DataOutputStream dataOut =
new DataOutputStream(
new FileOutputStream(
fileName));

//Send a RETR command to begin
// the message retrieval process
outputStream.println(
"RETR " + msgNumber);
//Validate the response.
String retrResponse =
validateOneLine();

//Clear the text in the TextArea
// at the beginning of each new
// message. If you don't do
// this, the String being
// displayed will become very
// long and the program will run
// very slowly for large numbers
// of messages.
textArea.setText("");

//Read the first line in the
// message from the server.
String msgLine =
inputStream.readLine();
//Insert asterisks in the text
// in an attempt to destroy
// viruses before the file is
// stored locally.
msgLine = insertStars(msgLine);

//Continue reading lines until
// a "." is encountered as the
// first char in a line. That
// signals the end of the msg.
while(!(msgLine.equals("."))){
//Write the line to the output
// file and read the next
// line. Insert newline
// characters when writing the
// output to the file.
dataOut.writeBytes(
msgLine + "n");
msgLine = inputStream.readLine();
//Insert asterisks to destroy
// virus code.
msgLine = insertStars(msgLine);
}//end while
//Close the output file. The
// message is now stored in a
// local file with a file name
// based on the unique ID
// provided by the server. Note
// that a unique ID provided by
// one server may duplicate a
// unique server provided by a
// different server.
dataOut.close();

//Now screen the file testing
// for reasons to delete the
// message from the server.
//First initialize the text showing
// in the various components in the
// GUI.
fromField.setText("Call screener");
subjField.setText("Call screener");
operMsgField.setText(
"Call screener");
textArea.setText(
"Progress Meter: ");
//Initialize the match flag
// to false.
boolean match = false;

//Now cause the message file to be
// screened. In the event that you
// decide to design your own
// screening algorithm, this is
// where you you would probably
// make the first modification to
// the program. Your version of
// the method named screenMsg
// should return true if it is
// recommending that the message be
// deleted from the server. Also,
// the object of type ScreenResult
// passed as a parameter to the
// method should be populated with
// information to be displayed in
// the text fields and text area of
// the GUI.
ScreenResult theResult =
new ScreenResult();
match = screener.screenMsg(
fileName,uidl,theResult);

//Now display the information
// encapsulated in the ScreenResult
// object by the screenMsg method.
fromField.setText(theResult.from);
subjField.setText(
theResult.subject);
operMsgField.setText(
"Offending Phrase: "
+ theResult.thePhrase);
textArea.setText(theResult.text);
textArea.append("Msg Number: "
+ msgNumber);

//At this point, the user can
// view the From line and the
// Subject line for the message,
// the complete text of the message
// down to the line containing the
// offending word or phrase, as
// well as that word or phrase.

//Increment the message counter
// in preparation for
// processing the next message.
msgCounter++;

//A return value of true means that
// the screener is recommending
// deletion of the message from the
// Email server.
if(match == true){
//The message has been flagged
// as a candidate for deletion
// from the server. Return
// from the ActionPerformed
// method and take no further
// action until the user
// presses the Delete button
// or the Start/Next button.
//Pressing the Delete button
// causes the message to be
// deleted from the server.
//Pressing the Start/Next
// button causes it to be
// preserved.
return;
}//end if match == true

//Control reaches this point only
// if match is not true.
//The messaage is not a
// candidate for deletion from
// the server.
//At this point, we could
// require the user to press
// the Start/Next button to
// process the next message.
//However, we won't do that. The
// following code fires an event
// identical to that which would
// be fired if the user pressed
// the Start/Next button.
Toolkit.getDefaultToolkit().
getSystemEventQueue().
postEvent(new ActionEvent(
startButton,
ActionEvent.
ACTION_PERFORMED,
"Start/Next"));
}//end if msgNumber <= numberMsgs
else{//msgNumber > numberMsgs
//No more messages. Disable the
//Start/Next button.
startButton.setEnabled(false);
//Instruct the user to terminate
// the program.
subjField.setText(
"No more messages, press Close");
fromField.setText(
"No more messages, press Close");
operMsgField.setText(
"No more messages, press Close");
textArea.setText(
"No more messages, press Close");
}//end else
}//end try
catch(Exception ex){
ex.printStackTrace();}
}//end actionPerformed
}//end ActionListener
);//end addActionListener

//Register an ActionListener on the Delete
// button to make it possible for the
// user to remove a message from the
// server.
deleteButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Clear the operator message field
operMsgField.setText("");

//Confirm that local msgNumber is in
// synch with message number on server
int firstSpace = fileName.indexOf(" ");
int secondSpace = fileName.indexOf(
" ",firstSpace + 1);
String chunk = fileName.substring(
firstSpace + 1,secondSpace);
if(Integer.parseInt(chunk)
!= msgNumber){
System.out.println(
"msgNumber synch error");
System.exit(0);//terminate
}//end if

//Deletion of a message from the
// server is accomplished by marking
// the message for deletion while in
// the TRANSACTION state. The
// message is actually deleted when
// the client sends a QUIT command
// to the server causing the server
// to enter the UPDATE state. If the
// program aborts prematurely before
// sending a QUIT command, marked
// messages are not deleted from the
// server.
//Mark the message for deletion.
//Note that the following three statements have
// been purposely disabled to prevent you from
// accidentally deleting messages from the server
// during your early testing of the program. Do
// not enable these three statements until you
// are certain that you really do want to delete
// messages from the server. At that point in
// time, you can enable the three statements by
// removing the comment indicators.

/*
outputStream.println(
"DELE " + msgNumber);
//Validate the response and display
// it on the GUI. You probably won't
// see it on the GUI because of what
// heppens next. The program
// immediately clears the display
// and begins processing the
// next message. If you modify the
// program to eliminate the clearing
// of the display between messages,
// you will see this response.

textArea.append(
"DELE "+validateOneLine()+"n");
textArea.append(
"Deleted:" + msgNumber + "n");

*/

//Create and fire a synthetic event
// that simulates the user pressing
// the Start/Next button. This
// initialtes the processing of the
// next message.
Toolkit.getDefaultToolkit().
getSystemEventQueue().
postEvent(new ActionEvent(
startButton,
ActionEvent.
ACTION_PERFORMED,
"Start/Next"));
}//end actionPerformed
}//end ActionListener
);//end addActionListener

//Configure the GUI by placing the
// various components on it, setting the size
// and making it visible.
add(startButton);
add(deleteButton);
add(fromField);
add(subjField);
add(operMsgField);
add(textArea);
setTitle("Copyright 2004, R.G.Baldwin");
//Increase the following parameters and
// modify the construction parameters for
// the text fields and the text area to
// increase the size of the GUI.
setSize(400,400);
//Make the GUI visible.
setVisible(true);
}//end constructor
//===========================================//

//Validate a one-line response.
//The purpose of this method is to confirm that
// the server returned +OK and not -ERR to the
// previous command.
//If +OK, the method returns the string
// returned by the server.
//If -ERR, the method displays the string
// returned by the server and terminates the
// session.
private String validateOneLine(){
try{
String response = inputStream.readLine();
if(response.startsWith("+OK")){
return response;
}else{
System.out.println(response);
//Terminate the session.
outputStream.println("QUIT");
socket.close();
System.out.println(
"Premature QUIT on -ERR");
System.exit(0);
}//end else
}catch(IOException e){e.printStackTrace();}
//The following return statement is required
// to satisfy the compiler.
return "Make compiler happy";
}//end validateOneLine()
//===========================================//

//Purpose of this method is to insert an
// asterisk (star) every tenth character in
// order to destroy virus code before it is
// written into the output file. While this
// makes the local version of the message
// harder to read, it does little to reduce its
// usefulness for computer analysis.
private String insertStars(String stringIn){
StringBuffer stringBuffer =
new StringBuffer(stringIn);
int length = stringBuffer.length();
for(int cnt = 9; cnt < length; cnt+=10){
stringBuffer.insert(cnt,'*');
}//end for loop
return new String(stringBuffer);
}//end insertStars
//===========================================//
}//end class Pop302
//=============================================//

//Class to encapsulate screening results. An
// object of this type is passed to the screenMsg
// method where it is populated with the results
// of the screen.
class ScreenResult{
public String subject = "";
public String from = "";
public String thePhrase = "";
public String text = "";
}//end ScreenResults
//=============================================//

//This class implements a set of rules for
// detecting SPAM messages and for recommending
// whether or not a message should be deleted
// from the server.
//If you have a better way to detect SPAM, you
// can replace this class by a completely
// different class definition, so long as you
// maintain the user interface.
//
//An object of this class has one entry point and
// one exit point, which is the public method
// named screenMsg. However, the constructor
// receives a reference to the GUI object created
// by instantiating the class named Pop302. The
// object of this class uses that reference to
// display progress on the text area belonging
// to the GUI. This is comforting on those
// occasions when a very long message is
// encountered. This callback link could easily
// be eliminated by deleting code from two
// locations in this class and removing the
// callback constructor parameter.
class Screen{

TreeSet subjWordList;
TreeSet bodyWordList;
TreeSet friendlyWordList;
Pop302 theGui;//save callback reference here
String phrase;

Screen(Pop302 theGui){//constructor
this.theGui = theGui;
//Read the files containing word lists and
// create TreeSet objects containing those
// words or phrases in alphabetical order.
makeSubjWordList();
makeBodyWordList();
makeFriendlyWordList();
}//end constructor

//This method is used to identify messages that
// are candidates for being deleted from the
// server. Such identification is based on
// analyzing the file in which the message is
// stored locally. A return value of true means
// that the message is believed to be SPAM, and
// that the method recommends deleting the
// message from the server. In addition to
// the return value, certains strings are
// encapsulated in the incoming object of type
// ScreenResult. This information is posted
// on the GUI by the object of the class named
// Pop302.
public boolean screenMsg(String fileName,
String uidl,ScreenResult theResult){
//Initialize match to false
boolean match = false;
try{
//Open the file containing a local copy of
// the message. Note that the message has
// been modified by inserting asterisks
// in an attempt to protect against
// viruses.
BufferedReader inData
= new BufferedReader(new FileReader(
fileName));
String data;//temp holding area

//Get the Subject line by skipping header
// lines prior to the Subject line. Mark
// the beginning of the file to make it
// easy to rewind later. Set the readAhead
// Limit to 10000 characters before the
// mark will be lost.
inData.mark(10000);
//Populate the ScreenResult object just in
// case a Subject line isn't found and a
// From line isn't found later.
theResult.subject = "No Subj line found";
theResult.from = "No From line found";
while((data = inData.readLine()) != null){
//Remove the asterisks from the data that
// were inserted earlier in an attempt
// to defeat viruses.
data = removeStars(data).toUpperCase();
if(data.startsWith("SUBJECT:")){
//Put the Subject line in the
// ScreenResult object.
theResult.subject = data.toUpperCase();
//Screen against an upper-case version
// of the words and phrases in a
// TreeSet object containing friendly
// email addresses and subjects
Iterator iterator =
friendlyWordList.iterator();
while(iterator.hasNext()){
String friendlyWord =
((String)(iterator.next())).
toUpperCase();
match = false;
if(!(friendlyWord.equals(""))){
//The screenOnPhrase method is used
// to search for a match between
// the entries in the friendlyWord
// list and the text in the Subject
// line of the message. In this
// case, no extra characters are
// allowed in the phrase as
// it appears in the Subject line.
match = screenOnPhrase(
data,friendlyWord,0);
}//end if

if(match == true){
//The screenOnPhrase method found a
// match with an phrase in
// the friendlyWord list.
//The message should not
// be deleted from the server and
// the local copy should be
// deleted from the disk to
// prevent later analysis to
// extract target words or phrases
// for SPAM. This algorithm has no
// way of blocking SPAM with a
// Subject that matches a phrase
// in the friendlyWord list.
//Close the file and delete it from
// the local disk.
inData.close();
new File(fileName).delete();
//Store the phrase in the
// ScreenResult object.
theResult.thePhrase = phrase;
//Terminate the execution of this
// method, telling the program not
// to delete the message from the
// server by returning false. This
// is one of three return points
// from this method.
return false;
}//end if match = true
}//end while iterator has next
//Don't attempt to read any more lines
// from this file in this while loop.
break;
}//end if
}//end while loop

//Reset back to beginning of file. The
// Subject for this message is now showing
// in the GUI if the message contained a
// Subject line. Otherwise the GUI
// contains a message indicating that
// a Subject line wasn't found.
inData.reset();

//Get the From line by skipping header
// lines prior to the From line.
//Also test the From line against a list of
// friendly email addresses.
//If a match is found, do not delete the
// message from the server, but do delete
// the local file containing the message
// from the disk to prevent it from being
// analyzed later in an attempt to find
// target words or phrases for SPAM.
while((data = inData.readLine()) != null){
//Convert the data to all upper case and
// remove all asterisks from the data.
// Note that this will remove naturally
// occurring asterisks in addition to
// those that were inserted in an attempt
// to protect against virus code. If
// this proves to be a problem later, the
// removeStars method can be modified to
// remove only those asterisks that were
// inserted on ten-character intervals.
data = removeStars(data.toUpperCase());
if(data.startsWith("FROM:")){
//Put the From line in the ScreenResult
// object.
theResult.from = data;

//Screen against an upper-case version
// of the words and phrases in a
// TreeSet object containing friendly
// email addresses and subjects
Iterator iterator =
friendlyWordList.iterator();
while(iterator.hasNext()){
String friendlyWord =
((String)(iterator.next())).
toUpperCase();
match = false;
if(!(friendlyWord.equals(""))){
//The screenOnPhrase method is used
// to search for a match between
// the entries in the friendlyWord
// list and the text in the From
// line of the message. In this
// case, no extra characters are
// allowed in the Email address as
// it appears in the From line.
match = screenOnPhrase(
data,friendlyWord,0);
}//end if

if(match == true){
//The screenOnPhrase method found a
// match with an Email address in
// the friendlyWord list.
//Either this msg was sent by a
// friend, or a spammer sent it
// claiming to be from a friend.
// In either case, it should not
// be deleted from the server and
// the local copy should be
// deleted from the disk to
// prevent later analysis to
// extract target words or phrases
// for SPAM. This algorithm has no
// way of blocking SPAM that claims
// to have been sent by a friendly
// Email address.
//Close the file and delete it from
// the local disk.
inData.close();
new File(fileName).delete();
//Store the Email address in the
// ScreenResult object.
theResult.thePhrase = phrase;
//Terminate the execution of this
// method, telling the program not
// to delete the message from the
// server by returning false. This
// is one of two return points from
// this method.
return false;
}//end if match = true
}//end while iterator has next
//Don't attempt to read any more lines
// from this file in this while loop.
break;
}//end if data starts with From
}//end while loop on null

//Reset back to beginning of the file. The
// From line for this message is now
// showing in the GUI. Read and process
// the entire file.
inData.reset();

//Read and process strings until eof is
// indicated by null. Provide separate
// processing for the Subject line and all
// other lines in the message described
// herein as body text, although this also
// includes header data in the message.
//Note that more sophisticated forms of
// screening can be inserted at this point
// in the program.
int progressCounter = 0;
while((data = inData.readLine()) != null){
data = removeStars(data);
if(data.startsWith("Subject")){
//Process the Subject line.
//Process all data as upper case data
// to keep the spammer from hiding
// behind random case conversions.
data = data.toUpperCase();

//Append each line of data to the
// String stored in the ScreenResult
// object.
theResult.text =
theResult.text + data + "n";
//Display progress on the GUI. Remove
// this code to break this link with
// the GUI object if desired.
if(++progressCounter < 50){
theGui.textArea.append(".");
}else{//Display progress on a new line
progressCounter = 0;
theGui.textArea.append(".n");
}//end else

//Screen against an upper-case version
// of the words and phrases in a
// TreeSet object containing target
// words and phrases in the Subject
// line of the message.
match = false;
Iterator iterator =
subjWordList.iterator();
while(iterator.hasNext()){
String subjWord =
((String)(iterator.next())).
toUpperCase();
if(!(subjWord.equals(""))){
//Search for a match between the
// words or phrases in the subjWord
// list and the line of msg data.
// Allow one extra character to
// occur between the characters in
// the data and still make a match.
match = screenOnPhrase(
data,subjWord,1);
}//end if

if(match == true){
//Msg is a candidate for deletion.
//Don't need this local file for
// statistical analysis. Move it
// to a local archive folder.
inData.close();
boolean moved =
new File(fileName).renameTo(
new File(
"c:/MailFiles/Archives/"
+uidl+".txt"));
if(!moved)System.out.println(
"Unable to move file " + uidl);
//There is no need to test against
// any more words in this iterator
// loop.
break;
}//end if
}//end while iterator has next
}//end if data starts with Subject
//Data line does not start with Subject.
// Process the line as body text. Note
// that some body text occurs before the
// Subject line in the format of a
// typical message. Therefore, the code
// in the else clause will typically be
// executed several times before the
// code in the if clause discussed above
// will be executed.
else{
//Screen on an upper-case version of
// the message.
data = data.toUpperCase();
//Append the data line to the String
// value being stored in the
// ScreenResult object.
theResult.text =
theResult.text + data + "n";
//Display progress in the GUI. Remove
// this code to break the callback link
// with the GUI if desirec.
if(++progressCounter < 50){
theGui.textArea.append(".");
}else{
progressCounter = 0;
theGui.textArea.append(".n");
}//end else
//Screen against an upper-case version
// of the words and phrases in a
// TreeSet object designed
// specifically for screening body
// text.
Iterator iterator = bodyWordList.
iterator();
match = false;
while(iterator.hasNext()){
String bodyWord =
((String)(iterator.next())).
toUpperCase();
if(!(bodyWord.equals(""))){
//Allow one character to occur
// between the characters in the
// data line and still make a
// match.
match = screenOnPhrase(
data,bodyWord,1);
}//end if

if(match == true){
//Msg is a candidate for deletion.
//Don't need this local file for
// statistical analysis. Move it
// to a local archive folder.
inData.close();
boolean moved =
new File(fileName).renameTo(
new File(
"c:/MailFiles/Archives/"
+uidl+".txt"));
if(!moved)System.out.println(
"Unable to move file " + uidl);
//There is no need to test against
// any more words in this iterator
// loop.
break;
}//end if
}//end while iterator has next
}//end else for line not Subject line
//A match has been found. No need to
// read any more data lines in this while
// loop.
if(match == true)break;
}//end while loop on read until null
inData.close();//Close file if still open
}catch(Exception e){e.printStackTrace();}
//Store the matching phrase (or the last
// phrase processed) in the ScreenResult
// object.
theResult.thePhrase = phrase;
//Return the value of match indicating
// whether or not a match was found.Note that
// this return statement is one of two return
// points in this method. This return will
// not be reached if a friendly Email address
// was found earlier when processing the From
// line.
return match;
}//end screenMsg
//===========================================//

//This method tests a string to see if it
// contains a word or phrase that may have
// extraneous characters inserted into it,
// such as VI*A-GRA.
//If the string contains the sequence of
// characters making up the word or phrase,
// with spanLim or fewer extraneous characters
// between any two of the word's characters,
// the method returns true. For example, if
// spanLim = 1, the spammer can insert one
// character between any two of the characters
// that make up the word and the word will
// still be detected. However, if the
// spammer inserts two or more characters,
// the offending word will not be detected.
//Need to be careful to avoid making spanLim
// too large. Large values of spanLim result
// in false alarms due to the fact that
// widely-separated characters can be
// considered to be part of the word or
// phrase. For example, if spanLim = 2 or
// greater, the word PORN will be found in
// the word imPORtaNt.
private boolean screenOnPhrase(String data,
String phrase,
int spanLim){
this.phrase = phrase;
StringBuffer str = new StringBuffer();
ArrayList locationData = new ArrayList();

//Compare each char in the data with each
// unique char in the word or phrase. If
// there is a match, append the char to str
// and save the location of the char in
// the ArrayList referred to by locationData.

//Eliminate duplicate char in the word or
// phrase by storing in a TreeSet. Note that
// this will also sort the char, but that
// doesn't matter.
TreeSet treeSet = new TreeSet();
for(int cnt = 0; cnt < phrase.length();
cnt++){
treeSet.add(
new Character(phrase.charAt(cnt)));
}//end for loop

//Get the unique characters from the set and
// save them in a StringBuffer
Iterator iter = treeSet.iterator();
StringBuffer tempPhrase = new StringBuffer();
while(iter.hasNext()){
tempPhrase.append(
((Character)(iter.next())).charValue());
}//end while

//Use the StringBuffer of unique characters
// to test the string and extract matching
// characters from the string. Discard all
// non-matching characters. This converts
// the original data into a string of
// characters, each of which is a character
// in the word or phrase. All other
// characters have been removed. Thus, if
// the data contains the word or phrase, it
// will occur somewhere in the compressed
// string with no extra characters in
// between. An example might be as follows:
// SMSPMASPAMMPAS
for(int i = 0; i < data.length(); i++){
for(int j = 0; j < tempPhrase.length();
j++){
if(data.charAt(i) ==
tempPhrase.charAt(j)){
str.append(data.charAt(i));
locationData.add(new Integer(i));
}//end if
}//end for on tempPhrase
}//end for on data

//Test to see if the extracted char sequence
// contains the word or phrase.
int match = str.indexOf(phrase);
if(match == -1){
return false;//no match
}//end if

//There is a match. Confirm that the span
// between target characters in data is not
// greater than allowed by the incoming
// spanLim parameter.
int maxSpan = 0;
int locA = ((Integer)locationData.
get(match)).intValue();
int locB = 0;
for(int cnt = 1; cnt < phrase.length();
cnt++){
locB = ((Integer)locationData.get(
match + cnt)).intValue();
int span = locB - locA;
if(span > maxSpan){
maxSpan = span;
}//end if
locA = locB;
}//end for loop

if(maxSpan > spanLim+1){
return false;//span too large
}else{
return true;//made a match
}//end else

}//end screenOnPhrase

//===========================================//
//Purpose of this method is to remove the
// asterisks inserted into the data by the
// method named insertStars, and to append two
// asterisks at the end of the line. Note that
// this method removes all asterisks, not just
// those inserted earlier. If this proves to
// be a problem, this method should be modified
// to remove only those asterisks that occur on
// ten-character intervals.
private String removeStars(String stringIn){
StringBuffer stringBuf =
new StringBuffer(stringIn);
int index = 0;
while(index > -1){
index = stringBuf.lastIndexOf("*");
if(index > -1){
stringBuf.delete(index,index+1);
}//end if
}//end while
stringBuf.append("**");
return new String(stringBuf);
}//end removeStars()
//===========================================//

//Purpose: To create a TreeSet object
// containing words used to screen the message
// subject lines.
//This method reads strings from a text file
// named Pop302a.txt and creates the list as
// a TreeSet object with no duplicates.
//See additional comments in the later section
// regarding the makeBodyList method.

private void makeSubjWordList(){
subjWordList = new TreeSet();

//Read word list from text file and populate
// the TreeSet object.
try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302a.txt"));
String data; //temp holding area

while((data = inData.readLine()) != null){
subjWordList.add(data);

}//end while loop
inData.close();//Close file
}catch(Exception e){e.printStackTrace();}
}//end makeSubjWordList
//===========================================//

//Purpose: To create a TreeSet object
// containing words and phrases used to screen
// the message BODY lines. See notes above
// regarding the list used to screen the
// Subject line of each message in the method
// named makeSubjWordList.

//It is important to maintain these two lists
// as separate lists. Because of the much
// larger number of characters in the body than
// in the Subject, false alarms are much more
// likely in the body. Therefore, individual
// words that work well when screening the
// Subject line may produce false alarms when
// screening the body. For example, the word
// PORN appears in the word IMPORTANT. It is
// much more likely that the word IMPORTANT
// will appear somewhere in the body than in
// the Subject line (although it may appear in
// the Subject line as well, thus producing a
// false alarm in both cases). Also, the word
// ANTIVIRUS works well in the Subject, but
// cannot be used to screen the body because
// many servers insert that word into the
// message header after they test the message
// for viruses. Also, IP addresses and URLs
// work well in the body, but rarely appear in
// the Subject. Therefore, testing the Subject
// against a long list of URLs simply wastes
// time.

//The following words (among others) should not
// be added to the list for the reasons given:

//PORN may be confused with IMPORTANT
//SPAM causes lots of false alarms. I inserted
// a space as in "SPAM " to decrease false
// alarms. Will probably also decrease valid
// hits.
//ANTIVIRUS appears in some valid message hdrs
//WEIGHT often appears in messages regarding
// html fonts
//SLUT may be confused with SOLUTION
//==End of prohibited list==


private void makeBodyWordList(){
bodyWordList = new TreeSet();

//Read word list from text file and populate
// the TreeSet object.
try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302b.txt"));
String data; //temp holding area

while((data = inData.readLine()) != null){
bodyWordList.add(data);

}//end while loop
inData.close();//Close file
}catch(Exception e){e.printStackTrace();}
}//end makeBodyWordList
//===========================================//

//Purpose: To create a TreeSet object
// containing words used to screen the message
// From lines.
//This method reads strings from a text file
// named Pop302c.txt and creates the list as
// a TreeSet object with no duplicates.
//Only the primary portion of the friends
// Email address should be included in the
// file used to create the list. This would
// be x@y.z

private void makeFriendlyWordList(){
friendlyWordList = new TreeSet();

//Read word list from text file and populate
// the TreeSet object.
try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302c.txt"));
String data; //temp holding area

while((data = inData.readLine()) != null){
friendlyWordList.add(data);

}//end while loop
inData.close();//Close file
}catch(Exception e){e.printStackTrace();}
}//end makeFriendlyWordList
//===========================================//
}//end class Screen

Listing 34

Sample file Pop302a.txt

AGE REVERSING PRODUCT
ANNUAL FEE
AT THE PUMP
AUTO BONANZA
AUTO WARRANTY
BECAREFUL DOWNLOADING MUSIC FILES
BOTOX
BOTTLES SOLD DAILY
C1ALIS
CIALIS
CODEINE
LEVITRA
NA1L-FUNGUS
NEVER REPAY YOUR CREDIT CARD DEBT
ORDER YOUR DRUGS
PATCH
PILLS
PILLZ
SEXUA1
SEXUAL
SILDENAFIL CITRATE
SIZE DOES MATTER
SLZE MATTERS
SOLVE THE PROBLEM DOWNSTAIRS
TERMINATE DEBT
TONER CARTRIDGES
V1@GRA
V1AGRA
VALIUM
VA|IUM
VI@GRA
VIAGRA
VIC0DIN
VICODAN
VICODIN
VIIAGRA
VLAGRA
VÍAGRA
XANAAX
XANAX
Z0LOFT
/IAGR@

Listing 35

Sample file Pop302b.txt

123.456.789.123
HTTP://SOMESITE.ABC
AGE REVERSING PRODUCT
ANNUAL FEE
AT THE PUMP
AUTO BONANZA
AUTO WARRANTY
BECAREFUL DOWNLOADING MUSIC FILES
BOTOX
BOTTLES SOLD DAILY
C1ALIS
CIALIS
CODEINE
LEVITRA
NA1L-FUNGUS
NEVER REPAY YOUR CREDIT CARD DEBT
ORDER YOUR DRUGS
PATCH
PILLS
PILLZ
SEXUA1
SEXUAL
SILDENAFIL CITRATE
SIZE DOES MATTER
SLZE MATTERS
SOLVE THE PROBLEM DOWNSTAIRS
TERMINATE DEBT
TONER CARTRIDGES
V1@GRA
V1AGRA
VALIUM
VA|IUM
VI@GRA
VIAGRA
VIC0DIN
VICODAN
VICODIN
VIIAGRA
VLAGRA
VÍAGRA
XANAAX
XANAX
Z0LOFT
/IAGR@

Listing 36

Sample file Pop302c.txt

BALDWIN@DICKBALDWIN.COM
MSNBC_BREAKINGNEWS_NEWSMAIL@MSNBC.COM
BOOKSTORE@INFORMIT.COM
ENews@SSA.GOV
Developer.com Update
MSNBC_DAILYMARKETCLOSE_NEWSMAIL@MSNBC.COM
ITSC 1313
ITSC1313

Listing 37

  


Copyright 2004, Richard G. Baldwin.  Reproduction in whole or
in
part in any form or medium without express written permission from
Richard
Baldwin is prohibited.

About the author

Richard Baldwin
is a college professor (at Austin Community College in Austin, TX) and
private consultant whose primary focus is a combination of Java, C#,
and XML. In addition to the many platform and/or language independent
benefits of Java and C# applications, he believes that a combination of
Java, C#, and XML will become the primary driving force in the delivery
of structured information on the Web.

Richard has participated in numerous consulting projects, and he
frequently provides onsite training at the high-tech companies located
in and around Austin, Texas.  He is the author of Baldwin’s
Programming Tutorials, which
has gained a worldwide following among experienced and aspiring
programmers. He has also published articles in JavaPro magazine.

Richard holds an MSEE degree from Southern Methodist University
and has many years of experience in the application of computer
technology to real-world problems.

Baldwin@DickBaldwin.com

-end-
 

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories