JavaEnlisting Java in the War Against SPAM: Training the Subject Line Screener

Enlisting Java in the War Against SPAM: Training the Subject Line Screener

Java Programming Notes # 2154


Preface


This is the third lesson in a series designed to teach you how to
write
a Java program to remove SPAM from your Email server before you
download it into your primary Email client.  The first lesson was
entitled Enlisting
Java in the War Against SPAM, Part 1, The Communications Module

The previous lesson was entitled Enlisting
Java in the War Against SPAM, Part 2, The Screening Module
.

The communications module

The first lesson explained the communications module used to
communicate with
your Email server, and to remove SPAM messages from the
server.

SPAM screening algorithm

The second lesson explained my SPAM screening algorithm.  The
program is designed to allow you to use my SPAM screening
algorithm, or to invent your own.  You can use my
algorithm as a starting point if you decide to invent your own.

Training the algorithm

My algorithm operates
separately on the Subject line, the From line,
and the body text of each Email message.  This
lesson explains my program for training the algorithm to do a better
job of
identifying SPAM in the future based on the Subject line
of Email messages.

The next lesson will explain my program for training the
algorithm to do a better job of
identifying SPAM in the future based on the body text of
Email messages.

Viewing tip

You may find it useful to open another copy of this lesson in a
separate browser window.  That will make it easier for you to
scroll back and forth among the different listings and figures while
you are reading about them.

Supplementary material

I recommend that you also study the other lessons in my extensive
collection of online Java tutorials.  You will find those lessons
published at Gamelan.com
However, as of the date of this writing, Gamelan doesn’t maintain a
consolidated index of my Java tutorial lessons, and sometimes
they are difficult to locate there.  You will find a consolidated
index at www.DickBaldwin.com.

Preview

Can you write better SPAM screening
algorithms?

Did you ever think that you might be able to write better SPAM
screening algorithms than those available in the SPAM screening
software that you are now using?  If so, this series of lessons is
for you.

Even if that is not the case, like most of us, you are probably
overwhelmed by SPAM
and therefore you may find this series of lessons interesting.

Remove SPAM from the server

In this series of lessons, I am showing you how to write a
Java program
that supplements the SPAM screening software that you are currently
using.  The program is used to identify and remove SPAM from your
Email server before it is downloaded into your primary Email client.

Any SPAM that makes it past the SPAM screening program can be
further acted upon
by the SPAM screener that is built into your Email client.

Lessons in the series

This series consists of (at least) four lessons.  The
first lesson in the
series explained the communications module used to communicate with
your Email server, and to remove SPAM messages from the
server.

As mentioned earlier, the program is designed to allow you to invent
and implement your own
SPAM screening algorithm in addition to, or as an alternative to my
algorithm.

The second lesson explained the inner
workings of my SPAM screening algorithm, which operates
separately on the Subject line, the From line,
and the body text of each Email message.

Algorithm training programs

The screening program is
most useful
when the algorithm has been well trained, and the training of the
algorithm is kept up to date.  Although it is possible to perform
this training with a
text editor, you can be much more productive, and you are much more
likely to keep the training up to date using the programs that I will
explain in this lesson and the next.

This lesson explains a program named Pop302d,
which
uses historical data to train the algorithm to do a better job
of identifying SPAM in future messages based on the Subject
of the message.

The next lesson will explain another program named Pop302e,
which uses historical data to train the algorithm to do a
better job of identifying SPAM based on the body text of
the message, (which includes the From line but excludes the
Subject line).

Because of the need to train the algorithm and to keep it up to
date, and the ease with which
these programs make that possible, the training programs are
equally as important as the main program.

Operational sequence

Here is the typical operational sequence that I go through each
morning to remove SPAM from my Email server before downloading it into
my primary Email client, and to train the algorithm to recognize any
future SPAM messages that managed to evade the screening program on the
first pass that morning.

  1. Run the main program named Pop302 (explained in the
    previous two lessons)
    to identify SPAM and remove it from the
    server.  This normally allows a few (typically about five
    percent)
    SPAM messages (stragglers) to get
    through.  These straggler messages are stored in a history folder
    on my local disk.
  2. Run the program named Pop302d (explained in this
    lesson)
    to train the algorithm to recognize the stragglers as SPAM
    based on
    information in the Subject line.  (It is also
    possible to recognize that a message is not SPAM and to exclude it from
    further training activities.)
  3. Run the program named Pop302e (explained in the next
    lesson)

    to train the algorithm to recognize the stragglers as SPAM based on
    information
    in the body text.
  4. Go back and run the main program named Pop302 to remove
    the SPAM stragglers from the server.
  5. Run the primary Email client to download the remaining good
    messages from the Email server into my local Email inbox.

When I am in a hurry …

However, it isn’t necessary to perform all of these steps every
day.  On those mornings when I am in a hurry, I skip steps
2, 3, and 4, leaving the straggler messages in the local history folder
for use later.

(The straggler messages will, of course, end up in my
local Email inbox when I run my primary Email client without having
purposely
removed them from the server beforehand.)

Sometime later (perhaps the next day or several days later)
I will perform steps 2 and 3 to train the algorithm to recognize future
SPAM
messages represented by the characteristics of the SPAM messages that
have
been saved in the local
history folder.

Effectiveness of my algorithm

After several weeks of training, my
algorithm is reliably identifying about ninety-five percent of all SPAM
messages, allowing me to delete them from my Email server before
downloading them into my primary Email
client.  By executing steps 2, 3, and 4 above, I am able to also
eliminate the remaining five percent of the SPAM messages before
downloading them into my primary Email client.

Most
of
the messages that make it past the screen are messages written in
a non-English language that I am unable to read.  No one that I
know would send a message to me in a
language that I can’t read, so I consider such messages to be
SPAM.  They constitute the bulk of the five-percent of the SPAM
messages that make it past the screen.

Operational
Discussion

Offensive words and phrases

My SPAM screening algorithm screens for SPAM on the basis of words
or
phrases in the From line, words or phrases in the Subject
line, and
words or phrases in the body text.

Friendly Email addresses and subjects

A list of friendly Email addresses and friendly subjects is used to
screen the From
line and the Subject line.  Messages that are from
friendly Email addresses, and messages that have known good Subject
lines are preserved on the server and no information about those
messages is
saved on the local disk. They are simply ignored after determining that
they are friendly.

The friendly list is created and maintained using a simple text
editor.  This is a relatively short list that rarely changes and
no special training program is needed to keep this list
up to date.

Different lists for Subject and body
text

Different lists of words and phrases are used for screening Subject
lines and body text for SPAM. This is important because
the same set of
words
and phrases aren’t always appropriate for use in both cases.

For example, the word ANTIVIRUS is appropriate for screening the
Subject line, but is not appropriate for screening the
body text. The
word ANTIVIRUS often appears legally in the header of Email messages
that have been scanned for viruses by the server, but also often
appears in the Subject line of SPAM messages.

Conversely, some of the most useful material in the list used to screen
the body text is a large set of URLs for known spammer web sites. 
However, URLs have little value for screening the Subject line,
so there is no point in wasting computer time screening Subject lines
against URLs.

To summarize …

Although there are exceptions to every rule, the most useful material
for screening Subject lines consists of plain text
words and phrases that are commonly included in the Subject lines
of SPAM messages.

Such text is of very limited value when screening the body text
for reasons that I will explain later.  The most useful material
for screening the body text is a list of known spammer
URLs, which are of very limited value when screening Subject lines.

Common spammer tricks are defeated

Several common spammer tricks are defeated by my SPAM screening
algorithm.

Insertion of extraneous characters and case
changes

For example, the common spammer trick of inserting a small number of
extra characters
between the
characters in an offending word or phrase is defeated.  Also, the
common trick of mixing the case of the
characters in an offending word or phrase is defeated.

As a specific example, my algorithm will recommend deletion of any
message having
any of the following in its Subject line or its body
text if the word VIAGRA is included in the lists used to screen for
SPAM:

vIaGrA
V.IagRA
V.I.A.G.R.A

Therefore, when using this program to train the algorithm, if you find
a SPAM message containing V.IagRA in the subject line, you
should use the program to add the word VIagRA to the
list.  (Simply edit out the extraneous period between the V
and the I.)
The algorithm will deal with the other
variations of that word having extraneous characters inserted and
having random case changes.

Purposely misspelled words

As currently written, my algorithm has no way of dealing with another
common spammer trick of purposely misspelling words. 

(Perhaps this is an opportunity for you to improve the
algorithm.)

Therefore, if you find a SPAM message where the Subject line
contains the word V.I@gRA, you should use the program to add
the word VI@gRA to the list to deal with the incorrect
spelling.  (Once again, just edit out the extraneous period.)

Duplicates are not a problem

Don’t be concerned that you may forget and add a word or phrase to the
list more than once.  The program will eliminate all duplicates,
thereby preventing the length of the list from increasing due to
duplicates.

Bogus HTML tags

One of the reasons that plain text words and phrases are not very
useful in screening body text is because spammers often
insert long bogus HTML tags in the middle of such words and
phrases.  Because the tags are not recognized as valid HTML tags,
the HTML rendering engine simply ignores them when the SPAM message is
displayed in an HTML-compatible Email client.

My screening algorithm, as currently written, does not deal with this
situation.  It wouldn’t be difficult to update the algorithm to
cause it to delete bogus HTML tags before screening the body
text.  I will probably update the
algorithm to delete such tags at some point in the future, but the
current version doesn’t contain that capability.

The user interface

Figure 1 shows the GUI through which the user controls the training
program named Pop302d.  Figure 1 also partially illustrates
the procedure that I use to train the algorithm to
do a better job of identifying SPAM in the Subject line
of future messages.

User interface

Figure 1 Graphical User Interface

(Note that this GUI was purposely made narrow to cause
it to fit into this narrow publication format.  I
recommend that you increase the width of the Frame to at least 750
pixels, and increase the width of the TextField and TextArea objects to
at least 100 characters each.)

A message from the history folder

In Figure 1, a message previously stored in the history folder has been
loaded into the GUI.  The complete raw text of that message is
available for viewing in the large text area if desired.

The From
line and the Subject line are displayed in the
top two text fields in the GUI.  (I purposely deleted the
Email address of the sender.)

User instructions are displayed in the fourth text field.

Offensive text in the Subject line

In this case, the user has identified unique and offensive text (XA-N@X)
in
the Subject
line and has selected that text with the mouse.  Having
first selected the text with the mouse, the user has pressed the
button labeled Copy Selected Text.  That caused the
selected text to be automatically copied into the third text field.

Two additional steps

Two
additional steps are required to add a useful version of that text to
the
word
list used to screen the Subject
line of future messages.


The first step is to delete the “-” character that separates the A
from the N, at which time the contents of the third text field
will read
XAN@X.

The second step
is to press
the button labeled Post Text. This will cause the selected and (possibly)
edited text to be automatically added to the word list.

That is all that is required to cause the program to identify this
offensive text in the Subject lines of all future
messages.

Additional offensive material

At this point, the user may be able to identify other unique and
offensive words and phrases in the Subject line.  (The
Subject line might contain a long list of drugs, for example.)

The user can select any number of additional words and phrases from the
Subject line and add them individually to the list using
the above
procedure.

Can type, or copy and paste also

In addition, the user can add other words and phrases to the list by
either typing them into the third text field, or by selecting and
copying words and phrases from the large text area and pasting them
into the third text field.  Having done that, the user can cause
the word or phrase in the third text field to be added to the list
simply by pressing the Post Text button.

Process the next message

If the user then presses the Next button, the next message in
the history folder will be loaded into the GUI.  The current
message will
not be deleted from the history folder.

(This is
what you would normally do if you are going to use the same message
later to train the algorithm to better identify SPAM on the basis of
the body text.  I will discuss this procedure in the next lesson.)

If the user presses the Delete
Local File
button, the current file will be deleted from the
history folder and the next message in the history folder will be
loaded into the GUI.

(This is
what you would normally do if you have determined that the message is
not SPAM, or should not be used for further training of the algorithm
for some other reason.  Perhaps the message was received from a
friend whose Email address has not yet been added to the list of
friendly Email addresses discussed earlier in this lesson.  Note
that deleting the message file from the local disk does not delete the
message from the server.)

A
very simple process

As you can
see, the process of training the algorithm on the Subject line
consists simply of selecting text with the mouse and pressing buttons
to cause the selected text to be added to the word list.  This can
be accomplished very quickly with very little effort.  Except for
the possible requirement to delete extra characters, no actual typing
is required.

Be aggressive

My experience using a moderate speed computer is that the length of the
list used to screen the Subject line has very little
impact on the time required to run the main program to identify SPAM
and remove it from the server.  Therefore, you can afford to be
very aggressive when adding words and phrases to this list.

(As I will explain in the next lesson, that is not the
case for the list used to screen body text.  For reasons
that I will explain then, you will need to be somewhat concerned about
the length of that list.)

If you identify a word or a phrase in the Subject line
of a SPAM message that you believe is unique to SPAM, and is likely to
occur in future SPAM messages, go ahead and add it to the list.

Be careful of false positives

Be careful, however, of adding words or phrases to the list that will
create false positives in the future (a false positive is a
non-SPAM message that is identified as SPAM by the main program).

For example, you probably shouldn’t add the word GET to the
list.  This word is not unique, and it will be found in many
future messages, including those that are SPAM, and those that are
not SPAM.

False positives cause no permanent damage (you can always elect not
to delete
messages from the server when running the main program,
even
if the program identifies them as SPAM).

The problem with false positives is that they force you to concentrate
more carefully when making the
decision to delete or not to delete.  This can increase the time
required to process the messages on the server.

Program Code

Program Pop302d

This lesson explains the program named Pop302d,
which is used to train my screening algorithm to do a better job of
identifying future SPAM messages on the basis of the Subject
line.

Simple text files

All three word lists used by my screening algorithm are maintained in
local text files.  These files can be
created and edited with an ordinary text editor if need be.  Thus,
if one of the lists becomes corrupted, it is easy to correct the
situation using a text editor.  However, it is easier to train the
algorithm using the programs that I will explain in this and the next
lesson.

File names

The following file names are hard-coded into the various programs that
use these text files.  (You may
want to change these file names for your version of the programs. 
If you do, make certain that you change them everywhere that they are
used.)

  • Pop302a.txt – contains a word list used for screening the Subject
    line for offensive words and phrases.  This file is updated by the
    program discussed in this lesson.
  • Pop302b.txt – contains a word list used for screening the
    body
    text
    for offensive words and phrases.  This file is updated by a
    program named Pop302e that I will discuss in the next lesson.
  • Pop302c.txt – contains a list of friendly Email addresses
    and
    friendly subjects for
    screening the From and Subject lines to
    identify friendly messages.  This file is easy to keep up to date
    using a text editor.  No special program is required for updating
    this file.

Location of the text files

The programs require the three .txt files to
be in the same folder as the compiled .class files for
the programs named Pop302, Pop302d, and Pop302e
However, you can easily modify the programs to change the location of
the .txt files if
you choose to do so.  Just be sure to change the location in all
three programs.

Local copies of the messages are stored in two different
folders.  Some of the
local copies are stored in a history folder while the remainder are
stored in an archive folder.  The locations of these folders on
the disk are hard-coded into the three programs.  You can change
the locations if you like, but be sure to make appropriate changes to
all three programs.

Testing

This program was tested using SDK 1.4.2 under WinXP.

Will discuss in fragments

I will discuss the program named Pop302d in fragments.  A
complete listing of
the program is provided in Listing 30 near the end of the lesson. 
You should be able to copy and paste that listing into your Java IDE to
compile and test the program on your system.

Purpose of the program

The purpose of this program is to process raw message text files
produced by the program named Pop302, using the information
contained in those files to update the word list stored in the file
named Pop302a.txt.

Designed for ease of use

The program is designed for extreme ease of use in order to encourage
the user to keep the word list up to date.

Beginning of the class definition

This program consists of one top-level class named Pop302d, and
six anonymous inner classes.  The top-level class definition
begins in
Listing 1.

Note that Pop302d extends Frame.  Therefore, an
object of type Pop302d is a GUI.

class Pop302d extends Frame{

BufferedReader inputStream;
PrintWriter outputStream;
TextArea textArea = new TextArea(12,50);
Button copyButton = new Button(
"Copy Selected Text");
Button postButton = new Button("Post Text");
Button deleteButton = new Button(
"Delete Local File");
Button nextButton = new Button("Next");
TextField fromField = new TextField(
"From data will appear here",50);
TextField subjField = new TextField(
"Subject data will appear here",50);
TextField outputWordField = new TextField(
"User pastes output words here",50);
TextField operMsgField = new TextField(
"User instructions appear here. " +
"Press Next to process first message.",50);
TreeSet subjWordList;
String[] dirList;
int fileCounter = 0;
//Change the following to move to message files
// to a different location on the disk.
File dataDir = new File("c:/MailFiles");
String msgToUser =
"nPost phrases for this message.n" +
"Then press Next to process next message.";

Listing 1

Increase the size of the GUI

The code in Listing 1 declares numerous instance variables,
initializing many of them.

Note in particular the instantiation of objects of the classes TextArea
and TextField in Listing 1.  As mentioned earlier, this
GUI was purposely made narrow to force it to fit into this narrow
publication format.  The GUI is much more useful if it is
wider.  I recommend that you modify the parameters passed to the TextArea
and TextField constructors in Listing 1 to make them at least
100 characters wide (instead of 50).

I also recommend that you modify the parameters passed to the setSize
method later in the program (see Listing 27) to cause the Frame
object to be at least 750 pixels wide.

The main method

The main method for the program is shown in Listing 2.

  public static void main(String[] args){
Pop302d thisObj = new Pop302d();
thisObj.makeSubjWordList();
}//end main

Listing 2

The main method instantiates an object of the Pop302d
class, causing
the constructor to be executed.

(Much of the interesting code in
the program resides in the constructor.)

Then the main method invokes the makeSubjWordList
method on a reference to the object.  I will discuss that method,
along with a couple of other utility methods before discussing the
constructor.

The makeSubjWordList method

The purpose of the makeSubjWordList method is to create a TreeSet
object
containing words used by the program named Pop302 to screen the Subject
lines of Email messages in an attempt to identify SPAM messages.

The makeSubjWordList method reads strings from a text file
named Pop302a.txt and creates the list as a TreeSet object
sorted in natural order with no duplicates.

The data read from the file is converted to upper case before being
added to the TreeSet object.

Also makes backup files

If that were all that the method did, it would be a relatively simple
method.  However, after creating the TreeSet object, the
method writes the
data from the object into a backup file named Pop302a.bakN,
where
N is the value of the next available file name in the
directory.

A new backup file with a unique name is created each time the program
is run.  Once the number of sequential backup files reaches 5, the
program automatically deletes the oldest file before creating a new
backup file.

Thus the program maintains a sequence of five backup files with
extensions bak0 through bak5 with one number missing.

Restoring data from a backup file

If you need to use one of the backup files to restore your primary
file, the age-order of the files should be determined by the
modification date and not by the name of the file.

The makeSubjWordList method begins in Listing 3. 

  void makeSubjWordList(){
subjWordList = new TreeSet();

try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302a.txt"));
String data; //temp holding variable

while((data = inData.readLine()) != null){
subjWordList.add(data.toUpperCase());
}//end while loop

inData.close();//Close input file

Listing 3

Read text file and populate TreeSet object

The code in Listing 3 reads the words and phrases from the text file
named Pop302.txt (in the current directory) and
populates
the TreeSet object referred to by subjWordList with an
upper-case version of those words and phrases.

This code is straightforward and should not require detailed discussion.

When the code in Listing 3 finishes executing, the TreeSet
object is populated and ready for use.

Write a backup file

The code in Listing 4 begins the process of making a backup file before
returning to the calling method and allowing the data in the TreeSet
object to be modified.

      int N = 0;
File theFile = null;
for(N = 0;N < 6;N++){
theFile = new File("Pop302a.bak" + N);
if(!(theFile.exists()))break;
}//end for loop

Listing 4

Determine the next allowable backup file name

The code in Listing 4 determines the next allowable backup file name by
determining which one of the following set of six backup files does not
exist in the directory:

  • Pop302a.bak0
  • Pop302a.bak1
  • Pop302a.bak2
  • Pop302a.bak3
  • Pop302a.bak4
  • Pop302a.bak5

The number associated with the first backup file found to be missing is
saved in the local variable named N.

(Initially, of course, they are all missing so a value
of 0 would be saved in N, and the first backup file would be
named Pop302a.bak0.)

Cause N to rotate from 0 through 5

When the value of N reaches 5, the backup file named Pop302a.bak0
is deleted as shown in Listing 5.  The backup file created by that
run will be named Pop302a.bak5 causing the file named Pop302a.bak0
to be the missing file during the next run.

      if(N == 5){//del file 0 for use next time
new File("Pop302a.bak0").delete();
}//end if

Listing 5

For N not equal to 5

For values of N other than 5, the next backup file in the
sequence will be deleted (if it exists) making that file name
available for the next run.  This is shown in Listing 6.

      else{//delete the next file in sequence
if(new File(
"Pop302a.bak" + (N + 1)).exists()){
new File(
"Pop302a.bak" + (N + 1)).delete();
}//end if
}//end else

Listing 6

Get output stream for the backup file

Having determined the value of N to be used in creating and
writing a backup file, the code in Listing 7 opens an output file
stream for a file using the name encapsulated in the File
object in Listing 4.

      DataOutputStream dataOut =
new DataOutputStream(
new FileOutputStream(
theFile));

Listing 7

Write and close the backup file

The code in Listing 8 uses an Iterator to read the contents of
the TreeSet object and to write the data into the backup file.

      Iterator iter = subjWordList.iterator();

while(iter.hasNext()){
data = (String)iter.next();
dataOut.writeBytes(data + "n");
}//end while

dataOut.close();
}catch(Exception e){e.printStackTrace();}
}//end makeSubjWordList

Listing 8

Then the code in Listing 8 closes the backup file and the method
returns.  At this point, the TreeSet object is populated
ready for use and a new backup file has been written containing the
contents of the TreeSet object.

Additional utility methods

The Pop302d class contains two additional utility methods,
which you can view in their entirety in Listing 30:

  • writeSubjWordList
  • removeStars

The writeSubjWordList method is completely straightforward and
shouldn’t require any discussion.  This method is used to write
the output
file containing a modified word list when the program terminates.

The removeStars method is the same as a method having the same
name that I discussed in the previous lesson entitled Enlisting
Java in the War Against SPAM, Part 2, The Screening Module

This method is used to remove asterisks that were purposely inserted
into the message files when they were created to defeat any virus code
that might be lurking there.  I will simply refer you back to the
discussion in the previous lesson for this method.

The constructor

That brings us to the constructor for the class named Pop302d,
which begins in Listing 9.

The code in Listing 9 registers an anonymous WindowListener object
to service the close button on the Frame.

(The close button is typically the button with
the X in the upper-right corner of the Frame.)

  Pop302d(){//constructor
this.addWindowListener(
new WindowAdapter(){
public void windowClosing(WindowEvent e){
writeSubjWordList();
System.exit(0);
}//end windowClosing
}//end WindowAdapter()
);//end addWindowListener

Listing 9

The windowClosing method

This anonymous class defines the windowClosing method declared
in the WindowListener interface, by extending the WindowAdapter
class.

(If you don’t understand what I mean by this, see the
lessons on this topic at www.DickBaldwin.com.)

When the user presses the close button, the windowClosing
method is invoked.  The windowClosing method invokes the writeSubjWordList
method to write the contents of the TreeSet object into the
output file named Pop302b.txt.

Then the windowClosing method terminates the program.

(As you will see later, the contents of the TreeSet
object are also written into the output file when the user clicks the Next
button and there are no more messages to be processed.  I will
explain the rationale for that later.)

Set the layout for the GUI

The code in Listing 10 sets the layout for the GUI to FlowLayout
This is a simple layout manager that works pretty well for this simple
GUI.

    setLayout(new FlowLayout());

Listing 10

An ActionListener on the Next button

Listing 11 shows the beginning of code, which registers an anonymous ActionListener
object on the Next button in Figure 1.

    nextButton.addActionListener(
new ActionListener(){

Listing 11

The definition of this anonymous class, which defines the actionPerformed
method of the ActionListener interface, is rather long and
complex.  I will break it down into several fragments to discuss
it.

A String[] object containing file names

Later on, we will see code that populates an array object of type String[]
with references to String objects containing the names of all
the files in the directory that end
with the extension .txt.  That object is referred to by a
reference variable named dirList, which is declared in Listing
1
and used in Listing 12.

For now suffice it to say that this String[] object will have
been populated before the user has an opportunity to press the Next
button, which causes the code in the actionPerformed method
that
begins in Listing 12 to be executed.

        public void actionPerformed(
ActionEvent e){

if((fileCounter >= 0) &&
(fileCounter < dirList.length)){

Listing 12

As you may have surmised, the variable named fileCounter in
Listing 12 is
used to keep track of the number of files processed.

If not out of bounds

The actionPerformed method that starts in Listing 12 consists
of two major blocks of
code, delineated by an if/else statement.

The first block of code, which begins with the if statement in
Listing 12, is executed if the value of fileCounter is positive
and is less than the number of files whose names are stored in
the String[] object.

The second block of code will begin with the else clause that
in Listing 21 quite a bit later in this discussion.

If message is the last message

Having determined that the value of fileCounter is still within
bounds, the next step is to determine if the value of fileCounter
represents the last file in the directory.  The code to determine
this and to take
the necessary action if true is shown in Listing 13.

            if(fileCounter ==
(dirList.length - 1)){
writeSubjWordList();
msgToUser = "nnNo more messages."
+ "nPost phrases for this "
+ "message.n Then press "
+ "close to terminate.";
//Disable the Next button so that
// the user cannot fire any more
// events of this type.
nextButton.setEnabled(false);
}//end if no more messages

Listing 13

No more message files

The body of the if statement in Listing 13 is executed when the
user clicks the Next button and the value of fileCounter
represents the last file in the directory.  Among other things, we
need to disable the Next button so that the user cannot attempt
to read and process any more message files.

Write the output file

This code in Listing 13 begins by invoking the writeSubjWordList
method to cause the current version of the word list in the TreeSet
object to be written into
the output file.

Although the writeSubjWordList method will be invoked again
later
when the user presses the close button to terminate the
program, writing the file at this point ensures that all changes made
to the contents of the TreeSet object up to this point will be
reflected in the output file even if the user fails to properly
terminate the program by pressing the close button.

Construct message to the user

The code in Listing 13 constructs a message to the user that will be
displayed at the end of the message text in the large text area of
Figure 1.

This message indicates that while there are no more messages to be
processed, the user can post additional words and phrases to the list
before pressing the close button to terminate the
program. 

Terminating the program

If the user terminates the program properly by pressing the close
button, the contents of the TreeSet object will be written
into the output file again when the program terminates.  Those
contents will include any postings made while processing the last
message.

However, if the user terminates the program improperly by failing to
press the close button, only those final postings
associated with the last message will be lost.  The contents of
the
TreeSet object have already been written into the output file
containing all postings made up to, but not including the postings
associated with the last message.

Disable the Next button

Finally, the code in Listing 13 disables the Next button to
ensure that the user cannot attempt to read and process additional
message files.

Standard processing for all message files

The code beginning in Listing 14 is executed for all message files,
including the last one.

            textArea.setText("Processing " +
dirList[fileCounter] + "n");

operMsgField.setText("Paste a phrase"
+ " in the output field and press "
+ "Post. Post as many new phrases "
+ "as you want. Press next to "
+ "process next message.");

outputWordField.setText("Paste "
+ "output phrase here and then "
+ "press Post.");

Listing 14

Messages for the user

The code in Listing 14 displays three messages for the user.  One
message is displayed at the top of the large text area in Figure
1.  The other two are displayed in the third and fourth text
fields in Figure 1.

Prepare to read the message file

The code in Listing 15 is executed in preparation for reading the
message file.

            try{
BufferedReader inData
= new BufferedReader(
new FileReader(dataDir.
getAbsolutePath() + File.
separator + dirList[
fileCounter]));

String data; //temp holding area

inData.mark(10000);

subjField.setText(
"No Subj line found yet");
fromField.setText(
"No From line found yet");

Listing 15

The code in Listing 15 performs the following actions:

  • Get an input stream that can be used to read the file whose name
    is stored in the String[] object at index fileCounter
    Note that asterisks were written into the file on ten-character
    intervals when it was written to defeat any executable virus code that
    might be lurking there.
  • Declare a local variable named data that will be the
    temporary repository for each message line as it is read from the file.
  • Set a mark at the beginning of the file that will allow for
    rewinding the file to the beginning after some lines have been read.
  • Put text in the top two text fields of Figure 1 that will be
    displayed in the unusual event that the message doesn’t have From
    and Subject lines.

Read and display the Subject line

The code in Listing 16 reads far enough into the message to find the
Subject line and to display it in the second text field in Figure 1.

              while((data = inData.readLine())
!= null){

data = removeStars(data);

if(data.startsWith("Subject:")){
subjField.setText(
data.toUpperCase());
break;//No need to keep reading
}//end if
}//end while loop on null

Listing 16

The code in Listing 16 performs the following actions:

  • Read the next line of text from the file.  If readLine returns
    null, terminate the loop.
  • Invoke the removeStars method to remove the asterisks
    purposely inserted into the file when it was written.
  • Test to see if the line starts with Subject:  If
    not, loop and read the next line.  If so, convert the line to
    upper case, write it into the second text field in Figure 1, and break
    out of the loop.  In that case, there is no point in reading
    additional lines.

Get and display the From line

The code in Listing 17 rewinds the input stream from the file, and then
executes essentially the same code as Listing 16 to get and display the
From information in the first text field in Figure 1.

              inData.reset();

while((data = inData.readLine())
!= null){
data = removeStars(data);

if(data.startsWith("From:")){
fromField.setText(
data.toUpperCase());
break;
}//end if
}//end while loop on null

Listing 17

Read and display entire message

The code in Listing 18 rewinds the input stream, and then reads and
displays the entire message, one line at a time, in the large text area
in Figure 1. 

              inData.reset();

while((data = inData.readLine())
!= null){
data = removeStars(data);
textArea.append(data + "n");
}//end while loop

Listing 18

This data is displayed to help the user decide if this is a SPAM
message, and if so, what to do in terms of updating the word list used
for screening the Subject line of future messages.

Display user message and close the message file

The code in Listing 19 displays a message to the user in the large text
area at the end of the email message in Figure 1.

              textArea.append(msgToUser + "n");
inData.close();//Close file
}catch(Exception ex){
ex.printStackTrace();}

Listing 19

The message will either be the message initialized into msgToUser
in Listing 1, or the value written into that object in Listing 13.

Then the code in Listing 19 closes the input file.

Increment the file counter

The code in Listing 20 increments the file counter so that the next
time the user presses the Next button, the program will be
ready to read and process the next message file.

            fileCounter++;
}//end if on fileCounter in bounds

Listing 20

If fileCounter is out of bounds

Much earlier, in conjunction with the discussion of Listing 12, I told
you that the actionPerformed method consists of two major
blocks of code.  One block, which I have been discussing up to
this point, is executed when the value of fileCounter is in
bounds insofar as the number of available files is concerned.

Consider, however, the case where the Next button is pressed
when the directory is empty of files.  This case is handled by the
else clause in Listing 21.

          else{//file counter is out of bounds
textArea.setText(
"No more files. Press Close to "
+ "terminate.");
nextButton.setEnabled(false);
}//end else counter is out of bounds

}//end actionPerformed
}//end ActionListener
);//end addActionListener

Listing 21

The code in the else clause in Listing 21 displays a message to
the user to the effect that there are no more files and disables the Next
button.

End of anonymous class definition for Next
button

Listing 21 also signals the end of the code, begun in Listing 11,
which registers an anonymous ActionListener object on the Next
button.

Instantiate ActionListener object from
anonymous class

The code that begins in Listing 22 registers a common ActionListener
on the Post Text button and also on the third text field in
Figure
1.  This makes it possible to post a new word or phrase to the
list by pressing the Post Text button, or by pressing the Enter
key when the text field has the focus.

    ActionListener postListener =
new ActionListener(){
public void actionPerformed(
ActionEvent e){
String tempWord =
outputWordField.getText();
subjWordList.add(tempWord);

outputWordField.setText(
tempWord + " posted");
}//end actionPerformed
};//end ActionListener

Listing 22

The code in Listing 22 instantiates an object of an anonymous class
that implements the ActionListener interface, and saves that
object’s reference in the reference variable named postListener.

The actionPerformed method

The actionPerformed method defined in Listing 22 performs the
following actions:

  • Get the word or phrase from the third text field in Figure 1 and
    add it to the list in the TreeSet object.
  • Modify the text in the third text field to provide feedback to
    the user indicating that the posting is complete.

Register ActionListener on Post Text
button and output field

The code in Listing 23 registers the ActionListener object on
both the Post Text button and the third text field in Figure 1.

      postButton.addActionListener(postListener);
outputWordField.addActionListener(
postListener);

Listing 23

Register ActionListener on Copy Selected
Text
button

Recall from the earlier discussion of Figure 1 that the user can select
a block of text in the second text field and copy the selected text to
the third text field by pressing the Copy Selected Text button.

This is accomplished by an ActionListener object that is
registered on the Copy Selected Text button in Listing 24.

    copyButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
outputWordField.setText(
subjField.getSelectedText());
}//end actionPerformed
}//end new ActionListener
);//end addActionListener

Listing 24

The actionPerformed method in Listing 24 invokes the getSelectedText
method on the second text field in Figure 1 and writes that block of
text into the third text field by invoking the setText method
on the third text field.

Register ActionListener on Delete Local
File
button

The code that begins in Listing 25 registers an ActionListener on
the Delete Local File button in Figure 1.  This ActionListener
makes it possible for the user to remove a file from the local history
directory.

While running this program, the user would typically delete files from
the history directory that
are recognized as not being SPAM.  Such files should not be used
for training the
screening algorithm.

    deleteButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
File tempFile = new File(
dataDir.getAbsolutePath() +
File.separator +
dirList[fileCounter-1]
);
if(tempFile.exists()){
tempFile.delete();//Delete the file
}//end if

Listing 25

Process for deleting a file

The process for deleting a file in Java is to instantiate a File
object that represents the file to be deleted, and then to invoke the delete
method on that File object.

A File object that represents the file to be
deleted

This is accomplished by the actionPerformed method defined in
Listing 25.

Recall that dataDir is a reference to a File object
that represents the directory where the message files are stored. 
The code highlighted in red in Listing 25 constructs a String
object representing the path and file name of the file to be
deleted.  This String is passed to the constructor for the
new File object.

(Note that when instantiating the new File
object, it is
necessary to subtract one from the value of the file counter to cause
it to reference the current file.  This is because it has already
been
incremented by the event handler for the Next button in
preparation for
processing the next file.)

Then the code in Listing 25 confirms that the file actually exists, and
if true, the file is deleted.

Fire a synthetic ActionEvent on the Next
button

The code in Listing 26 fires a synthetic ActionEvent on the Next
button to cause the program to process the next file in the history
directory without a requirement for the user to press the Next
button.

          Toolkit.getDefaultToolkit().
getSystemEventQueue().
postEvent(new ActionEvent(
nextButton,
ActionEvent.
ACTION_PERFORMED,
"Next"));
}//end actionPerformed
}//end ActionListener
);//end addActionListener

Listing 26

I have discussed the firing of synthetic event in previous lessons in
this series, so I won’t discuss that topic further in this lesson.

Configure the GUI

The code in Listing 27 configures the GUI by placing the various
components in it, setting the title, setting the size, and making the
whole thing visible.

    add(copyButton);
add(postButton);
add(nextButton);
add(deleteButton);
add(fromField);
add(subjField);
add(outputWordField);
add(operMsgField);
add(textArea);
setTitle("Copyright 2004, R.G.Baldwin");
setSize(400,400);
setVisible(true);

Listing 27

As mentioned earlier, you should increase the width
parameter to the setSize
method (highlighted in red) to at least 750 pixels.  That
will cause your GUI to be much wider and much more useful.

(Don’t forget to also change the width of the text
fields and the text area in Listing 1 to at least 100 characters
instead of only 50 characters.)

Create the directory listing for the history
directory

The code in Listing 28 populates a String[] object referred to
by dirList with the names of the files in the history directory
having an extension of .txt.

This is an anonymous implementation of a class that implements the Filename
Filter

interface.

    dirList = dataDir.list(
new FilenameFilter(){
public boolean accept(
File dir,String name){
if(!(new File(dir,name).isFile()))
return false;
return name.endsWith(".txt");
}//end accept
}//end FilenameFilter
);//end list

Listing 28

What does Sun have to say about FilenameFilter?

The code in Listing 28 is a little cryptic.  Here is
part of what Sun has to say about the FilenameFilter interface:

“Instances of classes that implement this interface are
used to filter filenames. These instances are used to filter directory
listings in the list method of class File

The code in Listing 28 defines the accept method, which is
declared in the FilenameFilter
interface.

The list method of the File class

Recall that dataDir is a reference to an object of type File,
which represents the directory containing the message files.  The
code in Listing 28 invokes the list method on the File
object, passing a reference to an object of type Filename Filter
as a parameter.

There are two overloaded versions of the list method in the File
class.  One of the overloaded versions takes no parameters, and,
according to Sun,

“Returns an array of strings naming the files and
directories in the directory…”

That is not the version of list invoked in Listing 28. 
Rather the version of the list method invoked in Listing 28

“Returns an array of strings naming the files and
directories in the directory … that satisfy the specified filter. The
behavior of this method is the same as that of the list() method,
except that the strings in the returned array must satisfy the filter.”

The filter

The characteristics of the filter referred to above are
determined by the behavior of the accept method defined in
Listing 28.  The accept method returns true if
the name of the file, which is received as an incoming parameter, ends
with .txt.  Otherwise, the accept method returns false.

In operation …

In operation, the code in the overloaded list method calls the accept
method once for each item in the directory, passing the directory and
the name of the item as parameters.  If that item is a file, and
if the name of the file ends with .txt, the accept method
returns true, and that file name is included in the String[]
object returned by the list method.  Otherwise, the accept
method returns false, and the name of that item is
not included n the returned String[] object.

Display a list of files

When the GUI first appears on the screen, the code in Listing 29 causes
a list of the text files in the history directory to be displayed in
the large text area in Figure 1.

    this.textArea.append("Files to be processed"
+ "n");
//Display the list of files
for(int cnt = 0;cnt < dirList.length;cnt++){
this.textArea.append(dirList[cnt] + "n");
}//end for loop

}//end constructor
}//end class Pop302d

Listing 29

The code in Listing 29 is straightforward and shouldn’t require further
explanation.

End constructor and class

The code in Listing 29 also signals the end of the constructor as well
as the end of the class.  Once again, you will find a complete
listing of the program in Listing 30 near the end of the lesson.

Run the Program

I encourage you to copy the code from Listing 30 below, as well as
the program named Pop302 and the three
starter text files at the end of the previous lesson entitled Enlisting
Java in the War Against SPAM, Part 2, The Screening Module

(The program named Pop302 is in Listing 34 in the previous
lesson.)

Compile and execute the
program named Pop302.  This should put some message files
in your history directory that can be used to test the program named Pop302d.

Then compile and execute the program named Pop302d from
Listing 30 below.  Experiment with it, making changes, and
observing the
results
of your
changes.  For example, I recommend that you increase the width of
the GUI to at least 750 pixels, and increase the width of the text
fields and text area to at least 100 characters.

Before executing the two programs, you will need to create three text
files
having the following names and purposes and store them in the folder
containing your compiled Java class files.

  • Pop302a.txt – contains offensive Subject line
    words and phrases
  • Pop302b.txt – contains offensive body text words
    and phrases
  • Pop302c.txt – contains friendly Email addresses and friendly Subject
    line material

Eventually you will need to populate these files with words
and phrases that work well for you.  The program named Pop302d
explained in this lesson will help you to update the file named Pop302a.txt
based on actual SPAM messages in your history directory.

In the meantime, I have provided sample text files in Listings 35,
36, and 37 in the previous lesson entitled Enlisting
Java in the War Against SPAM, Part 2, The Screening Module

You can use these files as starter lists.  If
you receive the same kinds of SPAM that I receive, the words in these
lists should make it possible for you to test the two programs and to
get a few
hits on SPAM messages.

These are simply text files so feel free to add other words and
phrases as appropriate.

(Let me caution you not to enable the
DELE
code in the program named Pop302 until you are certain
that you actually want to delete messages from the server.  Once a
message is deleted from the server, there is no way to recover it from
the server.)

Summary

This lesson shows you how to train my SPAM
screening algorithm to do a better job of
identifying SPAM in the future based on the Subject line of SPAM
messages.

The lesson explains a program named Pop302d,
which
uses historical data to train the algorithm.

The GUI used to control the program is shown in Figure 1.  The
program
is designed for extreme ease of use.  Training the algorithm
consists
mainly of selecting text with the mouse and pressing buttons.  This can
be accomplished very quickly with very little effort.  Except for
the possible requirement to delete extra characters, no actual typing
is required.

After several weeks of training, my
algorithm is reliably identifying about ninety-five percent of all SPAM
messages, allowing me to delete them from my Email server before
downloading them into my primary Email
client.

Most
of
the messages that make it past the screen are messages written in
a non-English language that I am unable to read.  No one that I
know would send a message to me in a
language that I can’t read, so I consider such messages to be
SPAM.  They constitute the bulk of the five-percent of the SPAM
messages that make it past the screen.

What’s Next?

Long range forecast

There are a number of spammer tricks that haven’t been addressed so
far in this series of articles.  I will be addressing many of them
in future articles.  I am also working on programs that combine
spam screening with virus protection.  I will be publishing those
programs as well.

HTML

Many SPAM messages are written in HTML.  The use of HTML
provides a myriad of opportunities for the spammer to defeat SPAM
blocking software.  For example, it is common practice for the
spammer to insert bogus HTML tags into offending words.  Because
the tags are not recognized by the HTML rendering engine, they are
simply ignored, causing the offending word to be rendered intact.

This is fairly easy to handle, and I will describe the solution in a
future article.

Encoded Subject lines

The Email system allows for the Subject line to be
encoded in at least two different ways, making it appear to be
meaningless in its raw form.  Apparently this is intended to make
it possible to reliably send characters other than those included in
the seven-bit ASCII character set across the Internet.  Spammers
commonly use this as a way to disguise their subjects in raw text
form.  I will show you how to deal with this.

Encoded body text

The Email system also allows for encoding of the body of the Email
message to make it possible to send graphic images and other types of
material across the Internet.  Again, spammers find this to be a
convenient way to disguise their messages in raw text form.

Because Email messages can consist of multiple parts, some encoded
and some not encoded, this is somewhat more difficult to handle than
encoded Subject lines.  I will show you how to
deal with this.

Writing the message as HTML entities

Sometimes I encounter SPAM messages where the spammer has
represented most of the characters in the message as HTML
entities.  Sometimes this material is also encoded as described
above.  I will show you how to deal with this.

Extraneous random material

Another big issue has to do with the inclusion of large numbers of
extraneous random words, random sentence fragments, or simply random
groups of letters in email messages.  Apparently this is done to
defeat
statistical SPAM blocking programs such as the one described by Paul Graham
Also, such material is sometimes inserted into the middle of offending
words and phrases.

I don’t know how effective this is in defeating statistical
programs, but it
is a nuisance to a human observer like myself trying to
identify offending words and phrases within the message.  So far,
I
have identified three ways that spammers insert extraneous
material and make it almost invisible to the reader of the
message.  (I expect to identify other ways as time goes on.)

Below the </HTML> tag

The most obvious way to hide extraneous material is to add the
extraneous material
below the </HTML> tag.  Then it is simply ignored by the
HTML rendering engine.  This is relatively easy to deal with
insofar as a human observer is concerned, and would probably be easy
for a statistical processor to deal with as well.  I will show you
how I deal with this in my screening algorithm.

Color and size

A more subtle and effective way is to include the extraneous
material in the HTML body, but to cause the color of the extraneous
text to be the same as the background color.  That causes it to
become
invisible, but it still occupies space.

A third way is to include the
extraneous text in the HTML body, but to reduce its size so that is
appears as a small irregular dot.  This can be combined with
the color idea to cause the dot to become invisible.  Then it
becomes hardly noticeable to the reader of the message.

I don’t have an easy solution to the color and size issues yet, but I
am working on them.  If I develop a solution, I will explain it to
you.

Non-English messages

After a few weeks of training, my algorithm is currently
identifying about ninety-five percent of all SPAM messages.  Most
of
the messages that make it past the screen are messages written in
a non-English language that I am unable to read.

Although I don’t know anything about the intentions of the author (SPAM
or otherwise),
since I can’t read the message, I don’t want to see
it.  Also, no one that I know would send a message to me in a
language that I can’t read, so I consider such messages to be
SPAM.  They constitute the bulk of the five-percent of the SPAM
messages that make it past the screen.

I have some ideas about how to deal with this based on the use of
non-ASCII characters, but I haven’t tried them yet.  If these
ideas bear fruit, I will tell you about them.

Other spammer tricks

As time goes on, I expect to identify other tricks used by spammers
to defeat SPAM blocking software.  If I identify any that seem to
be significant, I will tell you about them as well.

The next lesson

The next lesson in this series will present and explain my
program named Pop302e, which provides an easy way to
train my screening algorithm to do a better job of identifying SPAM
based on
the body text of a message.

Complete Program Listing


A complete listing of the program is provided in Listing 30.

Disclaimer of responsibility:  If you elect to use this
program
you use it at your own risk.  Make absolutely certain that you
understand what you are doing before you execute the program. 
Inappropriate use could result in the loss of Email messages.  The
author of this program, Richard G. Baldwin, accepts no responsibility
for any losses that you may incur as a result of using this program.

/*File Pop302d.java Copyright 2003, R.G.Baldwin
Rev 01/11/04

The purpose of this program is to process a text
file produced by P302.java for the purpose of
using the information contained in that file to
update the word list stored in Pop302a.txt

Tested using SDK 1.4.2 under WinXP
************************************************/
import java.io.*;
import java.util.*;
import java.awt.*;
import java.awt.event.*;

class Pop302d extends Frame{

BufferedReader inputStream;
PrintWriter outputStream;
TextArea textArea = new TextArea(12,50);
Button copyButton = new Button(
"Copy Selected Text");
Button postButton = new Button("Post Text");
Button deleteButton = new Button(
"Delete Local File");
Button nextButton = new Button("Next");
TextField fromField = new TextField(
"From data will appear here",50);
TextField subjField = new TextField(
"Subject data will appear here",50);
TextField outputWordField = new TextField(
"User pastes output words here",50);
TextField operMsgField = new TextField(
"User instructions appear here. " +
"Press Next to process first message.",50);
TreeSet subjWordList;
String[] dirList;
int fileCounter = 0;
//Change the following to move to message files
// to a more permanent location on the disk.
File dataDir = new File("c:/MailFiles");
String msgToUser =
"nPost phrases for this message.n" +
"Then press Next to process next message.";

public static void main(String[] args){
Pop302d thisObj = new Pop302d();
thisObj.makeSubjWordList();
}//end main
//===========================================//

Pop302d(){//constructor
//Register a window listener to service
// the close button on the Frame. This is
// an anonymous class defiition.
this.addWindowListener(
new WindowAdapter(){
public void windowClosing(WindowEvent e){
//Write the updated word list stored in
// a TreeSet object to an output file
// on shutdown. It is also written
// when you click the Next button and
// there are no remaining files to be
// processed.
writeSubjWordList();
System.exit(0);
}//end windowClosing
}//end WindowAdapter()
);//end addWindowListener

//Note, it will be necessary to temporarily
// make this Pop302d GUI narrower to produce
// the figures for publication.
setLayout(new FlowLayout());

//Register an ActionListener on the
// nextButton. This is an anonymous
// class definition.
nextButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){

//Protect against ArrayIndexOutOfBounds
if((fileCounter >= 0) &&
(fileCounter < dirList.length)){

if(fileCounter ==
(dirList.length - 1)){
//The user clicked the Next button
// but there are no more files.
//Write the modified word list
// stored in the TreeSet object to
// an output file. This also
// happens when the user clicks the
// close button on the Frame later,
// but this write operation is
// provided here just in case the
// user terminates without pressing
// the close button. The user can
// post additional words to the
// TreeSet object after this write
// operation occurs. That is why
// an additional write operation
// occurs when the user presses
// the close button.
writeSubjWordList();
msgToUser = "nnNo more messages."
+ "nPost phrases for this "
+ "message.n Then press "
+ "close to terminate.";
//Disable the Next button so that
// the user cannot fire any more
// events of this type.
nextButton.setEnabled(false);
}//end if no more messages

//Identify the file being processed
textArea.setText("Processing " +
dirList[fileCounter] + "n");

//Provide instructions to the user.
operMsgField.setText("Paste a phrase"
+ " in the output field and press "
+ "Post. Post as many new phrases "
+ "as you want. Press next to "
+ "process next message.");
outputWordField.setText("Paste "
+ "output phrase here and then "
+ "press Post.");

try{
//Open the file containing a local
// copy of the message. Note that
// the message has been mangled by
// inserting asterisks in an
// attempt to protect against
// viruses.
BufferedReader inData
= new BufferedReader(
new FileReader(dataDir.
getAbsolutePath() + File.
separator + dirList[
fileCounter]));
String data; //temp holding area

//Precondition the display of
// Subject in the GUI by skipping
// header lines prior to the
// Subject line. Mark the beginning
// of the file. Set the
// readAheadLimit to 10000
// characters before the mark will
// be lost.
inData.mark(10000);
//Some messages may not contain a
// Subject or From line. Don't
// want the old one to continue to
// be visible in the GUI.
subjField.setText(
"No Subj line found yet");
fromField.setText(
"No From line found yet");
while((data = inData.readLine())
!= null){
//A null result indicates end of
// file.
//Remove the asterisks that were
// inserted into the data when
// the file was written in an
// attempt to protect against
// viruses. Append two asterisks
// to the end of each line.
data = removeStars(data);

//Trap the Subject line, convert
// it to upper case, and display
// it in a field on the GUI.
if(data.startsWith("Subject:")){
subjField.setText(
data.toUpperCase());
break;//No need to keep reading
}//end if
}//end while loop on null

//Reset back to beginning of file.
// The Subject for this message is
// now showing in the GUI.
inData.reset();

//Precondition the display of From
// line in the GUI by skipping
// header lines prior to the From
// line. Code is similar to that
// discssed above.
while((data = inData.readLine())
!= null){
data = removeStars(data);
if(data.startsWith("From:")){
fromField.setText(
data.toUpperCase());
break;
}//end if
}//end while loop on null

//Reset back to beginning of file.
// The From line for this message
// is now showing in the GUI. Read
// and display the entire file.
// This data is displayed for
// informtion purposes only to help
// the user decide what to do in
// terms of updating the word list
// used by Pop302 for processing
// the Subject line.
inData.reset();

//Read and display strings until
// eof is indicated by null.
while((data = inData.readLine())
!= null){
data = removeStars(data);
textArea.append(data + "n");
}//end while loop

//Display messages to the user at
// the end of the data in the text
// area.
textArea.append(msgToUser + "n");
inData.close();//Close file
}catch(Exception ex){
ex.printStackTrace();}

//Increment the fileCounter so that
// the next time the Next button
// fires an ActionEvent, the next
// file in the directory listing will
// be processed.
fileCounter++;

}//end if on fileCounter in bounds
else{
//File counter out of bounds. This
// happens if you delete all the
// files.
textArea.setText(
"No more files. Press Close to "
+ "terminate.");
nextButton.setEnabled(false);
}//end else counter is out of bounds
}//end actionPerformed
}//end ActionListener
);//end addActionListener

//Register an object of the following
// anonymous class on both the Post button
// and the outputWordField. That way, the
// contents of the outputWordField can be
// posted to the new word list by either
// clicking the Post button, or pressing the
// Enter key when the outputWordField has the
// focus.
ActionListener postListener =
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Get the word or phrase from the field
// and add it to the TreeSet object.
String tempWord =
outputWordField.getText();
subjWordList.add(tempWord);

//Provide feedback to confirm that it
// has been posted. This tells the
// user that she is free to post
// another word if she desires.
outputWordField.setText(
tempWord + " posted");
}//end actionPerformed
};//end ActionListener

//Register the ActionListener object on
// the two source objects.
postButton.addActionListener(postListener);
outputWordField.addActionListener(
postListener);

//Register an ActionListener on the
// copyButton to copy selected text to the
// outputWordField
copyButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
outputWordField.setText(
subjField.getSelectedText());
}//end actionPerformed
}//end new ActionListener
);//end addActionListener

//Register an ActionListener on the Delete
// button to make it possible for the
// user to remove a file from the local
// directory. The user would typically
// delete files that are recognized as not
// being SPAM. Thus, they shouldn't be used
// to construct spam-blocker word lists.
deleteButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Delete the local file currently being
// displayed in the GUI. Must subtract
// one from the value of the file
// counter to cause it to reference the
// current file because it has already
// been incremented by the event
// handler for the Next button in
// preparation for processing the next
// file.

//Create a File object that represents
// the current file.
File tempFile = new File(
dataDir.getAbsolutePath() +
File.separator +
dirList[fileCounter-1]);
if(tempFile.exists()){
tempFile.delete();//Delete the file
}//end if

//Fire a synthetic event on the Next
// button to cause the program to
// process the next file in the
// directory listing.
Toolkit.getDefaultToolkit().
getSystemEventQueue().
postEvent(new ActionEvent(
nextButton,
ActionEvent.
ACTION_PERFORMED,
"Next"));
}//end actionPerformed
}//end ActionListener
);//end addActionListener

//Configure the GUI by placing the various
// components on it.
add(copyButton);
add(postButton);
add(nextButton);
add(deleteButton);
add(fromField);
add(subjField);
add(outputWordField);
add(operMsgField);
add(textArea);
setTitle("Copyright 2004, R.G.Baldwin");
//Will need to make the GUI narrower in order
// to create the figures for publication.
setSize(400,400);
//Make the GUI visible.
setVisible(true);

//The following code creates a directory
// listing containing only those files that
// end in .txt.
//This is an anonymous implementation of a
// class that implements FilenameFilter.
dirList = dataDir.list(
new FilenameFilter(){
public boolean accept(
File dir,String name){
if(!(new File(dir,name).isFile()))
return false;
return name.endsWith(".txt");
}//end accept
}//end FilenameFilter
);//end list

//Create a message in the text area at
// startup showing the list of files in the
// directory that are available for
// processing.
this.textArea.append("Files to be processed"
+ "n");
//Display the list of files
for(int cnt = 0;cnt < dirList.length;cnt++){
this.textArea.append(dirList[cnt] + "n");
}//end for loop

}//end constructor
//===========================================//

//Purpose: To create a TreeSet object
// containing words used to filter the message
// subject lines in the program named
// Pop302.java.
//This method reads strings from a text file
// named Pop302a.txt and creates the list as
// a TreeSet object sorted in natural order
// with no duplicates.
//After creating the list, it writes the data
// from the list into a backup file named
// Pop302a.bakN, where N is the value of the
// next available file name in the directory.
//A new backup file with a unique name is
// created each time the program is run. Once
// the number of backup files reaches 5, the
// program automatically deletes the oldest
// file before creating a new backup
// file. Thus the program automatically
// maintains a sequence of five backup files
// with extensions .bak0 through bak5 with one
// number missing. The age-order of the files
// should be determined by the modificatin date
// and not by the name of the file.
//The data read from the file is converted to
// upper case before being added to the TreeSet
// object.

void makeSubjWordList(){
subjWordList = new TreeSet();

//Read words or phrases from text file and
// populate the TreeSet object.
try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302a.txt"));
String data; //temp holding area

while((data = inData.readLine()) != null){
subjWordList.add(data.toUpperCase());
}//end while loop

inData.close();//Close input file

//Write a backup file before making any
// modifications to the data.

//First determine the name of the next
// backup file allowed in the directory.
int N = 0;
File theFile = null;
for(N = 0;N < 6;N++){
theFile = new File("Pop302a.bak" + N);
if(!(theFile.exists()))break;
}//end for loop

//Cause N to rotate from 0 through 5
if(N == 5){//del file 0 for use next time
new File("Pop302a.bak0").delete();
}//end if
else{//delete the next file in sequence
if(new File(
"Pop302a.bak" + (N + 1)).exists()){
new File(
"Pop302a.bak" + (N + 1)).delete();
}//end if
}//end else

//Now write the output file
DataOutputStream dataOut =
new DataOutputStream(
new FileOutputStream(
theFile));

//Use an Iterator object to access the data
// in the TreeSet object.
Iterator iter = subjWordList.iterator();

while(iter.hasNext()){
data = (String)iter.next();
dataOut.writeBytes(data + "n");
}//end while

dataOut.close();
}catch(Exception e){e.printStackTrace();}
}//end makeSubjWordList
//===========================================//

//Purpose: To write the data from a TreeSet
// object into a file named Pop302a.txt that
// is used in the program named Pop302.java to
// filter the message subject lines.
//This method is the reverse of the method
// named makeSubjWordList.

void writeSubjWordList(){
try{
DataOutputStream dataOut =
new DataOutputStream(
new FileOutputStream(
"Pop302a.txt"));

//Use an iterator to access the data in
// the TreeSet object.
Iterator iter = subjWordList.iterator();
String data;

while(iter.hasNext()){
data = (String)iter.next();
dataOut.writeBytes(data + "n");
}//end while

dataOut.close();
}catch(Exception e){e.printStackTrace();}
}//end SubjWordList
//===========================================//

//Purpose of this method is to remove the
// asterisks inserted into the data by the
// method named insertStars when the data files
// were stored on the disk. In addition to
// removing asterisks, two asterisks are
// appended to the end of each line. Note that
// all asterisks will be removed, not just
// those inserted earlier.
String removeStars(String stringIn){
StringBuffer stringBuf =
new StringBuffer(stringIn);
int index = 0;
while(index > -1){
index = stringBuf.lastIndexOf("*");
if(index > -1){
stringBuf.delete(index,index+1);
}//end if
}//end while
stringBuf.append("**");
return new String(stringBuf);
}//end removeStars()
//===========================================//

}//end class Pop302d
//============================================//

Listing 30


Copyright 2004, Richard G. Baldwin.  Reproduction in whole or
in
part in any form or medium without express written permission from
Richard
Baldwin is prohibited.

About the author

Richard Baldwin
is a college professor (at Austin Community College in Austin, TX) and
private consultant whose primary focus is a combination of Java, C#,
and XML. In addition to the many platform and/or language independent
benefits of Java and C# applications, he believes that a combination of
Java, C#, and XML will become the primary driving force in the delivery
of structured information on the Web.

Richard has participated in numerous consulting projects, and he
frequently provides onsite training at the high-tech companies located
in and around Austin, Texas.  He is the author of Baldwin’s
Programming Tutorials, which
has gained a worldwide following among experienced and aspiring
programmers. He has also published articles in JavaPro magazine.

Richard holds an MSEE degree from Southern Methodist University
and has many years of experience in the application of computer
technology to real-world problems.

Baldwin@DickBaldwin.com

-end-
 

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories