JavaEnlisting Java in the War Against SPAM: Training the Body Screener

Enlisting Java in the War Against SPAM: Training the Body Screener

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Java Programming Notes # 2156


Preface


This is the fourth lesson in a series designed to teach you how to
write
a Java program to remove SPAM messages from your Email server before
you
download them into your primary Email client.  The first lesson
was
entitled Enlisting
Java in the War Against SPAM: The Communications Module

The previous lesson was entitled Enlisting
Java in the War Against SPAM: Training the Subject Line Screener
.

The communications module

The first lesson explained the communications module used to
communicate with
your Email server, and to remove SPAM messages from the
server.

SPAM screening algorithm

The second lesson explained my SPAM screening algorithm.  The
program is designed to allow you to use my SPAM screening
algorithm, or to invent your own.  You can use my
algorithm as a starting point if you decide to invent your own.

My algorithm operates
separately on the Subject line, the From line,
and the body text of each Email message.  The
previous
lesson explained the program that I use to train that portion of the
algorithm that screens the message on the basis of the Subject line.

This lesson explains my program for training the
algorithm to do a better job of
identifying SPAM in the future based on the body text of
Email messages.

Future lessons will explain a large number of items contained in the
section entitled “What’s Next” in the previous lesson.

Viewing tip

You may find it useful to open another copy of this lesson in a
separate browser window.  That will make it easier for you to
scroll back and forth among the different listings and figures while
you are reading about them.

Supplementary material

I recommend that you also study the other lessons in my extensive
collection of online Java tutorials.  You will find those lessons
published at Gamelan.com
However, as of the date of this writing, Gamelan doesn’t maintain a
consolidated index of my Java tutorial lessons, and sometimes
they are difficult to locate there.  You will find a consolidated
index at www.DickBaldwin.com.

Preview

Remove SPAM from the server

In this series of lessons, I am showing you how to write a
Java program
that supplements the SPAM screening software that you are currently
using.  The program is used to identify and remove SPAM from your
Email server before it is downloaded into your primary Email client.

Any SPAM that makes it past the SPAM screening program can be
further acted upon
by the SPAM screener that is built into your Email client.

Algorithm training programs

The screening program is
most useful
when the algorithm has been well trained, and the training of the
algorithm is kept up to date.  Although it is possible to perform
this training with a
text editor, you can be much more productive, and you are much more
likely to keep the training up to date using the programs that I have
provided.

The previous lesson explained a program named Pop302d,
which
uses historical data to train the algorithm to do a better job
of identifying SPAM in future messages based on the Subject
of the message.

This lesson explains a program named Pop302e,
which uses historical data to train the algorithm to do a
better job of identifying SPAM based on the body text of
the message.

Because of the need to train the algorithm and to keep it up to
date, and the ease with which
these programs make that possible, the training programs are
equally as important as the main screening program.

Operational sequence

Here is the typical operational sequence that I follow each
morning to remove SPAM from my Email server before downloading it into
my primary Email client, and to train the algorithm to recognize any
future SPAM messages that managed to evade the screening program on the
first pass that morning.

  1. Run the program named Pop302 to identify SPAM and to
    remove it from the
    server.  This normally allows a few SPAM messages (stragglers)
    to get
    through.  These straggler messages are stored in a history folder
    on my local disk.
  2. Run the program named Pop302d to train the algorithm to
    recognize the stragglers as SPAM
    based on
    information in the Subject line.  (It is also
    possible to recognize that a message is not SPAM and to exclude it from
    further training activities.)
  3. Run the program named Pop302e (explained in this
    lesson)

    to train the algorithm to recognize the stragglers as SPAM based on
    information
    in the body text.
  4. Go back and run the main program named Pop302 to remove
    the SPAM stragglers from the server.
  5. Run the primary Email client to download the remaining good
    messages from the Email server into my local Email inbox.

When I am in a hurry …

It isn’t necessary to perform all of these steps every
day.  On those mornings when I am in a hurry, I skip steps
2, 3, and 4, leaving the straggler messages in the local history folder
for use later.

Sometime later (perhaps the next day or several days later)
I perform steps 2 and 3 to train the algorithm to recognize future
SPAM
messages represented by the characteristics of the SPAM messages that
have
been saved in the local
history folder.

Effectiveness of my algorithm

After several weeks of training, my
algorithm is reliably identifying about ninety-five percent of all SPAM
messages, allowing me to delete them from my Email server before
downloading them into my primary Email
client.  By executing steps 2, 3, and 4 above, I am able to also
eliminate the remaining five percent of the SPAM messages before
downloading them into my primary Email client.

Most
of
the messages that make it past the initial screen are messages written
in
a non-English language that I am unable to read.  No one that I
know would send a message to me in a
language that I can’t read, so I consider such messages to be
SPAM, regardless of the original intent of the author.  Those
foreign language messages constitute the bulk of the five percent of
the SPAM
messages that make it past the screen.

Operational
Discussion

Offensive words and phrases

My SPAM screening algorithm screens for SPAM on the basis of words
or
phrases in the From line, words or phrases in the Subject
line, and
words or phrases in the body text.  Offensive words
and phrases include IP addresses and URL addresses, and anything else
that you decide to include in the list of offensive words and
phrases.  At this point in time, there is nothing statistical
about my algorithm, nor does the algorithm use any concepts that might
be described as artificial intelligence.  However, that
may change in a future upgrade.

Friendly Email addresses and subjects

A list of friendly Email addresses and friendly subjects is used to
screen the From
line and the Subject line.  Messages that are from
friendly Email addresses, and messages that have known good Subject
lines are preserved on the server. They are simply ignored after
determining that
they are friendly.

The friendly list is created and maintained using a simple text
editor.  This is a relatively short list that rarely changes and
no special training program is needed to keep this list
up to date.

Different lists for Subject and body
text

Different lists of words and phrases are used for screening Subject
lines and body text for SPAM. This is important because
the same set of
words
and phrases aren’t always appropriate for use in both cases.

For example, the word ANTIVIRUS is appropriate for screening the
Subject line, but is not appropriate for screening the
body text. The
word ANTIVIRUS often appears legally in the header of Email messages
that have been scanned for viruses by the server, but also often
appears in the Subject line of SPAM messages.

Conversely, some of the most useful material in the list used to screen
the body text is a large set of URLs for known spammer web sites. 
However, URLs have little value for screening the Subject line,
so there is no point in wasting computer time screening Subject lines
against URLs.

To summarize …

Although there are exceptions to every rule, the most useful material
for screening Subject lines consists of plain text
words and phrases that are commonly included in the Subject lines
of SPAM messages.

Such text is of limited value when screening the body text
for reasons that I will explain later (but this is also subject to
change in a future upgrade).
  The most useful material
for screening the body text is a list of known spammer
URLs, which are of very limited value when screening Subject lines.

Common spammer tricks are defeated

Several common spammer tricks are defeated by my SPAM screening
algorithm (and I will defeat other spammer tricks in future
upgrades).
  Some of those tricks are described here, and some
will be
described in future lessons.

Insertion of extraneous characters and case
changes

For example, the common spammer trick of inserting a small number of
extra characters
between the
characters in an offending word or phrase is defeated.  Also, the
common trick of mixing the case of the
characters in an offending word or phrase is defeated.

As a specific example, my algorithm will recommend deletion of any
message having
any of the following in its Subject line or its body
text if the word VIAGRA is included in the lists used to screen for
SPAM:

vIaGrA
V.IagRA
V.I.A.G.R.A

Therefore, when using this program to train the algorithm, if you find
a SPAM message containing V.IagRA in the body text, you
should use the program to add the word VIagRA to the
list.  (Simply edit out the extraneous period between the V
and the I.)
The algorithm will deal with the other
variations of that word having extraneous characters inserted and
having random case changes.

Purposely misspelled words

As currently written, my algorithm has no way of dealing with another
common spammer trick of purposely misspelling words. 

(Perhaps this is an opportunity for you to improve the
algorithm.)

Therefore, if you find a SPAM message where the body text contains the
word V.I@gRA, you should use the program to add
the word VI@gRA to the list to deal with the incorrect
spelling.  (Once again, just edit out the extraneous period.)

Duplicates are not a problem

Don’t be concerned that you may forget and add a word or phrase to the
list more than once.  The program will eliminate all duplicates,
thereby preventing the length of the list from increasing due to
duplicates.

Bogus HTML tags

One of the reasons that plain text words and phrases are not very
useful in screening body text is because spammers often
insert long bogus HTML tags in the middle of such words and
phrases.  Because the tags are not recognized as valid HTML tags,
the HTML rendering engine simply ignores them when the SPAM message is
displayed in an HTML compatible Email client.

My screening algorithm, as currently written, does not deal directly
with this
situation.  (I plan to update the algorithm to
cause it to delete bogus HTML tags before screening the body
text in a future version.)
  I will publish that version once
it
becomes operational.  At that point in time, the use of offensive
words and phrases (in addition to URLs) in the body text will
become much more
productive.

The user interface

Figures 1 and 2 show the GUI through which the user controls the
training
program named Pop302e.

Figure 1 shows the GUI at startup, and Figure 2
shows the GUI containing a SPAM message from the history folder.

Figure 1 Graphical User Interface at startup

(Note that this GUI was purposely made narrow to cause
it to fit into this narrow publication format.  I
recommend that you increase the width of the Frame to at least 750
pixels, and increase the width of the TextField and TextArea objects to
at least 100 characters each.)

A
URL has been identified

In Figure 2, the SPAM message is typical of many SPAM
messages being distributed at this time.  In this case, the
program has identified a URL associated with the spammer.  (See
the text in the third text field from the top, which now contains the
URL identified in the large text area.)

Figure 2 Graphical User Interface in operation

At this point, the user has
the option of
pressing the Post Word button to cause this URL to be
added to the list of offensive words and phrases.

(The user
can also edit the URL in the third text field, if needed, before
pressing the Post Word button.)

What
do we know about this URL?

There is no question that
this message contains a reference
to the URL shown in Figure 2.  It isn’t difficult to establish
that this URL is associated with the message.  Just point
your browser to http://secure.bidz.com/MAIN/DEFAULT.ASP
and take a look at the web site represented by that URL.  As of
the date of this writing, it is a web site identified as
BidZ.com.

Whether or not you consider the message to be SPAM depends on whether
or not you desire to receive messages from this organization on a
frequent basis.  If you do consider it to be SPAM, press the Post
Word
button to put it on the list.  If you don’t consider it
to be SPAM, press the Delete Local File/NextMSG button to
remove it
from further consideration as SPAM, and consider putting the From
address on the friendly list mentioned earlier.

I receive a message with almost the exact same Subject
line from this spammer several times each week.

A message from the history folder


In Figure 2, a message previously stored in the history folder has been
loaded into the GUI.  The complete raw text of that message is
available for viewing in the large text area if desired.

The From
line and the Subject line are displayed in the
top two text fields in the GUI.  At this point in the operation,
the user has pressed the Select URL button, causing the
program to find the URL, to highlight it in boldface, and to copy it
into the third text field from the top.  As mentioned above, the
user has the option
of pressing the Post Word button to copy the contents of
that text field into the list of offensive words and phrases.  The
user also has the option of editing the contents of that text field
before posting if needed.

User instructions are displayed in the fourth text field.

Can type, or copy and paste also

In addition, the user can add other words and phrases to the list by
either typing them into the third text field, or by selecting and
copying words and phrases from the large text area and pasting them
into the third text field.  Having done that, the user can cause
the word or phrase in the third text field to be added to the list
simply by pressing the Post Word button.

Process the next message

When the user presses the Next Msg button, the next
message in
the history folder will be loaded into the GUI.  The current
message will
not be deleted from the history folder in this case.

If the user
presses the Delete
Local File/NextMsg
button, the current file will be deleted from
the
history folder and the next message in the history folder will be
loaded into the GUI.

A
very simple process

As you can
see, the process of training the algorithm consists mainly of pressing
buttons to cause material to be identified and added to the word
list.  This can
be accomplished very quickly with very little effort.  Except for
the possible requirement to do an occasional edit on the contents of
the text field, no actual typing
is required.

Be careful of false positives

Be careful, however, of adding words or phrases to the list that will
create false positives in the future (a false positive is a
non-SPAM message that is identified as SPAM by the screening program).

As an extreme example, you probably shouldn’t add the words GET
or THE to the
list.  These words are not unique, and they will be found in many
future messages, including those that are SPAM, and those that are
not SPAM.

False positives cause no permanent damage (you can always elect not
to delete messages from the server when running the screening
program,
even
if the program identifies them as SPAM).

The main problem with false positives is that they force you to
concentrate
more carefully when making the
decision to delete or not to delete.  This can increase the time
required to process the messages on the server.

Program Code

Program Pop302e

This lesson explains the program named Pop302e,
which is used to train my screening algorithm to do a better job of
identifying future SPAM messages on the basis of the body text of the
message.

Simple text files

All three word lists used by my screening algorithm are maintained in
local text files.  These files can be
created and edited with an ordinary text editor if need be.  Thus,
if one of the lists becomes corrupted, it is easy to correct the
situation using a text editor.  However, it is easier to train the
algorithm using the programs that I have provided.

File names

The following file names are hard coded into the various programs that
use these text files.  (You may
want to change these file names for your version of the programs. 
If you do, make certain that you change them everywhere that they are
used.)

  • Pop302a.txt – contains a word list used for screening the Subject
    line for offensive words and phrases.  This file is updated by the
    program named Pop302d discussed in the previous lesson.
  • Pop302b.txt – contains a word list used for screening the
    body
    text
    for offensive words and phrases.  This file is updated by the
    program named Pop302e that I will discuss in this lesson.
  • Pop302c.txt – contains a list of friendly Email addresses
    and
    friendly subjects for
    screening the From and Subject lines to
    identify friendly messages.  This file rarely changes and is easy
    to keep up to date
    using a text editor.  No special program is required for updating
    this file.

Location of the text files

The programs require the three .txt files to
be located in the same folder as the compiled .class
files for
the programs named Pop302, Pop302d, and Pop302e
However, you can easily modify the programs to change the location of
the .txt files if
you choose to do so.  Just be sure to change the location in all
three programs.

Local copies of the messages are stored in two different
folders.  Some of the
local copies are stored in a history folder while the remainder are
stored in an archive folder.  The locations of these folders on
the disk are hard coded into the three programs.  You can change
the locations if you like, but be sure to make appropriate changes to
all three programs.

Testing

This program was tested using SDK 1.4.2 under WinXP.

Will discuss in fragments

I will discuss the program named Pop302e in fragments.  A
complete listing of
the program is provided in Listing 10 near the end of the lesson. 
You should be able to copy and paste that listing into your Java IDE to
compile and test the program on your system.

Much of the code in this lesson is very similar to the program named Pop302d
that I explained in the previous lesson.  Therefore, much of the
explanation in this lesson will be very brief.  I will concentrate
mainly on those things that differentiate this program from the
previously discussed program.

Purpose of the program

The purpose of this program is to process raw message text files
produced by the program named Pop302, using the information
contained in those files to update the word list stored in the file
named Pop302b.txt.

Designed for ease of use

The program is designed for extreme ease of use in order to encourage
the user to keep the word list up to date.

Beginning of the class definition

This program consists of one top level class named Pop302e, and
seven anonymous inner classes.  The top level class definition
begins in
Listing 1.

Note that Pop302e extends Frame.  Therefore, an
object of type Pop302e is a GUI.

class Pop302e extends Frame{

BufferedReader inputStream;
PrintWriter outputStream;
TextArea textArea = new TextArea(10,50);
Button postButton = new Button("Post Word");
Button deleteButton = new Button(
"Delete Local File/NextMsg");
Button nextButton = new Button("Next Msg");
Button selectIpButton = new Button(
"Select IP");
Button selectUrlButton = new Button(
"Select URL");
int prevUrl = 0;
TextField fromField = new TextField(
"From data will appear here",50);
TextField subjField = new TextField(
"Subject data will appear here",50);
TextField outputWordField = new TextField(
"User pastes output words here",50);
TextField operMsgField = new TextField(
"User instructions appear here. " +
"Press Next Msg to process first message.",
50);
TreeSet bodyWordList;
String[] dirList;
int fileCounter = 0;
//Change the following to move message files
// to a more permanent location on the disk.
File dataDir = new File("c:/MailFiles");
String msgToUser =
"nPost phrases for this message.n" +
"Then press Next Msg to process " +
"next message.";

Listing 1

Increase the size of the GUI

The code in Listing 1 declares numerous instance variables,
initializing many of them.

Note in particular the instantiation of objects of the classes TextArea
and TextField in Listing 1.  As mentioned earlier, this
GUI was purposely made narrow to force it to fit into this narrow
publication format.  The GUI is much more useful if it is
wider.  I recommend that you modify the parameters passed to the TextArea
and TextField constructors in Listing 1 to make them at least
100 characters wide (instead of 50).

I also recommend that you modify the parameters passed to the setSize
method later in the program (see Listing 9) to cause the Frame
object to be at least 750 pixels wide.

The main method

The main method for the program is shown in Listing 2.

  public static void main(String[] args){
Pop302e thisObj = new Pop302e();
thisObj.makeBodyWordList();
}//end main

Listing 2

The main method instantiates an object of the Pop302e
class, causing
the constructor to be executed.

(Much of the interesting code in
the program resides in the constructor.)

Then the main method invokes the makeBodyWordList
method on a reference to the object.  I will briefly discuss that
method,
along with a couple of other utility methods before discussing the
constructor.

The makeBodyWordList method

The purpose of the makeBodyWordList method is to create a TreeSet
object
containing words used by the program named Pop302 to screen body
text of Email messages in an attempt to identify SPAM messages.

The makeBodyWordList method reads strings from a text file
named Pop302b.txt and creates the list as a TreeSet object
sorted in natural order with no duplicates.

The data read from the file is converted to upper case before being
added to the TreeSet object.

Also makes backup files

After creating the TreeSet object, the
method writes the
data from the object into a backup file named Pop302b.bakN,
where
N is the value of the next available file name in the
directory.

A new backup file with a unique name is created each time the program
is run.  Once the number of sequential backup files reaches 5, the
program automatically deletes the oldest file before creating a new
backup file.

Thus the program maintains a sequence of five backup files with
extensions bak0 through bak5 with one number missing.

You can view the makeBodyWordList method in its entirety in
Listing 10 near the end of the lesson.

The makeBodyWordList method is very similar to the makeSubjWordList
method that I discussed in the previous lesson.  Therefore, I
won’t discuss it further in this lesson.

Additional utility methods

The Pop302e class contains two additional utility methods,
which you can view in their entirety in Listing 10:

  • writeBodyWordList
  • removeStars

The writeBodyWordList method is completely straightforward and
shouldn’t require any discussion.  This method is used to write
the output
file containing a modified word list when the program terminates.

The removeStars method is the same as a method having the same
name that I discussed in an earlier lesson entitled Enlisting
Java in the War Against SPAM: The Screening Module

This method is used to remove asterisks that were purposely inserted
into the message files when they were created to defeat any virus code
that might be lurking there.  I will simply refer you back to the
discussion in the earlier lesson for this method.

The constructor

That brings us to the constructor for the class named Pop302e,
which begins in Listing 3.

The code in Listing 3 registers an anonymous WindowListener object
to service the close button on the Frame.

  Pop302e(){//constructor
this.addWindowListener(
new WindowAdapter(){
public void windowClosing(WindowEvent e){
writeBodyWordList();
System.exit(0);
}//end windowClosing
}//end WindowAdapter()
);//end addWindowListener

setLayout(new FlowLayout());

Listing 3

The windowClosing method

The windowClosing method in Listing 3 invokes the writeBodyWordList
method to write the current and updated contents of the TreeSet
object into the
output file named Pop302b.txt.

Then the windowClosing method terminates the program.

Set the layout for the GUI

The code in Listing 3 also sets the layout for the GUI to FlowLayout
This is a simple layout manager that works pretty well for this simple
GUI.

An ActionListener on the Next Msg button

The code in Listing 4 registers an anonymous ActionListener
object on the Next Msg button in Figure 1.

    //Register an ActionListener on the
// nextButton. This is an anonymous
// class definition.
nextButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Do this to assist the action listener
// on the selectUrlButton
prevUrl = 0;

//Protect against ArrayIndexOutOfBounds
if((fileCounter >= 0) &&
(fileCounter < dirList.length)){

//The user clicked the Next button
// but there are no more files.
if(fileCounter ==
(dirList.length - 1)){

//Write the modified word list
// stored in the TreeSet object to
// an output file. This also
// happens when the user clicks the
// close button on the Frame.
writeBodyWordList();
msgToUser = "nnNo more messages."
+ "nPost phrases for this "
+ "message.n Then press "
+ "close to terminate.";
//Disable the Next button so that
// the user cannot fire any more
// events of this type.
nextButton.setEnabled(false);
}//end if

//Identify the file being processed
textArea.setText("Processing " +
dirList[fileCounter] + "n");

//Provide instructions to the user.
operMsgField.setText("Paste a phrase"
+ " in the output field and press "
+ "Post. Post as many new phrases "
+ "as you want. Press next to "
+ "process next message.");
outputWordField.setText("Paste "
+ "output phrase here and then "
+ "press Post.");

try{
//Open the file containing a local
// copy of the message. Note that
// the message has been mangled by
// inserting asterisks in an
// attempt to protect against
// viruses.
BufferedReader inData
= new BufferedReader(
new FileReader(dataDir.
getAbsolutePath() + File.
separator + dirList[
fileCounter]));
String data; //temp holding area

//Precondition the display of
// Subject in the GUI by skipping
// header lines prior to the
// Subject line. Mark the beginning
// of the file. Set the
// readAheadLimit to 10000
// characters before the mark will
// be lost.
inData.mark(10000);
//Some messages may not contain a
// Subject or From line. Don't
// want the old one to continue to
// be visible in the GUI.
subjField.setText(
"No Subj line found yet");
fromField.setText(
"No From line found yet");
while((data = inData.readLine())
!= null){
//A null result indicates end of
// file.

//Remove the asterisks that were
// inserted into the data when
// the file was written in an
// attempt to protect against
// viruses. Append two asterisks
// to the end of each line.
data = removeStars(data);

//Trap the Subject line, convert
// it to upper case, and display
// it in a field on the GUI.
if(data.startsWith("Subject:")){
subjField.setText(
data.toUpperCase());
break;//No need to keep reading
}//end if
}//end while loop on null

//Reset back to beginning of file.
// The Subject for this message is
// now showing in the GUI.
inData.reset();

//Precondition the display of From
// line in the GUI by skipping
// header lines prior to the From
// line. Code is similar to that
// discssed above.
while((data = inData.readLine())
!= null){
data = removeStars(data);
if(data.startsWith("From:")){
fromField.setText(
data.toUpperCase());
break;
}//end if
}//end while loop on null

//Reset back to beginning of file.
// The From line for this message
// is now showing in the GUI. Read
// and display the entire file.
// This is the data that the user
// will analyze to determine the
// new words or phrases that will
// be added to the word list.
inData.reset();

//Read and display strings until
// eof is indicated by null.
while((data = inData.readLine())
!= null){
data = removeStars(data);
textArea.append(data + "n");
}//end while loop

//Display messages to the user at
// the end of the data in the text
// area.
textArea.append(msgToUser + "n");
inData.close();//Close file
}catch(Exception ex){
ex.printStackTrace();}

//Increment the fileCounter so that
// the next time the Next button
// fires an ActionEvent, the next
// file in the directory listing will
// be processed.
fileCounter++;

}//end if on fileCounter in bounds
else{
//File counter out of bounds. This
// happens if you delete all the
// files.
textArea.setText(
"No more files. Press Close to "
+ "terminate.");
}//end else
}//end actionPerformed
}//end ActionListener
);//end addActionListener

Listing 4

While the definition of this anonymous class, which defines the actionPerformed
method of the ActionListener interface, is rather long and
complex, it is almost identical to code that I discussed in detail in
the previous lesson in this series.  Therefore, I won’t bore you
by repeating that discussion here.  Rather, I will simply refer
you back to the previous lesson, and let the comments in Listing 4
speak for themselves.

Set prevUrl to zero

I will point out one statement in Listing 4 that is different from the
code in the previous lesson.  The statement highlighted in red in
Listing 4 sets the value of a variable named prevUrl to
zero.  This will become important later when I discuss the
action listener that I will register on the Select URL button
in Figure 1.

Instantiate ActionListener object from
anonymous class

The code in Listing 5 registers a common ActionListener
on the Post Word button and also on the third text field in
Figure
1.  This makes it possible to post a new word or phrase to the
list by pressing the Post Word button, or by pressing the Enter
key when the text field has the focus.

Note that the code in Listing 5 defines an anonymous class, but it does
not use an anonymous object.

    ActionListener postListener =
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Get the word or phrase from the field
// and add it to the TreeSet object.
String tempWord =
outputWordField.getText();
bodyWordList.add(tempWord);

//Provide feedback to confirm that it
// has been posted. This tells the
// user that she is free to post
// another word if she desires.
outputWordField.setText(
tempWord + " posted");
}//end actionPerformed
};//end ActionListener

//Register the ActionListener object on
// the two source objects.
postButton.addActionListener(postListener);
outputWordField.addActionListener(
postListener);

Listing 5

Once again, the code in Listing 5 is almost identical to code that I
discussed in detail in the previous lesson, so I won’t discuss it
further here.

The Select IP button

Now we are getting into new territory that wasn’t covered in the
previous lesson.  Figure 3 shows the GUI after the user has
loaded a message file from the history folder and has pressed the Select
IP
button. 
The program has found the
IP address of the computer that (presumably) originated the
message, and has copied that IP address into the third text
field.  At this point, the user has the option of pressing the Post
Word
button to cause the IP address to be added to the list of
offending words and phrases.

The user also has the option of editing the IP address before pressing
the Post Word button.  For example, removing the rightmost
three digits would cause the truncated IP address to refer to a group
of machines rather than a single machine.

(This narrow publication format makes it a little
difficult to see, but the beginning of the IP address is highlighted on
the right side of the large text area.  This illustrates why you
should increase the width of the GUI before using the program.)

GUI with IP address highlighted

Figure 3 GUI with IP address selected.

How reliable is the originating IP address
information?

I don’t have a good answer to that question.  Obviously, there are
programmers who can write programs to insert a fake IP address into
each
packet before the packets are transmitted.  (Otherwise, it
would
not be possible to write programs that insert valid IP addresses into
packets before they are transmitted.)

I am led to believe, however, that it is much more difficult to fake
the originating IP address than it is to fake other parts of the header
such as the information in the Return-Path and From
lines of the header.  While the originating IP address may not be
totally reliable as a means of identifying the source of the message,
it is probably the most reliable item in the header.

The specifics

The message header for a POP3 message may contain several lines that
start with the text Received:  The one furthermost from
the
beginning of the message contains the IP address of the computer that
originated the message (possibly faked).  The ones closer
to the beginning of the
message show the IP addresses of machines along the way that relayed
the message from the sender to the receiver.

A typical message header

The header for a typical Email message is shown in Figure 4 (with
line breaks manually inserted to force it to fit in this
narrow publication format).

Note that this message was
not SPAM.
  Rather, this was a valid Email message that I
received from a colleague at the college where I teach.

...
Received: from mailhub1.austin.cc.tx.us
(root@mailhub1.austin.cc.tx.us [206.77.151.36])
by omnistarhost.com (8.11.6/8.11.6)
with ESMTP id i0QFxtc12960
for <Baldwin@DickBaldwin.com>;
Mon, 26 Jan 2004 09:59:55 -0600
Received: from monk.austincc.edu
(root@monk.austin.cc.tx.us [198.213.3.10])
by mailhub1.austin.cc.tx.us
(8.12.3/8.12.3/Debian-6.4) with ESMTP id
i0QFtIxS028428;
Mon, 26 Jan 2004 09:55:18 -0600
Received: from D3093X31 (m198214191169.
austin.cc.tx.us [198.214.191.169])
by monk.austincc.edu
(8.12.3/8.12.3/Debian -4) with SMTP id
i0QFt8De001872;

Mon, 26 Jan 2004 09:55:08 -0600
Message-ID: <000a01c3e425$60bc1e50$a9bfd6c6@
D3093X31>

Figure 4

The originating IP address

I highlighted the “Received:” line containing the
originating IP address in blue in Figure 4.

The originating IP address, which I highlighted in red, is
contained in the first pair of square brackets in that line.

(Had I not manually inserted line breaks, all of the
material in blue and red would appear on a single line.)

Who originated the message?

Figure 5 contains information about IP address 198.214.191.169
obtained from the ARIN WHOIS database at http://www.arin.net/whois/.

Search results for: 198.214.191.169  

Univ. of Texas System Office of Telecom.
Services NETBLK-THENET-CIDR-C2
(NET-198-214-0-0-1)
198.214.0.0 - 198.214.255.255
Austin Community College AUS-COM-COL2
(NET-198-214-176-0-1)
198.214.176.0 - 198.214.191.255

# ARIN WHOIS database, last updated
2004-01-26 19:15

Figure 5

The originating organization

You can click on the links in Figure 5 to learn more details about the
organization to which this IP address is assigned.  For example,
Figure
6 shows the information provided by one of the links in Figure 5.

OrgName:    Austin Community College
OrgID: ACC-9
Address: Riverside Campus
Address: 1020 Grove Blvd.
City: Austin
StateProv: TX
PostalCode: 78741
Country: US

NetRange: 198.214.176.0 - 198.214.191.255
CIDR: 198.214.176.0/20
NetName: AUS-COM-COL2
NetHandle: NET-198-214-176-0-1
Parent: NET-198-214-0-0-1
NetType: Reassigned
Comment:
RegDate: 1995-02-23
Updated: 1995-02-23

TechHandle: GK87-ARIN
TechName: --Name removed by Baldwin--
TechPhone: --Phone number removed by Baldwin--
TechEmail: --Email address removed by Baldwin--

Figure 6

I can attest to the fact that the organization identified in Figure 6
is correct.  I can also state that the contact name that I removed
from Figure 6 is a person having significant responsibilities with
regard to networking at the college.

Why am I telling you this?

The jury is still out regarding the usefulness of the originating IP
address with regards to identifying the source of SPAM messages. 
The IP address of
spammers seems to change frequently for a variety of reasons.

I
have accumulated several thousand recent SPAM messages.  As soon
as
I can find the time, I plan to do some statistical work to identify
those
originating IP addresses that have been used in a large percentage of
those messages.  I will add those IP addresses to my list of
offensive words and phrases.

Let’s all complain in unison

However, the IP address is very useful for another purpose —
complaining.

Worms send SPAM and viruses

It probably wouldn’t do much good to complain to an originator that
distributes SPAM for profit.  However, some SPAM
and most viruses are actually sent by malicious code embedded in the
computers of unsuspecting people.  Sometimes complaining to the
technical contact for the originating IP address of a SPAM or virus
message will result in the malicious code being removed from the
computer. 
That will eliminate the SPAM and virus messages being transmitted by
that computer.

Every little bit helps

Removing such code from one computer wouldn’t have much impact on the
overall problem, but removing such code from many computers could have
a significant impact on the problem.  If nothing else, it would
make it easier to identify the real culprits in the SPAM and virus
problem.

For example, a good friend of mine was recently notified by her cable
modem ISP that a computer connected to the cable modem had been
transmitting SPAM or
viruses.  Apparently the ISP had received a complaint that
included the date, time, and originating IP address in the message
header.  The ISP was able to use this information to determine
that the IP address had been assigned to my friend’s cable modem at the
time that the message was transmitted.  My friend received
technical advice from the ISP to help in cleaning up
the computer and eliminating the problem.

At this point in time, the main reason that I have provided the ability
in
Figure 3 to easily extract the originating IP address is in the hope
that lots of users of the program will use that information to complain
to the organization to which the IP address is assigned. 
Remember, given the originating IP address, you can obtain information
about the organization to which that IP address is assigned at http://www.arin.net/whois/.

Automating the complaint process

One of the upgrades that I am considering for the future is to add a
module to the program that makes it possible to press a button while
viewing a raw SPAM message
and to automatically send a complaint via Email to the TechName
at
the TechEMail address illustrated for a each IP address in
Figure 6.  If I do
that, I might reasonably expect lots of users to take advantage of the
capability and to complain about SPAM and virus messages that
they receive.

Enough talk, let’s see some code

Listing 6 shows the code that registers an action listener
on the Select URL button shown in Figure 3.  This listener
finds and selects the IP address of the originator of the
message in the text string encapsulated in the TextArea object.

(Programmatically selecting text in a TextArea object is
similar to
manually selecting text with a mouse.)

Write the originating IP address into the text
field

Having identified and selected the text comprising the originating IP
address, the code in the action listener writes the selected characters
into the third text field in Figure 3.  At that point, the user
has the option of adding the IP address to the list of offensive words
and phrases by pressing the Post Word button.

(Perhaps equally important, the user has the option of
writing the information down and using it later to complain to the
organization to which the IP address is assigned.)

Where is that IP address again?

As mentioned earlier, the originating IP address is the first IP
address in square brackets that occurs in the last line in the message
that starts with the text Received:

      selectIpButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Create a String replica of the
// contents of the textArea.
String theString =
textArea.getText();
//Get the index of the line
// containing the originating
// IP address.
int receivedIndex =
theString.lastIndexOf("Received:");
//Scroll to make that line visible
// in the text area.
textArea.select(
receivedIndex,receivedIndex+9);
//Get the indices of the square
// brackets containing the IP address
int leftIpIndex = theString.indexOf(
'[',receivedIndex);
if(leftIpIndex != -1){
int rightIpIndex =
theString.indexOf(
']',leftIpIndex);
//Select the IP address and write
// it into the outputWordField.
textArea.select(leftIpIndex+1,
rightIpIndex);
outputWordField.setText(
textArea.getSelectedText());
}//end if not -1
else{//leftIpIndex == -1
outputWordField.setText(
"No [...] on this Received line");
}//end else
}//end actionPerformed
}//end ActionListener
);//end addActionListener

Listing 6

With the combination of the operational description given above, and
the comments in the code, the code in Listing 6 should be
self-explanatory and should not require further discussion.

The Select URL button

Each time the user presses the Select URL button in Figure 2,
the program finds the next URL in the text area, selects it, and copies
it into the third text field.  The HTTP:// text is
stripped from the beginning of the URL to reduce processing time later
when the URL is used to screen a message.

(A beneficial upgrade would be to also strip WWW from
the beginning of the URL when it occurs, because it doesn’t contribute
much in the way of identifying information for the URL.)

Once the URL has been copied into the text field, the user has the
option of pressing the Post Word button to add it to the list
of
offensive words and phrases.

(The user can edit the URL in the text field before
posting if needed.  At this point in time, I usually eliminate the
WWW by editing it out.)

Figure 2 shows the URL for HTTP://SECURE.BIDZ.COM selected and copied
into the text field, ready for posting to the list.

Register action listener on the Select URL
button

The code in Listing 7 registers an action listener on the Select URL
button.  This listener finds and selects the next URL that appears
in the message text.  When it finds a URL, it writes the selected
characters into the third text field in Figure 2.

URLs are located by searching for HTTP:// in the text of the
message.

      selectUrlButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Create an upper-case String
// replica of the contents of the
// textArea, and convert the contents
// of the text area to upper case.
String theString =
textArea.getText().toUpperCase();
textArea.setText(theString);

//Get the index of the next URL and
// save it as previous URL for
// initialization purposes.
int urlIndex =
theString.indexOf(
"HTTP://",prevUrl);
prevUrl = urlIndex + 1;

//Get the index of the first / in the
// URL following the //. Also deal
// with the possibility that no URL
// was found. This slash character
// signals the end of the server name
// and the beginning of a directory
// on the server. Don't include the
// directory name in the selected
// text.
if(urlIndex != -1){
//A URL was found.
int slashIndex = theString.indexOf(
'/',urlIndex + 7);
//Deal with the possibility that
// the URL doesn't contain a slash
// in the body of the URL and
// there is no slash later in the
// message.
if(slashIndex != -1){
//Deal with URLs that don't have
// a slash in them causing the
// program to find the next slash
// somewhere in the body of the
// text instead. Limit the amount
// of selected text to 80
// characters. The extra text
// will require editing in the
// text field by the user.
if(slashIndex - urlIndex > 80){
slashIndex = urlIndex + 80;
}//end if
//Select the URL, beginning after
// the second slash, and write it
// into the outputWordField.
//Discard HTTP:// to reduce run
// time later when screening
// for SPAM messages on the
// server.
textArea.select(
urlIndex+7,slashIndex);
//Put the selected URL into the
// output word field.
outputWordField.setText(
textArea.getSelectedText());
}//if slashIndex != -1
else{
//Program doesn't handle this
// unusual situation.
outputWordField.setText("Can't "
+ "find / following HTTP://");
}//end else
}//end if urlIndex != -1
else{//urlIndex == -1
outputWordField.setText(
"Can't find HTTP://");
}//end else
}//end actionPerformed
}//end ActionListener
);//end addActionListener

Listing 7

Again, the previous operational description and the comments in the
code should make the code in Listing 7 self-explanatory.

The Delete Local File/NextMsg button

The code in Listing 8 registers an ActionListener on the Delete
Local File/NextMsg
button shown in Figure 2.  This makes it
possible for the user to remove a file from the local directory.

The user would typically delete files that are recognized as not being
SPAM. Also, the user would delete files for which analysis is complete.
When a file is deleted using this button, the next file is
automatically loaded ready for processing.

    deleteButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Delete the local file currently being
// displayed in the GUI. Must subtract
// one from the value of the file
// counter to cause it to reference the
// current file because it has already
// been incremented by the event
// handler for the Next button in
// preparation for processing the next
// file.

//Create a File object that represents
// the current file.
File tempFile = new File(
dataDir.getAbsolutePath() +
File.separator +
dirList[fileCounter-1]);
if(tempFile.exists()){
tempFile.delete();//Delete the file
}//end if

//Fire a synthetic event on the Next
// button to cause the program to
// process the next file in the
// directory listing.
Toolkit.getDefaultToolkit().
getSystemEventQueue().
postEvent(new ActionEvent(
nextButton,
ActionEvent.
ACTION_PERFORMED,
"Next"));
}//end actionPerformed
}//end ActionListener
);//end addActionListener

Listing 8

The code in Listing 8 is essentially the same as code that I discussed
in detail in the previous lesson.  Therefore, I won’t repeat that
discussion here.  Rather, I will simply refer you back to that
lesson.

Configure the GUI

The code in Listing 9 configures the GUI by placing the various
components in it, setting the title, setting the size, and making the
whole thing visible.

    add(postButton);
add(nextButton);
add(deleteButton);
add(selectIpButton);
add(selectUrlButton);
add(fromField);
add(subjField);
add(outputWordField);
add(operMsgField);
add(textArea);
setTitle("Copyright 2004, R.G.Baldwin");
//You should increase the width to at least
// 750 pixels, and increase the width of the
// text fields and the text area to at least
// 100 characters to make the GUI more
// usable.
setSize(400,400);
//Make the GUI visible.
setVisible(true);

Listing 9

The code in Listing 9 is completely straightforward, and shouldn’t
require further discussion.

The remaining code

The remaining code in Listing 10 near the end of the lesson is
essentially the same as code that I discussed in detail in the previous
lesson.  Therefore, I won’t repeat that discussion here.

Run the Program

I encourage you to copy the code from Listing 10 below, as well as
the programs and starter text files at the ends of the lessons entitled
Enlisting
Java in the War Against SPAM: The Communications Module
Enlisting
Java in the War Against SPAM: The Screening Module
, and Enlisting
Java in the War Against SPAM: Training the Subject Line Screener
.

Compile and execute the
program named Pop302.  This should put some message files
in your history directory that can be used to test the algorithm
training programs named Pop302d and Pop302e.

Then compile and execute the programs named Pop302d and Pop302e
to train the algorithm based on your history data.  Experiment
with the programs, making changes, and
observing the
results
of your
changes.

For maximum usability, I recommend that you increase the width of
the GUIs to at least 750 pixels, and increase the width of the text
fields and text areas in those GUIs to at least 100 characters.

Before executing the programs, you will need to create three text
files
having the following names and purposes and store them in the folder
containing your compiled Java class files.

  • Pop302a.txt – contains offensive Subject line
    words and phrases
  • Pop302b.txt – contains offensive body text words
    and phrases
  • Pop302c.txt – contains friendly Email addresses and friendly Subject
    line material

Eventually you will need to populate these files with words
and phrases that work well for you.  The programs named Pop302d
and Pop302e
will help you to update those files based on actual SPAM messages in
your history directory.

In the meantime, I have provided sample text files in Listings 35,
36, and 37 in the lesson entitled Enlisting
Java in the War Against SPAM: The Screening Module

You can use those files as starter lists.  If
you receive the same kinds of SPAM that I receive, the words in these
lists should make it possible for you to test the programs and to
get a few
hits on SPAM messages.

These are simple text files so feel free to add other words and
phrases as appropriate.

(Let me caution you not to enable the
DELE
code in the program named Pop302 until you are certain
that you actually want to delete messages from the server.  Once a
message is deleted from the server, there is no way to recover it from
the server.)

Summary

This lesson shows you how to train my SPAM
screening algorithm to do a better job of
identifying SPAM in the future based on the body text of SPAM
messages.

The lesson explains a program named Pop302e,
which
uses historical data to train the algorithm.

The GUI used to control the program is shown in Figures 1, 2, and
3.  The
program
is designed for extreme ease of use.  Training the algorithm
consists
mainly of pressing buttons and occasionally selecting text with the
mouse.  This can
be accomplished very quickly with very little effort.  Except for
the possible requirement to delete extra characters, no actual typing
is required.

After several weeks of training, my
algorithm is reliably identifying about ninety-five percent of all SPAM
messages, allowing me to delete them from my Email server before
downloading them into my primary Email
client.

What’s Next?

This program is a continuing work in progress.  Just today, for
example, I have been working to make the program more effective in
combating a flurry of new Email activity triggered by the propagation
of the W32.Novarg.A@mm
virus across the Internet.

The characteristics of the Email messages transporting the virus are
well defined at the Symantec
site.  I have used those characteristics to successfully
identify and delete the messages transporting the virus from my Email
server before downloading them into my primary Email client.  Once
I learned of the virus, and located the characteristic on the Symantec
web site, only a few minutes were required to create a
successful block.

However, I am also receiving large numbers of Email messages from
servers
that received messages containing the virus with the From line
faked to show my return Email address.  These messages, which are
complaining that I supposedly sent them a virus, are a little more
difficult to deal with.  They arrive in many different
formats so it has taken some time to nail down the key words in the
many different formats.  After about twenty-four hours, I pretty
well have that
situation under control as well.

Finally, I am receiving large numbers of undeliverable Email
messages from servers
that received messages containing the virus with the From line
faked to show my return Email address and an invalid To
address.  These messages are also difficult to deal with due to
two factors:

  • They arrive in many different formats.
  • They often contain an embedded copy of the original message.

The second item in the above list results in a message within a
message.  While I have the system updated to identify and delete
these messages, at this point, the display of the message in the large
text area of Figure 3 is sometimes less than ideal.

Note, however, that my primary goal in developing this software package
was to combat the daily flood of SPAM messages, and was not to deal
with the occasional flurry of messages resulting from viruses. 
The ability to deal with those messages to any degree of effectiveness
is simply a bonus.

I continue to make
improvements to the programs as I learn more about the tricks that
spammers use to thwart programs such as this one.  Some of those
tricks and possible ways to defeat those tricks were described near the
end of the previous lesson entitled Enlisting
Java in the War Against SPAM: Training the Subject Line Screener
.

The next lesson, and several lessons following that one will show you
my implementation of program upgrades designed to make the program more
effective against the more sophisticated spammer tricks.

Complete Program Listing


A complete listing of the program is provided in Listing 10.

Disclaimer of responsibility:  If you elect to use this
program
you use it at your own risk.  Make absolutely certain that you
understand what you are doing before you compile and execute the
program. 
Inappropriate use could result in the loss of Email messages.  The
author of this program, Richard G. Baldwin, accepts no responsibility
for any losses that you may incur as a result of using this program.

/*File Pop302e.java Copyright 2004, R.G.Baldwin
Rev 01/26/05

The purpose of this program is to process text
files produced by P302.java for the purpose of
using the information contained in those files to
update the word list stored in Pop302b.txt

Tested using SDK 1.4.2 under WinXP
************************************************/
import java.io.*;
import java.util.*;
import java.awt.*;
import java.awt.event.*;

class Pop302e extends Frame{

BufferedReader inputStream;
PrintWriter outputStream;
TextArea textArea = new TextArea(10,50);
Button postButton = new Button("Post Word");
Button deleteButton = new Button(
"Delete Local File/NextMsg");
Button nextButton = new Button("Next Msg");
Button selectIpButton = new Button(
"Select IP");
Button selectUrlButton = new Button(
"Select URL");
int prevUrl = 0;
TextField fromField = new TextField(
"From data will appear here",50);
TextField subjField = new TextField(
"Subject data will appear here",50);
TextField outputWordField = new TextField(
"User pastes output words here",50);
TextField operMsgField = new TextField(
"User instructions appear here. " +
"Press Next Msg to process first message.",
50);
TreeSet bodyWordList;
String[] dirList;
int fileCounter = 0;
//Change the following to move message files
// to a more permanent location on the disk.
File dataDir = new File("c:/MailFiles");
String msgToUser =
"nPost phrases for this message.n" +
"Then press Next Msg to process " +
"next message.";

public static void main(String[] args){
Pop302e thisObj = new Pop302e();
thisObj.makeBodyWordList();
}//end main
//===========================================//

Pop302e(){//constructor
//Register a window listener to service
// the close button on the Frame. This is
// an anonymous class defiition.
this.addWindowListener(
new WindowAdapter(){
public void windowClosing(WindowEvent e){
//Write the updated word list stored in
// a TreeSet object to an output file
// on shutdown. It is also written
// when you click the Next button and
// there are no remaining files to be
// processed.
writeBodyWordList();
System.exit(0);
}//end windowClosing
}//end WindowAdapter()
);//end addWindowListener

setLayout(new FlowLayout());

//Register an ActionListener on the
// nextButton. This is an anonymous
// class definition.
nextButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Do this to assist the action listener
// on the selectUrlButton
prevUrl = 0;

//Protect against ArrayIndexOutOfBounds
if((fileCounter >= 0) &&
(fileCounter < dirList.length)){

//The user clicked the Next button
// but there are no more files.
if(fileCounter ==
(dirList.length - 1)){

//Write the modified word list
// stored in the TreeSet object to
// an output file. This also
// happens when the user clicks the
// close button on the Frame.
writeBodyWordList();
msgToUser = "nnNo more messages."
+ "nPost phrases for this "
+ "message.n Then press "
+ "close to terminate.";
//Disable the Next button so that
// the user cannot fire any more
// events of this type.
nextButton.setEnabled(false);
}//end if

//Identify the file being processed
textArea.setText("Processing " +
dirList[fileCounter] + "n");

//Provide instructions to the user.
operMsgField.setText("Paste a phrase"
+ " in the output field and press "
+ "Post. Post as many new phrases "
+ "as you want. Press next to "
+ "process next message.");
outputWordField.setText("Paste "
+ "output phrase here and then "
+ "press Post.");

try{
//Open the file containing a local
// copy of the message. Note that
// the message has been mangled by
// inserting asterisks in an
// attempt to protect against
// viruses.
BufferedReader inData
= new BufferedReader(
new FileReader(dataDir.
getAbsolutePath() + File.
separator + dirList[
fileCounter]));
String data; //temp holding area

//Precondition the display of
// Subject in the GUI by skipping
// header lines prior to the
// Subject line. Mark the beginning
// of the file. Set the
// readAheadLimit to 10000
// characters before the mark will
// be lost.
inData.mark(10000);
//Some messages may not contain a
// Subject or From line. Don't
// want the old one to continue to
// be visible in the GUI.
subjField.setText(
"No Subj line found yet");
fromField.setText(
"No From line found yet");
while((data = inData.readLine())
!= null){
//A null result indicates end of
// file.

//Remove the asterisks that were
// inserted into the data when
// the file was written in an
// attempt to protect against
// viruses. Append two asterisks
// to the end of each line.
data = removeStars(data);

//Trap the Subject line, convert
// it to upper case, and display
// it in a field on the GUI.
if(data.startsWith("Subject:")){
subjField.setText(
data.toUpperCase());
break;//No need to keep reading
}//end if
}//end while loop on null

//Reset back to beginning of file.
// The Subject for this message is
// now showing in the GUI.
inData.reset();

//Precondition the display of From
// line in the GUI by skipping
// header lines prior to the From
// line. Code is similar to that
// discssed above.
while((data = inData.readLine())
!= null){
data = removeStars(data);
if(data.startsWith("From:")){
fromField.setText(
data.toUpperCase());
break;
}//end if
}//end while loop on null

//Reset back to beginning of file.
// The From line for this message
// is now showing in the GUI. Read
// and display the entire file.
// This is the data that the user
// will analyze to determine the
// new words or phrases that will
// be added to the word list.
inData.reset();

//Read and display strings until
// eof is indicated by null.
while((data = inData.readLine())
!= null){
data = removeStars(data);
textArea.append(data + "n");
}//end while loop

//Display messages to the user at
// the end of the data in the text
// area.
textArea.append(msgToUser + "n");
inData.close();//Close file
}catch(Exception ex){
ex.printStackTrace();}

//Increment the fileCounter so that
// the next time the Next button
// fires an ActionEvent, the next
// file in the directory listing will
// be processed.
fileCounter++;

}//end if on fileCounter in bounds
else{
//File counter out of bounds. This
// happens if you delete all the
// files.
textArea.setText(
"No more files. Press Close to "
+ "terminate.");
}//end else
}//end actionPerformed
}//end ActionListener
);//end addActionListener

//Register an object of the following
// anonymous class on both the Post button
// and the outputWordField. That way, the
// contents of the outputWordField can be
// posted to the new word list by either
// clicking the Post button, or pressing the
// Enter key when the outputWordField has the
// focus. Note that this is an anonymous
// class, but it is not an anonymous object.
ActionListener postListener =
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Get the word or phrase from the field
// and add it to the TreeSet object.
String tempWord =
outputWordField.getText();
bodyWordList.add(tempWord);

//Provide feedback to confirm that it
// has been posted. This tells the
// user that she is free to post
// another word if she desires.
outputWordField.setText(
tempWord + " posted");
}//end actionPerformed
};//end ActionListener

//Register the ActionListener object on
// the two source objects.
postButton.addActionListener(postListener);
outputWordField.addActionListener(
postListener);

//Register an action listener on the
// selectIP button. This listener finds
// and selects the IP address of the
// originator of the message. Then it
// writes the selected characters into the
// outputWordField where it can be added
// to the word list by pressing the Post
// button.
//The IP address of the originator is the
// first IP address in square brackets that
// occurs in the last line in the message
// that starts with "Received:"
selectIpButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Create a String replica of the
// contents of the textArea.
String theString =
textArea.getText();
//Get the index of the line
// containing the originating
// IP address.
int receivedIndex =
theString.lastIndexOf("Received:");
//Make that line visible
textArea.select(
receivedIndex,receivedIndex+9);
//Get the indices of the square
// brackets containing the IP address
int leftIpIndex =theString.indexOf(
'[',receivedIndex);
if(leftIpIndex != -1){
int rightIpIndex =
theString.indexOf(
']',leftIpIndex);
//Select the IP address and write
// it into the outputWordField.
textArea.select(leftIpIndex+1,
rightIpIndex);
outputWordField.setText(
textArea.getSelectedText());
}//end if not -1
else{//leftIpIndex == -1
outputWordField.setText(
"No [...] on this Received line");
}//end else
}//end actionPerformed
}//end ActionListener
);//end addActionListener


//Register an action listener on the
// selectUrl button. This listener finds
// and selects the next URL that appear in
// the message. When it finds a URL, it
// writes the selected characters into the
// outputWordField where it can be added
// to the word list by pressing the Post
// button.
//URLs are located by searching for HTTP://
//A good upgrade would be to eliminate WWW
// when it is included in the URL, because
// it doesn't add much information and
// consumes processing time later when
// messages on the server are being
// screened.
selectUrlButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Create an upper-case String
// replica of the contents of the
// textArea, and convert the contents
// of the text area to upper case.
String theString =
textArea.getText().toUpperCase();
textArea.setText(theString);

//Get the index of the next URL and
// save it as previous URL.
int urlIndex =
theString.indexOf(
"HTTP://",prevUrl);
prevUrl = urlIndex + 1;

//Get the index of the first / in the
// URL. Deal with the possibility
// that no URL will be found.
if(urlIndex != -1){
int slashIndex =theString.indexOf(
'/',urlIndex + 7);
//Deal with the possibility that
// the URL doesn't contain a slash
// in the body of the URL and
// there is no slash later in the
// message.
if(slashIndex != -1){
//Deal with URLs that don't have
// a slash in them causing the
// program to find the next slash
// somewhere in the body of the
// text instead.
if(slashIndex - urlIndex > 80){
slashIndex = urlIndex + 80;
}//end if
//Select the URL, beginning after
// the second slash, and write it
// into the outputWordField.
//Discard HTTP:// to reduce run
// time later when screening
// for SPAM messages on the
// server.
textArea.select(
urlIndex+7,slashIndex);
//Put the selected URL into the
// output word field.
outputWordField.setText(
textArea.getSelectedText());
}//if slashIndex != -1
else{
outputWordField.setText("Can't "
+ "find / following HTTP://");
}//end else
}//end if urlIndex != -1
else{//urlIndex == -1
outputWordField.setText(
"Can't find HTTP://");
}//end else
}//end actionPerformed
}//end ActionListener
);//end addActionListener

//Register an ActionListener on the Delete
// button to make it possible for the
// user to remove a file from the local
// directory. The user would typically
// delete files that are recognized as not
// being SPAM. Also, the user would
// delete files for which analysis is
// complete. When a file is deleted using
// this button, the next file is
// automatically loaded ready for processing.
deleteButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Delete the local file currently being
// displayed in the GUI. Must subtract
// one from the value of the file
// counter to cause it to reference the
// current file because it has already
// been incremented by the event
// handler for the Next button in
// preparation for processing the next
// file.

//Create a File object that represents
// the current file.
File tempFile = new File(
dataDir.getAbsolutePath() +
File.separator +
dirList[fileCounter-1]);
if(tempFile.exists()){
tempFile.delete();//Delete the file
}//end if

//Fire a synthetic event on the Next
// button to cause the program to
// process the next file in the
// directory listing.
Toolkit.getDefaultToolkit().
getSystemEventQueue().
postEvent(new ActionEvent(
nextButton,
ActionEvent.
ACTION_PERFORMED,
"Next"));
}//end actionPerformed
}//end ActionListener
);//end addActionListener

//Configure the GUI by placing the various
// components on it.
add(postButton);
add(nextButton);
add(deleteButton);
add(selectIpButton);
add(selectUrlButton);
add(fromField);
add(subjField);
add(outputWordField);
add(operMsgField);
add(textArea);
setTitle("Copyright 2004, R.G.Baldwin");
//You should increase the width to at least
// 750 pixels, and increase the width of the
// text fields and the text area to at least
// 100 characters to make the GUI more
// usable.
setSize(400,400);
//Make the GUI visible.
setVisible(true);

//The following code creates a directory
// listing containing only those files that
// end in .txt.
//This is an anonymous implementation of a
// class that implements FilenameFilter.
dirList = dataDir.list(
new FilenameFilter(){
public boolean accept(
File dir,String name){
if(!(new File(dir,name).isFile()))
return false;
return name.endsWith(".txt");
}//end accept
}//end FilenameFilter
);//end list

//Create a message in the text area at
// startup showing the list of files in the
// directory that are available for
// processing.
this.textArea.append("Files to be processed"
+ "n");
//Display the list of files
for(int cnt = 0;cnt < dirList.length;cnt++){
this.textArea.append(dirList[cnt] + "n");
}//end for loop
//Scroll back to the top of the text area
textArea.select(0,0);
}//end constructor
//===========================================//


//Purpose: To create a TreeSet object
// containing words used to filter the message
// body lines in the program named
// Pop302.java.
//This method reads strings from a text file
// named Pop302b.txt and creates the list as
// a TreeSet object sorted in natural order
// with no duplicates.
//After creating the list, it writes the data
// from the list into a backup file named
// Pop302b.bakN, where N is the value of the
// next available file name in the directory.
//A new backup file with a unique name is
// created each time the program is run. Once
// the number of backup files reaches 5, the
// program automatically deletes the oldest
// file before creating a new backup
// file. Thus the program automatically
// maintains a sequence of five backup files
// with extensions .bak0 through bak5 with one
// number missing. The age-order of the files
// should be determined by the modificatin date
// and not by the name of the file.
//The data read from the file is converted to
// upper case before being added to the TreeSet
// object.

void makeBodyWordList(){
bodyWordList = new TreeSet();

//Read words or phrases from text file and
// populate the TreeSet object.
try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302b.txt"));
String data; //temp holding area

while((data = inData.readLine()) != null){
bodyWordList.add(data.toUpperCase());
}//end while loop

inData.close();//Close input file

//Write a backup file before making any
// modifications to the data.

//First determine the name of the next
// backup file allowed in the directory.
int N = 0;
File theFile = null;
for(N = 0;N < 6;N++){
theFile = new File("Pop302b.bak" + N);
if(!(theFile.exists()))break;
}//end for loop

//Cause N to rotate from 0 through 5
if(N == 5){//del file 0 for use next time
new File("Pop302b.bak0").delete();
}//end if
else{//delete the next file in sequence
if(new File(
"Pop302b.bak" + (N + 1)).exists()){
new File(
"Pop302b.bak" + (N + 1)).delete();
}//end if
}//end else

//Now write the output file
DataOutputStream dataOut =
new DataOutputStream(
new FileOutputStream(
theFile));

//Use an Iterator object to access the data
// in the TreeSet object.
Iterator iter = bodyWordList.iterator();

while(iter.hasNext()){
data = (String)iter.next();
dataOut.writeBytes(data + "n");
}//end while

dataOut.close();
}catch(Exception e){e.printStackTrace();}
}//end makeBodyWordList
//===========================================//

//Purpose: To write the data from a TreeSet
// object into a file named Pop302b.txt that
// is used in the program named Pop302.java to
// filter the message body lines.
//This method is the reverse of the method
// named makeBodyWordList.

void writeBodyWordList(){
try{
DataOutputStream dataOut =
new DataOutputStream(
new FileOutputStream(
"Pop302b.txt"));

//Use an iterator to access the data in
// the TreeSet object.
Iterator iter = bodyWordList.iterator();
String data;

while(iter.hasNext()){
data = (String)iter.next();
dataOut.writeBytes(data + "n");
}//end while

dataOut.close();
}catch(Exception e){e.printStackTrace();}
}//end writeBodyWordList
//===========================================//

//Purpose of this method is to remove the
// asterisks inserted into the data by the
// method named insertStars when the data files
// were stored on the disk. In addition to
// removing asterisks, two asterisks are
// appended to the end of each line.
String removeStars(String stringIn){
StringBuffer stringBuf =
new StringBuffer(stringIn);
int index = 0;
while(index > -1){
index = stringBuf.lastIndexOf("*");
if(index > -1){
stringBuf.delete(index,index+1);
}//end if
}//end while
stringBuf.append("**");
return new String(stringBuf);
}//end removeStars()

}//end class Pop302e
//============================================//

Listing 10


Copyright 2004, Richard G. Baldwin.  Reproduction in whole or
in
part in any form or medium without express written permission from
Richard
Baldwin is prohibited.

About the author

Richard Baldwin
is a college professor (at Austin Community College in Austin, TX) and
private consultant whose primary focus is a combination of Java, C#,
and XML. In addition to the many platform and/or language independent
benefits of Java and C# applications, he believes that a combination of
Java, C#, and XML will become the primary driving force in the delivery
of structured information on the Web.

Richard has participated in numerous consulting projects, and he
frequently provides onsite training at the high-tech companies located
in and around Austin, Texas.  He is the author of Baldwin’s
Programming Tutorials, which
has gained a worldwide following among experienced and aspiring
programmers. He has also published articles in JavaPro magazine.

Richard holds an MSEE degree from Southern Methodist University
and has many years of experience in the application of computer
technology to real-world problems.

Baldwin@DickBaldwin.com

-end-
 

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories