September 30, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

Enlisting Java in the War Against SPAM: The Screening Module

  • February 3, 2004
  • By Richard G. Baldwin
  • Send Email »
  • More Articles »

Java Programming Notes # 2152


Preface

This is the second lesson in a series designed to teach you how to write a Java program to remove SPAM from your Email server before you download it into your primary Email client.  The first lesson was entitled Enlisting Java in the War Against SPAM, Part 1, The Communications Module.

The communications module

The first lesson explained the communications module used to communicate with your Email server, and to remove SPAM messages from the server.

SPAM screening algorithm

The program is designed to allow you to use my SPAM screening algorithm, or to invent your own.  This lesson explains the inner workings of my SPAM screening algorithm.  You can use my algorithm as a starting point if you decide to invent your own.

Training the algorithm


The next two lessons will explain how my algorithm can be trained to do an increasingly better job of screening SPAM over time.

Viewing tip

You may find it useful to open another copy of this lesson in a separate browser window.  That will make it easier for you to scroll back and forth among the different listings and figures while you are reading about them.

Supplementary material

I recommend that you also study the other lessons in my extensive collection of online Java tutorials.  You will find those lessons published at Gamelan.com.  However, as of the date of this writing, Gamelan doesn't maintain a consolidated index of my Java tutorial lessons, and sometimes they are difficult to locate there.  You will find a consolidated index at www.DickBaldwin.com.

Preview

Can you write better SPAM screening algorithms?

Did you ever think that you might be able to write better SPAM screening algorithms than those available in the SPAM screening software that you are now using?  If so, this series of lessons is for you.

Even if that is not the case, like most of us, you are probably overwhelmed by SPAM and therefore you may find this lesson interesting.

Remove SPAM from the server

In this and the previous lesson, I am showing you how to write a Java program that supplements the SPAM screening software that you are currently using.  This program is used to identify and remove SPAM from your Email server before it is downloaded into your primary Email client.

Any SPAM that makes it past this program can be further acted upon by the SPAM screener that is built into your Email client.

The communications module

This series consists of (at least) four lessons.  The first lesson in the series explained the communications module used to communicate with your Email server, and to remove SPAM messages from the server.

My SPAM screening algorithm

As mentioned above, this program is designed to allow you to invent and implement your own SPAM screening algorithm in addition to, or as an alternative to my algorithm.

This lesson explains the inner workings of my SPAM screening algorithm.  My algorithm operates separately on the Subject line, the From line, and the body text of each Email message.

Algorithm training programs

The third lesson will explain a companion program named Pop302d, designed to make use of historical data to train the algorithm to do a better job of identifying SPAM in future messages based on the Subject of the message.

The fourth lesson will explain another companion program named Pop302e, designed to make use of historical data to train the algorithm to do a better job of identifying SPAM based on the body text of the message, (which includes the From line).

Because of the need to train the algorithm, and the ease with which these companion programs make that possible, the companion programs are equally as important as the main program.

Operational sequence

Here is the typical operational sequence that I go through each morning to remove SPAM from my Email server before downloading it into my primary Email client, and to train the algorithm to recognize any future SPAM messages that made it through the screen that morning.

  1. Run the main program named Pop302 (explained in this and the previous lesson) to identify SPAM and remove it from the server.  This normally allows a few (typically about ten percent) SPAM messages (stragglers) to get through, which are stored in a history folder on my local disk.
  2. Run the program named Pop302d (explained in the next lesson) to train the algorithm to recognize the stragglers as SPAM based on information in the Subject line.
  3. Run the program named Pop302e (explained in Part 4) to train the algorithm to recognize the stragglers as SPAM based on information in the body text.
  4. Go back and run the main program named Pop302 to remove those SPAM stragglers messages from the server.
  5. Run my primary Email client to download the remaining good messages into my local Email inbox.

When I am in a hurry ...

However, it isn't necessary to perform all of these steps every day.  On those mornings when I am in a hurry, I skip steps 2, 3, and 4, leaving the straggler messages in the local history folder for use later.

(The straggler messages will, of course, end up in my local Email inbox when I run my primary Email client without purposely removing them from the server beforehand.)

Sometime later (perhaps the next day or several days later) I will perform steps 2 and 3 to train the algorithm to recognize future SPAM messages represented by the characteristics of the messages that have been saved in the local history folder.

Effectiveness of my algorithm

After about one week of training, my algorithm was reliably identifying about ninety percent of all SPAM messages, allowing me to delete them from my Email server before downloading them into my primary Email client.  By executing steps 2, 3, and 4 above, I am able to also eliminate the remaining ten percent of the SPAM messages before downloading them into my primary Email client.

Discussion and Sample Code

The full Screen class

The version of the program that I discussed in the previous lesson contained a stripped-down version of a class named Screen. This version of the program allowed for testing the communications module on your system with your Email server without doing any actual screening for SPAM.

I will explain the full version of the class named Screen in this lesson.  In so doing, I explain my algorithm for identifying SPAM.

Purpose of the program

The purpose of this program is to read messages from a POP3 (Post Office Protocol - Version 3) server, to analyze the messages according to a set of screening rules, and to delete the messages that fail the screening test from the server.
(As written, the program asks the user to confirm the deletion of each message from the server, but this confirmation step could easily be removed if you decide to do so.)
Key words and phrases

My SPAM screening algorithm screens for SPAM on the basis of words or phrases in the From line, words or phrases in the Subject line, and words or phrases in the body text.

Friendly Email addresses and subjects

A list of friendly Email addresses and friendly subjects is used to screen the From line and the Subject line.  Messages that are from friendly Email addresses, and messages that have known good Subject lines are preserved on the server and no information about those messages is saved on the local disk. They are simply ignored after determining that they are friendly.

Different lists for Subject and body text

Different lists of words and phrases are used for screening Subject lines and body text for SPAM. This is important because the same set of words and phrases can't always be used for both cases.

For example, the word ANTIVIRUS is appropriate for screening the Subject line, but is not appropriate for screening the body text. The word ANTIVIRUS often appears legally in the header of Email messages that have been scanned for viruses by the server, but also often appears in the Subject line of SPAM messages.

Common spammer tricks are defeated

Several common spammer tricks are defeated by my SPAM screening algorithm.

For example, the common spammer trick of inserting extra characters between the characters in an offending word or phrase is defeated.  Also, the common trick of mixing the case of the characters in an offending word or phrase is defeated.

As a specific example, my algorithm will recommend deletion of any message having any of the following in its Subject line or its body text if the word VIAGRA is included in the lists used to screen for SPAM:

vIaGrA
V.IagRA
V.I.A.G.R.A

Very important characteristics

These two characteristics of the algorithm alone have a significantly positive impact on the effectiveness of training the algorithm to do a better job of identifying SPAM in the future.
(You don't have to identify all of the variations of a word or phrase commonly used by spammers to fool the system.  The program does that for you automatically.)
My algorithm also defeats the common trick of appending random characters to the end of the Subject line, because it doesn't require a match for the entire Subject line.  Rather, it searches for words or phrases internal to the text of the Subject line.

The user interface

Figure 1 shows the GUI through which the user controls the program.

Graphical user interface

Figure 1 Graphical User Interface
(Note that this GUI was purposely made narrow to cause it to fit into this narrow publication format.  I recommend that you increase the width of the Frame to at least 750 pixels, and increase the width of the TextField and TextArea objects to at least 100 characters each.

Note also that this is an actual SPAM message, from which I purposely removed the Email address of the sender prior to publication.  The message may not have actually been sent by the individual whose Email address appeared on the From line.)

The Offending Phrase

When the program identifies a message that is a candidate for deletion, the reason for that recommendation is shown in the third text field from the top in Figure 1.

Deleting a message from the server

The user confirms that the message should be deleted from the Server by clicking the Delete button in Figure 1. If the user doesn't want to delete the message, he should click the Start/Next button instead.
(Note that the capability to actually delete messages from the server was disabled in the program shown in Listing 34 near the end of this lesson.  Make certain that you are ready to actually delete messages from the server before you enable that capability.)

Information available at decision time

As currently written, this program requires the user to confirm the actual deletion of each SPAM message from the server before that message is actually deleted.

At the point in time that the user is required to confirm deletion of a message from the server, the following information is available to assist the user in making the decision:
  • From line
  • Subject line
  • Offending line of text, which may or may not be the subject
  • Offending word or phrase in the offending line of text
  • Entire raw text of the message down to and including the offending line
No images are rendered

No images are rendered by the program, so it is not necessary for the user to view offending images in order to make the decision to delete.

Deletion is not required

Having viewed the above information, if the user is still unable to make an informed decision to delete the message, the user has the option to let the message pass through and to be downloaded into his primary Email client.  Once having viewed the message in the primary Email client, the user still has the option of updating the offending word lists with IP addresses, URLs, etc, so that deletion decisions on future similar messages will be easier to make.

Saved in local archive folder

The raw text of all messages that are identified as candidates for deletion from the server are saved in an archive folder on the local disk, regardless of whether the user elects to delete them from the server or not. Thus if a message is deleted from the server and it is later determined that was a mistake, a raw text copy of the deleted message is available locally in the archive folder.
(You should probably empty this folder periodically so that it won't fill up your disk.)

In addition, I have plans to write several additional programs that will analyze large numbers of SPAM messages in the archive folder for at least two purposes:
  • To remove words and phrases from the word lists that occur in only a very small percentage of SPAM messages, thereby increasing run time without contributing significantly to the desired result.
  • To search for common characteristics among SPAM message that can be used to improve the effectiveness of the screening algorithm.
Saved in history folder

Except for messages from friendly Email addresses and messages with friendly Subject lines, all messages that are not identified as candidates for deletion from the server are saved in a history folder on the local disk.  These messages are used later to train the algorithm to do a better job of identifying future SPAM messages.  I will explain this training process in Part 3 and Part 4 of this series of lessons.

Protection against viruses

Before any message is saved in a local file, asterisks are inserted into the text on ten-character intervals in an attempt to destroy any virus code that may be embedded in the message.

If a message makes it through the screen and is later identified as having a virus as an attachment, a series of ten or more bytes can be extracted from the virus code and added to the word list as an offending phrase.  This should cause any future messages having that same virus code as an attachment to be identified as a candidate for deletion from the server.

Training programs

Companion programs that I have written are used to analyze the non-deleted message files saved locally in the history folder in order to train the algorithm to do a better job of identifying SPAM messages in the future.

These programs are designed for extreme ease of use to encourage the user to train the algorithm frequently.  The better the algorithm is trained, the better it will perform.

I will explain these training programs in detail in Part 3 and Part 4 of this series of lessons.  A brief preview of the training programs is provided below.

Simple text files

All three word lists are maintained in local text files, which can be created and edited with an ordinary text editor if need be.  Thus, if one of the lists becomes corrupted, it is easy to correct the situation using an ordinary text editor.

File names

The following file names are hard-coded into the program.  You may want to change these file names for your version of the program.
  • Local copy - the unique file name for a local copy of each message is based on the unique identifier for that message (UIDL) obtained from the mail server.
  • Pop302a.txt - contains a word list for screening the Subject line for offensive words and phrases.
  • Pop302b.txt - contains a word list for screening the body text for offensive words and phrases.
  • Pop302c.txt - contains a list of friendly Email addresses and friendly subjects for screening the From and Subject lines to identify friendly messages.
Location of the text files

As written, the program requires the three .txt files to be in the same folder as the compiled .class files for the programs named Pop302, Pop302d, and Pop302e.  However, you can easily modify the programs to change the location of the .txt files if you choose to do so.  Just be sure to change the location in all three programs.

The local copies of the messages are stored in two different folders.  Some of the local copies are stored in a history folder while the remainder are stored in an archive folder.  The locations of these folders on the disk are hard-coded into the three programs.  You can change the locations if you like, but be sure to make appropriate changes to all three programs.
Three classes

This program consists of two main classes and one minor class. As discussed in the previous lesson, an object of the class named Pop302 handles all communications with the POP3 server.

A method belonging to an object of the class named Screen is used to screen each message in an attempt to identify SPAM.  This is the class that I will explain in this lesson.

This class can be totally replaced by Java programmers who choose to design their own screening algorithm provided that they maintain the interface with the object of the class named Pop302.

An object of a very simple class named ScreenResult is used as a wrapper to return several items of information from the screening method to the calling method.

Testing

The program was tested using SDK 1.4.2 under WinXP in conjunction with two different POP3 Email servers.

Will discuss in fragments

I will discuss the class named Screen in fragments.  A complete listing of the program is provided in Listing 34 near the end of the lesson.  You should be able to copy and paste that listing into your Java IDE to compile and test the program on your system.

Improvements in
the class named Pop302


Before getting into the details of the class named Screen, I want to mention that the program shown in Listing 34 contains a couple of improvements relative to the version explained in the previous lesson.

One of the improvements involves displaying the message number in the bottom of the text area of Figure 1.

The other improvement involves making a safety check to confirm that the message number being maintained locally is in synchronization with the message number on the server (in the UIDL) before deleting a message from the server.

If you understand the rest of the program, these two modifications should not require a detailed explanation.

Deletion of messages is disabled

Also before getting into the details of the Screen class I want to show you a fragment containing three statements that are disabled in the Pop302 class in Listing 34.  The three disabled statements are shown in Listing 1.  (Note that the statements are separated by comments in Listing 34.)

/*Begin comment block
outputStream.println(
"DELE " + msgNumber);
textArea.append(
"DELE "+validateOneLine()+"\n");
textArea.append(
"Deleted:" + msgNumber + "\n");

*/End comment block
Listing 1

The three statements shown in Listing 1 were purposely disabled (by including them in a comment block) to prevent you from accidentally deleting messages from the server during your early testing of the program. Do not enable these three statements until you are ready to actually delete messages from the server.  At that point in time, you can enable the three statements by removing the comment indicators that surround them.

The Screen class

The Screen class implements a set of rules for identifying SPAM messages and for recommending whether or not a message should be deleted from the server.

If you have a better way to identify SPAM, you can replace this class by a completely different class definition, so long as you maintain the user interface.

An object of this class has one entry point and one exit point, which is the public instance method of the Screen class named screenMsg.

A callback to the GUI

However, there is an additional linkage between the two objects that you need to consider.  The constructor for the Screen class receives a reference to the GUI object created by instantiating the class named Pop302. A method in the object of the Screen class uses that reference to display progress on the text area belonging to the GUI.

This display of progress is comforting on those occasions when a very long message is encountered and the user needs assurance that the system is still working, and isn't hung up.

This callback link could easily be eliminated by deleting code from several locations in the Screen class and removing the callback parameter from the constructor.

Beginning of the Screen class

The Screen class begins in Listing 2, which declares several instance variables.

class Screen{

TreeSet subjWordList;
TreeSet bodyWordList;
TreeSet friendlyWordList;
Pop302 theGui;//save callback reference here
String phrase;

Listing 2

The purpose of these instance variables will become clear as I discuss the code in which they are used.

The constructor

The constructor for the Screen class is shown in its entirety in Listing 3.

  Screen(Pop302 theGui){//constructor
this.theGui = theGui;

makeSubjWordList();
makeBodyWordList();
makeFriendlyWordList();
}//end constructor

Listing 3

As you can see, the constructor receives and saves a reference to the GUI.  This reference is used later to display progress as discussed above.

Make word lists as TreeSet objects

The last three statements in the constructor invoke methods that read text files containing lists of words or phrases, and create TreeSet objects containing those words and phrases.  These TreeSet objects are used later to test for the occurrence of the words or phrases in raw text versions of Email messages.

The TreeSet objects are created and populated by invoking three very similar methods:
  • makeSubjWordList
  • makeBodyWordList
  • makeFriendlyWordList

I will discuss each of these methods in the sections that follow.

The makeSubjWordList method

The purpose of the makeSubjWordList method is to create a TreeSet object containing words and phrases used later to screen the message Subject lines.

The makeSubjWordList method is shown in Listing 4.
  This method reads strings from a text file named Pop302a.txt and creates the list as a TreeSet object. 

  private void makeSubjWordList(){
subjWordList = new TreeSet();

try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302a.txt"));
String data; //temp holding area

while((data = inData.readLine()) != null){
subjWordList.add(data);

}//end while loop
inData.close();//Close file
}catch(Exception e){e.printStackTrace();}
}//end makeSubjWordList

Listing 4

Why use the TreeSet class?

The TreeSet class was chosen for this purpose because it eliminates duplicates.
(Duplicates in the list are bad because they increase runtime with no beneficial effect.  One of the major problems with the message filter in the commercial Email client program that I use is that there is no way to avoid duplicates other than simply remembering that an item was previously placed in the filter.)
With my screening algorithm, even if the user creates duplicates in the text file while training the algorithm, duplicates are eliminated from the TreeSet object and also from the text file before actual processing begins.

The code in Listing 4 is straightforward and shouldn't require further explanation.

The
makeBodyWordList method


The purpose of the makeBodyWordList method is to create a TreeSet object containing words and phrases used later to screen the text in the body of the message.

Separation of lists is important

It is important to maintain separate lists for screening the Subject line and the body text.  Because of the larger number of characters in the body text, false positives are more likely when screening the body text.
(A false positive arises when a message that is not SPAM fails one of the SPAM screening rules and is identified as SPAM by the screening algorithm.)
Some words work well and some don't

Therefore, some words and phrases that work well when screening the Subject line may produce false positives when screening the body text. For example, the common spammer word SLUT appears in the word SoLUTion with only one character separating the S and the L.  It is much more likely that the word SOLUTION will appear somewhere in the body text than in the Subject line (although it may appear in the Subject line as well, thus producing a false positive in either case).

On a more definitive note, the word ANTIVIRUS works well when screening the Subject line, but cannot be used to screen the body text.  Many servers insert the word ANTIVIRUS into the message header after they test the message for viruses. On the other hand, the word ANTIVIRUS often appears in the Subject line of SPAM messages.

IP addresses and URLs

IP addresses and URLs can be very useful in identifying SPAM during the screening of the body text.  However, they rarely occur in the Subject line. Therefore, testing the Subject line against a long list of IP addresses and URLs simply wastes computer time.

Some words to avoid

The following words (among others) probably should not be included in the list used to screen body text for the reasons given.  Undoubtedly you will identify other words and phrases that should be excluded from the list as you gain experience with the system.
  • PORN may be confused with IMPORTANT.
  • SPAM causes lots of false positives. As a remedy, I inserted a space following the M as in "SPAM " to decrease false positives. (This may also decrease valid hits as well.)
  • ANTIVIRUS appears in some valid message headers.
  • WEIGHT often appears in messages regarding HTML fonts.
  • SLUT may be confused with SOLUTION.
The makeBodyWordList code

The code for the makeBodyWordList method is shown in Listing 5.  This method reads strings from a text file named Pop302b.txt and creates the list as a TreeSet object.

  private void makeBodyWordList(){
bodyWordList = new TreeSet();

try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302b.txt"));
String data; //temp holding area

while((data = inData.readLine()) != null){
bodyWordList.add(data);

}//end while loop
inData.close();//Close file
}catch(Exception e){e.printStackTrace();}
}//end makeBodyWordList

Listing 5

This code is straightforward and should not require explanation.

The
makeFriendlyWordList method


The purpose of the makeFriendlyWordList method is
to create a TreeSet object containing words and phrases used to pre-screen the message From and Subject lines before screening against the SPAM lists.  The objective of this pre-screening step is to identify messages that claim to be from an approved list of senders (often referred to as a white list), or messages with a known good Subject line.

Messages that have From or Subject lines matching the words or phrases in this list are not deleted from the server, and are not subjected to screening for SPAM.

Format for Email addresses

When adding Email addresses to the list contained in the text file, only the primary portion of the friendly Email address should be included.  For example, you will often see an Email address presented as follows:
Mary Smith <msmith@somewhere.com>
In this case, only the following portion should be included in the friendly list:
msmith@somewhere.com
This is the portion that is most likely to remain stable over time.  The remaining portion is simply window dressing added to the primary Email address by the program used to compose and address the message.

Code for the makeFriendlyWordList method

The makeFriendlyWordList method is shown in Listing 6.  This method reads strings from a text file named Pop302c.txt and creates the list as a TreeSet object.

  private void makeFriendlyWordList(){
friendlyWordList = new TreeSet();

try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302c.txt"));
String data; //temp holding area

while((data = inData.readLine()) != null){
friendlyWordList.add(data);

}//end while loop
inData.close();//Close file
}catch(Exception e){e.printStackTrace();}
}//end makeFriendlyWordList

Listing 6

This code is completely straightforward and therefore shouldn't require further explanation.

The screenMsg method

Up to this point, the code that I have presented has been rather mundane.  However, things should start getting a little more interesting at this point.

Setting the stage

The statement in Listing 7 was extracted from the Pop302 class in Listing 34.  This is the statement that ties the communication module (an object of the Pop302 class) to the screening module (an object of the Screen class).

              match = screener.screenMsg(
fileName,uidl,theResult);

Listing 7

Where are we at this point?

At this point in the execution of the program, the communication module has retrieved a message from the server and has written it into a file on the local disk with the path and file name given by fileName.
(The file name is based on the server's unique identifier for the message, given by uidl in Listing 7.)
Pass the file to the screenMsg method

The communication module passes the file name for the disk file containing a raw text copy of the message to the method named screenMsg where it will be screened for SPAM.

The
screenMsg method needs fileName in order to read the file from the disk to perform the screen.

Why does the
screenMsg need uidl?


If the file is identified as SPAM, it will be moved from the history folder to an archive folder.  In order to do that, the
screenMsg method needs to create a path and file name pointing to the archive folder.  For this, it needs the unique identifier, uidl.
(Obviously I could have parsed fileName inside the screenMsg method to get uidl, but I found it easier to simply pass it as a parameter to the screenMsg method.)
What is theResult?

In addition to fileName and uidl, the method needs a reference to an empty object of the class ScreenResult.  It populates that object in order to send information back to the calling method.  That object's reference is represented by theResult in Listing 7.  The
screenMsg method populates this object with several pieces of data for later use by the communications module.

The
screenMsg method returns boolean


The screenMsg method returns a boolean value, which is assigned to match in Listing 7.

If the return value is true, the
screenMsg method has concluded that the message is SPAM and is a candidate for deletion from the server.
(Recall, however, that the communications module allows the user to make the final decision regarding deletion of the message from the server.)
If the return value is false ...

If the return value is false, the
screenMsg method has found nothing to indicate that the message is SPAM.
(The message might not be SPAM, or it might be a form of SPAM that the algorithm doesn't yet know how to identify.  If it is deemed by the user that the message is SPAM, the message will be used later to train the algorithm to recognize that form of SPAM in future messages.)
About 90 percent effective

As of this writing, I am finding that the algorithm is able to identify about 90 percent of SPAM messages on the average.  The remaining ten percent of the SPAM messages are used to further train the algorithm to recognize spam of that type in the future.  I am hopeful that this performance will improve in the future as the algorithm becomes better trained.

Beginning of the screenMsg method

The code for the screenMsg method begins in Listing 8.

  public boolean screenMsg(String fileName,
String uidl,ScreenResult theResult){

//Initialize return value to false
boolean match = false;

Listing 8

As you can see, the method signature matches my earlier description with respect to the statement in Listing 7 that calls the screenMsg method.

The code in Listing 8 initializes the variable named match to false.  This is the value that will be returned from the method if it is not overwritten later by the discovery of a match between a test phrase and a text line in the message.

Purpose of the
screenMsg method


The
screenMsg is used to identify messages that are candidates for deletion from the server. Such identification is based on analyzing the file in which the message is stored locally, and comparing that file with the contents of TreeSet objects populated earlier with the contents of the files named Pop302a.txt, Pop302b.txt, and Pop303c.txt.

A return value of true means that the message is identified as SPAM and should be deleted from the server.

Returned String values

In addition to the boolean return value, references to four String objects are encapsulated in the incoming object of type ScreenResult.  The populated object is used later by the calling method in the communication module.  These String objects represent:
  • The text of the messages' Subject line.
  • The text of the messages' From line.
  • The offending word or phrase (if any) that was found in the Subject line or in the body of the message, which includes the From line.
  • The raw text of the message down to the line that includes the offending word or phrase, or the entire raw text of the message if no offending words or phrases were found.
Display information on the GUI for benefit of the user

This information is displayed in the GUI in Figure 1 by the communication module, which is an object of the class named Pop302.  This is information is presented to help the user make an informed decision regarding deletion of the message from the server.
(If the screenMsg method returns false, the program doesn't pause for the user to make such a decision, and processing of the next message on the server begins immediately, with all of the above information having been removed from the GUI.)
Refer back to Figure 1

The list of items placed in the ScreenResult object includes the offending word or phrase that was found in the Subject line or in the body of the message.  Referring back to Figure 1, the Offending Phrase for the message shown in Figure 1 was V1AGRA.
(This was an easy one because there were no extraneous characters inserted in the offending word, although the spammer did use a numeral 1 character in place of an I.)
The Subject line

Also in Figure 1, the Offending Phrase was found in the Subject line, which means that the program made the decision very quickly (it didn't have to examine a large amount of body text in order to make a decision).

Data in the text area

As shown in the large text area in Figure 1, this was Message Number 5 in the dropbox on the server.

The raw message text is displayed in the text area down to the line that contained the Offending Phrase, which in this case was the Subject line.

The From line

The top-most text field in Figure 1 originally contained an Email address that purportedly was the address of the sender of the message.  However, I suspect that the person identified by that Email address wasn't actually the sender of the message, so I deleted the Email address before publishing this image.
(On the basis of the earliest RECEIVED: FROM line in the raw message text in Figure 1, the message appears to have been sent by a computer having an IP address of 204.85.84.207.  However, given the identity of the organization to which that IP address is assigned, (according to a WHOIS Database Search at http://www.arin.net/whois/) this seems somewhat unlikely as well.  But, one never knows who may be spamming.  Maybe that computer is infected with a Trojan horse that is broadcasting SPAM messages without the knowledge of its owner.)
Open the disk file for reading

The code in Listing 9 opens the file containing the local copy of the raw message text for reading.  Listing 9 also declares a local variable named data that will be used in the file reading process.

    try{
BufferedReader inData
= new BufferedReader(new FileReader(
fileName));
String data;//temp holding area

Listing 9

(Recall from the previous lesson that asterisks were inserted into the data on ten-character intervals in an attempt to destroy any executable virus code that may be included in the byte stream.)
Prepare to process the file

The code in Listing 10 is executed in preparation for processing the file containing the raw message data.

      inData.mark(10000);

theResult.subject = "No Subj line found";
theResult.from = "No From line found";

Listing 10

Mark the beginning of the input stream

The first statement in Listing 10
marks the beginning position in the input stream. Subsequent calls to reset will attempt to reposition the stream to this point.

This mark will be used later to rewind the stream to the beginning.

Populate the ScreenResult object


The last two statements in Listing 10 populate two of the fields in the ScreenResult object with default values, just in case one or the other of the corresponding lines are not found in the message.  If the lines are found in the message, these default values will be overwritten with the actual data from the message.

The removeStars method

Before going any further, I am going to put the discussion of the screenMsg method on hold for a moment and discuss the removeStars method shown in Listing 11.

  private String removeStars(String stringIn){
StringBuffer stringBuf =
new StringBuffer(stringIn);
int index = 0;
while(index > -1){
index = stringBuf.lastIndexOf("*");
if(index > -1){
stringBuf.delete(index,index+1);
}//end if
}//end while
stringBuf.append("**");
return new String(stringBuf);
}//end removeStars()

Listing 11

The purpose of this method is to remove the asterisks that were inserted into the data by the method named insertStars before the file was written (see the previous lesson).  The method also appends two asterisks at the end of each line.
(Note that this method removes all asterisks, not just those inserted earlier.  If this proves to be a problem, this method should be modified to remove only those asterisks that occur on ten-character intervals.)
The code in this method is straightforward and shouldn't require further explanation.

Return to discussion of the screenMsg method

Returning to the discussion of the screenMsg method, we are ready to examine the code used to screen the Subject line.  That code begins in Listing 12.

      while((data = inData.readLine()) != null){
data = removeStars(data).toUpperCase();
if(data.startsWith("SUBJECT:")){

Listing 12

The code in Listing 12 is the beginning of a while loop that reads successive lines of text from the disk file until it either runs out of lines (null) or encounters a line that starts with SUBJECT: (see Figure 1 for the format of the Subject line in the message).

If it runs out of lines before finding the Subject line, the loop will terminate.  If it finds the Subject line before running out of lines, the line will be processed and a break will be executed to terminate the loop.  Thus, the code in this loop will process only the Subject line.
(Note that the removeStars method is called to remove the asterisks from the data and the data is converted to upper case before testing for SUBJECT:)
Populate the output object

We are now in the body of the if statement begun in Listing 12.  A match for SUBJECT: has been found.  The code in Listing 13 populates the output object with the Subject line data overwriting the default value put there by the code in Listing 10.

          theResult.subject = data.toUpperCase();

Listing 13

This will result in the Subject line data being displayed in the second text field of the GUI of Figure 1 when the screenMsg method returns.

Screen against friendly words and phrases

The next step is to screen the Subject line data against the words and phrases in the friendly list.  If a word or phrase from the friendly list appears in the Subject line data, the message will be preserved on the server and will not be subjected to SPAM screening.

In addition, the local copy of the message currently located in the history folder will be deleted, because it is not considered to be SPAM.  That way, the message will not be used later when training the algorithm to do a better job of identifying SPAM.

Executing the screen

The code that screens the Subject line data against the friendly list begins in Listing 14.

          Iterator iterator =
friendlyWordList.iterator();
while(iterator.hasNext()){
String friendlyWord =
((String)(iterator.next())).
toUpperCase();
match = false;
if(!(friendlyWord.equals(""))){
match = screenOnPhrase(
data,friendlyWord,0
);
}//end if

Listing 14

The friendly list is stored in a TreeSet object referred to by the reference variable named friendlyWordList.

An Iterator loop


The code in Listing 14 shown the beginning of an Iterator loop, used to iteratively extract each word or phrase from the friendly list and compare it with the Subject line data.  The comparison is actually rather complex and is performed in a method named screenOnPhrase, which I will discuss later.

As each friendly word or phrase is extracted from the friendly list, it is passed, along with the Subject line data to the method named screenOnPhrase.

That method will return true if a match is found, and will return false if no match is found.

Extraneous characters

The third parameter in the call to the screenOnPhrase method specifies the number of extraneous characters allowed to occur between the characters in the data and still declare a match to be true.  In this case, a value of zero is passed for this parameter, meaning that no extraneous characters are allowed.
(As it turns out, the value of zero results in a trivial case, and I could have accomplished this more simply than by invoking the rather complex code in the screenOnPhrase method.  However, when I wrote the prototype for the program, I hadn't decided that I was going to use a value of zero here.)
Behavior of the screenOnPhrase method

Basically, in this case, the screenOnPhrase method is testing to see if the friendly word or phrase occurs anywhere within the Subject data line, and if so, it will return true.  Otherwise, it will return false.
(Note that everything has been converted to upper case at this point, so matching the case is not an issue.)
At this point, I can either branch off and discuss the screenOnPhrase method, or continue discussing the code in the screenMsg method.  I have decided to do the latter, and explain the inner workings of the screenOnPhrase method later.

If a match was found

Listing 15 shows what happens if the current friendly word or phrase was found in the Subject data line (the current word or phrase is that word or phrase most recently extracted from the friendlyWordList by the iterator).

            if(match == true){
inData.close();
new File(fileName).delete();

theResult.thePhrase = phrase;

return false;
}//end if match = true

}//end while iterator has next

Listing 15

Delete the file from the SPAM history folder

The first thing that happens when a match is found is that the input stream is closed and the file containing the raw message is deleted from the folder in which SPAM history messages are stored.  The rationale is that this is not a SPAM message, and should not be used later when training the algorithm to do a better job of identifying SPAM.

Populate the output object

The next thing that happens is that the matching phrase is stored in one of the fields of the ScreenResult object.
(In this case, that String isn't currently used in any significant way by the communications module, but it is available to be used in the future if needed.)
Return a false value

Perhaps the most significant thing that happens in Listing 15 is that the screenMsg method terminates and returns a false value to the calling method in the communications module Listing 7.  That essentially terminates the processing of this message.  It is preserved on the server and is not subjected to further screening for SPAM.

If no match was found

If none of the words or phrases in the friendlyWordList match the Subject data line, control will fall out of the loop at the bottom of Listing 15 when the data in the friendlyWordList is exhausted.  This will transfer control to the code in Listing 16.

          break;
}//end if data.startsWithSUBJECT:
}//end while input is not null

Listing 16

The break statement

The break statement in Listing 16 is inside the body of the if statement that began in Listing 12.  This code is being executed because a data line was found that starts with SUBJECT:

If no match was found, the return statement in Listing 15 would not be executed, and control would reach this point.  Since this part of the code deals exclusively with the Subject line, and a Subject line was found, there is no point in reading any more input lines.  Hence the break statement in Listing 16 terminates the read loop that began in Listing 12.  No more data will be read from file in this part of the code.


Rewind the input stream to the beginning

The code in Listing 17 resets the stream back to the mark that was set on the stream in Listing 10.  Since that mark was set at the beginning of the file, this code rewinds the data file back to the beginning.

      inData.reset();

Listing 17

Contents of ScreenResult object

At this point, the subject field in the output ScreenResult object contains Subject line data if a Subject line was found, or contains a message to the effect that no Subject line was found.  This latter message was put there by default in Listing 10, and was not overwritten if no Subject line was found.

Process the From line

The next step in the process is to process the From line for the purposes of:
  • Returning the From data in one of the fields of the ScreenResult object if a From line exists.
  • Determining if the message was sent from a friendly Email address.  If so, return false causing this message to be preserved on the server and exempt from SPAM screening.
Except for the fact that the code in Listing 18 extracts and processes a text line that starts with From:, the code in Listing 18 is essentially the same as the code used to process the Subject line beginning in Listing 12 and ending in Listing 17.  Therefore, it should not be necessary for me to explain that code again.

      while((data = inData.readLine()) != null){
data = removeStars(data.toUpperCase());
if(data.startsWith("FROM:")){
theResult.from = data;
Iterator iterator =
friendlyWordList.iterator();
while(iterator.hasNext()){
String friendlyWord =
((String)(iterator.next())).
toUpperCase();
match = false;
if(!(friendlyWord.equals(""))){
match = screenOnPhrase(
data,friendlyWord,0);
}//end if

if(match == true){
inData.close();
new File(fileName).delete();
theResult.thePhrase = phrase;
return false;
}//end if match = true
}//end while iterator has next
break;
}//end if data starts with From
}//end while input is not null

inData.reset();

Listing 18

Rewind the input stream

Assuming that the method did not return false as the result of a match on a friendly Email address, the code in Listing 18 also resets the input stream in preparation for screening the entire message for SPAM.

Under that same assumption,
at this point, the from field in the output ScreenResult object contains From line data if a From line was found, or contains a message to the effect that no From line was found.  This latter message was put there by default in Listing 10, and was not overwritten if no From line was found.

Missing Subject line and From line is rare

Experience indicates that it is very rare to receive an Email message that doesn't contain both a FROM: line and a SUBJECT: line in the header, although either or both may not contain any characters to the right of the space following the colon.  In fact, it is very common for the Subject line of SPAM messages to be completely blank.  (Perhaps that causes people to read the messages out of curiosity.)

The screenOnPhrase method

While discussing both the From line and the Subject line, I asked that you simply accept that the screenOnPhrase method can determine if one upper-case String object is contained as a substring within another upper-case String object.  I did that because there is no great challenge to programming such an operation when there are no extraneous characters.  In fact, one of the indexOf methods of the String class, which searches for a substring within a String, can accomplish this very handily.

No extraneous characters allowed

This assumes, of course that extraneous characters are not allowed between the characters of the substring within the String.   That was the case in processing the From line and the Subject line above because I set the third parameter value to zero in the method call.  However, that is not the case in searching for offending words and phrases in a SPAM screen.

A SPAM example

For example, here is a typical Subject line taken from a SPAM message in my archive folder.

Subject: T@ke 5O% off Ge|neric V*i*a*g*r*a 0nline t:0day

In order to recognize that the Subject line contains the word Viagra, it is necessary that the program be able to ignore the asterisks that separate the letters of the word
V*i*a*g*r*a.

In order to recognize that the line contains the word Generic, it is necessary that the program be able to ignore the vertical bar that separates the e and the n in
Ge|neric.

In order to recognize that the line contains the word t0day, it is necessary that the program be able to ignore the colon that separates the t and the 0 in
t:0day.

Other common spammer tricks

This example illustrates another common spammer trick of switching zero characters with alphabetic O characters, replacing the lower-case a character with the @ character, etc.

I haven't attempted to automate the resolution of substitution issues such as this, and probably won't.  Given the training programs that I will explain in the next two lessons, it is easy to use historical SPAM data to train the algorithm to recognize these variations.  If such a mangled word occurs only once in the SPAM history folder, the algorithm can be trained in a single training session to recognize it as SPAM in all future messages.

Ignoring extraneous characters

Now getting back to the issue of ignoring extraneous characters in the offending words and phrases, that is what the method named screenOnPhrase knows how to do very well.
(It is also what the Email filtering capability of the commercial Email client that I use doesn't know how to do at all.)
The capability to ignore extraneous characters is one of the keys to a successful SPAM screening program.
(The spammers never make it easy to identify and block their Email messages.  That is why a successful SPAM screening program must have the capability to learn each new spammer trick as soon as it appears through an ongoing, simple to use algorithm training effort.)
The screenOnPhrase method

The screenOnPhrase method requires an incoming parameter of type int named spanLim.  This is the parameter by which the programmer specifies how many extraneous characters will be allowed between letters in the offending word or phrase and still have it be recognized as an offending word or phrase.

When the
screenOnPhrase method was used to process the From and Subject data, the value of this parameter was set to zero.  Thus no extraneous characters were allowed in the matching friendly Email address data or the matching friendly Subject line material.

In order for the
screenOnPhrase method to recognize the words Viagra, Generic, and t0day in the above example, the value of the spanLim parameter would have to be 1 or greater.

As you will see later, the current version of this program uses a spanLim value of 1 to screen for SPAM.  Experience shows that this is successful in identifying most of the offensive words and phrases without unduly increasing the occurrence of false positives.

At this point, I will put the discussion of the screenMsg method on hold for a short while and explain the inner workings of the
screenOnPhrase method.

Description of the
screenOnPhrase method


This method tests a String to see if it contains a word or phrase that may have extraneous characters inserted into it, such as VI*A-GRA.

The method requires an incoming parameter of type int named spanLim.  If the String contains the sequence of characters, in the correct order, that make up the word or phrase, with spanLim or fewer extraneous characters between any two of the matching characters, the method returns true.  Otherwise, it returns false.

A spanLim example

For example, if spanLim = 1, the spammer can insert one character between any two of the characters that make up the offending word in the String and the offending word will still be detected.

However, if the spammer inserts two or more extraneous characters, the offending word will not be detected.

Be careful of false positives

You should be careful and avoid making spanLim too large.  Large values of spanLim result in higher false positives due to the fact that widely-separated characters can be considered to be part of the word or phrase. For example, if spanLim = 2 or greater, the word PORN will be found in the word IMPORTANT.  However, if spanLim =1, the word PORN will not be found in IMPORTANT.

Operation of the screenOnPhrase method

Basically this is how the screenOnPhrase method does what it does.  The method receives incoming String parameters named data and phrase, along with an int parameter named spanLim.  The objective is to determine if phrase is contained in data with no more than spanLim extraneous characters separating the matching characters.

Search for matching characters

First the method searches data for characters that match the characters in phrase discarding all other characters.  While doing this, however, it keeps track of the original positions of the matching characters in data.

A new compressed string

The result is a new string containing only the characters that match the characters in phrase, all in their original order.  Let's refer to this as str.  All extraneous characters have been discarded from data producing a new string named str.

For example, if the phrase is SPAM, str might look like the following after all extraneous characters have been discarded:

SMSPMASPAMMPAS

Does str contain phrase?

A test is made to determine if str contains sequences of characters that match phrase(In the above example, str does contain a sequence of characters matching SPAM, which I highlighted using boldface.)

How many extraneous characters were discarded?

If a match is found, this means that the original data did contain phrase with the possibility of extraneous characters in between the characters of phrase.  However, there is still the issue of how many extraneous characters were discarded in order to get the positive match.  This is determined by examining the original position information that was saved while extraneous characters were being discarded.

If the number of characters discarded between any two of the characters matching the sequence was less than or equal to spanLim, the method returns true.  Otherwise it returns false.

Let's see this in code

The code to accomplish this is a little bit complex.  The screenOnPhrase method begins in Listing 19.

  private boolean screenOnPhrase(String data,
String phrase,
int spanLim){
this.phrase = phrase;
StringBuffer str = new StringBuffer();
ArrayList locationData = new ArrayList();

Listing 19

The code in Listing 19 saves phrase in an instance variable, and declares a couple of new local variables that will be used later.  Note that str is a StringBuffer object and is not a String object. 
(An object of the StringBuffer class can have its contents modified, while an object of the String class is immutable.)
Compare data with phrase

The next step is to compare the characters in data with the unique characters in phrase, saving only the matching characters in str, and saving the original locations of the matching characters in locationData, which refers to an object of type ArrayList.

First, however, it is necessary to eliminate duplicate characters from phrase.

Eliminate duplicate characters from phrase

This is accomplished by the code in Listing 20 by storing the characters from phrase into a TreeSet object.  Storing the characters in a TreeSet object eliminates duplicates.
(It also sorts the characters, but that doesn't matter one way or the other in this case.)

    TreeSet treeSet = new TreeSet();
for(int cnt = 0; cnt < phrase.length();
cnt++){
treeSet.add(
new Character(phrase.charAt(cnt)));
}//end for loop

Iterator iter = treeSet.iterator();
StringBuffer tempPhrase = new StringBuffer();
while(iter.hasNext()){
tempPhrase.append(
((Character)(iter.next())).charValue());
}//end while

Listing 20

Having stored the characters from phrase in the TreeSet object, the code in listing 20 goes on to use an Iterator to extract the unique characters from the TreeSet and store them in a new StringBuffer object named tempPhrase.
(The characters are stored in a StringBuffer object because it is possible to build such an object one character at a time.  This is not possible with a String object.)
Extract matching characters from data

Listing 21 uses a pair of nested for loops along with tempPhrase to extract matching characters from data and to store them, (in their original order), in str.  The original position of each matching character is stored in locationData.

    for(int i = 0; i < data.length(); i++){
for(int j = 0; j < tempPhrase.length();
j++){
if(data.charAt(i) ==
tempPhrase.charAt(j)){
str.append(data.charAt(i));
locationData.add(new Integer(i));
}//end if
}//end for on tempPhrase
}//end for on data

Listing 21

This converts the original data into a compressed string of characters, each of which matches a character in phrase. All other characters have been discarded. Thus, if data contains phrase, it will occur somewhere in str with no extraneous characters separating the characters in phrase.

Does str contain phrase?

The next step is the easy one.  Listing 22 tests to see if the new compressed string named str contains the original phrase.

    int match = str.indexOf(phrase);
if(match == -1){
return false;//no match
}//end if

Listing 22

This is accomplished by invoking the indexOf method of the StringBuffer class, which returns the index within str of the first occurrence of phase.

Behavior of the indexOf method

The indexOf method returns -1 if phrase does not occur within str, in which case the screenOnPhrase method simply returns false.  Otherwise, the method goes on to test for the maximum number of extraneous characters that separated the matching characters in the original data.
(While writing this explanation, I realized that phrase might have occurred more than once in data, with too many extraneous characters in the first occurrence and an acceptable number of extraneous characters in a later occurrence.  In this case, however, the algorithm would return false.

Given the reason for the extraneous characters in the first place, it is probably unlikely that this will happen.  However, this is a logic error that would be worth fixing.)

Check number of extraneous characters

When there is a match, we need to confirm that the span between matching characters does not exceed the number allowed by the incoming parameter spanLim.  This is accomplished by the code in Listing 23.

    int maxSpan = 0;
int locA = ((Integer)locationData.
get(match)).intValue();
int locB = 0;
for(int cnt = 1; cnt < phrase.length();
cnt++){
locB = ((Integer)locationData.get(
match + cnt)).intValue();
int span = locB - locA;
if(span > maxSpan){
maxSpan = span;
}//end if
locA = locB;
}//end for loop

if(maxSpan > spanLim+1){
return false;//span too large
}else{
return true;//made a match
}//end else

}//end screenOnPhrase

Listing 23

What happened before was ...

As each matching character was extracted from data in Listing 21, the original position of that character in data was encapsulated in an object of type Integer.  That Integer object's reference was appended to a list of such references in the object of type ArrayList referred to by locationData.

Thus the elements in the ArrayList refer to objects containing the original positions of successive matching characters in data.  This information will be used to calculate the maximum number of characters separating those matching characters.

The code in Listing 22 found the position of the first character of phrase in the compressed string referred to by str and saved that value in a local int variable named match.

What happens now is ...

The code in listing 23 uses the value of match to extract an Integer object from the list containing the original position of the first matching character of phrase in data.  The contents of the Integer object are extracted and saved as locA.

Then the code in Listing 23 enters a for loop, extracting successive references from the list and uses the information encapsulated therein to calculate the difference in original positions of the successive matching characters.  The maximum value of that difference is calculated.  This process continues until a number of original positions equal to the number of characters in phrase have been examined.  Then the maximum difference in position is compared with spanLim.

If it is determined that the number of characters between the original positions of the matching characters in data exceeds spanLim, the method returns false.  Otherwise, it returns true.

Return to the discussion of the screenMsg method


Now that we have an understanding of how the method named screenOnPhrase works, it is time to return to the discussion of the method named screenMsg.

Up to this point, the method has examined the Subject and From lines for two purposes:
  • To determine if they are friendly, and if so, to terminate SPAM screening for the current message.
  • To provide the contents of the two lines for later display in the GUI.
Screening for SPAM

If control still resides in the screenMsg method at this point (meaning that the message wasn't declared to be friendly on the basis of either the From line or the Subject line), it is time to screen the entire message looking for indications that the message is SPAM.

This is accomplished in two parts.  The Subject line is screened for SPAM using one word list and the body text is screened for SPAM using a different word list.  If either the Subject line or the body text is determined to contain SPAM, the method terminates returning true.
(A typical message contains a few lines of body text before the Subject line and potentially many lines of body text following the Subject line.

The lines are screened in the order that they occur in the message. Therefore, if the Subject line is determined to contain SPAM, the screening process will often require much less time than will be required to locate SPAM in the body text.)

If any line is determined to contain SPAM, the method terminates at that point returning true.

If no SPAM is identified in any line of the message, the method returns false.

The screening process

The screening process begins in Listing 24.

      int progressCounter = 0;
while((data = inData.readLine()) != null){
data = removeStars(data);
if(data.startsWith("Subject")){

data = data.toUpperCase();

theResult.text =
theResult.text + data + "\n";

//Display progress on the GUI.
if(++progressCounter < 50){
theGui.textArea.append(".");
}else{//Display progress on a new line
progressCounter = 0;
theGui.textArea.append(".\n");
}//end else

Listing 24

The code in Listing 24 reads a line of text from the input file, removes asterisks from the line and tests to see if the line starts with Subject: (note that the line hasn't been converted to upper case yet at this point).

If the line does not start with Subject:, an else clause, (to be discussed later), is executed to screen the line as body text.

If the text line is
the Subject line ...


For the case where the line does start with Subject:
  • The line is converted to upper case.
  • The line is appended to the contents of the field named text in the output object of type ScreenResult.
  • A single period is appended to the string currently residing in the text area of the GUI to be displayed as a progress indicator.  (Each set of 50 periods appears on a new line in the progress indicator.)
Screen the Subject line for SPAM

The code beginning in Listing 25 uses an Iterator to screen an upper-case version of the Subject line against upper-case versions of each of the offensive words and phrases stored in a TreeSet object referred to by subjWordList.


match = false;
Iterator iterator =
subjWordList.iterator();
while(iterator.hasNext()){
String subjWord =
((String)(iterator.next())).
toUpperCase();
if(!(subjWord.equals(""))){
match = screenOnPhrase(
data,subjWord,1);

}//end if

Listing 25

Invoking screenOnPhrase

The actual process of screening against each offensive word or phrase in the TreeSet object occurs as a result of invoking the screenOnPhrase method (discussed earlier) to determine if the Subject line contains the offensive word or phrase.  A one-character separation is allowed between the characters in the offensive phrase in the Subject line.  The boolean value returned by
screenOnPhrase is stored in the variable named match.

The value of match will eventually be returned to indicate whether or not the screenMsg method found a match between a text line in the message and an offensive word or phrase from one of the word lists.

If the returned value is
false ...


If the returned value is false, the Iterator loop continues looping, attempting to match offensive words or phrases from subjWordList with the Subject line until there are no more offensive words or phrases stored in subjWordList.

At that point, it is concluded that the Subject line doesn't contain SPAM.  Control is transferred back to the top of the while loop in Listing 24, where another line of text is read and screened.
(There can be only one Subject line in a properly formatted message, so the remaining lines will probably all be screened as body text.)
If the returned value is true ...

If the screenOnPhrase method returns true, the body of the if statement in Listing 26 is executed.

            if(match == true){
//Move local message file
inData.close();
boolean moved =
new File(fileName).renameTo(
new File(
"c:/MailFiles/Archives/"
+uidl+".txt"));
if(!moved)System.out.println(
"Unable to move file " + uidl);

//Break out of Iterator loop
break;
}//end if match == true
}//end while iterator has next
}//end if data starts with Subject

Listing 26

Basically two things happen in the body of this if statement:
  • The local copy of the message is moved from a history folder to an archive folder.  (The message won't be needed for training the algorithm later because the algorithm already knows how to identify the message as SPAM.)
  • Control breaks out of the Iterator loop.  (Only one match against an offensive word or phrase is required to declare that a message is SPAM.)
I won't try to explain the process of moving the file.  If you don't understand that code, look it up in the Java API documentation.

Transfer of control

After breaking out of the Iterator loop, control transfers to the return statement in Listing 33 below with match containing true.  A true value for match indicates that the Subject line identifies the message as SPAM.

If the text line is not the Subject line ...

A line of text was read in Listing 24, and a test was made to see if that line was the Subject line for the message.  If not, control transfers to the code that begins in Listing 27.

Since the line of text is not the Subject line, it needs to be screened against the offensive words and phrases in a different list designed for screening body text.

Listing 27 executes several steps in preparation for that screening process.

        else{
data = data.toUpperCase();
theResult.text =
theResult.text + data + "\n";
if(++progressCounter < 50){
theGui.textArea.append(".");
}else{
progressCounter = 0;
theGui.textArea.append(".\n");
}//end else

Listing 27

The code in Listing 27:
  • Converts the text line to upper case.
  • Appends the text line to the contents of the field named text in the output object of type ScreenResult.
  • Causes a single period to be displayed in the progress indicator on the GUI of Figure 1.
Screen the message body text line

The line of message body text is actually screened by invoking the screenOnPhrase method in Listing 28.  The third parameter in the invocation of this method allows for one extraneous character to separate the characters of the offending phrase in the line of body text.

          Iterator iterator = bodyWordList.
iterator();
match = false;
while(iterator.hasNext()){
String bodyWord =
((String)(iterator.next())).
toUpperCase();
if(!(bodyWord.equals(""))){
match = screenOnPhrase(
data,bodyWord,1);
}//end if

Listing 28

Loop on an Iterator

An upper-case version of the message text line is screened against upper-case versions of each of the offending words and phrases in a TreeSet object referred to by bodyWordList.

An Iterator is used to cause this process to continue until either a match is found, or the items in the list are exhausted.

The boolean value returned by screenOnPhrase is stored in the variable named match.

If the returned value is true ...

If the value returned by screenOnPhrase is true, the code in the body of the if statement of Listing 29 is executed.

            if(match == true){
inData.close();
boolean moved =
new File(fileName).renameTo(
new File(
"c:/MailFiles/Archives/"
+uidl+".txt"));
if(!moved)System.out.println(
"Unable to move file " + uidl);

break;
}//end if match == true
}//end while iterator has next
}//end else for line not Subject line

Listing 29

This code performs the following operations:
  • Move the local copy of the message from the history folder to an archive folder for the same reasons given with respect to the Subject line earlier.
  • Break out of the Iterator loop because there is no need to test against any additional offensive words or phrases.
At this point, the value of match is true meaning that a match has been found.  Control is transferred to the if statement at the top of listing 30.

If screenOnPhrase returned false ...

On the other hand, if no matches were found for any of the words and phrases in bodyWordList, control reaches the if statement at the top of Listing 30 with match containing a value of false.

        if(match == true)break;
}//end while loop on read until null

Listing 30

If match is false, the code in Listing 30 loops back to the top of Listing 24, reads the next line of text from the message, and begins the screening process all over again.

Close the file


If match is true in Listing 30, there is no need to do any further testing so the code in Listing 30 breaks out of the while loop responsible for reading lines of text from the file, transferring control to the top of Listing 31.

Control can also transfer to the top of Listing 31 when the end of the text file has been reached.  In that case, the value of match will be false, indicating that no match was found.

      inData.close();//Close file if still open
}catch(Exception e){e.printStackTrace();}

Listing 31

The code in Listing 31 closes the file and finishes off the obligatory code for a try/catch block.

Store the final phrase

In the event that a match was found, the variable phrase contains the offending phrase that identifies the message as spam.  If a match was not found, the contents of phrase are of no significant value.  In either case, however, the value of phrase is stored in the field named thePhrase in the output object of type ScreenResult in Listing 32.

    theResult.thePhrase = phrase;

Listing 32

Return the value of match

The code in Listing 33 returns the value of the variable match.  This will either be the initial value of false if no match was found (see Listing 8) or will be true if a match was found and the initial value was overwritten with true.

    return match;
}//end screenMsg method

}//end class Screen

Listing 33

Return points for the screenMsg method

The screenMsg method contains three return statements.

The first occurs in Listing 15 where the code explicitly returns false, indicating that a friendly phrase was found in the Subject line, and that the message should not be deleted from the server.

The second occurs in Listing 18 where the code explicitly returns false, indicating that the message was sent by a friend and therefore shouldn't be deleted from the server.

The third occurs in listing 33.  A match value of false at this point indicates that the message was not identified as SPAM and should not be deleted from the server.  A match value of true at this point indicates that the message is believed to be SPAM and probably should be deleted from the server.

Preview of Future Lessons

This program is most useful when you have well-developed lists of offending words and phrases.  Although it is possible to create those lists with a text editor, you can be much more productive, and you are much more likely to update the lists using the programs that I will present in the next two lessons.

Therefore, I will give you a preview of those two programs.  I will show you three images that partially illustrate the capabilities of the two programs.

The first image (Figure 2) shows the GUI used to train the algorithm to do a better job of identifying SPAM in the Subject line of a message. 

The second and third images (Figures 3 and 4) show two different aspects of the GUI used to train the algorithm to do a better job of identifying SPAM in the body text of a message.
(In all three cases, the width of the GUI was reduced to make it fit into this narrow publication format.  The version that I routinely use is much wider, and can therefore display much more information.)
Training the algorithm on the Subject line

Figure 2 illustrates the procedure that I use to train the algorithm to do a better job of identifying SPAM in the Subject line of future messages.

User interface

Figure 2 User interface for training on Subject line

In Figure 2, a message previously stored in the history folder has been loaded into the GUI.  The complete raw text of that message is available for viewing in the large text area if desired.  The From line and the Subject line are displayed in the top two text fields in the GUI.  (I purposely deleted the Email address of the sender in all three of these images.)  User instructions are displayed in the fourth text field.

Offensive text in the Subject line

In this case, the user has identified offensive text (XANAX) in the Subject line and has selected that text with the mouse.  Two additional steps are required to add that text to the
word list used to screen the Subject line of future messages.

The first step is to press the button labeled Copy Selected Text.  This will cause the selected text to be copied into the third text field from the top where it can be edited if desired.

(In the event that the spammer inserted extra characters into the offensive text, such as in X-ANAX, the extra characters should be deleted before proceeding to the second step.)
The second step is to press the button labeled Post Text. This will cause the selected and (possibly) edited text to be automatically added to the word list.

That is all that is required to cause the program to identify this offensive text in the Subject lines of all future messages.

Process the next message

If the user then presses the Next button, the next message in the history folder will be loaded into the GUI.  The current message will not be deleted from the history folder.
(This is what you would normally do if you are going to use the same message later to train the algorithm to better identify SPAM on the basis of the body text.)
If the user presses the Delete Local File button, the current file will be deleted from the history folder and the next message in the history folder will be loaded into the GUI.
(This is what you would normally do if you have determined that the message is not SPAM, or should not be used for further training of the algorithm for some other reason.  Perhaps the message was received from a friend whose Email address has not yet been added to the list of friendly Email addresses discussed earlier in this lesson.  Note that deleting the message file from the local disk does not delete the message from the server.)
A very simple process

As you can see, the process of training the algorithm on the Subject line consists simply of selecting text with the mouse and pressing buttons to cause the selected text to be added to the word list.  This can be accomplished very quickly with very little effort.  Except for the possible requirement to delete extra characters, no actual typing is required.
(As an alternative, the user can type anything into the third text field and press the Post Text button to cause it to be added to the word list.  Any number of items can be added to the word list before moving on to the next message.)
Training the algorithm on the body text

Figure 3 illustrates one aspect of the procedure for training the algorithm to do a better job of identifying SPAM on the basis of the body text of future messages.

User interface

Figure 3 User interface for training on body text and IP address

Once again, a message previously stored in the history folder has been loaded into the GUI.  The complete raw text of that message is available for viewing in the large text area.  The From line and the Subject line are displayed in the top two text fields in the GUI.  User instructions are displayed in the fourth text field from the top.

Add originating IP address to the list

At this point, the user has pressed the button labeled Select IP.  This caused the program to search out the IP address of the computer that originally sent this message, and to copy that IP address into the third text field from the top.  All that is required to add that IP address to the list of offending phrases is to press the button labeled Post Word.

As you can see, getting the originating IP address of a SPAM message and adding it to the word list is very simple.  As before, once it is in the third text field, you can edit if you like before adding it to the list.

Getting offending text from the body text

Also at this point, you can scroll the text area.  If you visually identify something in that text that you believe will uniquely identify messages from this spammer in the future, you can copy and paste that text into the third text field, and then add the text to the list by pressing the Post Word button.

Adding URLs to the list

One of the best ways to identify SPAM is to identify URLs referenced in the SPAM messages.  This is something that is difficult, or at least expensive for the spammer to change frequently.  (Sometimes the identification of one critical URL will cause hundreds and perhaps thousands of future messages to be identified as SPAM.)

Adding URLs is very easy

Figure 4 illustrates a special feature of the program designed to let you capitalize on that weakness.  Once a message is loaded into the GUI, each time you press the button labeled Select URL, the program will search down through the message until it finds the next block of text that begins with HTTP://(This is normally an indication of a URL.)

The program will select that URL, beginning with HTTP:// and including everything out to the character before the next / character.  That / character normally separates the domain name from a directory or file name.  Then the program copies the selected text into the third text field.
(The spammer can much more easily change directory and file names than domain names, so they are excluded from the text that is selected and copied.)

User interface

Figure 4 User interface for training on body text and URL

A URL has been identified

In Figure 4, the program has selected the URL being used by the spammer and has copied it into the text field.  At this point, the user can edit the URL if appropriate, and can add it to the list by pressing the Post Word button.

Each time the user presses the Select URL button the next URL in the message is copied into the text field.  When no more URLs can be found, a message to that effect is displayed in the text field.

Thus, it is very easy for the user to identify all the URLs being used by the spammer and to add some or all of them to the list.

The next message

The behavior of the Next button and the Delete Local File/Next button are the same as discussed relative to Figure 2.  I typically delete the file from the history folder after I have used it to train the algorithm on the basis of body text.

Stay tuned

So, stay tuned.  I will explain the programs that provide this training capability in the next two lessons in this series.

Run the Program

I encourage you to copy the code from Listing 34 and the three starter text files in Listing 35, Listing 36, and Listing 37 into your text editor.  Compile and execute the program.  Experiment with it, making changes, and observing the results of your changes.

You may want to modify this code to cause the message files to be stored in a different location on your disk.  If so, modify the strings in Listing 34 that read "c:/MailFiles/" + uidl + ".txt" and "c:/MailFiles/Archives/" + uidl +".txt" to specify a different folder. Make certain that the folder where you plan to save the files exists before running the program.

Before running the program, you will need to create three text files having the following names and purposes and store them in the folder containing your compiled Java class files for this program:

  • Pop302a.txt - contains offensive Subject line words and phrases
  • Pop302b.txt - contains offensive body text words and phrases
  • Pop302c.txt - contains friendly Email addresses and friendly Subject line material

Eventually you will need to populate these files with words and phrases that work well for you.  (The algorithm training programs that I will present in the next two lessons will be extremely helpful in this regard.)

In the meantime, I have provided sample files in Listing 35, Listing 36, and Listing 37 that you can use as starter lists.  If you receive the same kinds of SPAM that I receive, the words in these lists should make it possible for you to test the program and get a few hits on SPAM messages.

These are simply text files so feel free to add other words and phrases as appropriate.

(Let me caution you not to enable the DELE code in Listing 34 until you are certain that you actually want to delete messages from the server.  Once a message is deleted from the server, there is no way to recover it from the server.)

Summary

The previous program explained the communications module of a program used to remove SPAM from your Email server before it is downloaded into your primary Email client.

This program explains my algorithm used to identify SPAM.  You can use the algorithm as is, or modify it to better suit your needs.

After about one week of training my algorithm was reliably identifying about ninety percent of all SPAM messages.  I expect this performance to improve over time as the algorithm becomes better trained.

This program is most useful when you have well-developed lists of offending words and phrases.  Although it is possible to create those lists with a text editor, you can be much more productive, and you are much more likely to update the lists using the programs that I will present in the next two lessons.

What's Next?

In the next lesson in this series, I will present and explain my program named Pop302d, which provides an easy way to train my screening algorithm to do a better job of identifying SPAM in the Subject line of a message.

Complete Program Listing

A complete listing of the program is provided in Listing 34.  In addition, starter text files are provided in Listing 35, Listing 36, and Listing 37.

The three DELE statements shown in red in Listing 34 have been purposely disabled to prevent you from accidentally deleting messages from your server while testing this program.
Do not enable these three statements until you are ready to actually delete messages from the server.  Once a message is deleted from the server, it cannot be recovered from the server.
Disclaimer of responsibility:  If you elect to use this program you use it at your own risk.  Make absolutely certain that you understand what you are doing before you execute the program.  The author of this program, Richard G. Baldwin, and the websites Developer.com and Gamelan.com accept no responsibility for any losses that you may incur as a result of using this program.

/*File Pop302.java Copyright 2004, R.G.Baldwin
Rev 01/02/04

Upgraded on 01/02/04 to do the following:
-Display msgNumber in text area while awaiting
decision to delete or not to delete.
-Confirm that local msgNumber is in synch with
message number on server (in UIDL) before
deletion of a message from the server.

The purpose of this program is to read messages
from a POP3 server, analyze the messages
according to screening rules, and delete those
messages from the server that fail the screening
test. (As written, the program asks the user
to confirm the deletion of each message, but
this confirmation step could easily be removed.)

This version of the program screens on the basis
of key words or phrases in the From line, key
words or phrases in the Subject line, and key
words or phrases in the body text.

A list of friendly Email addresses is used to
screen the From line. Messages that are from
friendly Email addresses are not deleted from
the server and no information about those
messages is saved on the local disk. They are
totally ignored after determining that they were
sent from a friendly Email address.

Different lists of words are used for screening
Subject lines and body text. For example,
ANTIVIRUS is appropriate for screening the
Subject line, but is not appropriate for
screening the body text. The word ANTIVIRUS
often appears legally in the header of Email
messages that have been scanned for viruses by
the server, but also often appears in the Subject
line of SPAM messages.

The common spammer tricks of inserting extra
characters between the characters in the
offending word and mixing the case of the
characters in the offending word is defeated by
this program.

For example, this program will flag for deletion
a message having any of the following in its
Subject line or its body text:

vIaGrA
V.IagRA
V.I.A.G.R.A

This program also defeats the common trick of
appending random characters to the end of the
Subject line, because it doesn't require a match
for the entire Subject line.

When the program detects a message that is a
candidate for deletion, the user is asked to
verify the deletion by clicking the Delete
button. If the user doesn't want to delete the
message, she should click the Start/Next
button.

The following information is available to the
user for making that decision:
- From
- Subject
- Offending line, which may also be the subject
- Offending word or phrase
- Entire raw text of the message up to and
including the offending line

All messages that are candidates for deletion
from the server are saved in an archive folder
on the local disk, regardless of whether the
user elects to delete them from the server. Thus
if a message is deleted from the server and it is
later determined that was a mistake, a raw text
copy of the deleted message is available locally
in the archive folder. You should probably empty
this folder periodically so that it won't fill
up your disk.

Except for friendly messages, all messages that
are not candidates for deletion from the server
are saved in a history folder on the local
disk. These messages can be used later to train
the program to do a better job of recognizing
SPAM.

Before any message is saved in a local file,
asterisks are inserted into the text on
ten-character intervals in an attempt to destroy
any virus code that may be embedded in the
message.

Numerous upgrades are possible. One possible
upgrade is to create a premium list of words and
phrases that will always result in deletion of
the message from the server without prior
approval by the user. For example, the user
might want to have any message containing
VIAGRA to be automatically deleted. However,
great care is urged in this regard. Certain
words such as SPAM and PORN occasionally occur
in a message with the letters separated by only
a few characters. This program would identify
those messages as being candidates for deletion.
For example, the offending word PORN occurs in
the non-offending word imPORtaNt with the letters
R and N separated by only two characters. The
word SLUT appears in the word SoLUTion with only
one character between the S and the L. The word
SPAM often occurs in different variations of
body text.

Another possible upgrade would be to allow the
user to specify the number of characters that may
occur between the letters of an offending word
or phrase. As programmed, that value is
hard-coded into the program, and as of this
writing, that value is one.

If the number of characters is set to zero, many
spam messages will avoid detection. If that
value is set to a large number, many false alarms
will occur. Therefore, care should be taken when
adjusting this value.

Another possible modification would be to allow
the program to automatically delete all
messages that are determined to be candidates
for deletion. Since these messages are saved
locally in an archive folder, a separate program
could be written to allow the user to review
those messages locally at her convenience just
in case a valid message was inadvertently
deleted from the server.

Companion programs that I have written provide
for creating and maintaining the word lists
discussed above in disk files. These programs
are used to analyze the non-deleted message files
saved locally in the history folder in order to
train this program to do a better job of
identifying SPAM messages in the future. These
programs are designed for ease of use to
encourage the user to train the program
frequently.

All three word lists are maintained in simple
text files, which can be edited with an
ordinary text editor if need be.

For technical information on POP3, see RFC 1725
at
http://www.cis.ohio-state.edu/htbin/rfc/rfc1725.
html

A POP3 Command Summary follows based on the
information at that web site.

Minimal POP3 Commands:
USER name
PASS string
QUIT
STAT
LIST [msg]
RETR msg
DELE msg
NOOP
RSET
QUIT

Optional POP3 Commands:
APOP name digest
TOP msg n
UIDL [msg]

POP3 Replies:
+OK
-ERR

File names: The following file names are hard-
coded into the program:

The file name for a local copy of a message is
the unique identifier for that message obtained
from the mail server.

Pop302a.txt - contains a word list for screening
the Subject lines.

Pop302b.txt - contains a word list for screening
the body text lines.

Pop302c.txt - contains a list of friendly Email
addresses for screening the From lines to
identify friendly messages.

This program consists of two main classes. An
object of the class named Pop302 handles all
communications with the Pop3 server.

An object of the class named Screen screens each
message in an attempt to identify SPAM. This
class can be totally replaced by Java programmers
who wish to design their own screening algorithm
provided they maintain the interface with the
object of the class named Pop302.

Tested using SDK 1.4.2 under WinXP
************************************************/

import java.net.*;
import java.io.*;
import java.util.*;
import java.awt.*;
import java.awt.event.*;

class Pop302 extends Frame{
int msgCounter = 0;
int msgNumber;
TextArea textArea;
TextField subjField;
TextField fromField;
TextField operMsgField;
int numberMsgs = 0;
String uidl = "";//unique msg ID
BufferedReader inputStream;
PrintWriter outputStream;
Socket socket;
Screen screener;
String fileName;

public static void main(String[] args){
if(args.length != 3){
System.out.println("Usage: java Pop301 "
+ "server userName password");
System.exit(0);
}//end if

new Pop302(args[0],args[1],args[2]);
}//end main
//===========================================//

Pop302(String server,String userName,
String password){
//Instantiate a new Screen object and pass
// this to allow for the object to call back
// and update the progress indicator.
screener = new Screen(this);

int port = 110; //pop3 mail port
try{
//Get a socket, connected to the
// specified server on the specified
// port.
socket = new Socket(server,port);

//Get an input stream from the socket
inputStream = new BufferedReader(
new InputStreamReader(
socket.getInputStream()));

//Get an output stream to the socket.
// Note that this stream will autoflush.
outputStream = new PrintWriter(
new OutputStreamWriter(
socket.getOutputStream()),true);

//Display the msg received from the
// server on the command-line screen
// immediately following connection.
String connectMsg = validateOneLine();
System.out.println("Connected to server "
+ connectMsg);

//The communication process is now in the
// AUTHORIZATION state. Send the user
// name and password to the server. Note
// that the use of an APOP command
// for sending user name and password
// would probably be more secure
// if it is supported by the server.
// However, my server apparently doesn't
// support APOP.
//Commands are sent in plain text, upper
// case to the server. Some commands
// require an argument following the
// command, as is the case with USER.
//Send the command.
outputStream.println("USER " + userName);
//Get response and confirm that the
// response was +OK and was not -ERR.
String userResponse = validateOneLine();
//Display the response on the command-
// line screen. Cannot display in the
// GUI at this point in time because the
// GUI object is not ready for use at
// this point in the execution of the
// constructor.
System.out.println("USER " + userResponse);
//Send the password to the server
outputStream.println("PASS " + password);
//Validate the server's response as +OK.
// Display the response in the process.
System.out.println(
"PASS " + validateOneLine());
}catch(Exception e){e.printStackTrace();}

//Register a window listener to service
// the close button on the Frame. This is
// an anonymous class defiition.
this.addWindowListener(
new WindowAdapter(){
public void windowClosing(WindowEvent e){

//Terminate the session with the
// server.
outputStream.println("QUIT");
String quitResponse =
validateOneLine();
//Display the response on the
// command-line screen.
System.out.println(
"QUIT " + quitResponse);
//Also display the response on the
// GUI. However, you probably won't
// see it because the GUI is
// closing.
textArea.append(quitResponse + "\n");

//Server is now in the UPDATE mode.
// It will delete all files marked
// with the DELE command earlier
// in the execution of the program.
//Close the socket
try{
socket.close();
}catch(Exception ex){
ex.printStackTrace();}

System.exit(0);
}//end windowClosing
}//end WindowAdapter()
);//end addWindowListener

//Note, this GUI was purposely made narrow
// in order to make it fit into the
// publication format. You should make
// it wider and also increase the width of
// the text fields and the TextArea defined
// below to make it more useful.
setLayout(new FlowLayout());
//Note that the compiler requires the
// references to the following buttons to
// be final because they are accessed from
// within an anonymous class definition.
final Button startButton =
new Button("Start/Next");
final Button deleteButton =
new Button("Delete");
subjField = new TextField(
"Display Subj here",50);
fromField = new TextField(
"Display From line here",50);
operMsgField = new TextField(
"Display operator messages here",50);
textArea = new TextArea(15,50);
textArea.append("Display raw data here\n");

//Register an ActionListener on the
// startButton. This is an anonymous
// class definition.
startButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Clear the operator message field
operMsgField.setText("");

try{
//The communication process is now
// in the TRANSACTION state.
//Retrive and screen messages
if(numberMsgs == 0){
//Calculate numberMsgs only at
// the beginning of the run,
// because it changes when
// messages are deleted.
outputStream.println("STAT");
String stat = validateOneLine();
//Get the number of messages as
// a String.
String numberMsgsStr =
stat.substring(
4,stat.indexOf(" ",5));
//Convert the String to an int.
numberMsgs = Integer.parseInt(
numberMsgsStr);
}//end if numberMsgs == 0
//NOTE: Msg numbers begin with 1,
// not 0.
//Retrieve and screen each
// message. Each msg ends with a
// period on a new line.
msgNumber = msgCounter + 1;

if(msgNumber <= numberMsgs){
//Process the next message.

//Get and save a unique identifier
// for the message from the server
// and validate the response.
outputStream.println(
"UIDL " + msgNumber);
uidl = validateOneLine();

//Open an output file to save
// the message. Use the UIDL
// as the file name. Others
// may need to modify the
// following code to identify
// a folder for local storage of
// the messages.
fileName =
"c:/MailFiles/" + uidl +".txt";
DataOutputStream dataOut =
new DataOutputStream(
new FileOutputStream(
fileName));

//Send a RETR command to begin
// the message retrieval process
outputStream.println(
"RETR " + msgNumber);
//Validate the response.
String retrResponse =
validateOneLine();

//Clear the text in the TextArea
// at the beginning of each new
// message. If you don't do
// this, the String being
// displayed will become very
// long and the program will run
// very slowly for large numbers
// of messages.
textArea.setText("");

//Read the first line in the
// message from the server.
String msgLine =
inputStream.readLine();
//Insert asterisks in the text
// in an attempt to destroy
// viruses before the file is
// stored locally.
msgLine = insertStars(msgLine);

//Continue reading lines until
// a "." is encountered as the
// first char in a line. That
// signals the end of the msg.
while(!(msgLine.equals("."))){
//Write the line to the output
// file and read the next
// line. Insert newline
// characters when writing the
// output to the file.
dataOut.writeBytes(
msgLine + "\n");
msgLine = inputStream.readLine();
//Insert asterisks to destroy
// virus code.
msgLine = insertStars(msgLine);
}//end while
//Close the output file. The
// message is now stored in a
// local file with a file name
// based on the unique ID
// provided by the server. Note
// that a unique ID provided by
// one server may duplicate a
// unique server provided by a
// different server.
dataOut.close();

//Now screen the file testing
// for reasons to delete the
// message from the server.
//First initialize the text showing
// in the various components in the
// GUI.
fromField.setText("Call screener");
subjField.setText("Call screener");
operMsgField.setText(
"Call screener");
textArea.setText(
"Progress Meter: ");
//Initialize the match flag
// to false.
boolean match = false;

//Now cause the message file to be
// screened. In the event that you
// decide to design your own
// screening algorithm, this is
// where you you would probably
// make the first modification to
// the program. Your version of
// the method named screenMsg
// should return true if it is
// recommending that the message be
// deleted from the server. Also,
// the object of type ScreenResult
// passed as a parameter to the
// method should be populated with
// information to be displayed in
// the text fields and text area of
// the GUI.
ScreenResult theResult =
new ScreenResult();
match = screener.screenMsg(
fileName,uidl,theResult);

//Now display the information
// encapsulated in the ScreenResult
// object by the screenMsg method.
fromField.setText(theResult.from);
subjField.setText(
theResult.subject);
operMsgField.setText(
"Offending Phrase: "
+ theResult.thePhrase);
textArea.setText(theResult.text);
textArea.append("Msg Number: "
+ msgNumber);

//At this point, the user can
// view the From line and the
// Subject line for the message,
// the complete text of the message
// down to the line containing the
// offending word or phrase, as
// well as that word or phrase.

//Increment the message counter
// in preparation for
// processing the next message.
msgCounter++;

//A return value of true means that
// the screener is recommending
// deletion of the message from the
// Email server.
if(match == true){
//The message has been flagged
// as a candidate for deletion
// from the server. Return
// from the ActionPerformed
// method and take no further
// action until the user
// presses the Delete button
// or the Start/Next button.
//Pressing the Delete button
// causes the message to be
// deleted from the server.
//Pressing the Start/Next
// button causes it to be
// preserved.
return;
}//end if match == true

//Control reaches this point only
// if match is not true.
//The messaage is not a
// candidate for deletion from
// the server.
//At this point, we could
// require the user to press
// the Start/Next button to
// process the next message.
//However, we won't do that. The
// following code fires an event
// identical to that which would
// be fired if the user pressed
// the Start/Next button.
Toolkit.getDefaultToolkit().
getSystemEventQueue().
postEvent(new ActionEvent(
startButton,
ActionEvent.
ACTION_PERFORMED,
"Start/Next"));
}//end if msgNumber <= numberMsgs
else{//msgNumber > numberMsgs
//No more messages. Disable the
//Start/Next button.
startButton.setEnabled(false);
//Instruct the user to terminate
// the program.
subjField.setText(
"No more messages, press Close");
fromField.setText(
"No more messages, press Close");
operMsgField.setText(
"No more messages, press Close");
textArea.setText(
"No more messages, press Close");
}//end else
}//end try
catch(Exception ex){
ex.printStackTrace();}
}//end actionPerformed
}//end ActionListener
);//end addActionListener

//Register an ActionListener on the Delete
// button to make it possible for the
// user to remove a message from the
// server.
deleteButton.addActionListener(
new ActionListener(){
public void actionPerformed(
ActionEvent e){
//Clear the operator message field
operMsgField.setText("");

//Confirm that local msgNumber is in
// synch with message number on server
int firstSpace = fileName.indexOf(" ");
int secondSpace = fileName.indexOf(
" ",firstSpace + 1);
String chunk = fileName.substring(
firstSpace + 1,secondSpace);
if(Integer.parseInt(chunk)
!= msgNumber){
System.out.println(
"msgNumber synch error");
System.exit(0);//terminate
}//end if

//Deletion of a message from the
// server is accomplished by marking
// the message for deletion while in
// the TRANSACTION state. The
// message is actually deleted when
// the client sends a QUIT command
// to the server causing the server
// to enter the UPDATE state. If the
// program aborts prematurely before
// sending a QUIT command, marked
// messages are not deleted from the
// server.
//Mark the message for deletion.
//Note that the following three statements have
// been purposely disabled to prevent you from
// accidentally deleting messages from the server
// during your early testing of the program. Do
// not enable these three statements until you
// are certain that you really do want to delete
// messages from the server. At that point in
// time, you can enable the three statements by
// removing the comment indicators.

/*
outputStream.println(
"DELE " + msgNumber);
//Validate the response and display
// it on the GUI. You probably won't
// see it on the GUI because of what
// heppens next. The program
// immediately clears the display
// and begins processing the
// next message. If you modify the
// program to eliminate the clearing
// of the display between messages,
// you will see this response.

textArea.append(
"DELE "+validateOneLine()+"\n");
textArea.append(
"Deleted:" + msgNumber + "\n");

*/

//Create and fire a synthetic event
// that simulates the user pressing
// the Start/Next button. This
// initialtes the processing of the
// next message.
Toolkit.getDefaultToolkit().
getSystemEventQueue().
postEvent(new ActionEvent(
startButton,
ActionEvent.
ACTION_PERFORMED,
"Start/Next"));
}//end actionPerformed
}//end ActionListener
);//end addActionListener

//Configure the GUI by placing the
// various components on it, setting the size
// and making it visible.
add(startButton);
add(deleteButton);
add(fromField);
add(subjField);
add(operMsgField);
add(textArea);
setTitle("Copyright 2004, R.G.Baldwin");
//Increase the following parameters and
// modify the construction parameters for
// the text fields and the text area to
// increase the size of the GUI.
setSize(400,400);
//Make the GUI visible.
setVisible(true);
}//end constructor
//===========================================//

//Validate a one-line response.
//The purpose of this method is to confirm that
// the server returned +OK and not -ERR to the
// previous command.
//If +OK, the method returns the string
// returned by the server.
//If -ERR, the method displays the string
// returned by the server and terminates the
// session.
private String validateOneLine(){
try{
String response = inputStream.readLine();
if(response.startsWith("+OK")){
return response;
}else{
System.out.println(response);
//Terminate the session.
outputStream.println("QUIT");
socket.close();
System.out.println(
"Premature QUIT on -ERR");
System.exit(0);
}//end else
}catch(IOException e){e.printStackTrace();}
//The following return statement is required
// to satisfy the compiler.
return "Make compiler happy";
}//end validateOneLine()
//===========================================//

//Purpose of this method is to insert an
// asterisk (star) every tenth character in
// order to destroy virus code before it is
// written into the output file. While this
// makes the local version of the message
// harder to read, it does little to reduce its
// usefulness for computer analysis.
private String insertStars(String stringIn){
StringBuffer stringBuffer =
new StringBuffer(stringIn);
int length = stringBuffer.length();
for(int cnt = 9; cnt < length; cnt+=10){
stringBuffer.insert(cnt,'*');
}//end for loop
return new String(stringBuffer);
}//end insertStars
//===========================================//
}//end class Pop302
//=============================================//

//Class to encapsulate screening results. An
// object of this type is passed to the screenMsg
// method where it is populated with the results
// of the screen.
class ScreenResult{
public String subject = "";
public String from = "";
public String thePhrase = "";
public String text = "";
}//end ScreenResults
//=============================================//

//This class implements a set of rules for
// detecting SPAM messages and for recommending
// whether or not a message should be deleted
// from the server.
//If you have a better way to detect SPAM, you
// can replace this class by a completely
// different class definition, so long as you
// maintain the user interface.
//
//An object of this class has one entry point and
// one exit point, which is the public method
// named screenMsg. However, the constructor
// receives a reference to the GUI object created
// by instantiating the class named Pop302. The
// object of this class uses that reference to
// display progress on the text area belonging
// to the GUI. This is comforting on those
// occasions when a very long message is
// encountered. This callback link could easily
// be eliminated by deleting code from two
// locations in this class and removing the
// callback constructor parameter.
class Screen{

TreeSet subjWordList;
TreeSet bodyWordList;
TreeSet friendlyWordList;
Pop302 theGui;//save callback reference here
String phrase;

Screen(Pop302 theGui){//constructor
this.theGui = theGui;
//Read the files containing word lists and
// create TreeSet objects containing those
// words or phrases in alphabetical order.
makeSubjWordList();
makeBodyWordList();
makeFriendlyWordList();
}//end constructor

//This method is used to identify messages that
// are candidates for being deleted from the
// server. Such identification is based on
// analyzing the file in which the message is
// stored locally. A return value of true means
// that the message is believed to be SPAM, and
// that the method recommends deleting the
// message from the server. In addition to
// the return value, certains strings are
// encapsulated in the incoming object of type
// ScreenResult. This information is posted
// on the GUI by the object of the class named
// Pop302.
public boolean screenMsg(String fileName,
String uidl,ScreenResult theResult){
//Initialize match to false
boolean match = false;
try{
//Open the file containing a local copy of
// the message. Note that the message has
// been modified by inserting asterisks
// in an attempt to protect against
// viruses.
BufferedReader inData
= new BufferedReader(new FileReader(
fileName));
String data;//temp holding area

//Get the Subject line by skipping header
// lines prior to the Subject line. Mark
// the beginning of the file to make it
// easy to rewind later. Set the readAhead
// Limit to 10000 characters before the
// mark will be lost.
inData.mark(10000);
//Populate the ScreenResult object just in
// case a Subject line isn't found and a
// From line isn't found later.
theResult.subject = "No Subj line found";
theResult.from = "No From line found";
while((data = inData.readLine()) != null){
//Remove the asterisks from the data that
// were inserted earlier in an attempt
// to defeat viruses.
data = removeStars(data).toUpperCase();
if(data.startsWith("SUBJECT:")){
//Put the Subject line in the
// ScreenResult object.
theResult.subject = data.toUpperCase();
//Screen against an upper-case version
// of the words and phrases in a
// TreeSet object containing friendly
// email addresses and subjects
Iterator iterator =
friendlyWordList.iterator();
while(iterator.hasNext()){
String friendlyWord =
((String)(iterator.next())).
toUpperCase();
match = false;
if(!(friendlyWord.equals(""))){
//The screenOnPhrase method is used
// to search for a match between
// the entries in the friendlyWord
// list and the text in the Subject
// line of the message. In this
// case, no extra characters are
// allowed in the phrase as
// it appears in the Subject line.
match = screenOnPhrase(
data,friendlyWord,0);
}//end if

if(match == true){
//The screenOnPhrase method found a
// match with an phrase in
// the friendlyWord list.
//The message should not
// be deleted from the server and
// the local copy should be
// deleted from the disk to
// prevent later analysis to
// extract target words or phrases
// for SPAM. This algorithm has no
// way of blocking SPAM with a
// Subject that matches a phrase
// in the friendlyWord list.
//Close the file and delete it from
// the local disk.
inData.close();
new File(fileName).delete();
//Store the phrase in the
// ScreenResult object.
theResult.thePhrase = phrase;
//Terminate the execution of this
// method, telling the program not
// to delete the message from the
// server by returning false. This
// is one of three return points
// from this method.
return false;
}//end if match = true
}//end while iterator has next
//Don't attempt to read any more lines
// from this file in this while loop.
break;
}//end if
}//end while loop

//Reset back to beginning of file. The
// Subject for this message is now showing
// in the GUI if the message contained a
// Subject line. Otherwise the GUI
// contains a message indicating that
// a Subject line wasn't found.
inData.reset();

//Get the From line by skipping header
// lines prior to the From line.
//Also test the From line against a list of
// friendly email addresses.
//If a match is found, do not delete the
// message from the server, but do delete
// the local file containing the message
// from the disk to prevent it from being
// analyzed later in an attempt to find
// target words or phrases for SPAM.
while((data = inData.readLine()) != null){
//Convert the data to all upper case and
// remove all asterisks from the data.
// Note that this will remove naturally
// occurring asterisks in addition to
// those that were inserted in an attempt
// to protect against virus code. If
// this proves to be a problem later, the
// removeStars method can be modified to
// remove only those asterisks that were
// inserted on ten-character intervals.
data = removeStars(data.toUpperCase());
if(data.startsWith("FROM:")){
//Put the From line in the ScreenResult
// object.
theResult.from = data;

//Screen against an upper-case version
// of the words and phrases in a
// TreeSet object containing friendly
// email addresses and subjects
Iterator iterator =
friendlyWordList.iterator();
while(iterator.hasNext()){
String friendlyWord =
((String)(iterator.next())).
toUpperCase();
match = false;
if(!(friendlyWord.equals(""))){
//The screenOnPhrase method is used
// to search for a match between
// the entries in the friendlyWord
// list and the text in the From
// line of the message. In this
// case, no extra characters are
// allowed in the Email address as
// it appears in the From line.
match = screenOnPhrase(
data,friendlyWord,0);
}//end if

if(match == true){
//The screenOnPhrase method found a
// match with an Email address in
// the friendlyWord list.
//Either this msg was sent by a
// friend, or a spammer sent it
// claiming to be from a friend.
// In either case, it should not
// be deleted from the server and
// the local copy should be
// deleted from the disk to
// prevent later analysis to
// extract target words or phrases
// for SPAM. This algorithm has no
// way of blocking SPAM that claims
// to have been sent by a friendly
// Email address.
//Close the file and delete it from
// the local disk.
inData.close();
new File(fileName).delete();
//Store the Email address in the
// ScreenResult object.
theResult.thePhrase = phrase;
//Terminate the execution of this
// method, telling the program not
// to delete the message from the
// server by returning false. This
// is one of two return points from
// this method.
return false;
}//end if match = true
}//end while iterator has next
//Don't attempt to read any more lines
// from this file in this while loop.
break;
}//end if data starts with From
}//end while loop on null

//Reset back to beginning of the file. The
// From line for this message is now
// showing in the GUI. Read and process
// the entire file.
inData.reset();

//Read and process strings until eof is
// indicated by null. Provide separate
// processing for the Subject line and all
// other lines in the message described
// herein as body text, although this also
// includes header data in the message.
//Note that more sophisticated forms of
// screening can be inserted at this point
// in the program.
int progressCounter = 0;
while((data = inData.readLine()) != null){
data = removeStars(data);
if(data.startsWith("Subject")){
//Process the Subject line.
//Process all data as upper case data
// to keep the spammer from hiding
// behind random case conversions.
data = data.toUpperCase();

//Append each line of data to the
// String stored in the ScreenResult
// object.
theResult.text =
theResult.text + data + "\n";
//Display progress on the GUI. Remove
// this code to break this link with
// the GUI object if desired.
if(++progressCounter < 50){
theGui.textArea.append(".");
}else{//Display progress on a new line
progressCounter = 0;
theGui.textArea.append(".\n");
}//end else

//Screen against an upper-case version
// of the words and phrases in a
// TreeSet object containing target
// words and phrases in the Subject
// line of the message.
match = false;
Iterator iterator =
subjWordList.iterator();
while(iterator.hasNext()){
String subjWord =
((String)(iterator.next())).
toUpperCase();
if(!(subjWord.equals(""))){
//Search for a match between the
// words or phrases in the subjWord
// list and the line of msg data.
// Allow one extra character to
// occur between the characters in
// the data and still make a match.
match = screenOnPhrase(
data,subjWord,1);
}//end if

if(match == true){
//Msg is a candidate for deletion.
//Don't need this local file for
// statistical analysis. Move it
// to a local archive folder.
inData.close();
boolean moved =
new File(fileName).renameTo(
new File(
"c:/MailFiles/Archives/"
+uidl+".txt"));
if(!moved)System.out.println(
"Unable to move file " + uidl);
//There is no need to test against
// any more words in this iterator
// loop.
break;
}//end if
}//end while iterator has next
}//end if data starts with Subject
//Data line does not start with Subject.
// Process the line as body text. Note
// that some body text occurs before the
// Subject line in the format of a
// typical message. Therefore, the code
// in the else clause will typically be
// executed several times before the
// code in the if clause discussed above
// will be executed.
else{
//Screen on an upper-case version of
// the message.
data = data.toUpperCase();
//Append the data line to the String
// value being stored in the
// ScreenResult object.
theResult.text =
theResult.text + data + "\n";
//Display progress in the GUI. Remove
// this code to break the callback link
// with the GUI if desirec.
if(++progressCounter < 50){
theGui.textArea.append(".");
}else{
progressCounter = 0;
theGui.textArea.append(".\n");
}//end else
//Screen against an upper-case version
// of the words and phrases in a
// TreeSet object designed
// specifically for screening body
// text.
Iterator iterator = bodyWordList.
iterator();
match = false;
while(iterator.hasNext()){
String bodyWord =
((String)(iterator.next())).
toUpperCase();
if(!(bodyWord.equals(""))){
//Allow one character to occur
// between the characters in the
// data line and still make a
// match.
match = screenOnPhrase(
data,bodyWord,1);
}//end if

if(match == true){
//Msg is a candidate for deletion.
//Don't need this local file for
// statistical analysis. Move it
// to a local archive folder.
inData.close();
boolean moved =
new File(fileName).renameTo(
new File(
"c:/MailFiles/Archives/"
+uidl+".txt"));
if(!moved)System.out.println(
"Unable to move file " + uidl);
//There is no need to test against
// any more words in this iterator
// loop.
break;
}//end if
}//end while iterator has next
}//end else for line not Subject line
//A match has been found. No need to
// read any more data lines in this while
// loop.
if(match == true)break;
}//end while loop on read until null
inData.close();//Close file if still open
}catch(Exception e){e.printStackTrace();}
//Store the matching phrase (or the last
// phrase processed) in the ScreenResult
// object.
theResult.thePhrase = phrase;
//Return the value of match indicating
// whether or not a match was found.Note that
// this return statement is one of two return
// points in this method. This return will
// not be reached if a friendly Email address
// was found earlier when processing the From
// line.
return match;
}//end screenMsg
//===========================================//

//This method tests a string to see if it
// contains a word or phrase that may have
// extraneous characters inserted into it,
// such as VI*A-GRA.
//If the string contains the sequence of
// characters making up the word or phrase,
// with spanLim or fewer extraneous characters
// between any two of the word's characters,
// the method returns true. For example, if
// spanLim = 1, the spammer can insert one
// character between any two of the characters
// that make up the word and the word will
// still be detected. However, if the
// spammer inserts two or more characters,
// the offending word will not be detected.
//Need to be careful to avoid making spanLim
// too large. Large values of spanLim result
// in false alarms due to the fact that
// widely-separated characters can be
// considered to be part of the word or
// phrase. For example, if spanLim = 2 or
// greater, the word PORN will be found in
// the word imPORtaNt.
private boolean screenOnPhrase(String data,
String phrase,
int spanLim){
this.phrase = phrase;
StringBuffer str = new StringBuffer();
ArrayList locationData = new ArrayList();

//Compare each char in the data with each
// unique char in the word or phrase. If
// there is a match, append the char to str
// and save the location of the char in
// the ArrayList referred to by locationData.

//Eliminate duplicate char in the word or
// phrase by storing in a TreeSet. Note that
// this will also sort the char, but that
// doesn't matter.
TreeSet treeSet = new TreeSet();
for(int cnt = 0; cnt < phrase.length();
cnt++){
treeSet.add(
new Character(phrase.charAt(cnt)));
}//end for loop

//Get the unique characters from the set and
// save them in a StringBuffer
Iterator iter = treeSet.iterator();
StringBuffer tempPhrase = new StringBuffer();
while(iter.hasNext()){
tempPhrase.append(
((Character)(iter.next())).charValue());
}//end while

//Use the StringBuffer of unique characters
// to test the string and extract matching
// characters from the string. Discard all
// non-matching characters. This converts
// the original data into a string of
// characters, each of which is a character
// in the word or phrase. All other
// characters have been removed. Thus, if
// the data contains the word or phrase, it
// will occur somewhere in the compressed
// string with no extra characters in
// between. An example might be as follows:
// SMSPMASPAMMPAS
for(int i = 0; i < data.length(); i++){
for(int j = 0; j < tempPhrase.length();
j++){
if(data.charAt(i) ==
tempPhrase.charAt(j)){
str.append(data.charAt(i));
locationData.add(new Integer(i));
}//end if
}//end for on tempPhrase
}//end for on data

//Test to see if the extracted char sequence
// contains the word or phrase.
int match = str.indexOf(phrase);
if(match == -1){
return false;//no match
}//end if

//There is a match. Confirm that the span
// between target characters in data is not
// greater than allowed by the incoming
// spanLim parameter.
int maxSpan = 0;
int locA = ((Integer)locationData.
get(match)).intValue();
int locB = 0;
for(int cnt = 1; cnt < phrase.length();
cnt++){
locB = ((Integer)locationData.get(
match + cnt)).intValue();
int span = locB - locA;
if(span > maxSpan){
maxSpan = span;
}//end if
locA = locB;
}//end for loop

if(maxSpan > spanLim+1){
return false;//span too large
}else{
return true;//made a match
}//end else

}//end screenOnPhrase

//===========================================//
//Purpose of this method is to remove the
// asterisks inserted into the data by the
// method named insertStars, and to append two
// asterisks at the end of the line. Note that
// this method removes all asterisks, not just
// those inserted earlier. If this proves to
// be a problem, this method should be modified
// to remove only those asterisks that occur on
// ten-character intervals.
private String removeStars(String stringIn){
StringBuffer stringBuf =
new StringBuffer(stringIn);
int index = 0;
while(index > -1){
index = stringBuf.lastIndexOf("*");
if(index > -1){
stringBuf.delete(index,index+1);
}//end if
}//end while
stringBuf.append("**");
return new String(stringBuf);
}//end removeStars()
//===========================================//

//Purpose: To create a TreeSet object
// containing words used to screen the message
// subject lines.
//This method reads strings from a text file
// named Pop302a.txt and creates the list as
// a TreeSet object with no duplicates.
//See additional comments in the later section
// regarding the makeBodyList method.

private void makeSubjWordList(){
subjWordList = new TreeSet();

//Read word list from text file and populate
// the TreeSet object.
try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302a.txt"));
String data; //temp holding area

while((data = inData.readLine()) != null){
subjWordList.add(data);

}//end while loop
inData.close();//Close file
}catch(Exception e){e.printStackTrace();}
}//end makeSubjWordList
//===========================================//

//Purpose: To create a TreeSet object
// containing words and phrases used to screen
// the message BODY lines. See notes above
// regarding the list used to screen the
// Subject line of each message in the method
// named makeSubjWordList.

//It is important to maintain these two lists
// as separate lists. Because of the much
// larger number of characters in the body than
// in the Subject, false alarms are much more
// likely in the body. Therefore, individual
// words that work well when screening the
// Subject line may produce false alarms when
// screening the body. For example, the word
// PORN appears in the word IMPORTANT. It is
// much more likely that the word IMPORTANT
// will appear somewhere in the body than in
// the Subject line (although it may appear in
// the Subject line as well, thus producing a
// false alarm in both cases). Also, the word
// ANTIVIRUS works well in the Subject, but
// cannot be used to screen the body because
// many servers insert that word into the
// message header after they test the message
// for viruses. Also, IP addresses and URLs
// work well in the body, but rarely appear in
// the Subject. Therefore, testing the Subject
// against a long list of URLs simply wastes
// time.

//The following words (among others) should not
// be added to the list for the reasons given:

//PORN may be confused with IMPORTANT
//SPAM causes lots of false alarms. I inserted
// a space as in "SPAM " to decrease false
// alarms. Will probably also decrease valid
// hits.
//ANTIVIRUS appears in some valid message hdrs
//WEIGHT often appears in messages regarding
// html fonts
//SLUT may be confused with SOLUTION
//==End of prohibited list==


private void makeBodyWordList(){
bodyWordList = new TreeSet();

//Read word list from text file and populate
// the TreeSet object.
try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302b.txt"));
String data; //temp holding area

while((data = inData.readLine()) != null){
bodyWordList.add(data);

}//end while loop
inData.close();//Close file
}catch(Exception e){e.printStackTrace();}
}//end makeBodyWordList
//===========================================//

//Purpose: To create a TreeSet object
// containing words used to screen the message
// From lines.
//This method reads strings from a text file
// named Pop302c.txt and creates the list as
// a TreeSet object with no duplicates.
//Only the primary portion of the friends
// Email address should be included in the
// file used to create the list. This would
// be x@y.z

private void makeFriendlyWordList(){
friendlyWordList = new TreeSet();

//Read word list from text file and populate
// the TreeSet object.
try{
BufferedReader inData
= new BufferedReader(new FileReader(
"Pop302c.txt"));
String data; //temp holding area

while((data = inData.readLine()) != null){
friendlyWordList.add(data);

}//end while loop
inData.close();//Close file
}catch(Exception e){e.printStackTrace();}
}//end makeFriendlyWordList
//===========================================//
}//end class Screen

Listing 34

Sample file Pop302a.txt

AGE REVERSING PRODUCT
ANNUAL FEE
AT THE PUMP
AUTO BONANZA
AUTO WARRANTY
BECAREFUL DOWNLOADING MUSIC FILES
BOTOX
BOTTLES SOLD DAILY
C1ALIS
CIALIS
CODEINE
LEVITRA
NA1L-FUNGUS
NEVER REPAY YOUR CREDIT CARD DEBT
ORDER YOUR DRUGS
PATCH
PILLS
PILLZ
SEXUA1
SEXUAL
SILDENAFIL CITRATE
SIZE DOES MATTER
SLZE MATTERS
SOLVE THE PROBLEM DOWNSTAIRS
TERMINATE DEBT
TONER CARTRIDGES
V1@GRA
V1AGRA
VALIUM
VA|IUM
VI@GRA
VIAGRA
VIC0DIN
VICODAN
VICODIN
VIIAGRA
VLAGRA
VÍAGRA
XANAAX
XANAX
Z0LOFT
\/IAGR@

Listing 35

Sample file Pop302b.txt

123.456.789.123
HTTP://SOMESITE.ABC
AGE REVERSING PRODUCT
ANNUAL FEE
AT THE PUMP
AUTO BONANZA
AUTO WARRANTY
BECAREFUL DOWNLOADING MUSIC FILES
BOTOX
BOTTLES SOLD DAILY
C1ALIS
CIALIS
CODEINE
LEVITRA
NA1L-FUNGUS
NEVER REPAY YOUR CREDIT CARD DEBT
ORDER YOUR DRUGS
PATCH
PILLS
PILLZ
SEXUA1
SEXUAL
SILDENAFIL CITRATE
SIZE DOES MATTER
SLZE MATTERS
SOLVE THE PROBLEM DOWNSTAIRS
TERMINATE DEBT
TONER CARTRIDGES
V1@GRA
V1AGRA
VALIUM
VA|IUM
VI@GRA
VIAGRA
VIC0DIN
VICODAN
VICODIN
VIIAGRA
VLAGRA
VÍAGRA
XANAAX
XANAX
Z0LOFT
\/IAGR@

Listing 36

Sample file Pop302c.txt

BALDWIN@DICKBALDWIN.COM
MSNBC_BREAKINGNEWS_NEWSMAIL@MSNBC.COM
BOOKSTORE@INFORMIT.COM
ENews@SSA.GOV
Developer.com Update
MSNBC_DAILYMARKETCLOSE_NEWSMAIL@MSNBC.COM
ITSC 1313
ITSC1313

Listing 37

  

Copyright 2004, Richard G. Baldwin.  Reproduction in whole or in part in any form or medium without express written permission from Richard Baldwin is prohibited.

About the author

Richard Baldwin is a college professor (at Austin Community College in Austin, TX) and private consultant whose primary focus is a combination of Java, C#, and XML. In addition to the many platform and/or language independent benefits of Java and C# applications, he believes that a combination of Java, C#, and XML will become the primary driving force in the delivery of structured information on the Web.

Richard has participated in numerous consulting projects, and he frequently provides onsite training at the high-tech companies located in and around Austin, Texas.  He is the author of Baldwin's Programming Tutorials, which has gained a worldwide following among experienced and aspiring programmers. He has also published articles in JavaPro magazine.

Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.

Baldwin@DickBaldwin.com

-end-
 








Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel