Enlisting Java in the War Against SPAM: The Screening Module
Java Programming Notes # 2152
- Preface
- Preview
- Discussion and Sample Code
- Preview of Future Lessons
- Run the Program
- Summary
- What's Next?
- Complete Program Listing
Preface
The communications module
The first lesson explained the communications module used to communicate with your Email server, and to remove SPAM messages from the server.
SPAM screening algorithm
The program is designed to allow you to use my SPAM screening algorithm, or to invent your own. This lesson explains the inner workings of my SPAM screening algorithm. You can use my algorithm as a starting point if you decide to invent your own.
Training the algorithm
The next two lessons will explain how my algorithm can be trained to do an increasingly better job of screening SPAM over time.
Viewing tip
You may find it useful to open another copy of this lesson in a separate browser window. That will make it easier for you to scroll back and forth among the different listings and figures while you are reading about them.
Supplementary material
I recommend that you also study the other lessons in my extensive collection of online Java tutorials. You will find those lessons published at Gamelan.com. However, as of the date of this writing, Gamelan doesn't maintain a consolidated index of my Java tutorial lessons, and sometimes they are difficult to locate there. You will find a consolidated index at www.DickBaldwin.com.
Preview
Can you write better SPAM screening
algorithms?
Did you ever think that you might be able to write better SPAM
screening algorithms than those available in the SPAM screening
software that you are now using? If so, this series of lessons is
for you.
Even if that is not the case, like most of us, you are probably
overwhelmed by SPAM
and therefore you may find this lesson interesting.
Remove SPAM from the server
In this and the previous lesson, I am showing you how to write a
Java program
that supplements the SPAM screening software that you are currently
using. This program is used to identify and remove SPAM from your
Email server before it is downloaded into your primary Email client.
Any SPAM that makes it past this program can be further acted upon
by the SPAM screener that is built into your Email client.
The communications module
This series consists of (at least) four lessons. The
first lesson in the
series explained the communications module used to communicate with
your Email server, and to remove SPAM messages from the
server.
My SPAM screening algorithm
As mentioned above, this program is designed to allow you to invent and implement your own SPAM screening algorithm in addition to, or as an alternative to my algorithm.
This lesson explains the inner
workings of my SPAM screening algorithm. My algorithm operates
separately on the Subject line, the From line,
and the body text of each Email message.
Algorithm training programs
The third lesson will explain a companion program named Pop302d,
designed to make
use of historical data to train the algorithm to do a better job
of identifying SPAM in future messages based on the Subject
of the message.
The fourth lesson will explain another companion program named Pop302e,
designed to
make use of historical data to train the algorithm to do a
better job of identifying SPAM based on the body text of
the message, (which includes the From line).
Because of the need to train the algorithm, and the ease with which
these companion programs make that possible, the companion programs are
equally as important as the main program.
Operational sequence
Here is the typical operational sequence that I go through each
morning to remove SPAM from my Email server before downloading it into
my primary Email client, and to train the algorithm to recognize any
future SPAM messages that made it through the screen that morning.
- Run the main program named Pop302 (explained in this and the previous lesson) to identify SPAM and remove it from the server. This normally allows a few (typically about ten percent) SPAM messages (stragglers) to get through, which are stored in a history folder on my local disk.
- Run the program named Pop302d (explained in the next lesson) to train the algorithm to recognize the stragglers as SPAM based on information in the Subject line.
- Run the program named Pop302e (explained in Part 4) to train the algorithm to recognize the stragglers as SPAM based on information in the body text.
- Go back and run the main program named Pop302 to remove those SPAM stragglers messages from the server.
- Run my primary Email client to download the remaining good messages into my local Email inbox.
When I am in a hurry ...
However, it isn't necessary to perform all of these steps every
day. On those mornings when I am in a hurry, I skip steps
2, 3, and 4, leaving the straggler messages in the local history folder
for use later.
(The straggler messages will, of course, end up in my local Email inbox when I run my primary Email client without purposely removing them from the server beforehand.)
Sometime later (perhaps the next day or several days later)
I will perform steps 2 and 3 to train the algorithm to recognize future
SPAM
messages represented by the characteristics of the messages that have
been saved in the local
history folder.
Effectiveness of my algorithm
After about one week of training, my
algorithm was reliably identifying about ninety percent of all SPAM
messages, allowing me to delete them from my Email server before
downloading them into my primary Email
client. By executing steps 2, 3, and 4 above, I am able to also
eliminate the remaining ten percent of the SPAM messages before
downloading them into my primary Email client.
Discussion and Sample Code
The version of the program that I discussed in the previous lesson contained a stripped-down version of a class named Screen. This version of the program allowed for testing the communications module on your system with your Email server without doing any actual screening for SPAM.
I will explain the full version of the class named Screen in this lesson. In so doing, I explain my algorithm for identifying SPAM.
Purpose of the program
The purpose of this program is to read messages from a POP3 (Post Office Protocol - Version 3) server, to analyze the messages according to a set of screening rules, and to delete the messages that fail the screening test from the server.
(As written, the program asks the user to confirm the deletion of each message from the server, but this confirmation step could easily be removed if you decide to do so.)Key words and phrases
My SPAM screening algorithm screens for SPAM on the basis of words or phrases in the From line, words or phrases in the Subject line, and words or phrases in the body text.
Friendly Email addresses and subjects
A list of friendly Email addresses and friendly subjects is used to screen the From line and the Subject line. Messages that are from friendly Email addresses, and messages that have known good Subject lines are preserved on the server and no information about those messages is saved on the local disk. They are simply ignored after determining that they are friendly.
Different lists for Subject and body text
Different lists of words and phrases are used for screening Subject lines and body text for SPAM. This is important because the same set of words and phrases can't always be used for both cases.
For example, the word ANTIVIRUS is appropriate for screening the Subject line, but is not appropriate for screening the body text. The word ANTIVIRUS often appears legally in the header of Email messages that have been scanned for viruses by the server, but also often appears in the Subject line of SPAM messages.
Common spammer tricks are defeated
Several common spammer tricks are defeated by my SPAM screening algorithm.
For example, the common spammer trick of inserting extra characters between the characters in an offending word or phrase is defeated. Also, the common trick of mixing the case of the characters in an offending word or phrase is defeated.
As a specific example, my algorithm will recommend deletion of any message having any of the following in its Subject line or its body text if the word VIAGRA is included in the lists used to screen for SPAM:
vIaGrA
V.IagRA
V.I.A.G.R.A
Very important characteristics
These two characteristics of the algorithm alone have a significantly positive impact on the effectiveness of training the algorithm to do a better job of identifying SPAM in the future.
(You don't have to identify all of the variations of a word or phrase commonly used by spammers to fool the system. The program does that for you automatically.)My algorithm also defeats the common trick of appending random characters to the end of the Subject line, because it doesn't require a match for the entire Subject line. Rather, it searches for words or phrases internal to the text of the Subject line.
The user interface
Figure 1 shows the GUI through which the user controls the program.

Figure 1 Graphical User Interface
(Note that this GUI was purposely made narrow to cause it to fit into this narrow publication format. I recommend that you increase the width of the Frame to at least 750 pixels, and increase the width of the TextField and TextArea objects to at least 100 characters each.The Offending Phrase
Note also that this is an actual SPAM message, from which I purposely removed the Email address of the sender prior to publication. The message may not have actually been sent by the individual whose Email address appeared on the From line.)
When the program identifies a message that is a candidate for deletion, the reason for that recommendation is shown in the third text field from the top in Figure 1.
Deleting a message from the server
The user confirms that the message should be deleted from the Server by clicking the Delete button in Figure 1. If the user doesn't want to delete the message, he should click the Start/Next button instead.
(Note that the capability to actually delete messages from the server was disabled in the program shown in Listing 34 near the end of this lesson. Make certain that you are ready to actually delete messages from the server before you enable that capability.)
Information available at decision time
As currently written, this program requires the user to confirm the actual deletion of each SPAM message from the server before that message is actually deleted.
At the point in time that the user is required to confirm deletion of a message from the server, the following information is available to assist the user in making the decision:
- From line
- Subject line
- Offending line of text, which may or may not be the subject
- Offending word or phrase in the offending line of text
- Entire raw text of the message down to and including the offending line
No images are rendered by the program, so it is not necessary for the user to view offending images in order to make the decision to delete.
Deletion is not required
Having viewed the above information, if the user is still unable to make an informed decision to delete the message, the user has the option to let the message pass through and to be downloaded into his primary Email client. Once having viewed the message in the primary Email client, the user still has the option of updating the offending word lists with IP addresses, URLs, etc, so that deletion decisions on future similar messages will be easier to make.
Saved in local archive folder
The raw text of all messages that are identified as candidates for deletion from the server are saved in an archive folder on the local disk, regardless of whether the user elects to delete them from the server or not. Thus if a message is deleted from the server and it is later determined that was a mistake, a raw text copy of the deleted message is available locally in the archive folder.
(You should probably empty this folder periodically so that it won't fill up your disk.)
In addition, I have plans to write several additional programs that will analyze large numbers of SPAM messages in the archive folder for at least two purposes:
- To remove words and phrases from the word lists that occur in only a very small percentage of SPAM messages, thereby increasing run time without contributing significantly to the desired result.
- To search for common characteristics among SPAM message that can be used to improve the effectiveness of the screening algorithm.
Except for messages from friendly Email addresses and messages with friendly Subject lines, all messages that are not identified as candidates for deletion from the server are saved in a history folder on the local disk. These messages are used later to train the algorithm to do a better job of identifying future SPAM messages. I will explain this training process in Part 3 and Part 4 of this series of lessons.
Protection against viruses
Before any message is saved in a local file, asterisks are inserted into the text on ten-character intervals in an attempt to destroy any virus code that may be embedded in the message.
If a message makes it through the screen and is later identified as having a virus as an attachment, a series of ten or more bytes can be extracted from the virus code and added to the word list as an offending phrase. This should cause any future messages having that same virus code as an attachment to be identified as a candidate for deletion from the server.
Training programs
Companion programs that I have written are used to analyze the non-deleted message files saved locally in the history folder in order to train the algorithm to do a better job of identifying SPAM messages in the future.
These programs are designed for extreme ease of use to encourage the user to train the algorithm frequently. The better the algorithm is trained, the better it will perform.
I will explain these training programs in detail in Part 3 and Part 4 of this series of lessons. A brief preview of the training programs is provided below.
Simple text files
All three word lists are maintained in local text files, which can be created and edited with an ordinary text editor if need be. Thus, if one of the lists becomes corrupted, it is easy to correct the situation using an ordinary text editor.
File names
The following file names are hard-coded into the program. You may want to change these file names for your version of the program.
- Local copy - the unique file name for a local copy of each message is based on the unique identifier for that message (UIDL) obtained from the mail server.
- Pop302a.txt - contains a word list for screening the Subject line for offensive words and phrases.
- Pop302b.txt - contains a word list for screening the body text for offensive words and phrases.
- Pop302c.txt - contains a list of friendly Email addresses and friendly subjects for screening the From and Subject lines to identify friendly messages.
As written, the program requires the three .txt files to be in the same folder as the compiled .class files for the programs named Pop302, Pop302d, and Pop302e. However, you can easily modify the programs to change the location of the .txt files if you choose to do so. Just be sure to change the location in all three programs.
The local copies of the messages are stored in two different folders. Some of the local copies are stored in a history folder while the remainder are stored in an archive folder. The locations of these folders on the disk are hard-coded into the three programs. You can change the locations if you like, but be sure to make appropriate changes to all three programs.
This program consists of two main classes and one minor class. As discussed in the previous lesson, an object of the class named Pop302 handles all communications with the POP3 server.
A method belonging to an object of the class named Screen is used to screen each message in an attempt to identify SPAM. This is the class that I will explain in this lesson.
This class can be totally replaced by Java programmers who choose to design their own screening algorithm provided that they maintain the interface with the object of the class named Pop302.
An object of a very simple class named ScreenResult is used as a wrapper to return several items of information from the screening method to the calling method.
Testing
The program was tested using SDK 1.4.2 under WinXP in conjunction with two different POP3 Email servers.
Will discuss in fragments
I will discuss the class named Screen in fragments. A complete listing of the program is provided in Listing 34 near the end of the lesson. You should be able to copy and paste that listing into your Java IDE to compile and test the program on your system.
Improvements in the class named Pop302
Before getting into the details of the class named Screen, I want to mention that the program shown in Listing 34 contains a couple of improvements relative to the version explained in the previous lesson.
One of the improvements involves displaying the message number in the bottom of the text area of Figure 1.
The other improvement involves making a safety check to confirm that the message number being maintained locally is in synchronization with the message number on the server (in the UIDL) before deleting a message from the server.
If you understand the rest of the program, these two modifications should not require a detailed explanation.
Deletion of messages is disabled
Also before getting into the details of the Screen class I want to show you a fragment containing three statements that are disabled in the Pop302 class in Listing 34. The three disabled statements are shown in Listing 1. (Note that the statements are separated by comments in Listing 34.)
/*Begin comment block |
The Screen class
The Screen class implements a set of rules for identifying SPAM messages and for recommending whether or not a message should be deleted from the server.
If you have a better way to identify SPAM, you can replace this class by a completely different class definition, so long as you maintain the user interface.
An object of this class has one entry point and one exit point, which is the public instance method of the Screen class named screenMsg.
A callback to the GUI
However, there is an additional linkage between the two objects that you need to consider. The constructor for the Screen class receives a reference to the GUI object created by instantiating the class named Pop302. A method in the object of the Screen class uses that reference to display progress on the text area belonging to the GUI.
This display of progress is comforting on those occasions when a very long message is encountered and the user needs assurance that the system is still working, and isn't hung up.
This callback link could easily be eliminated by deleting code from several locations in the Screen class and removing the callback parameter from the constructor.
Beginning of the Screen class
The Screen class begins in Listing 2, which declares several instance variables.
class Screen{ |
The constructor
The constructor for the Screen class is shown in its entirety in Listing 3.
Screen(Pop302 theGui){//constructor |
Make word lists as TreeSet objects
The last three statements in the constructor invoke methods that read text files containing lists of words or phrases, and create TreeSet objects containing those words and phrases. These TreeSet objects are used later to test for the occurrence of the words or phrases in raw text versions of Email messages.
The TreeSet objects are created and populated by invoking three very similar methods:
- makeSubjWordList
- makeBodyWordList
- makeFriendlyWordList
I will discuss each of these methods in the sections that follow.
The makeSubjWordList method
The purpose of the makeSubjWordList method is to create a TreeSet object containing words and phrases used later to screen the message Subject lines.
The makeSubjWordList method is shown in Listing 4. This method reads strings from a text file named Pop302a.txt and creates the list as a TreeSet object.
private void makeSubjWordList(){ |
The TreeSet class was chosen for this purpose because it eliminates duplicates.
(Duplicates in the list are bad because they increase runtime with no beneficial effect. One of the major problems with the message filter in the commercial Email client program that I use is that there is no way to avoid duplicates other than simply remembering that an item was previously placed in the filter.)With my screening algorithm, even if the user creates duplicates in the text file while training the algorithm, duplicates are eliminated from the TreeSet object and also from the text file before actual processing begins.
The code in Listing 4 is straightforward and shouldn't require further explanation.
The makeBodyWordList method
The purpose of the makeBodyWordList method is to create a TreeSet object containing words and phrases used later to screen the text in the body of the message.
Separation of lists is important
It is important to maintain separate lists for screening the Subject line and the body text. Because of the larger number of characters in the body text, false positives are more likely when screening the body text.
(A false positive arises when a message that is not SPAM fails one of the SPAM screening rules and is identified as SPAM by the screening algorithm.)Some words work well and some don't
Therefore, some words and phrases that work well when screening the Subject line may produce false positives when screening the body text. For example, the common spammer word SLUT appears in the word SoLUTion with only one character separating the S and the L. It is much more likely that the word SOLUTION will appear somewhere in the body text than in the Subject line (although it may appear in the Subject line as well, thus producing a false positive in either case).
On a more definitive note, the word ANTIVIRUS works well when screening the Subject line, but cannot be used to screen the body text. Many servers insert the word ANTIVIRUS into the message header after they test the message for viruses. On the other hand, the word ANTIVIRUS often appears in the Subject line of SPAM messages.
IP addresses and URLs
IP addresses and URLs can be very useful in identifying SPAM during the screening of the body text. However, they rarely occur in the Subject line. Therefore, testing the Subject line against a long list of IP addresses and URLs simply wastes computer time.
Some words to avoid
The following words (among others) probably should not be included in the list used to screen body text for the reasons given. Undoubtedly you will identify other words and phrases that should be excluded from the list as you gain experience with the system.
- PORN may be confused with IMPORTANT.
- SPAM causes lots of false positives. As a remedy, I inserted a space following the M as in "SPAM " to decrease false positives. (This may also decrease valid hits as well.)
- ANTIVIRUS appears in some valid message headers.
- WEIGHT often appears in messages regarding HTML fonts.
- SLUT may be confused with SOLUTION.
The code for the makeBodyWordList method is shown in Listing 5. This method reads strings from a text file named Pop302b.txt and creates the list as a TreeSet object.
private void makeBodyWordList(){ |
The makeFriendlyWordList method
The purpose of the makeFriendlyWordList method is to create a TreeSet object containing words and phrases used to pre-screen the message From and Subject lines before screening against the SPAM lists. The objective of this pre-screening step is to identify messages that claim to be from an approved list of senders (often referred to as a white list), or messages with a known good Subject line.
Messages that have From or Subject lines matching the words or phrases in this list are not deleted from the server, and are not subjected to screening for SPAM.
Format for Email addresses
When adding Email addresses to the list contained in the text file, only the primary portion of the friendly Email address should be included. For example, you will often see an Email address presented as follows:
Mary Smith <msmith@somewhere.com>In this case, only the following portion should be included in the friendly list:
msmith@somewhere.comThis is the portion that is most likely to remain stable over time. The remaining portion is simply window dressing added to the primary Email address by the program used to compose and address the message.
Code for the makeFriendlyWordList method
The makeFriendlyWordList method is shown in Listing 6. This method reads strings from a text file named Pop302c.txt and creates the list as a TreeSet object.
private void makeFriendlyWordList(){ |
The screenMsg method
Up to this point, the code that I have presented has been rather mundane. However, things should start getting a little more interesting at this point.
Setting the stage
The statement in Listing 7 was extracted from the Pop302 class in Listing 34. This is the statement that ties the communication module (an object of the Pop302 class) to the screening module (an object of the Screen class).
match = screener.screenMsg( |
At this point in the execution of the program, the communication module has retrieved a message from the server and has written it into a file on the local disk with the path and file name given by fileName.
(The file name is based on the server's unique identifier for the message, given by uidl in Listing 7.)Pass the file to the screenMsg method
The communication module passes the file name for the disk file containing a raw text copy of the message to the method named screenMsg where it will be screened for SPAM.
The screenMsg method needs fileName in order to read the file from the disk to perform the screen.
Why does the screenMsg need uidl?
If the file is identified as SPAM, it will be moved from the history folder to an archive folder. In order to do that, the screenMsg method needs to create a path and file name pointing to the archive folder. For this, it needs the unique identifier, uidl.
(Obviously I could have parsed fileName inside the screenMsg method to get uidl, but I found it easier to simply pass it as a parameter to the screenMsg method.)What is theResult?
In addition to fileName and uidl, the method needs a reference to an empty object of the class ScreenResult. It populates that object in order to send information back to the calling method. That object's reference is represented by theResult in Listing 7. The screenMsg method populates this object with several pieces of data for later use by the communications module.
The screenMsg method returns boolean
The screenMsg method returns a boolean value, which is assigned to match in Listing 7.
If the return value is true, the screenMsg method has concluded that the message is SPAM and is a candidate for deletion from the server.
(Recall, however, that the communications module allows the user to make the final decision regarding deletion of the message from the server.)If the return value is false ...
If the return value is false, the screenMsg method has found nothing to indicate that the message is SPAM.
(The message might not be SPAM, or it might be a form of SPAM that the algorithm doesn't yet know how to identify. If it is deemed by the user that the message is SPAM, the message will be used later to train the algorithm to recognize that form of SPAM in future messages.)About 90 percent effective
As of this writing, I am finding that the algorithm is able to identify about 90 percent of SPAM messages on the average. The remaining ten percent of the SPAM messages are used to further train the algorithm to recognize spam of that type in the future. I am hopeful that this performance will improve in the future as the algorithm becomes better trained.
Beginning of the screenMsg method
The code for the screenMsg method begins in Listing 8.
public boolean screenMsg(String fileName, |
The code in Listing 8 initializes the variable named match to false. This is the value that will be returned from the method if it is not overwritten later by the discovery of a match between a test phrase and a text line in the message.
Purpose of the screenMsg method
The screenMsg is used to identify messages that are candidates for deletion from the server. Such identification is based on analyzing the file in which the message is stored locally, and comparing that file with the contents of TreeSet objects populated earlier with the contents of the files named Pop302a.txt, Pop302b.txt, and Pop303c.txt.
A return value of true means that the message is identified as SPAM and should be deleted from the server.
Returned String values
In addition to the boolean return value, references to four String objects are encapsulated in the incoming object of type ScreenResult. The populated object is used later by the calling method in the communication module. These String objects represent:
- The text of the messages' Subject line.
- The text of the messages' From line.
- The offending word or phrase (if any) that was found in the Subject line or in the body of the message, which includes the From line.
- The raw text of the message down to the line that includes the
offending word or phrase, or the entire raw text of the message if no
offending words or phrases were found.
This information is displayed in the GUI in Figure 1 by the communication module, which is an object of the class named Pop302. This is information is presented to help the user make an informed decision regarding deletion of the message from the server.
(If the screenMsg method returns false, the program doesn't pause for the user to make such a decision, and processing of the next message on the server begins immediately, with all of the above information having been removed from the GUI.)Refer back to Figure 1
The list of items placed in the ScreenResult object includes the offending word or phrase that was found in the Subject line or in the body of the message. Referring back to Figure 1, the Offending Phrase for the message shown in Figure 1 was V1AGRA.
(This was an easy one because there were no extraneous characters inserted in the offending word, although the spammer did use a numeral 1 character in place of an I.)The Subject line
Also in Figure 1, the Offending Phrase was found in the Subject line, which means that the program made the decision very quickly (it didn't have to examine a large amount of body text in order to make a decision).
Data in the text area
As shown in the large text area in Figure 1, this was Message Number 5 in the dropbox on the server.
The raw message text is displayed in the text area down to the line that contained the Offending Phrase, which in this case was the Subject line.
The From line
The top-most text field in Figure 1 originally contained an Email address that purportedly was the address of the sender of the message. However, I suspect that the person identified by that Email address wasn't actually the sender of the message, so I deleted the Email address before publishing this image.
(On the basis of the earliest RECEIVED: FROM line in the raw message text in Figure 1, the message appears to have been sent by a computer having an IP address of 204.85.84.207. However, given the identity of the organization to which that IP address is assigned, (according to a WHOIS Database Search at http://www.arin.net/whois/) this seems somewhat unlikely as well. But, one never knows who may be spamming. Maybe that computer is infected with a Trojan horse that is broadcasting SPAM messages without the knowledge of its owner.)Open the disk file for reading
The code in Listing 9 opens the file containing the local copy of the raw message text for reading. Listing 9 also declares a local variable named data that will be used in the file reading process.
try{ |
(Recall from the previous lesson that asterisks were inserted into the data on ten-character intervals in an attempt to destroy any executable virus code that may be included in the byte stream.)Prepare to process the file
The code in Listing 10 is executed in preparation for processing the file containing the raw message data.
inData.mark(10000); |
The first statement in Listing 10 marks the beginning position in the input stream. Subsequent calls to reset will attempt to reposition the stream to this point.
This mark will be used later to rewind the stream to the beginning.
Populate the ScreenResult object
The last two statements in Listing 10 populate two of the fields in the ScreenResult object with default values, just in case one or the other of the corresponding lines are not found in the message. If the lines are found in the message, these default values will be overwritten with the actual data from the message.
The removeStars method
Before going any further, I am going to put the discussion of the screenMsg method on hold for a moment and discuss the removeStars method shown in Listing 11.
private String removeStars(String stringIn){ |
(Note that this method removes all asterisks, not just those inserted earlier. If this proves to be a problem, this method should be modified to remove only those asterisks that occur on ten-character intervals.)The code in this method is straightforward and shouldn't require further explanation.
Return to discussion of the screenMsg method
Returning to the discussion of the screenMsg method, we are ready to examine the code used to screen the Subject line. That code begins in Listing 12.
while((data = inData.readLine()) != null){ |
If it runs out of lines before finding the Subject line, the loop will terminate. If it finds the Subject line before running out of lines, the line will be processed and a break will be executed to terminate the loop. Thus, the code in this loop will process only the Subject line.
(Note that the removeStars method is called to remove the asterisks from the data and the data is converted to upper case before testing for SUBJECT:)Populate the output object
We are now in the body of the if statement begun in Listing 12. A match for SUBJECT: has been found. The code in Listing 13 populates the output object with the Subject line data overwriting the default value put there by the code in Listing 10.
theResult.subject = data.toUpperCase(); |
Screen against friendly words and phrases
The next step is to screen the Subject line data against the words and phrases in the friendly list. If a word or phrase from the friendly list appears in the Subject line data, the message will be preserved on the server and will not be subjected to SPAM screening.
In addition, the local copy of the message currently located in the history folder will be deleted, because it is not considered to be SPAM. That way, the message will not be used later when training the algorithm to do a better job of identifying SPAM.
Executing the screen
The code that screens the Subject line data against the friendly list begins in Listing 14.
Iterator iterator = |
An Iterator loop
The code in Listing 14 shown the beginning of an Iterator loop, used to iteratively extract each word or phrase from the friendly list and compare it with the Subject line data. The comparison is actually rather complex and is performed in a method named screenOnPhrase, which I will discuss later.
As each friendly word or phrase is extracted from the friendly list, it is passed, along with the Subject line data to the method named screenOnPhrase.
That method will return true if a match is found, and will return false if no match is found.
Extraneous characters
The third parameter in the call to the screenOnPhrase method specifies the number of extraneous characters allowed to occur between the characters in the data and still declare a match to be true. In this case, a value of zero is passed for this parameter, meaning that no extraneous characters are allowed.
(As it turns out, the value of zero results in a trivial case, and I could have accomplished this more simply than by invoking the rather complex code in the screenOnPhrase method. However, when I wrote the prototype for the program, I hadn't decided that I was going to use a value of zero here.)Behavior of the screenOnPhrase method
Basically, in this case, the screenOnPhrase method is testing to see if the friendly word or phrase occurs anywhere within the Subject data line, and if so, it will return true. Otherwise, it will return false.
(Note that everything has been converted to upper case at this point, so matching the case is not an issue.)At this point, I can either branch off and discuss the screenOnPhrase method, or continue discussing the code in the screenMsg method. I have decided to do the latter, and explain the inner workings of the screenOnPhrase method later.
If a match was found
Listing 15 shows what happens if the current friendly word or phrase was found in the Subject data line (the current word or phrase is that word or phrase most recently extracted from the friendlyWordList by the iterator).
if(match == true){ |
The first thing that happens when a match is found is that the input stream is closed and the file containing the raw message is deleted from the folder in which SPAM history messages are stored. The rationale is that this is not a SPAM message, and should not be used later when training the algorithm to do a better job of identifying SPAM.
Populate the output object
The next thing that happens is that the matching phrase is stored in one of the fields of the ScreenResult object.
(In this case, that String isn't currently used in any significant way by the communications module, but it is available to be used in the future if needed.)Return a false value
Perhaps the most significant thing that happens in Listing 15 is that the screenMsg method terminates and returns a false value to the calling method in the communications module Listing 7. That essentially terminates the processing of this message. It is preserved on the server and is not subjected to further screening for SPAM.
If no match was found
If none of the words or phrases in the friendlyWordList match the Subject data line, control will fall out of the loop at the bottom of Listing 15 when the data in the friendlyWordList is exhausted. This will transfer control to the code in Listing 16.
break; |
The break statement in Listing 16 is inside the body of the if statement that began in Listing 12. This code is being executed because a data line was found that starts with SUBJECT:
If no match was found, the return statement in Listing 15 would not be executed, and control would reach this point. Since this part of the code deals exclusively with the Subject line, and a Subject line was found, there is no point in reading any more input lines. Hence the break statement in Listing 16 terminates the read loop that began in Listing 12. No more data will be read from file in this part of the code.
Rewind the input stream to the beginning
The code in Listing 17 resets the stream back to the mark that was set on the stream in Listing 10. Since that mark was set at the beginning of the file, this code rewinds the data file back to the beginning.
inData.reset(); |
At this point, the subject field in the output ScreenResult object contains Subject line data if a Subject line was found, or contains a message to the effect that no Subject line was found. This latter message was put there by default in Listing 10, and was not overwritten if no Subject line was found.
Process the From line
The next step in the process is to process the From line for the purposes of:
- Returning the From
data in one of the fields of the ScreenResult
object if a From line exists.
- Determining if the
message was sent from a friendly Email address. If so, return
false causing this message to be preserved on the server and exempt
from SPAM screening.
while((data = inData.readLine()) != null){ |
Assuming that the method did not return false as the result of a match on a friendly Email address, the code in Listing 18 also resets the input stream in preparation for screening the entire message for SPAM.
Under that same assumption, at this point, the from field in the output ScreenResult object contains From line data if a From line was found, or contains a message to the effect that no From line was found. This latter message was put there by default in Listing 10, and was not overwritten if no From line was found.
Missing Subject line and From line is rare
Experience indicates that it is very rare to receive an Email message that doesn't contain both a FROM: line and a SUBJECT: line in the header, although either or both may not contain any characters to the right of the space following the colon. In fact, it is very common for the Subject line of SPAM messages to be completely blank. (Perhaps that causes people to read the messages out of curiosity.)
The screenOnPhrase method
While discussing both the From line and the Subject line, I asked that you simply accept that the screenOnPhrase method can determine if one upper-case String object is contained as a substring within another upper-case String object. I did that because there is no great challenge to programming such an operation when there are no extraneous characters. In fact, one of the indexOf methods of the String class, which searches for a substring within a String, can accomplish this very handily.
No extraneous characters allowed
This assumes, of course that extraneous characters are not allowed between the characters of the substring within the String. That was the case in processing the From line and the Subject line above because I set the third parameter value to zero in the method call. However, that is not the case in searching for offending words and phrases in a SPAM screen.
A SPAM example
For example, here is a typical Subject line taken from a SPAM message in my archive folder.
Subject: T@ke 5O% off Ge|neric V*i*a*g*r*a 0nline t:0day
In order to recognize that the Subject line contains the word Viagra, it is necessary that the program be able to ignore the asterisks that separate the letters of the word V*i*a*g*r*a.
In order to recognize that the line contains the word Generic, it is necessary that the program be able to ignore the vertical bar that separates the e and the n in Ge|neric.
In order to recognize that the line contains the word t0day, it is necessary that the program be able to ignore the colon that separates the t and the 0 in t:0day.
Other common spammer tricks
This example illustrates another common spammer trick of switching zero characters with alphabetic O characters, replacing the lower-case a character with the @ character, etc.
I haven't attempted to automate the resolution of substitution issues such as this, and probably won't. Given the training programs that I will explain in the next two lessons, it is easy to use historical SPAM data to train the algorithm to recognize these variations. If such a mangled word occurs only once in the SPAM history folder, the algorithm can be trained in a single training session to recognize it as SPAM in all future messages.
Ignoring extraneous characters
Now getting back to the issue of ignoring extraneous characters in the offending words and phrases, that is what the method named screenOnPhrase knows how to do very well.
(It is also what the Email filtering capability of the commercial Email client that I use doesn't know how to do at all.)The capability to ignore extraneous characters is one of the keys to a successful SPAM screening program.
(The spammers never make it easy to identify and block their Email messages. That is why a successful SPAM screening program must have the capability to learn each new spammer trick as soon as it appears through an ongoing, simple to use algorithm training effort.)The screenOnPhrase method
The screenOnPhrase method requires an incoming parameter of type int named spanLim. This is the parameter by which the programmer specifies how many extraneous characters will be allowed between letters in the offending word or phrase and still have it be recognized as an offending word or phrase.
When the screenOnPhrase method was used to process the From and Subject data, the value of this parameter was set to zero. Thus no extraneous characters were allowed in the matching friendly Email address data or the matching friendly Subject line material.
In order for the screenOnPhrase method to recognize the words Viagra, Generic, and t0day in the above example, the value of the spanLim parameter would have to be 1 or greater.
As you will see later, the current version of this program uses a spanLim value of 1 to screen for SPAM. Experience shows that this is successful in identifying most of the offensive words and phrases without unduly increasing the occurrence of false positives.
At this point, I will put the discussion of the screenMsg method on hold for a short while and explain the inner workings of the screenOnPhrase method.
Description of the screenOnPhrase method
This method tests a String to see if it contains a word or phrase that may have extraneous characters inserted into it, such as VI*A-GRA.
The method requires an incoming parameter of type int named spanLim. If the String contains the sequence of characters, in the correct order, that make up the word or phrase, with spanLim or fewer extraneous characters between any two of the matching characters, the method returns true. Otherwise, it returns false.
A spanLim example
For example, if spanLim = 1, the spammer can insert one character between any two of the characters that make up the offending word in the String and the offending word will still be detected.
However, if the spammer inserts two or more extraneous characters, the offending word will not be detected.
Be careful of false positives
You should be careful and avoid making spanLim too large. Large values of spanLim result in higher false positives due to the fact that widely-separated characters can be considered to be part of the word or phrase. For example, if spanLim = 2 or greater, the word PORN will be found in the word IMPORTANT. However, if spanLim =1, the word PORN will not be found in IMPORTANT.
Operation of the screenOnPhrase method
Basically this is how the screenOnPhrase method does what it does. The method receives incoming String parameters named data and phrase, along with an int parameter named spanLim. The objective is to determine if phrase is contained in data with no more than spanLim extraneous characters separating the matching characters.
Search for matching characters
First the method searches data for characters that match the characters in phrase discarding all other characters. While doing this, however, it keeps track of the original positions of the matching characters in data.
A new compressed string
The result is a new string containing only the characters that match the characters in phrase, all in their original order. Let's refer to this as str. All extraneous characters have been discarded from data producing a new string named str.
For example, if the phrase is SPAM, str might look like the following after all extraneous characters have been discarded:
SMSPMASPAMMPAS
Does str contain phrase?
A test is made to determine if str contains sequences of characters that match phrase. (In the above example, str does contain a sequence of characters matching SPAM, which I highlighted using boldface.)
How many extraneous characters were discarded?
If a match is found, this means that the original data did contain phrase with the possibility of extraneous characters in between the characters of phrase. However, there is still the issue of how many extraneous characters were discarded in order to get the positive match. This is determined by examining the original position information that was saved while extraneous characters were being discarded.
If the number of characters discarded between any two of the characters matching the sequence was less than or equal to spanLim, the method returns true. Otherwise it returns false.
Let's see this in code
The code to accomplish this is a little bit complex. The screenOnPhrase method begins in Listing 19.
private boolean screenOnPhrase(String data, |
(An object of the StringBuffer class can have its contents modified, while an object of the String class is immutable.)Compare data with phrase
The next step is to compare the characters in data with the unique characters in phrase, saving only the matching characters in str, and saving the original locations of the matching characters in locationData, which refers to an object of type ArrayList.
First, however, it is necessary to eliminate duplicate characters from phrase.
Eliminate duplicate characters from phrase
This is accomplished by the code in Listing 20 by storing the characters from phrase into a TreeSet object. Storing the characters in a TreeSet object eliminates duplicates.
(It also sorts the characters, but that doesn't matter one way or the other in this case.)
TreeSet treeSet = new TreeSet(); |
(The characters are stored in a StringBuffer object because it is possible to build such an object one character at a time. This is not possible with a String object.)Extract matching characters from data
Listing 21 uses a pair of nested for loops along with tempPhrase to extract matching characters from data and to store them, (in their original order), in str. The original position of each matching character is stored in locationData.
for(int i = 0; i < data.length(); i++){ |
Does str contain phrase?
The next step is the easy one. Listing 22 tests to see if the new compressed string named str contains the original phrase.
int match = str.indexOf(phrase); |
Behavior of the indexOf method
The indexOf method returns -1 if phrase does not occur within str, in which case the screenOnPhrase method simply returns false. Otherwise, the method goes on to test for the maximum number of extraneous characters that separated the matching characters in the original data.
(While writing this explanation, I realized that phrase might have occurred more than once in data, with too many extraneous characters in the first occurrence and an acceptable number of extraneous characters in a later occurrence. In this case, however, the algorithm would return false.
Given the reason for the extraneous characters in the first place, it is probably unlikely that this will happen. However, this is a logic error that would be worth fixing.)
Check number of extraneous characters
When there is a match, we need to confirm that the span between matching characters does not exceed the number allowed by the incoming parameter spanLim. This is accomplished by the code in Listing 23.
int maxSpan = 0; |
As each matching character was extracted from data in Listing 21, the original position of that character in data was encapsulated in an object of type Integer. That Integer object's reference was appended to a list of such references in the object of type ArrayList referred to by locationData.
Thus the elements in the ArrayList refer to objects containing the original positions of successive matching characters in data. This information will be used to calculate the maximum number of characters separating those matching characters.
The code in Listing 22 found the position of the first character of phrase in the compressed string referred to by str and saved that value in a local int variable named match.
What happens now is ...
The code in listing 23 uses the value of match to extract an Integer object from the list containing the original position of the first matching character of phrase in data. The contents of the Integer object are extracted and saved as locA.
Then the code in Listing 23 enters a for loop, extracting successive references from the list and uses the information encapsulated therein to calculate the difference in original positions of the successive matching characters. The maximum value of that difference is calculated. This process continues until a number of original positions equal to the number of characters in phrase have been examined. Then the maximum difference in position is compared with spanLim.
If it is determined that the number of characters between the original positions of the matching characters in data exceeds spanLim, the method returns false. Otherwise, it returns true.
Return to the discussion of the screenMsg method
Now that we have an understanding of how the method named screenOnPhrase works, it is time to return to the discussion of the method named screenMsg.
Up to this point, the method has examined the Subject and From lines for two purposes:
- To determine if they are friendly, and if so, to terminate SPAM screening for the current message.
- To provide the contents of the two lines for later display in the GUI.
If control still resides in the screenMsg method at this point (meaning that the message wasn't declared to be friendly on the basis of either the From line or the Subject line), it is time to screen the entire message looking for indications that the message is SPAM.
This is accomplished in two parts. The Subject line is screened for SPAM using one word list and the body text is screened for SPAM using a different word list. If either the Subject line or the body text is determined to contain SPAM, the method terminates returning true.
(A typical message contains a few lines of body text before the Subject line and potentially many lines of body text following the Subject line.If any line is determined to contain SPAM, the method terminates at that point returning true.
The lines are screened in the order that they occur in the message. Therefore, if the Subject line is determined to contain SPAM, the screening process will often require much less time than will be required to locate SPAM in the body text.)
If no SPAM is identified in any line of the message, the method returns false.
The screening process
The screening process begins in Listing 24.
int progressCounter = 0; |
If the line does not start with Subject:, an else clause, (to be discussed later), is executed to screen the line as body text.
If the text line is the Subject line ...
For the case where the line does start with Subject:
- The line is converted to upper case.
- The line is appended to the contents of the field named text in the output object of type ScreenResult.
- A single period is
appended to the string currently residing in the text area of the GUI
to be displayed as a progress indicator. (Each set of 50
periods appears on a new line in the progress indicator.)
The code beginning in Listing 25 uses an Iterator to screen an upper-case version of the Subject line against upper-case versions of each of the offensive words and phrases stored in a TreeSet object referred to by subjWordList.
|
The actual process of screening against each offensive word or phrase in the TreeSet object occurs as a result of invoking the screenOnPhrase method (discussed earlier) to determine if the Subject line contains the offensive word or phrase. A one-character separation is allowed between the characters in the offensive phrase in the Subject line. The boolean value returned by screenOnPhrase is stored in the variable named match.
The value of match will eventually be returned to indicate whether or not the screenMsg method found a match between a text line in the message and an offensive word or phrase from one of the word lists.
If the returned value is false ...
If the returned value is false, the Iterator loop continues looping, attempting to match offensive words or phrases from subjWordList with the Subject line until there are no more offensive words or phrases stored in subjWordList.
At that point, it is concluded that the Subject line doesn't contain SPAM. Control is transferred back to the top of the while loop in Listing 24, where another line of text is read and screened.
(There can be only one Subject line in a properly formatted message, so the remaining lines will probably all be screened as body text.)If the returned value is true ...
If the screenOnPhrase method returns true, the body of the if statement in Listing 26 is executed.
if(match == true){ |
- The local copy of the message is moved from a history folder to an archive folder. (The message won't be needed for training the algorithm later because the algorithm already knows how to identify the message as SPAM.)
- Control breaks out of the Iterator loop. (Only one match against an offensive word or phrase is required to declare that a message is SPAM.)
Transfer of control
After breaking out of the Iterator loop, control transfers to the return statement in Listing 33 below with match containing true. A true value for match indicates that the Subject line identifies the message as SPAM.
If the text line is not the Subject line ...
A line of text was read in Listing 24, and a test was made to see if that line was the Subject line for the message. If not, control transfers to the code that begins in Listing 27.
Since the line of text is not the Subject line, it needs to be screened against the offensive words and phrases in a different list designed for screening body text.
Listing 27 executes several steps in preparation for that screening process.
else{ |
- Converts the text line to upper case.
- Appends the text line to the contents of the field named text in the output object of type ScreenResult.
- Causes a single period to be displayed in the progress indicator on the GUI of Figure 1.
The line of message body text is actually screened by invoking the screenOnPhrase method in Listing 28. The third parameter in the invocation of this method allows for one extraneous character to separate the characters of the offending phrase in the line of body text.
Iterator iterator = bodyWordList. |
An upper-case version of the message text line is screened against upper-case versions of each of the offending words and phrases in a TreeSet object referred to by bodyWordList.
An Iterator is used to cause this process to continue until either a match is found, or the items in the list are exhausted.
The boolean value returned by screenOnPhrase is stored in the variable named match.
If the returned value is true ...
If the value returned by screenOnPhrase is true, the code in the body of the if statement of Listing 29 is executed.
if(match == true){ |
- Move the local copy of the message from the history folder to an archive folder for the same reasons given with respect to the Subject line earlier.
- Break out of the Iterator loop because there is no need to test against any additional offensive words or phrases.
If screenOnPhrase returned false ...
On the other hand, if no matches were found for any of the words and phrases in bodyWordList, control reaches the if statement at the top of Listing 30 with match containing a value of false.
if(match == true)break; |
Close the file
If match is true in Listing 30, there is no need to do any further testing so the code in Listing 30 breaks out of the while loop responsible for reading lines of text from the file, transferring control to the top of Listing 31.
Control can also transfer to the top of Listing 31 when the end of the text file has been reached. In that case, the value of match will be false, indicating that no match was found.
inData.close();//Close file if still open |
Store the final phrase
In the event that a match was found, the variable phrase contains the offending phrase that identifies the message as spam. If a match was not found, the contents of phrase are of no significant value. In either case, however, the value of phrase is stored in the field named thePhrase in the output object of type ScreenResult in Listing 32.
theResult.thePhrase = phrase; |
The code in Listing 33 returns the value of the variable match. This will either be the initial value of false if no match was found (see Listing 8) or will be true if a match was found and the initial value was overwritten with true.
return match; |
The screenMsg method contains three return statements.
The first occurs in Listing 15 where the code explicitly returns false, indicating that a friendly phrase was found in the Subject line, and that the message should not be deleted from the server.
The second occurs in Listing 18 where the code explicitly returns false, indicating that the message was sent by a friend and therefore shouldn't be deleted from the server.
The third occurs in listing 33. A match value of false at this point indicates that the message was not identified as SPAM and should not be deleted from the server. A match value of true at this point indicates that the message is believed to be SPAM and probably should be deleted from the server.
Preview of Future Lessons
This program is most useful
when you have well-developed lists of offending words and
phrases. Although it is possible to create those lists with a
text editor, you can be much more productive, and you are much more
likely to update the lists using the programs that I will present in
the next two lessons.Therefore, I will give you a preview of those two programs. I will show you three images that partially illustrate the capabilities of the two programs.
The first image (Figure 2) shows the GUI used to train the algorithm to do a better job of identifying SPAM in the Subject line of a message.
The second and third images (Figures 3 and 4) show two different aspects of the GUI used to train the algorithm to do a better job of identifying SPAM in the body text of a message.
(In all three cases, the width of the GUI was reduced to make it fit into this narrow publication format. The version that I routinely use is much wider, and can therefore display much more information.)Training the algorithm on the Subject line
Figure 2 illustrates the procedure that I use to train the algorithm to do a better job of identifying SPAM in the Subject line of future messages.

Figure 2 User interface for training on Subject line
In Figure 2, a message previously stored in the history folder has been loaded into the GUI. The complete raw text of that message is available for viewing in the large text area if desired. The From line and the Subject line are displayed in the top two text fields in the GUI. (I purposely deleted the Email address of the sender in all three of these images.) User instructions are displayed in the fourth text field.
Offensive text in the Subject line
In this case, the user has identified offensive text (XANAX) in the Subject line and has selected that text with the mouse. Two additional steps are required to add that text to the word list used to screen the Subject line of future messages.
The first step is to press the button labeled Copy Selected Text. This will cause the selected text to be copied into the third text field from the top where it can be edited if desired.
(In the event that the spammer inserted extra characters into the offensive text, such as in X-ANAX, the extra characters should be deleted before proceeding to the second step.)The second step is to press the button labeled Post Text. This will cause the selected and (possibly) edited text to be automatically added to the word list.
That is all that is required to cause the program to identify this offensive text in the Subject lines of all future messages.
Process the next message
If the user then presses the Next button, the next message in the history folder will be loaded into the GUI. The current message will not be deleted from the history folder.
(This is what you would normally do if you are going to use the same message later to train the algorithm to better identify SPAM on the basis of the body text.)If the user presses the Delete Local File button, the current file will be deleted from the history folder and the next message in the history folder will be loaded into the GUI.
(This is what you would normally do if you have determined that the message is not SPAM, or should not be used for further training of the algorithm for some other reason. Perhaps the message was received from a friend whose Email address has not yet been added to the list of friendly Email addresses discussed earlier in this lesson. Note that deleting the message file from the local disk does not delete the message from the server.)A very simple process
As you can see, the process of training the algorithm on the Subject line consists simply of selecting text with the mouse and pressing buttons to cause the selected text to be added to the word list. This can be accomplished very quickly with very little effort. Except for the possible requirement to delete extra characters, no actual typing is required.
(As an alternative, the user can type anything into the third text field and press the Post Text button to cause it to be added to the word list. Any number of items can be added to the word list before moving on to the next message.)Training the algorithm on the body text
Figure 3 illustrates one aspect of the procedure for training the algorithm to do a better job of identifying SPAM on the basis of the body text of future messages.

Figure 3 User interface for training on body text and IP address
Once again, a message previously stored in the history folder has been loaded into the GUI. The complete raw text of that message is available for viewing in the large text area. The From line and the Subject line are displayed in the top two text fields in the GUI. User instructions are displayed in the fourth text field from the top.
Add originating IP address to the list
At this point, the user has pressed the button labeled Select IP. This caused the program to search out the IP address of the computer that originally sent this message, and to copy that IP address into the third text field from the top. All that is required to add that IP address to the list of offending phrases is to press the button labeled Post Word.
As you can see, getting the originating IP address of a SPAM message and adding it to the word list is very simple. As before, once it is in the third text field, you can edit if you like before adding it to the list.
Getting offending text from the body text
Also at this point, you can scroll the text area. If you visually identify something in that text that you believe will uniquely identify messages from this spammer in the future, you can copy and paste that text into the third text field, and then add the text to the list by pressing the Post Word button.
Adding URLs to the list
One of the best ways to identify SPAM is to identify URLs referenced in the SPAM messages. This is something that is difficult, or at least expensive for the spammer to change frequently. (Sometimes the identification of one critical URL will cause hundreds and perhaps thousands of future messages to be identified as SPAM.)
Adding URLs is very easy
Figure 4 illustrates a special feature of the program designed to let you capitalize on that weakness. Once a message is loaded into the GUI, each time you press the button labeled Select URL, the program will search down through the message until it finds the next block of text that begins with HTTP://. (This is normally an indication of a URL.)
The program will select that URL, beginning with HTTP:// and including everything out to the character before the next / character. That / character normally separates the domain name from a directory or file name. Then the program copies the selected text into the third text field.
(The spammer can much more easily change directory and file names than domain names, so they are excluded from the text that is selected and copied.)

Figure 4 User interface for training on body text and URL
A URL has been identified
In Figure 4, the program has selected the URL being used by the spammer and has copied it into the text field. At this point, the user can edit the URL if appropriate, and can add it to the list by pressing the Post Word button.
Each time the user presses the Select URL button the next URL in the message is copied into the text field. When no more URLs can be found, a message to that effect is displayed in the text field.
Thus, it is very easy for the user to identify all the URLs being used by the spammer and to add some or all of them to the list.
The next message
The behavior of the Next button and the Delete Local File/Next button are the same as discussed relative to Figure 2. I typically delete the file from the history folder after I have used it to train the algorithm on the basis of body text.
Stay tuned
So, stay tuned. I will explain the programs that provide this training capability in the next two lessons in this series.
Run the Program
I encourage you to copy the code from Listing 34 and the three starter text files in Listing 35, Listing 36, and Listing 37 into your text editor. Compile and execute the program. Experiment with it, making changes, and observing the results of your changes.
You may want to modify this code to cause the message files to be
stored
in a different location on your disk. If so, modify the strings
in Listing 34 that read "c:/MailFiles/"
+ uidl + ".txt" and "c:/MailFiles/Archives/" + uidl +".txt"
to
specify a different folder. Make certain that the folder
where you plan to save the files exists before running the program.
Before running the program, you will need to create three text files
having the following names and purposes and store them in the folder
containing your compiled Java class files for this program:
- Pop302a.txt - contains offensive Subject line words and phrases
- Pop302b.txt - contains offensive body text words and phrases
- Pop302c.txt - contains friendly Email addresses and friendly Subject line material
Eventually you will need to populate these files with words
and phrases that work well for you. (The algorithm training
programs that I will present in the next two lessons will be extremely
helpful in this regard.)
In the meantime, I have provided sample files in Listing 35,
Listing 36, and Listing 37 that you can use as starter lists. If
you receive the same kinds of SPAM that I receive, the words in these
lists should make it possible for you to test the program and get a few
hits on SPAM messages.
These are simply text files so feel free to add other words and
phrases as appropriate.
(Let me caution you not to enable the DELE code in Listing 34 until you are certain that you actually want to delete messages from the server. Once a message is deleted from the server, there is no way to recover it from the server.)
Summary
The previous program explained the communications module of a program used to remove SPAM from your Email server before it is downloaded into your primary Email client.This program explains my algorithm used to identify SPAM. You can use the algorithm as is, or modify it to better suit your needs.
After about one week of training my algorithm was reliably identifying about ninety percent of all SPAM messages. I expect this performance to improve over time as the algorithm becomes better trained.
This program is most useful when you have well-developed lists of offending words and phrases. Although it is possible to create those lists with a text editor, you can be much more productive, and you are much more likely to update the lists using the programs that I will present in the next two lessons.
What's Next?
In the next lesson in this series, I will present and explain my program named Pop302d, which provides an easy way to train my screening algorithm to do a better job of identifying SPAM in the Subject line of a message.
Complete Program Listing
The three DELE statements shown in red in Listing 34 have been purposely disabled to prevent you from accidentally deleting messages from your server while testing this program.
Do not enable these three statements until you are ready to actually delete messages from the server. Once a message is deleted from the server, it cannot be recovered from the server.Disclaimer of responsibility: If you elect to use this program you use it at your own risk. Make absolutely certain that you understand what you are doing before you execute the program. The author of this program, Richard G. Baldwin, and the websites Developer.com and Gamelan.com accept no responsibility for any losses that you may incur as a result of using this program.
/*File Pop302.java Copyright 2004, R.G.Baldwin |
AGE REVERSING PRODUCT |
123.456.789.123 |
BALDWIN@DICKBALDWIN.COM |
Copyright 2004, Richard G. Baldwin. Reproduction in whole or in part in any form or medium without express written permission from Richard Baldwin is prohibited.
About the author
Richard Baldwin is a college professor (at Austin Community College in Austin, TX) and private consultant whose primary focus is a combination of Java, C#, and XML. In addition to the many platform and/or language independent benefits of Java and C# applications, he believes that a combination of Java, C#, and XML will become the primary driving force in the delivery of structured information on the Web.Richard has participated in numerous consulting projects, and he frequently provides onsite training at the high-tech companies located in and around Austin, Texas. He is the author of Baldwin's Programming Tutorials, which has gained a worldwide following among experienced and aspiring programmers. He has also published articles in JavaPro magazine.
Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.
-end-
This article was originally published on February 3, 2004