Enlisting Java in the War Against SPAM: Training the Body Screener
Java Programming Notes # 2156
- Preface
- Preview
- Operational Discussion
- Program Code
- Run the Program
- Summary
- What's Next?
- Complete Program Listing
Preface
The communications module
The first lesson explained the communications module used to communicate with your Email server, and to remove SPAM messages from the server.
SPAM screening algorithm
The second lesson explained my SPAM screening algorithm. The program is designed to allow you to use my SPAM screening algorithm, or to invent your own. You can use my algorithm as a starting point if you decide to invent your own.
My algorithm operates separately on the Subject line, the From line, and the body text of each Email message. The previous lesson explained the program that I use to train that portion of the algorithm that screens the message on the basis of the Subject line.
This lesson explains my program for training the algorithm to do a better job of identifying SPAM in the future based on the body text of Email messages.
Future lessons will explain a large number of items contained in the section entitled "What's Next" in the previous lesson.
Viewing tip
You may find it useful to open another copy of this lesson in a separate browser window. That will make it easier for you to scroll back and forth among the different listings and figures while you are reading about them.
Supplementary material
I recommend that you also study the other lessons in my extensive collection of online Java tutorials. You will find those lessons published at Gamelan.com. However, as of the date of this writing, Gamelan doesn't maintain a consolidated index of my Java tutorial lessons, and sometimes they are difficult to locate there. You will find a consolidated index at www.DickBaldwin.com.
Preview
Remove SPAM from the serverIn this series of lessons, I am showing you how to write a
Java program
that supplements the SPAM screening software that you are currently
using. The program is used to identify and remove SPAM from your
Email server before it is downloaded into your primary Email client.
Any SPAM that makes it past the SPAM screening program can be further acted upon by the SPAM screener that is built into your Email client.
Algorithm training programsThe screening program is
most useful
when the algorithm has been well trained, and the training of the
algorithm is kept up to date. Although it is possible to perform
this training with a
text editor, you can be much more productive, and you are much more
likely to keep the training up to date using the programs that I have
provided.
The previous lesson explained a program named Pop302d,
which
uses historical data to train the algorithm to do a better job
of identifying SPAM in future messages based on the Subject
of the message.
This lesson explains a program named Pop302e, which uses historical data to train the algorithm to do a better job of identifying SPAM based on the body text of the message.
Because of the need to train the algorithm and to keep it up to
date, and the ease with which
these programs make that possible, the training programs are
equally as important as the main screening program.
Operational sequence
Here is the typical operational sequence that I follow each
morning to remove SPAM from my Email server before downloading it into
my primary Email client, and to train the algorithm to recognize any
future SPAM messages that managed to evade the screening program on the
first pass that morning.
- Run the program named Pop302 to identify SPAM and to remove it from the server. This normally allows a few SPAM messages (stragglers) to get through. These straggler messages are stored in a history folder on my local disk.
- Run the program named Pop302d to train the algorithm to
recognize the stragglers as SPAM
based on
information in the Subject line. (It is also
possible to recognize that a message is not SPAM and to exclude it from
further training activities.)
- Run the program named Pop302e (explained in this lesson) to train the algorithm to recognize the stragglers as SPAM based on information in the body text.
- Go back and run the main program named Pop302 to remove the SPAM stragglers from the server.
- Run the primary Email client to download the remaining good messages from the Email server into my local Email inbox.
When I am in a hurry ...
It isn't necessary to perform all of these steps every
day. On those mornings when I am in a hurry, I skip steps
2, 3, and 4, leaving the straggler messages in the local history folder
for use later.
Sometime later (perhaps the next day or several days later)
I perform steps 2 and 3 to train the algorithm to recognize future
SPAM
messages represented by the characteristics of the SPAM messages that
have
been saved in the local
history folder.
Effectiveness of my algorithm
After several weeks of training, my
algorithm is reliably identifying about ninety-five percent of all SPAM
messages, allowing me to delete them from my Email server before
downloading them into my primary Email
client. By executing steps 2, 3, and 4 above, I am able to also
eliminate the remaining five percent of the SPAM messages before
downloading them into my primary Email client.
Most of the messages that make it past the initial screen are messages written in a non-English language that I am unable to read. No one that I know would send a message to me in a language that I can't read, so I consider such messages to be SPAM, regardless of the original intent of the author. Those foreign language messages constitute the bulk of the five percent of the SPAM messages that make it past the screen.
Operational
Discussion
Offensive words and phrasesMy SPAM screening algorithm screens for SPAM on the basis of words or phrases in the From line, words or phrases in the Subject line, and words or phrases in the body text. Offensive words and phrases include IP addresses and URL addresses, and anything else that you decide to include in the list of offensive words and phrases. At this point in time, there is nothing statistical about my algorithm, nor does the algorithm use any concepts that might be described as artificial intelligence. However, that may change in a future upgrade.
Friendly Email addresses and subjects
A list of friendly Email addresses and friendly subjects is used to screen the From line and the Subject line. Messages that are from friendly Email addresses, and messages that have known good Subject lines are preserved on the server. They are simply ignored after determining that they are friendly.
The friendly list is created and maintained using a simple text editor. This is a relatively short list that rarely changes and no special training program is needed to keep this list up to date.
Different lists for Subject and body text
Different lists of words and phrases are used for screening Subject lines and body text for SPAM. This is important because the same set of words and phrases aren't always appropriate for use in both cases.
For example, the word ANTIVIRUS is appropriate for screening the Subject line, but is not appropriate for screening the body text. The word ANTIVIRUS often appears legally in the header of Email messages that have been scanned for viruses by the server, but also often appears in the Subject line of SPAM messages.
Conversely, some of the most useful material in the list used to screen the body text is a large set of URLs for known spammer web sites. However, URLs have little value for screening the Subject line, so there is no point in wasting computer time screening Subject lines against URLs.
To summarize ...
Although there are exceptions to every rule, the most useful material for screening Subject lines consists of plain text words and phrases that are commonly included in the Subject lines of SPAM messages.
Such text is of limited value when screening the body text for reasons that I will explain later (but this is also subject to change in a future upgrade). The most useful material for screening the body text is a list of known spammer URLs, which are of very limited value when screening Subject lines.
Common spammer tricks are defeated
Several common spammer tricks are defeated by my SPAM screening algorithm (and I will defeat other spammer tricks in future upgrades). Some of those tricks are described here, and some will be described in future lessons.
Insertion of extraneous characters and case changes
For example, the common spammer trick of inserting a small number of extra characters between the characters in an offending word or phrase is defeated. Also, the common trick of mixing the case of the characters in an offending word or phrase is defeated.
As a specific example, my algorithm will recommend deletion of any message having any of the following in its Subject line or its body text if the word VIAGRA is included in the lists used to screen for SPAM:
vIaGrA
V.IagRA
V.I.A.G.R.A
Therefore, when using this program to train the algorithm, if you find a SPAM message containing V.IagRA in the body text, you should use the program to add the word VIagRA to the list. (Simply edit out the extraneous period between the V and the I.) The algorithm will deal with the other variations of that word having extraneous characters inserted and having random case changes.
Purposely misspelled words
As currently written, my algorithm has no way of dealing with another common spammer trick of purposely misspelling words.
(Perhaps this is an opportunity for you to improve the algorithm.)Therefore, if you find a SPAM message where the body text contains the word V.I@gRA, you should use the program to add the word VI@gRA to the list to deal with the incorrect spelling. (Once again, just edit out the extraneous period.)
Duplicates are not a problem
Don't be concerned that you may forget and add a word or phrase to the list more than once. The program will eliminate all duplicates, thereby preventing the length of the list from increasing due to duplicates.
Bogus HTML tags
One of the reasons that plain text words and phrases are not very useful in screening body text is because spammers often insert long bogus HTML tags in the middle of such words and phrases. Because the tags are not recognized as valid HTML tags, the HTML rendering engine simply ignores them when the SPAM message is displayed in an HTML compatible Email client.
My screening algorithm, as currently written, does not deal directly with this situation. (I plan to update the algorithm to cause it to delete bogus HTML tags before screening the body text in a future version.) I will publish that version once it becomes operational. At that point in time, the use of offensive words and phrases (in addition to URLs) in the body text will become much more productive.
The user interface
Figures 1 and 2 show the GUI through which the user controls the training program named Pop302e.
Figure 1 shows the GUI at startup, and Figure 2 shows the GUI containing a SPAM message from the history folder.

Figure 1 Graphical User Interface at startup
(Note that this GUI was purposely made narrow to cause it to fit into this narrow publication format. I recommend that you increase the width of the Frame to at least 750 pixels, and increase the width of the TextField and TextArea objects to at least 100 characters each.)A URL has been identified
In Figure 2, the SPAM message is typical of many SPAM messages being distributed at this time. In this case, the program has identified a URL associated with the spammer. (See the text in the third text field from the top, which now contains the URL identified in the large text area.)

Figure 2 Graphical User Interface in operation
At this point, the user has the option of pressing the Post Word button to cause this URL to be added to the list of offensive words and phrases.
(The user can also edit the URL in the third text field, if needed, before pressing the Post Word button.)What do we know about this URL?
There is no question that this message contains a reference to the URL shown in Figure 2. It isn't difficult to establish that this URL is associated with the message. Just point your browser to http://secure.bidz.com/MAIN/DEFAULT.ASP and take a look at the web site represented by that URL. As of the date of this writing, it is a web site identified as BidZ.com.
Whether or not you consider the message to be SPAM depends on whether or not you desire to receive messages from this organization on a frequent basis. If you do consider it to be SPAM, press the Post Word button to put it on the list. If you don't consider it to be SPAM, press the Delete Local File/NextMSG button to remove it from further consideration as SPAM, and consider putting the From address on the friendly list mentioned earlier.
I receive a message with almost the exact same Subject line from this spammer several times each week.
A message from the history folder
In Figure 2, a message previously stored in the history folder has been loaded into the GUI. The complete raw text of that message is available for viewing in the large text area if desired.
The From line and the Subject line are displayed in the top two text fields in the GUI. At this point in the operation, the user has pressed the Select URL button, causing the program to find the URL, to highlight it in boldface, and to copy it into the third text field from the top. As mentioned above, the user has the option of pressing the Post Word button to copy the contents of that text field into the list of offensive words and phrases. The user also has the option of editing the contents of that text field before posting if needed.
User instructions are displayed in the fourth text field.
Can type, or copy and paste also
In addition, the user can add other words and phrases to the list by either typing them into the third text field, or by selecting and copying words and phrases from the large text area and pasting them into the third text field. Having done that, the user can cause the word or phrase in the third text field to be added to the list simply by pressing the Post Word button.
Process the next message
When the user presses the Next Msg button, the next message in the history folder will be loaded into the GUI. The current message will not be deleted from the history folder in this case.
If the user presses the Delete Local File/NextMsg button, the current file will be deleted from the history folder and the next message in the history folder will be loaded into the GUI.
A very simple process
As you can see, the process of training the algorithm consists mainly of pressing buttons to cause material to be identified and added to the word list. This can be accomplished very quickly with very little effort. Except for the possible requirement to do an occasional edit on the contents of the text field, no actual typing is required.
Be careful of false positives
Be careful, however, of adding words or phrases to the list that will create false positives in the future (a false positive is a non-SPAM message that is identified as SPAM by the screening program).
As an extreme example, you probably shouldn't add the words GET or THE to the list. These words are not unique, and they will be found in many future messages, including those that are SPAM, and those that are not SPAM.
False positives cause no permanent damage (you can always elect not to delete messages from the server when running the screening program, even if the program identifies them as SPAM).
The main problem with false positives is that they force you to concentrate more carefully when making the decision to delete or not to delete. This can increase the time required to process the messages on the server.
Program Code
Program Pop302eThis lesson explains the program named Pop302e, which is used to train my screening algorithm to do a better job of identifying future SPAM messages on the basis of the body text of the message.
Simple text files
All three word lists used by my screening algorithm are maintained in local text files. These files can be created and edited with an ordinary text editor if need be. Thus, if one of the lists becomes corrupted, it is easy to correct the situation using a text editor. However, it is easier to train the algorithm using the programs that I have provided.
File names
The following file names are hard coded into the various programs that use these text files. (You may want to change these file names for your version of the programs. If you do, make certain that you change them everywhere that they are used.)
- Pop302a.txt - contains a word list used for screening the Subject
line for offensive words and phrases. This file is updated by the
program named Pop302d discussed in the previous lesson.
- Pop302b.txt - contains a word list used for screening the
body
text
for offensive words and phrases. This file is updated by the
program named Pop302e that I will discuss in this lesson.
- Pop302c.txt - contains a list of friendly Email addresses
and
friendly subjects for
screening the From and Subject lines to
identify friendly messages. This file rarely changes and is easy
to keep up to date
using a text editor. No special program is required for updating
this file.
The programs require the three .txt files to be located in the same folder as the compiled .class files for the programs named Pop302, Pop302d, and Pop302e. However, you can easily modify the programs to change the location of the .txt files if you choose to do so. Just be sure to change the location in all three programs.
Local copies of the messages are stored in two different folders. Some of the local copies are stored in a history folder while the remainder are stored in an archive folder. The locations of these folders on the disk are hard coded into the three programs. You can change the locations if you like, but be sure to make appropriate changes to all three programs.
Testing
This program was tested using SDK 1.4.2 under WinXP.
Will discuss in fragments
I will discuss the program named Pop302e in fragments. A complete listing of the program is provided in Listing 10 near the end of the lesson. You should be able to copy and paste that listing into your Java IDE to compile and test the program on your system.
Much of the code in this lesson is very similar to the program named Pop302d that I explained in the previous lesson. Therefore, much of the explanation in this lesson will be very brief. I will concentrate mainly on those things that differentiate this program from the previously discussed program.
Purpose of the program
The purpose of this program is to process raw message text files produced by the program named Pop302, using the information contained in those files to update the word list stored in the file named Pop302b.txt.
Designed for ease of use
The program is designed for extreme ease of use in order to encourage the user to keep the word list up to date.
Beginning of the class definition
This program consists of one top level class named Pop302e, and seven anonymous inner classes. The top level class definition begins in Listing 1.
Note that Pop302e extends Frame. Therefore, an object of type Pop302e is a GUI.
class Pop302e extends Frame{ |
The code in Listing 1 declares numerous instance variables, initializing many of them.
Note in particular the instantiation of objects of the classes TextArea and TextField in Listing 1. As mentioned earlier, this GUI was purposely made narrow to force it to fit into this narrow publication format. The GUI is much more useful if it is wider. I recommend that you modify the parameters passed to the TextArea and TextField constructors in Listing 1 to make them at least 100 characters wide (instead of 50).
I also recommend that you modify the parameters passed to the setSize method later in the program (see Listing 9) to cause the Frame object to be at least 750 pixels wide.
The main method
The main method for the program is shown in Listing 2.
public static void main(String[] args){ |
(Much of the interesting code in the program resides in the constructor.)Then the main method invokes the makeBodyWordList method on a reference to the object. I will briefly discuss that method, along with a couple of other utility methods before discussing the constructor.
The makeBodyWordList method
The purpose of the makeBodyWordList method is to create a TreeSet object containing words used by the program named Pop302 to screen body text of Email messages in an attempt to identify SPAM messages.
The makeBodyWordList method reads strings from a text file named Pop302b.txt and creates the list as a TreeSet object sorted in natural order with no duplicates.
The data read from the file is converted to upper case before being added to the TreeSet object.
Also makes backup files
After creating the TreeSet object, the method writes the data from the object into a backup file named Pop302b.bakN, where N is the value of the next available file name in the directory.
A new backup file with a unique name is created each time the program is run. Once the number of sequential backup files reaches 5, the program automatically deletes the oldest file before creating a new backup file.
Thus the program maintains a sequence of five backup files with extensions bak0 through bak5 with one number missing.
You can view the makeBodyWordList method in its entirety in Listing 10 near the end of the lesson.
The makeBodyWordList method is very similar to the makeSubjWordList method that I discussed in the previous lesson. Therefore, I won't discuss it further in this lesson.
Additional utility methods
The Pop302e class contains two additional utility methods, which you can view in their entirety in Listing 10:
- writeBodyWordList
- removeStars
The removeStars method is the same as a method having the same name that I discussed in an earlier lesson entitled Enlisting Java in the War Against SPAM: The Screening Module. This method is used to remove asterisks that were purposely inserted into the message files when they were created to defeat any virus code that might be lurking there. I will simply refer you back to the discussion in the earlier lesson for this method.
The constructor
That brings us to the constructor for the class named Pop302e, which begins in Listing 3.
The code in Listing 3 registers an anonymous WindowListener object to service the close button on the Frame.
Pop302e(){//constructor |
The windowClosing method in Listing 3 invokes the writeBodyWordList method to write the current and updated contents of the TreeSet object into the output file named Pop302b.txt.
Then the windowClosing method terminates the program.
Set the layout for the GUI
The code in Listing 3 also sets the layout for the GUI to FlowLayout. This is a simple layout manager that works pretty well for this simple GUI.
An ActionListener on the Next Msg button
The code in Listing 4 registers an anonymous ActionListener object on the Next Msg button in Figure 1.
//Register an ActionListener on the |
Set prevUrl to zero
I will point out one statement in Listing 4 that is different from the code in the previous lesson. The statement highlighted in red in Listing 4 sets the value of a variable named prevUrl to zero. This will become important later when I discuss the action listener that I will register on the Select URL button in Figure 1.
Instantiate ActionListener object from anonymous class
The code in Listing 5 registers a common ActionListener on the Post Word button and also on the third text field in Figure 1. This makes it possible to post a new word or phrase to the list by pressing the Post Word button, or by pressing the Enter key when the text field has the focus.
Note that the code in Listing 5 defines an anonymous class, but it does not use an anonymous object.
ActionListener postListener = |
The Select IP button
Now we are getting into new territory that wasn't covered in the previous lesson. Figure 3 shows the GUI after the user has loaded a message file from the history folder and has pressed the Select IP button. The program has found the IP address of the computer that (presumably) originated the message, and has copied that IP address into the third text field. At this point, the user has the option of pressing the Post Word button to cause the IP address to be added to the list of offending words and phrases.
The user also has the option of editing the IP address before pressing the Post Word button. For example, removing the rightmost three digits would cause the truncated IP address to refer to a group of machines rather than a single machine.
(This narrow publication format makes it a little difficult to see, but the beginning of the IP address is highlighted on the right side of the large text area. This illustrates why you should increase the width of the GUI before using the program.)

Figure 3 GUI with IP address selected.
How reliable is the originating IP address information?
I don't have a good answer to that question. Obviously, there are programmers who can write programs to insert a fake IP address into each packet before the packets are transmitted. (Otherwise, it would not be possible to write programs that insert valid IP addresses into packets before they are transmitted.)
I am led to believe, however, that it is much more difficult to fake the originating IP address than it is to fake other parts of the header such as the information in the Return-Path and From lines of the header. While the originating IP address may not be totally reliable as a means of identifying the source of the message, it is probably the most reliable item in the header.
The specifics
The message header for a POP3 message may contain several lines that start with the text Received: The one furthermost from the beginning of the message contains the IP address of the computer that originated the message (possibly faked). The ones closer to the beginning of the message show the IP addresses of machines along the way that relayed the message from the sender to the receiver.
A typical message header
The header for a typical Email message is shown in Figure 4 (with line breaks manually inserted to force it to fit in this narrow publication format).
Note that this message was not SPAM. Rather, this was a valid Email message that I received from a colleague at the college where I teach.
... |
I highlighted the "Received:" line containing the originating IP address in blue in Figure 4.
The originating IP address, which I highlighted in red, is contained in the first pair of square brackets in that line.
(Had I not manually inserted line breaks, all of the material in blue and red would appear on a single line.)Who originated the message?
Figure 5 contains information about IP address 198.214.191.169 obtained from the ARIN WHOIS database at http://www.arin.net/whois/.
Search results for: 198.214.191.169 |
You can click on the links in Figure 5 to learn more details about the organization to which this IP address is assigned. For example, Figure 6 shows the information provided by one of the links in Figure 5.
OrgName: Austin Community College |
Why am I telling you this?
The jury is still out regarding the usefulness of the originating IP address with regards to identifying the source of SPAM messages. The IP address of spammers seems to change frequently for a variety of reasons.
I have accumulated several thousand recent SPAM messages. As soon as I can find the time, I plan to do some statistical work to identify those originating IP addresses that have been used in a large percentage of those messages. I will add those IP addresses to my list of offensive words and phrases.
Let's all complain in unison
However, the IP address is very useful for another purpose -- complaining.
Worms send SPAM and viruses
It probably wouldn't do much good to complain to an originator that distributes SPAM for profit. However, some SPAM and most viruses are actually sent by malicious code embedded in the computers of unsuspecting people. Sometimes complaining to the technical contact for the originating IP address of a SPAM or virus message will result in the malicious code being removed from the computer. That will eliminate the SPAM and virus messages being transmitted by that computer.
Every little bit helps
Removing such code from one computer wouldn't have much impact on the overall problem, but removing such code from many computers could have a significant impact on the problem. If nothing else, it would make it easier to identify the real culprits in the SPAM and virus problem.
For example, a good friend of mine was recently notified by her cable modem ISP that a computer connected to the cable modem had been transmitting SPAM or viruses. Apparently the ISP had received a complaint that included the date, time, and originating IP address in the message header. The ISP was able to use this information to determine that the IP address had been assigned to my friend's cable modem at the time that the message was transmitted. My friend received technical advice from the ISP to help in cleaning up the computer and eliminating the problem.
At this point in time, the main reason that I have provided the ability in Figure 3 to easily extract the originating IP address is in the hope that lots of users of the program will use that information to complain to the organization to which the IP address is assigned. Remember, given the originating IP address, you can obtain information about the organization to which that IP address is assigned at http://www.arin.net/whois/.
Automating the complaint process
One of the upgrades that I am considering for the future is to add a module to the program that makes it possible to press a button while viewing a raw SPAM message and to automatically send a complaint via Email to the TechName at the TechEMail address illustrated for a each IP address in Figure 6. If I do that, I might reasonably expect lots of users to take advantage of the capability and to complain about SPAM and virus messages that they receive.
Enough talk, let's see some code
Listing 6 shows the code that registers an action listener on the Select URL button shown in Figure 3. This listener finds and selects the IP address of the originator of the message in the text string encapsulated in the TextArea object.
(Programmatically selecting text in a TextArea object is similar to manually selecting text with a mouse.)Write the originating IP address into the text field
Having identified and selected the text comprising the originating IP address, the code in the action listener writes the selected characters into the third text field in Figure 3. At that point, the user has the option of adding the IP address to the list of offensive words and phrases by pressing the Post Word button.
(Perhaps equally important, the user has the option of writing the information down and using it later to complain to the organization to which the IP address is assigned.)Where is that IP address again?
As mentioned earlier, the originating IP address is the first IP address in square brackets that occurs in the last line in the message that starts with the text Received:
selectIpButton.addActionListener( |
The Select URL button
Each time the user presses the Select URL button in Figure 2, the program finds the next URL in the text area, selects it, and copies it into the third text field. The HTTP:// text is stripped from the beginning of the URL to reduce processing time later when the URL is used to screen a message.
(A beneficial upgrade would be to also strip WWW from the beginning of the URL when it occurs, because it doesn't contribute much in the way of identifying information for the URL.)Once the URL has been copied into the text field, the user has the option of pressing the Post Word button to add it to the list of offensive words and phrases.
(The user can edit the URL in the text field before posting if needed. At this point in time, I usually eliminate the WWW by editing it out.)Figure 2 shows the URL for HTTP://SECURE.BIDZ.COM selected and copied into the text field, ready for posting to the list.
Register action listener on the Select URL button
The code in Listing 7 registers an action listener on the Select URL button. This listener finds and selects the next URL that appears in the message text. When it finds a URL, it writes the selected characters into the third text field in Figure 2.
URLs are located by searching for HTTP:// in the text of the message.
selectUrlButton.addActionListener( |
The Delete Local File/NextMsg button
The code in Listing 8 registers an ActionListener on the Delete Local File/NextMsg button shown in Figure 2. This makes it possible for the user to remove a file from the local directory.
The user would typically delete files that are recognized as not being SPAM. Also, the user would delete files for which analysis is complete. When a file is deleted using this button, the next file is automatically loaded ready for processing.
deleteButton.addActionListener( |
Configure the GUI
The code in Listing 9 configures the GUI by placing the various components in it, setting the title, setting the size, and making the whole thing visible.
add(postButton); |
The remaining code
The remaining code in Listing 10 near the end of the lesson is essentially the same as code that I discussed in detail in the previous lesson. Therefore, I won't repeat that discussion here.
Run the Program
I encourage you to copy the code from Listing 10 below, as well as
the programs and starter text files at the ends of the lessons entitled
Enlisting
Java in the War Against SPAM: The Communications Module, Enlisting
Java in the War Against SPAM: The Screening Module, and Enlisting
Java in the War Against SPAM: Training the Subject Line Screener.
Compile and execute the
program named Pop302. This should put some message files
in your history directory that can be used to test the algorithm
training programs named Pop302d and Pop302e.
Then compile and execute the programs named Pop302d and Pop302e
to train the algorithm based on your history data. Experiment
with the programs, making changes, and
observing the
results
of your
changes.
For maximum usability, I recommend that you increase the width of
the GUIs to at least 750 pixels, and increase the width of the text
fields and text areas in those GUIs to at least 100 characters.
- Pop302a.txt - contains offensive Subject line words and phrases
- Pop302b.txt - contains offensive body text words and phrases
- Pop302c.txt - contains friendly Email addresses and friendly Subject line material
Eventually you will need to populate these files with words
and phrases that work well for you. The programs named Pop302d
and Pop302e
will help you to update those files based on actual SPAM messages in
your history directory.
In the meantime, I have provided sample text files in Listings 35,
36, and 37 in the lesson entitled Enlisting
Java in the War Against SPAM: The Screening Module.
You can use those files as starter lists. If
you receive the same kinds of SPAM that I receive, the words in these
lists should make it possible for you to test the programs and to
get a few
hits on SPAM messages.
These are simple text files so feel free to add other words and
phrases as appropriate.
(Let me caution you not to enable the DELE code in the program named Pop302 until you are certain that you actually want to delete messages from the server. Once a message is deleted from the server, there is no way to recover it from the server.)
Summary
This lesson shows you how to train my SPAM screening algorithm to do a better job of identifying SPAM in the future based on the body text of SPAM messages.The lesson explains a program named Pop302e, which uses historical data to train the algorithm.
The GUI used to control the program is shown in Figures 1, 2, and 3. The program is designed for extreme ease of use. Training the algorithm consists mainly of pressing buttons and occasionally selecting text with the mouse. This can be accomplished very quickly with very little effort. Except for the possible requirement to delete extra characters, no actual typing is required.
After several weeks of training, my algorithm is reliably identifying about ninety-five percent of all SPAM messages, allowing me to delete them from my Email server before downloading them into my primary Email client.
What's Next?
This program is a continuing work in progress. Just today, for example, I have been working to make the program more effective in combating a flurry of new Email activity triggered by the propagation of the W32.Novarg.A@mm virus across the Internet.The characteristics of the Email messages transporting the virus are well defined at the Symantec site. I have used those characteristics to successfully identify and delete the messages transporting the virus from my Email server before downloading them into my primary Email client. Once I learned of the virus, and located the characteristic on the Symantec web site, only a few minutes were required to create a successful block.
However, I am also receiving large numbers of Email messages from servers that received messages containing the virus with the From line faked to show my return Email address. These messages, which are complaining that I supposedly sent them a virus, are a little more difficult to deal with. They arrive in many different formats so it has taken some time to nail down the key words in the many different formats. After about twenty-four hours, I pretty well have that situation under control as well.
Finally, I am receiving large numbers of undeliverable Email messages from servers that received messages containing the virus with the From line faked to show my return Email address and an invalid To address. These messages are also difficult to deal with due to two factors:
- They arrive in many different formats.
- They often contain an embedded copy of the original message.
Note, however, that my primary goal in developing this software package was to combat the daily flood of SPAM messages, and was not to deal with the occasional flurry of messages resulting from viruses. The ability to deal with those messages to any degree of effectiveness is simply a bonus.
I continue to make improvements to the programs as I learn more about the tricks that spammers use to thwart programs such as this one. Some of those tricks and possible ways to defeat those tricks were described near the end of the previous lesson entitled Enlisting Java in the War Against SPAM: Training the Subject Line Screener.
The next lesson, and several lessons following that one will show you my implementation of program upgrades designed to make the program more effective against the more sophisticated spammer tricks.
Complete Program Listing
Disclaimer of responsibility: If you elect to use this program you use it at your own risk. Make absolutely certain that you understand what you are doing before you compile and execute the program. Inappropriate use could result in the loss of Email messages. The author of this program, Richard G. Baldwin, accepts no responsibility for any losses that you may incur as a result of using this program.
/*File Pop302e.java Copyright 2004, R.G.Baldwin |
Copyright 2004, Richard G. Baldwin. Reproduction in whole or in part in any form or medium without express written permission from Richard Baldwin is prohibited.
About the author
Richard Baldwin is a college professor (at Austin Community College in Austin, TX) and private consultant whose primary focus is a combination of Java, C#, and XML. In addition to the many platform and/or language independent benefits of Java and C# applications, he believes that a combination of Java, C#, and XML will become the primary driving force in the delivery of structured information on the Web.Richard has participated in numerous consulting projects, and he frequently provides onsite training at the high-tech companies located in and around Austin, Texas. He is the author of Baldwin's Programming Tutorials, which has gained a worldwide following among experienced and aspiring programmers. He has also published articles in JavaPro magazine.
Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.
-end-