Java Programming Notes # 2152
- Preface
- Preview
- Discussion and
Sample Code - Preview of Future Lessons
- Run the Program
- Summary
- What’s Next?
- Complete Program
Listing
Preface
This is the second lesson in a series designed to teach you how to
write
a Java program to remove SPAM from your Email server before you
download it into your primary Email client. The first lesson was
entitled Enlisting
Java in the War Against SPAM, Part 1, The Communications Module.
The communications module
The first lesson explained the communications module used to
communicate with
your Email server, and to remove SPAM messages from the
server.
SPAM screening algorithm
The program is designed to allow you to use my SPAM screening
algorithm, or to invent your own. This lesson explains the
inner workings of my SPAM screening algorithm. You can use my
algorithm as a starting point if you decide to invent your own.
Training the algorithm
The next two lessons will
explain how my algorithm can be trained to do an increasingly better
job of screening SPAM over time.
Viewing tip
You may find it useful to open another copy of this lesson in a
separate browser window. That will make it easier for you to
scroll back and forth among the different listings and figures while
you are reading about them.
Supplementary material
I recommend that you also study the other lessons in my extensive
collection of online Java tutorials. You will find those lessons
published at Gamelan.com.
However, as of the date of this writing, Gamelan doesn’t maintain a
consolidated index of my Java tutorial lessons, and sometimes
they are difficult to locate there. You will find a consolidated
index at www.DickBaldwin.com.
Preview
Can you write better SPAM screening
algorithms?
Did you ever think that you might be able to write better SPAM
screening algorithms than those available in the SPAM screening
software that you are now using? If so, this series of lessons is
for you.
Even if that is not the case, like most of us, you are probably
overwhelmed by SPAM
and therefore you may find this lesson interesting.
Remove SPAM from the server
In this and the previous lesson, I am showing you how to write a
Java program
that supplements the SPAM screening software that you are currently
using. This program is used to identify and remove SPAM from your
Email server before it is downloaded into your primary Email client.
Any SPAM that makes it past this program can be further acted upon
by the SPAM screener that is built into your Email client.
The communications module
This series consists of (at least) four lessons. The
first lesson in the
series explained the communications module used to communicate with
your Email server, and to remove SPAM messages from the
server.
My SPAM screening algorithm
As mentioned above, this program is designed to allow you to invent
and implement your own
SPAM screening algorithm in addition to, or as an alternative to my
algorithm.
This lesson explains the inner
workings of my SPAM screening algorithm. My algorithm operates
separately on the Subject line, the From line,
and the body text of each Email message.
Algorithm training programs
The third lesson will explain a companion program named Pop302d,
designed to make
use of historical data to train the algorithm to do a better job
of identifying SPAM in future messages based on the Subject
of the message.
The fourth lesson will explain another companion program named Pop302e,
designed to
make use of historical data to train the algorithm to do a
better job of identifying SPAM based on the body text of
the message, (which includes the From line).
Because of the need to train the algorithm, and the ease with which
these companion programs make that possible, the companion programs are
equally as important as the main program.
Operational sequence
Here is the typical operational sequence that I go through each
morning to remove SPAM from my Email server before downloading it into
my primary Email client, and to train the algorithm to recognize any
future SPAM messages that made it through the screen that morning.
- Run the main program named Pop302 (explained in this
and the previous lesson) to identify SPAM and remove it from the
server. This normally allows a few (typically about ten
percent) SPAM messages (stragglers) to get
through, which are stored in a history folder on my local disk. - Run the program named Pop302d (explained in the next
lesson) to train the algorithm to recognize the stragglers as SPAM
based on
information in the Subject line. - Run the program named Pop302e (explained in Part 4)
to train the algorithm to recognize the stragglers as SPAM based on
information
in the body text. - Go back and run the main program named Pop302 to remove
those SPAM stragglers messages from the server. - Run my primary Email client to download the remaining good
messages into my local Email inbox.
When I am in a hurry …
However, it isn’t necessary to perform all of these steps every
day. On those mornings when I am in a hurry, I skip steps
2, 3, and 4, leaving the straggler messages in the local history folder
for use later.
(The straggler messages will, of course, end up in my
local Email inbox when I run my primary Email client without purposely
removing them from the server beforehand.)
Sometime later (perhaps the next day or several days later)
I will perform steps 2 and 3 to train the algorithm to recognize future
SPAM
messages represented by the characteristics of the messages that have
been saved in the local
history folder.
Effectiveness of my algorithm
After about one week of training, my
algorithm was reliably identifying about ninety percent of all SPAM
messages, allowing me to delete them from my Email server before
downloading them into my primary Email
client. By executing steps 2, 3, and 4 above, I am able to also
eliminate the remaining ten percent of the SPAM messages before
downloading them into my primary Email client.
Discussion
and Sample Code
The full Screen class
The version of the program that I discussed in the previous lesson
contained a
stripped-down version of a class named Screen. This version of
the program
allowed for testing the communications module on your system with your
Email server
without doing any actual screening for SPAM.
I will explain the full version of the class named Screen in
this lesson. In so doing, I explain my algorithm for identifying
SPAM.
Purpose of the program
The purpose of this program is to read messages from a POP3 (Post
Office Protocol – Version 3) server, to
analyze the messages according to a set of screening rules, and to
delete the messages that fail the screening test from the server.
(As written, the program asks the user to confirm the
deletion of each message from the server, but this confirmation step
could easily be
removed if you decide to do so.)
Key words and phrases
My SPAM screening algorithm screens for SPAM on the basis of words
or
phrases in the From line, words or phrases in the Subject
line, and
words or phrases in the body text.
Friendly Email addresses and subjects
A list of friendly Email addresses and friendly subjects is used to
screen the From
line and the Subject line. Messages that are from
friendly Email addresses, and messages that have known good Subject
lines are preserved on the server and no information about those
messages is
saved on the local disk. They are simply ignored after determining that
they are friendly.
Different lists for Subject and body
text
Different lists of words and phrases are used for screening Subject
lines and body text for SPAM. This is important because the same set of
words
and phrases can’t always be used for both cases.
For example, the word ANTIVIRUS is appropriate for screening the
Subject line, but is not appropriate for screening the
body text. The
word ANTIVIRUS often appears legally in the header of Email messages
that have been scanned for viruses by the server, but also often
appears in the Subject line of SPAM messages.
Common spammer tricks are defeated
Several common spammer tricks are defeated by my SPAM screening
algorithm.
For example, the common spammer trick of inserting extra characters
between the
characters in an offending word or phrase is defeated. Also, the
common trick of mixing the case of the
characters in an offending word or phrase is defeated.
As a specific example, my algorithm will recommend deletion of any
message having
any of the following in its Subject line or its body
text if the word VIAGRA is included in the lists used to screen for
SPAM:
vIaGrA
V.IagRA
V.I.A.G.R.A
Very important characteristics
These two characteristics of the algorithm alone have a significantly
positive
impact on the effectiveness of training the algorithm to do a better
job of
identifying SPAM in the future.
(You don’t have to identify all of the variations of a
word or phrase commonly used by spammers to fool the system. The
program does that for you automatically.)
My algorithm also defeats the common trick of appending random
characters to the end of the Subject line, because it
doesn’t require a
match for the entire Subject line. Rather, it
searches for words
or phrases internal to the text of the Subject line.
The user interface
Figure 1 shows the GUI through which the user controls the program.
Figure 1 Graphical User Interface
(Note that this GUI was purposely made narrow to cause
it to fit into this narrow publication format. I
recommend that you increase the width of the Frame to at least 750
pixels, and increase the width of the TextField and TextArea objects to
at least 100 characters each.Note also that this is an actual SPAM message, from which I purposely
removed the Email address of the sender prior to publication. The
message may not have actually been sent by the individual whose Email
address appeared on the From line.)
The Offending Phrase
When the program identifies a message that is a candidate for deletion,
the reason for that recommendation is shown in the third text field
from the top in Figure 1.
Deleting a message from the server
The user confirms that the message should be deleted from the Server by
clicking the Delete button
in Figure 1. If the user doesn’t want to delete the message, he should
click the Start/Next button instead.
(Note that the capability to actually delete messages
from the server was disabled in the program shown in Listing 34 near
the end of this lesson. Make certain that you are ready to
actually delete messages from the server before you enable that
capability.)
Information available at decision time
As currently written, this program requires the user to confirm the
actual deletion of each SPAM message from the server before that
message is actually deleted.
At the point in time that the user is required to confirm deletion of a
message
from the server, the following information is available to assist the
user in
making
the decision:
- From line
- Subject line
- Offending line of text, which may or may not be the subject
- Offending word or phrase in the offending line of text
- Entire raw text of the message down to and including the
offending line
No images are rendered
No images are rendered by the program, so it is not necessary for the
user to view offending images in order to make the decision to delete.
Deletion is not required
Having viewed the above information, if the user is still unable to
make
an informed decision to delete the message, the user has the
option to let the message pass through and to be downloaded into his
primary Email client. Once having viewed the message in the
primary Email client, the user still has the option of updating the
offending word lists with IP addresses, URLs, etc, so that
deletion decisions on future similar messages will be easier to make.
Saved in local archive folder
The raw text of all messages that are identified as candidates for
deletion from the
server are saved in an archive folder on the local disk, regardless of
whether the user elects to delete them from the server or not. Thus if
a message is deleted from the server and it is later determined that
was a mistake, a raw text copy of the deleted message is available
locally in the archive folder.
(You should probably empty this folder periodically so
that it won’t fill up your disk.)
In addition, I have plans to write several additional programs that
will analyze large numbers of SPAM messages in the archive folder for
at least two purposes:
- To remove words and phrases from the word lists that occur in
only a very small percentage of SPAM messages, thereby increasing run
time without contributing significantly to the desired result. - To search for common characteristics among SPAM message that can
be used to improve the effectiveness of the screening algorithm.
Saved in history folder
Except for messages from friendly Email addresses and messages with
friendly Subject lines, all messages that
are not identified as candidates for deletion from the server are saved
in a history
folder on the local disk. These messages are used later to train
the algorithm to do a better job of identifying future SPAM
messages. I will explain this training process in Part 3 and Part
4 of this
series of lessons.
Protection against viruses
Before any message is saved in a local file, asterisks are inserted
into the text on ten-character intervals in an attempt to destroy any
virus code that may be embedded in the message.
If a message makes it through the screen and is later identified as
having a virus as an attachment, a series of ten or more bytes can be
extracted from the virus code and added to the word list as an
offending phrase. This should cause any future messages having
that
same virus code as an attachment to be identified as a candidate for
deletion from the server.
Training programs
Companion programs that I have written are used to analyze the
non-deleted message files
saved
locally in the history folder in order to train the algorithm to do a
better job of identifying SPAM messages in the future.
These programs
are designed for extreme ease of use to encourage the user to train the
algorithm frequently. The better the algorithm is trained, the
better it will perform.
I will explain these training programs in detail in Part 3 and Part 4
of this
series of
lessons. A brief preview
of the training programs is provided below.
Simple text files
All three word lists are maintained in local text files, which can be
created and edited with an ordinary text editor if need be. Thus,
if one of the lists becomes corrupted, it is easy to correct the
situation using an ordinary text editor.
File names
The following file names are hard-coded into the program. You may
want to change these file names for your version of the program.
- Local copy – the unique file name for a local copy of each
message is
based on the
unique identifier for that message (UIDL) obtained from the
mail server. - Pop302a.txt – contains a word list for screening the Subject
line for offensive words and phrases. - Pop302b.txt – contains a word list for screening the body
text
for offensive words and phrases. - Pop302c.txt – contains a list of friendly Email addresses
and
friendly subjects for
screening the From and Subject lines to
identify friendly messages.
Location of the text files
As written, the program requires the three .txt files to
be in the same folder as the compiled .class files for
the programs named Pop302, Pop302d, and Pop302e.
However, you can easily modify the programs to change the location of
the .txt files if
you choose to do so. Just be sure to change the location in all
three programs.
The local copies of the messages are stored in two different
folders. Some of the
local copies are stored in a history folder while the remainder are
stored in an archive folder. The locations of these folders on
the disk are hard-coded into the three programs. You can change
the locations if you like, but be sure to make appropriate changes to
all three programs.
Three classes
This program consists of two main classes and one minor class. As
discussed in the previous lesson, an
object of the class named
Pop302 handles all communications with the POP3 server.
A method belonging to an object of the class named Screen is
used to screen each message in an
attempt to identify SPAM. This is the class that I will explain
in this lesson.
This class can be totally replaced by Java programmers who choose to
design their own screening algorithm provided that they maintain the
interface with the object of the class named Pop302.
An object of a very simple class named ScreenResult is used as
a wrapper to return several items of information from the screening
method to the calling method.
Testing
The program was tested using SDK 1.4.2 under WinXP in conjunction with
two different POP3 Email servers.
Will discuss in fragments
I will discuss the class named Screen in fragments. A
complete listing of
the program is provided in Listing 34 near the end of the lesson.
You should be able to copy and paste that listing into your Java IDE to
compile and test the program on your system.
Improvements in the class named Pop302
Before getting into the details of the class named Screen, I
want to mention that the program shown in Listing 34 contains a couple
of improvements relative to the version explained in the previous
lesson.
One of the improvements involves displaying the message number in the
bottom of the text area of Figure 1.
The other improvement involves making a safety check to confirm that
the message number being maintained locally is in synchronization with
the message number on the server (in the UIDL) before deleting
a message from the server.
If you understand the rest of the program, these two modifications
should not require a detailed explanation.
Deletion of messages is disabled
Also before getting into the details of the Screen class I
want to show you a fragment containing three statements that are
disabled in the Pop302 class in Listing 34. The three
disabled statements are shown in Listing 1. (Note that the
statements are separated by comments in Listing 34.)
/*Begin comment block |
The three statements shown in Listing 1 were purposely disabled (by
including them in a comment block) to prevent you from accidentally
deleting messages from the server during your early testing of the
program. Do not enable these three statements until you are ready to
actually delete messages from the server. At that point in time,
you can enable the three statements by removing the comment indicators
that surround them.
The Screen class
The Screen class implements a set of rules for identifying SPAM
messages and for recommending whether or not a message should be
deleted from the server.
If you have a better way to identify SPAM, you can replace this class
by a completely different class definition, so long as you maintain the
user interface.
An object of this class has one entry point and one exit point, which
is the public instance method of the Screen class named screenMsg.
A callback to the GUI
However, there is an additional linkage between the two objects that
you need to
consider. The constructor for the Screen class receives
a reference to the GUI object
created by instantiating the class named Pop302. A method in
the object of the Screen class uses that reference to display
progress on the text area belonging to the GUI.
This display of progress is comforting on those occasions when a very
long message is encountered and the user needs assurance that the
system is still working, and isn’t hung up.
This callback link could easily be eliminated by deleting code from
several locations in the Screen class and removing the
callback parameter from the constructor.
Beginning of the Screen class
The Screen class begins in Listing 2, which declares several
instance variables.
class Screen{ |
The purpose of these
instance variables will become clear as I discuss the code in which
they are used.
The constructor
The constructor for the Screen class is shown in its entirety
in Listing 3.
Screen(Pop302 theGui){//constructor |
As you can see, the
constructor receives and saves a reference to the GUI. This
reference is used later to display progress as discussed above.
Make word lists as TreeSet objects
The last three statements in the constructor invoke methods that read
text files containing
lists of words or phrases, and create TreeSet objects
containing those words and phrases. These TreeSet
objects are used later to test for the occurrence of the words or
phrases in raw text versions of Email messages.
The TreeSet objects are created and populated by invoking three
very similar methods:
- makeSubjWordList
- makeBodyWordList
- makeFriendlyWordList
I will discuss each of these methods in the sections that follow.
The makeSubjWordList method
The purpose of the makeSubjWordList method is to
create a TreeSet object containing words and phrases used
later to screen the message Subject lines.
The makeSubjWordList method is shown in Listing 4.
This method reads strings from a text file named Pop302a.txt
and creates the list as a TreeSet object.
private void makeSubjWordList(){ |
Why use the TreeSet class?
The TreeSet class was chosen for this purpose because it
eliminates duplicates.
(Duplicates in the list are bad because they increase
runtime with no beneficial effect.
One of the major problems with the message filter in the commercial
Email client
program that I use is that
there is no way to avoid duplicates other than simply remembering that
an item was previously placed in the filter.)
With my screening algorithm, even if the user creates duplicates in the
text
file while training the algorithm, duplicates are eliminated from the TreeSet
object and also from the text file before
actual processing begins.
The code in Listing 4 is
straightforward and shouldn’t require further explanation.
The makeBodyWordList
method
The purpose of the makeBodyWordList method is to create a TreeSet
object containing words and phrases used later to screen the text
in the body of the message.
Separation of lists is important
It is important to maintain separate lists for screening the Subject
line and the body text. Because of the larger number of
characters in the body text, false positives are more likely when
screening the body text.
(A false positive arises when a message that is not SPAM
fails one of the SPAM screening rules and is identified as SPAM by the
screening algorithm.)
Some words work well and some don’t
Therefore, some words and phrases that work well when screening the Subject
line may produce false positives when screening the body text.
For example, the common spammer word SLUT appears in the word SoLUTion
with only one character separating the S and the L. It is much
more likely that the word SOLUTION will appear somewhere in
the body text than in the Subject line (although it
may appear in the Subject line as well, thus producing a false
positive in either case).
On a more definitive note, the word ANTIVIRUS works well when
screening the Subject line, but cannot be used to
screen the body text. Many servers insert the word ANTIVIRUS
into the message header after they test the message for viruses. On the
other hand, the word ANTIVIRUS often appears in the Subject
line of SPAM messages.
IP addresses and URLs
IP addresses and URLs can be very useful in identifying SPAM during the
screening of the body text. However, they rarely occur in the Subject
line. Therefore, testing the Subject line
against a long list of IP addresses and URLs simply wastes computer
time.
Some words to avoid
The following words (among others) probably should not be
included in the list used to screen body text for the reasons
given. Undoubtedly you will identify other words and phrases that
should be excluded from the list as you gain experience with the system.
- PORN may be confused with IMPORTANT.
- SPAM causes lots of false positives. As a remedy, I inserted a
space following the M as in “SPAM ” to decrease false positives. (This
may also decrease valid hits as well.) - ANTIVIRUS appears in some valid message headers.
- WEIGHT often appears in messages regarding HTML fonts.
- SLUT may be confused with SOLUTION.
The makeBodyWordList code
The code for the makeBodyWordList method is shown in Listing
5.
This method reads strings from a text file named Pop302b.txt
and creates the list as a TreeSet object.
private void makeBodyWordList(){ |
This code is
straightforward and should not require explanation.
The makeFriendlyWordList
method
The purpose of the makeFriendlyWordList method is to
create a TreeSet object containing words and phrases used to
pre-screen the message From and Subject lines
before screening against the SPAM lists. The objective of this
pre-screening step is to identify messages that claim to be from an
approved list of senders (often referred to as a white list),
or messages with a known good Subject line.
Messages that have From or Subject lines
matching the words or phrases in this list are not deleted from the
server, and are not subjected to screening for SPAM.
Format for Email addresses
When adding Email addresses to the list contained in the text file,
only the primary portion of the friendly Email address should be
included. For example, you will often see an Email address
presented as follows:
Mary Smith <msmith@somewhere.com>
In this case, only the following portion should be included in the
friendly list:
msmith@somewhere.com
This is the portion that is most likely to remain stable over
time.
The remaining portion is simply window dressing added to the primary
Email address by the program used to compose and address the message.
Code for the makeFriendlyWordList method
The makeFriendlyWordList method is shown in Listing 6.
This method reads strings from a text file named Pop302c.txt
and creates the list as a TreeSet object.
private void makeFriendlyWordList(){ |
This code is completely
straightforward and therefore shouldn’t require further explanation.
The screenMsg method
Up to this point, the code that I have presented has been rather
mundane. However, things should start getting a little more
interesting at this point.
Setting the stage
The statement in Listing 7 was extracted from the Pop302 class
in Listing 34. This is the statement that ties the communication
module (an object of the Pop302 class) to the
screening module (an object of the Screen class).
match = screener.screenMsg( |
Where
are we at this point?
At this point in the execution of the program, the communication module
has retrieved a message from the server and has written it into a file
on the local disk with the path and file name given by fileName.
(The file
name is based on the server’s unique identifier for the message, given
by uidl in Listing 7.)
Pass
the file to the screenMsg method
The communication module passes the file name for the disk file
containing a raw text
copy of the message to the method named screenMsg where it will
be screened for SPAM.
The screenMsg
method needs fileName
in order to read the
file from the disk to perform the screen.
Why does the screenMsg
need uidl?
If the file is identified as SPAM, it will be moved from the history
folder to an archive folder. In order to do that, the screenMsg method needs to create a path
and file name pointing to the archive folder. For this, it needs
the unique identifier, uidl.
(Obviously I
could have parsed fileName inside the screenMsg method to get uidl, but I
found it easier to simply pass it as a parameter to the screenMsg method.)
What
is theResult?
In addition to fileName and uidl, the method needs a
reference to an empty object of the class ScreenResult.
It populates that object in order to send information back to the
calling method. That object’s
reference is represented by theResult in Listing 7. The screenMsg method populates this object
with several pieces of data for later use by the communications
module.
The screenMsg method
returns boolean
The screenMsg method returns a boolean value, which is assigned
to match in Listing 7.
If the return value is true, the screenMsg method has concluded that the
message is SPAM and is a
candidate for deletion from the server.
(Recall,
however, that the
communications module allows the user to make the final decision
regarding deletion of the message from the server.)
If the return value is false …
If the return value is false, the screenMsg method has found nothing to indicate
that the message is SPAM.
(The message
might not be SPAM, or it might be a form of SPAM that the algorithm
doesn’t yet know how to identify. If it is deemed by the user
that the message is SPAM, the message will be used later to train the
algorithm to recognize that form of SPAM in future messages.)
About
90 percent effective
As of this writing, I am finding that the algorithm is able to identify
about 90 percent of SPAM messages on the average. The remaining
ten percent of the SPAM messages are used to further train the
algorithm to recognize spam of that type in the future. I am
hopeful that this performance will improve in the future as the
algorithm becomes better trained.
Beginning of the screenMsg method
The code for the screenMsg method begins in Listing 8.
public boolean screenMsg(String fileName, |
As you can see, the method
signature matches my earlier description with respect to the statement
in
Listing 7 that calls the screenMsg method.
The code in Listing 8 initializes the variable named match to
false. This is the value that will be
returned from the method if it is not overwritten later by the
discovery of a match between a test phrase and a text line in the
message.
Purpose of the screenMsg
method
The screenMsg
is
used to identify messages that are candidates for deletion from the
server. Such identification is based on analyzing the file in which the
message is stored locally, and comparing that file with the contents of
TreeSet objects populated earlier with the contents of the
files named Pop302a.txt, Pop302b.txt, and Pop303c.txt.
A return value of true means that the
message is identified as SPAM and should be deleted from the server.
Returned String values
In addition to the boolean return value, references to four String
objects
are encapsulated in the incoming object of type ScreenResult.
The populated object is used later by the calling method in the
communication module. These String objects
represent:
- The text of the messages’ Subject line.
- The text of the messages’ From line.
- The offending word or phrase (if any) that was found in
the Subject line or in the body of the message, which
includes the From line. - The raw text of the message down to the line that includes the
offending word or phrase, or the entire raw text of the message if no
offending words or phrases were found.
Display information on the GUI for benefit of
the user
This information is displayed in the GUI in Figure 1 by the
communication module, which is an object of the class named Pop302.
This is information is presented to help the user make an informed
decision regarding deletion of the message from the server.
(If the screenMsg method returns false, the
program doesn’t pause for the user to make such a decision, and
processing of the next message on the server begins immediately, with
all of the above information having been removed from the GUI.)
Refer
back to Figure 1
The list of items placed in the ScreenResult
object includes the offending word or phrase that was found in
the Subject line or in the body of the message. Referring back to Figure
1, the Offending Phrase for the message shown
in Figure 1 was V1AGRA.
(This was an
easy one because there were no extraneous characters inserted in the
offending word, although the spammer did use a numeral 1
character in place of an I.)
The
Subject line
Also in Figure 1, the Offending Phrase was found in the Subject
line, which means that the program made the decision very quickly
(it didn’t have to examine a large amount of body text in order to make
a decision).
Data in the text area
As shown in the large text area in Figure 1, this was Message Number 5
in the
dropbox on the server.
The raw message text is displayed in the text area down to the line
that contained the Offending Phrase, which in this case was the Subject
line.
The From line
The top-most text field in Figure 1 originally contained an Email
address that
purportedly was the address of the sender of the message.
However, I suspect that the person identified by that Email address
wasn’t actually the sender of the message, so I deleted the Email
address before publishing this image.
(On the
basis of the earliest RECEIVED: FROM line in the raw message text in
Figure 1, the
message appears to have been sent by a computer having an IP address of
204.85.84.207. However, given the identity of the organization to
which that IP address is assigned, (according to a WHOIS Database
Search at http://www.arin.net/whois/)
this seems somewhat unlikely as well. But, one never knows who
may be spamming. Maybe that computer is infected with a Trojan
horse that is broadcasting SPAM messages without the knowledge of its
owner.)
Open
the disk file for reading
The code in Listing 9 opens the file containing the local copy of the
raw message text for reading. Listing 9 also declares a local
variable named data that will be used in the file reading
process.
try{ |
(Recall from
the previous
lesson that asterisks were inserted into the data on ten-character
intervals in an attempt to destroy any executable virus code that may
be included in the byte stream.)
Prepare
to process the file
The code in Listing 10 is executed in preparation for processing the
file containing the raw message data.
inData.mark(10000); |
Mark
the beginning of the input stream
The first statement in Listing 10 marks the beginning
position in the input stream. Subsequent calls to reset will
attempt to reposition the stream to this point.
This mark will be used later to rewind the stream to the beginning.
Populate the ScreenResult object
The last two statements in Listing 10 populate two of the fields in the
ScreenResult object with default values, just in case one or the
other of the corresponding lines are not found in the message. If
the lines are found in the message, these default values will be
overwritten with the actual data from the message.
The removeStars method
Before going any further, I am going to put the discussion of the screenMsg
method on hold for a moment and discuss the removeStars method
shown in Listing 11.
private String removeStars(String stringIn){ |
The purpose of this method is to remove the asterisks that were
inserted into the data by the method named insertStars before
the file was written (see the previous lesson). The
method also appends two asterisks at
the end of each line.
(Note that this method removes all asterisks, not just
those inserted earlier. If this proves to be a problem, this
method should be modified to remove only those asterisks that occur on
ten-character intervals.)
The code in this method is straightforward and shouldn’t require
further explanation.
Return to discussion of the screenMsg method
Returning to the discussion of the screenMsg method, we are
ready to examine the code used to screen the Subject line.
That code begins in Listing 12.
while((data = inData.readLine()) != null){ |
The code in Listing 12 is
the beginning of a while loop that reads successive lines of
text from
the disk file until it either runs out of lines (null) or
encounters a line that starts with SUBJECT: (see Figure 1
for the format of the Subject line in the message).
If it runs out of lines before finding the Subject line,
the loop will terminate. If it finds the Subject line
before running out of lines, the line will be processed and a break
will be executed to terminate the loop. Thus, the code in this
loop will process only the Subject line.
(Note that
the removeStars method is called to remove the asterisks from the data
and the data is converted to upper case before testing for SUBJECT:)
Populate
the output object
We are now in the body of the if statement begun in Listing
12. A match for SUBJECT: has been found. The code
in Listing 13 populates the output object with the Subject line
data overwriting the default value put there by the code in Listing 10.
theResult.subject = data.toUpperCase(); |
This will result in the Subject line
data being displayed in the second text field of the GUI of Figure 1
when the screenMsg method returns.
Screen against friendly words and phrases
The next step is to screen the Subject line data
against the words and phrases in the friendly list. If a word or
phrase from the friendly list appears in the Subject line
data, the message will be preserved on the server and will not be
subjected to SPAM screening.
In addition, the local copy of the message currently located in the
history folder will be deleted, because it is not considered to be
SPAM. That way, the message will not be used later when training
the algorithm to do a better job of identifying SPAM.
Executing the screen
The code that screens the Subject line data against the
friendly list begins in Listing 14.
Iterator iterator = |
The friendly list is stored
in a TreeSet object referred to by the reference variable named
friendlyWordList.
An Iterator loop
The code in Listing 14 shown the beginning of an Iterator loop,
used to iteratively extract each word or phrase from the friendly
list and compare it with the Subject line data.
The comparison is actually rather complex and is performed in a method
named screenOnPhrase, which I will discuss later.
As each friendly word or phrase is extracted from the friendly list, it
is
passed, along with the Subject line data to the method
named screenOnPhrase.
That method will return true if a match is found, and
will return false
if no match is found.
Extraneous characters
The third parameter in the call to the screenOnPhrase method
specifies the
number of extraneous characters allowed to occur between the characters
in the data and still declare a match to be true. In this
case, a value of zero is passed for this parameter, meaning that no
extraneous characters are allowed.
(As it turns
out, the value of zero results in a trivial case, and I could have
accomplished this more simply than by invoking the rather complex code
in the screenOnPhrase method. However, when I wrote the
prototype for the program, I hadn’t decided that I was going to use a
value of zero here.)
Behavior of the screenOnPhrase method
Basically, in this case, the screenOnPhrase method is testing
to see if the friendly word or phrase occurs anywhere within the Subject
data line, and if so, it will return true. Otherwise, it
will return false.
(Note that
everything has been converted to upper case at this point, so matching
the case is not an issue.)
At this point, I can either
branch off and discuss the screenOnPhrase method, or
continue discussing the code in the screenMsg method. I
have decided to do the latter, and explain the inner
workings of the screenOnPhrase method later.
If a match was found
Listing 15 shows what happens if the current friendly word or phrase
was found in the Subject data line (the current
word or phrase is that word or phrase most recently extracted from the friendlyWordList
by the iterator).
if(match == true){ |
Delete
the file from the SPAM history folder
The first thing that happens when a match is found is that the input
stream is closed and the file containing the raw message is deleted
from the folder in which SPAM history messages are stored. The
rationale is that this is not a SPAM message, and should not be used
later when training the algorithm to do a better job of identifying
SPAM.
Populate the output object
The next thing that happens is that the matching phrase is stored in
one of the fields of the ScreenResult object.
(In this case, that String
isn’t currently used in any significant way by the communications
module, but it is available to be used in the future if needed.)
Return a false value
Perhaps the most significant thing that happens in Listing 15 is that
the screenMsg method
terminates and returns a false value to the calling
method in the communications module Listing 7. That
essentially terminates the processing of this message. It is
preserved on the server and is not subjected to further screening for
SPAM.
If no match was found
If none of the words or phrases in the friendlyWordList match
the Subject data line, control will fall out of the
loop at the bottom of Listing 15 when the data in the friendlyWordList
is exhausted. This will transfer control to the code in Listing
16.
break; |
The
break statement
The break statement in Listing 16 is inside the body of the if
statement that began in Listing 12. This code is being executed
because a data line was found that starts with SUBJECT:
If no match was found, the return statement in Listing 15 would
not be executed, and control would reach this point. Since this
part
of the code deals exclusively with the Subject line,
and a Subject line
was found, there is no point in reading any more input lines.
Hence
the break statement in Listing 16 terminates the read loop
that
began in Listing 12. No more data will be read from file in this
part of the code.
Rewind the input stream to the beginning
The code in Listing 17 resets the stream back to the mark that was set
on the stream in Listing 10. Since that mark was set at the
beginning of the file, this code rewinds the data file back
to the
beginning.
inData.reset(); |
Contents
of ScreenResult object
At this point, the subject
field in the output ScreenResult object contains Subject
line data if a Subject line was found, or
contains a message to the effect that no Subject line
was found. This latter message was put there by default in
Listing 10, and was not overwritten if no Subject line
was found.
Process the From line
The next step in the process is to process the From line
for the purposes of:
- Returning the From
data in one of the fields of the ScreenResult
object if a From line exists.
- Determining if the
message was sent from a friendly Email address. If so, return
false causing this message to be preserved on the server and exempt
from SPAM screening.
Except for the fact that
the code in Listing 18 extracts and processes a text line that starts
with From:, the code in Listing 18 is essentially the same as
the code used to process the Subject line beginning in
Listing 12 and ending in Listing 17. Therefore, it should not be
necessary for me to explain that code again.
while((data = inData.readLine()) != null){ |
Rewind
the input stream
Assuming that the method
did not return false as the result of a match on a friendly Email
address, the code in Listing 18 also resets the input stream in
preparation for screening the entire message for SPAM.
Under that same assumption, at this point, the from field in the output ScreenResult
object contains From line data if a From line
was found, or
contains a message to the effect that no From line
was found. This latter message was put there by default in
Listing 10, and was not overwritten if no From line
was found.
Missing Subject line and From line
is rare
Experience indicates that it is very rare to receive an Email message
that doesn’t contain both a FROM: line and a SUBJECT:
line in the header, although either or both may not contain any
characters to the right of the space following the colon. In
fact, it is very common for the Subject line of SPAM
messages
to be completely blank. (Perhaps that causes people to read
the messages out of curiosity.)
The screenOnPhrase method
While discussing both the From line and the Subject
line, I asked that you simply accept that the screenOnPhrase
method can determine if one upper-case String object is
contained as a substring within another upper-case String object.
I did that because there is no great challenge to programming such an
operation when there are no extraneous characters. In
fact, one of the indexOf methods of the String
class, which searches for a substring within a String,
can accomplish this very handily.
No extraneous characters allowed
This assumes, of course that extraneous characters are not allowed
between the characters of the substring within the String.
That was the case in processing the From line and
the Subject line above because I set the third
parameter value to zero in the method call. However, that is not
the
case in searching for offending words and phrases in a SPAM screen.
A SPAM example
For example, here is a typical Subject line taken from
a SPAM message in my archive folder.
Subject: T@ke 5O% off Ge|neric V*i*a*g*r*a 0nline t:0day
In order to recognize that the Subject line contains
the word Viagra, it is necessary that the program be able to
ignore the asterisks that separate the letters of the word V*i*a*g*r*a.
In order to recognize that the line contains the word Generic,
it is necessary that the program be able to ignore the vertical bar
that separates the e and the n in Ge|neric.
In order to recognize that the line contains the word t0day, it
is necessary that the program be able to ignore the colon that
separates the t and the 0 in t:0day.
Other common spammer tricks
This example illustrates
another common spammer trick of switching zero characters with
alphabetic O characters, replacing the lower-case a
character with the @ character, etc.
I haven’t attempted to automate the resolution of substitution issues
such as this,
and probably won’t. Given the training programs that I will
explain in the next two lessons, it is easy to use historical SPAM data
to train the algorithm to recognize these variations. If such a
mangled word occurs only once in the SPAM history folder, the algorithm
can be trained in a single training session to recognize it as SPAM in
all future messages.
Ignoring extraneous characters
Now getting back to the
issue of ignoring extraneous characters in the offending words and
phrases, that is what the method named screenOnPhrase knows how to do very well.
(It is also
what the Email filtering capability of the commercial Email client that
I use doesn’t
know how to do at all.)
The capability to ignore
extraneous characters is one of
the keys to a successful SPAM screening program.
(The
spammers never make it easy to identify and block their Email
messages. That is why a successful SPAM screening program must
have the capability to learn each new spammer trick as soon as it
appears through an ongoing, simple to use algorithm training effort.)
The screenOnPhrase method
The screenOnPhrase method requires an incoming parameter of
type int named spanLim. This is the parameter
by which the programmer specifies how many extraneous characters
will be allowed between letters in the offending word or phrase and
still have it be recognized as an offending word or phrase.
When the screenOnPhrase
method was used to process the From and Subject data,
the value of this parameter was set to zero. Thus no extraneous
characters were allowed in the matching friendly Email address data or
the matching friendly Subject line material.
In order for the screenOnPhrase method to recognize the words Viagra, Generic, and t0day in the above
example, the value of the spanLim parameter would have to be 1
or greater.
As you will see later, the current version of this program uses a spanLim
value of 1 to screen for SPAM. Experience shows that this is
successful in identifying most of the offensive words and phrases
without
unduly increasing the occurrence of false positives.
At this point, I will put the discussion of the screenMsg
method on hold for a short while and explain the inner workings of the screenOnPhrase method.
Description of the screenOnPhrase
method
This method tests a String to see if it contains
a word or phrase that may have extraneous characters inserted into it,
such as VI*A-GRA.
The method requires an incoming parameter of type int named spanLim.
If the String contains the sequence of characters, in the
correct order, that make up
the word or phrase, with spanLim or fewer extraneous
characters between any two of the matching characters, the
method returns true. Otherwise, it returns false.
A spanLim example
For example, if spanLim = 1, the spammer can insert one
character between any two of the characters that make up the offending
word in
the String and the offending word will still be detected.
However, if the spammer inserts two or more extraneous characters, the
offending word will not be detected.
Be careful of false positives
You should be careful and avoid making spanLim too large.
Large values of spanLim result in higher false positives due
to the fact that widely-separated characters can be considered to be
part of the word or phrase. For example, if spanLim = 2 or
greater, the word PORN will be found in the word IMPORTANT.
However, if spanLim =1, the word PORN will not be
found in IMPORTANT.
Operation of the screenOnPhrase method
Basically this is how the screenOnPhrase method does what it
does. The method
receives incoming String parameters named data and phrase,
along with an int parameter named spanLim. The
objective is to determine if phrase is contained in data
with no more than spanLim extraneous characters separating the
matching characters.
Search for matching characters
First the method searches data for characters that match the
characters in phrase discarding all other characters.
While doing this, however, it keeps track of the original positions of
the
matching characters in data.
A new compressed string
The result is a new string containing only the characters that match
the characters in phrase, all in their original order.
Let’s refer to this as str. All extraneous characters
have been discarded from data producing a new string named str.
For example, if the phrase is SPAM, str might look like
the following after all extraneous characters have been discarded:
SMSPMASPAMMPAS
Does
str contain phrase?
A test is made to determine if str contains sequences of
characters that match phrase. (In the above example, str
does contain a sequence of characters matching SPAM, which I
highlighted using boldface.)
How many extraneous characters were discarded?
If a match is found, this means that the original data did
contain phrase with the possibility of extraneous characters in
between the characters of phrase. However, there is still
the issue of how many
extraneous characters were discarded in order to get the positive
match.
This is determined by examining the original position information that
was saved while extraneous characters were being discarded.
If the number of characters discarded between any two of the characters
matching the sequence was less than or equal to spanLim, the
method returns true. Otherwise it returns false.
Let’s see this in code
The code to accomplish this is a little bit complex. The screenOnPhrase
method begins in Listing 19.
private boolean screenOnPhrase(String data, |
The code in Listing 19
saves phrase in an instance variable, and declares a couple of
new local variables that will be used later. Note that str
is a StringBuffer object and is not a String
object.
(An object
of
the StringBuffer class can have its contents modified, while an
object of the String class is immutable.)
Compare data with phrase
The next step is to compare the characters in data with the
unique characters in phrase, saving only the matching
characters in str, and saving the original locations of the
matching characters in locationData, which refers to an object
of type ArrayList.
First, however, it is necessary to eliminate duplicate characters from phrase.
Eliminate duplicate characters from phrase
This is accomplished by the code in Listing 20 by storing the
characters from phrase into a TreeSet object.
Storing the characters in a TreeSet object eliminates
duplicates.
(It also
sorts the characters, but that doesn’t matter
one way or the other in this case.)
TreeSet treeSet = new TreeSet(); |
Having stored the
characters from phrase in the TreeSet object, the code
in listing 20 goes on to use an Iterator to extract the unique
characters from the TreeSet and store them in a new StringBuffer
object named tempPhrase.
(The
characters are stored in a StringBuffer object because it is
possible to build such an object one character at a time. This is
not possible with a String object.)
Extract
matching characters from data
Listing 21 uses a pair of nested for loops along with tempPhrase
to extract matching characters from data and to store them, (in
their original order), in str. The original position
of each matching character is stored in locationData.
for(int i = 0; i < data.length(); i++){ |
This converts the original data into a compressed string of
characters, each of which matches a character in phrase. All
other
characters have been discarded. Thus, if data contains phrase,
it will occur somewhere in str with no extraneous characters
separating the characters in phrase.
Does str contain phrase?
The next step is the easy one. Listing 22 tests to see if the new
compressed string named str contains the original phrase.
int match = str.indexOf(phrase); |
This is accomplished by
invoking the indexOf method of the StringBuffer class,
which returns the index within str of the first
occurrence of phase.
Behavior of the indexOf method
The indexOf method returns -1 if phrase does not occur
within str, in which case the screenOnPhrase method
simply returns false. Otherwise, the method goes on to test for
the maximum number of extraneous characters that separated the matching
characters in the original data.
(While writing this explanation, I realized that phrase
might have occurred more than once in data, with too many
extraneous characters in the first occurrence and an acceptable number
of extraneous characters in a later occurrence. In this case,
however, the algorithm would return false.Given the reason for the extraneous characters in the first place,
it is probably unlikely that this will happen. However, this is a
logic error that would be worth fixing.)
Check number of extraneous characters
When there is a match, we need to confirm that the span between
matching characters does not exceed the number allowed by the incoming
parameter
spanLim. This is accomplished by the code in Listing 23.
int maxSpan = 0; |
What
happened before was …
As each matching character was extracted from data in Listing
21, the original position of that character in data was
encapsulated in
an object of type Integer. That Integer object’s
reference was
appended to a list of such references in the object of type ArrayList
referred to by locationData.
Thus the elements in the ArrayList refer to objects containing
the original positions of successive matching characters in data.
This information will be used to calculate the maximum number of
characters
separating those matching characters.
The code in Listing 22 found the position of the first character of phrase
in the compressed string referred to by str and saved that
value in a local int variable
named match.
What happens now is …
The code in listing 23 uses the value of match to extract an Integer
object from the list containing the original position of the first
matching character of phrase in data. The
contents of the Integer
object are extracted and saved as locA.
Then the code in Listing 23 enters a for loop, extracting
successive references from the list and uses the information
encapsulated therein to calculate the difference in original positions
of the successive matching characters. The maximum value of that
difference
is calculated. This process continues until a number
of original positions equal to the number of characters in phrase
have been examined. Then the maximum difference in position
is compared with spanLim.
If it is determined that the number of characters between the
original positions of the matching characters in data exceeds spanLim,
the method returns false. Otherwise, it returns true.
Return to the discussion of the screenMsg method
Now that we have an understanding of how the method named screenOnPhrase
works, it is time to return to the discussion of the method named screenMsg.
Up to this point, the method has examined the Subject and
From lines for two purposes:
- To determine if they
are friendly, and if so, to terminate SPAM screening for the current
message. - To provide the
contents
of the two lines for later display in the GUI.
Screening
for SPAM
If control still resides in the screenMsg method at this point (meaning
that the message wasn’t declared to be friendly on the basis of either
the From line or the Subject line), it is time to
screen the entire message looking for indications that the message is
SPAM.
This is accomplished in two parts. The Subject line
is screened for SPAM using one word list and the body text
is screened for SPAM using a different word list. If either the Subject
line or the body text is
determined to contain SPAM, the method terminates returning true.
(A typical
message contains a few lines of body text before the Subject
line and potentially many lines of body text following
the Subject line.The lines are screened in the order that they occur in the message.
Therefore, if the Subject line is determined to contain SPAM,
the screening process will often require much less time than will be
required to locate SPAM in the body text.)
If
any line is determined to contain SPAM, the method terminates at that
point returning true.
If no SPAM is identified in any line of the message, the method returns
false.
The screening process
The screening process begins in Listing 24.
int progressCounter = 0; |
The code in Listing 24
reads a line of text from the input file, removes asterisks from the
line and tests to see if the line starts with Subject: (note
that the line hasn’t been converted to upper case yet at this point).
If the line does not start with Subject:, an else
clause, (to be discussed later), is executed to screen the line
as body
text.
If the text line is the Subject line …
For the case where the line does start with Subject:
- The line is converted
to upper case. - The line is appended
to the contents of the field named text in the output object of
type ScreenResult. - A single period is
appended to the string currently residing in the text area of the GUI
to be displayed as a progress indicator. (Each set of 50
periods appears on a new line in the progress indicator.)
Screen
the Subject line for SPAM
The code beginning in Listing 25 uses an Iterator to screen an
upper-case version of the Subject line against
upper-case versions of each of the offensive words and phrases stored
in
a TreeSet object referred to by subjWordList.
|
Invoking
screenOnPhrase
The actual process of screening against each offensive word or phrase
in
the TreeSet object occurs as
a result of invoking the screenOnPhrase method (discussed
earlier) to determine if the Subject line contains
the offensive word or phrase. A one-character separation is
allowed between the characters in the offensive phrase in the Subject
line. The
boolean value returned by screenOnPhrase is stored in the variable named match.
The value of match will eventually be returned to indicate
whether or not the screenMsg method found a match between a
text line in the message and an offensive word or phrase from one of
the word lists.
If the returned value is false …
If the returned value is false, the Iterator loop
continues looping, attempting to match offensive words or phrases from subjWordList
with the Subject line until there are no more
offensive words or phrases stored in subjWordList.
At that point, it is concluded that the Subject line
doesn’t contain SPAM. Control is transferred back to the top of
the while loop in Listing 24, where another line of text is
read and screened.
(There can be only one Subject line in a
properly
formatted message, so the remaining lines will probably all be screened
as body
text.)
If
the returned value is true …
If the screenOnPhrase method returns true, the body of the if
statement in Listing 26 is executed.
if(match == true){ |
Basically two things happen
in the body of this if statement:
- The local copy of the
message is moved from a history folder to an archive folder. (The
message won’t be needed for training the algorithm later because the
algorithm already knows how to identify the message as SPAM.) - Control breaks out of
the Iterator loop. (Only one match against an
offensive word or phrase is required to declare that a message is SPAM.)
I won’t try to explain the
process of moving the file. If you don’t understand that code,
look it up in the Java API documentation.
Transfer of control
After breaking out of the Iterator loop, control transfers to
the return statement in Listing 33 below with match
containing true. A true value for match indicates that
the Subject line identifies the message as SPAM.
If the text line is not the Subject
line …
A line of text was read in Listing 24, and a test was made to see if
that line was the Subject line for the message.
If not, control transfers to the code that begins in Listing 27.
Since the line of text is not the Subject line, it
needs to be screened against the offensive words and phrases in a
different list designed for screening body text.
Listing 27 executes several steps in preparation for that screening
process.
else{ |
The code in Listing 27:
- Converts the text
line to upper case. - Appends the text line
to the contents of the field named text in the output object of
type ScreenResult. - Causes a single
period to be displayed in the progress indicator on the GUI of Figure 1.
Screen
the message body text line
The line of message body text is actually screened by invoking the screenOnPhrase
method in Listing 28. The third parameter in the invocation of
this method allows for one extraneous character to separate the
characters of the offending phrase in the line of body text.
Iterator iterator = bodyWordList. |
Loop
on an Iterator
An upper-case version of the message text line is screened against
upper-case versions of each of the offending words and phrases in a TreeSet
object referred to by bodyWordList.
An Iterator is used to cause this process to continue until
either a match is found, or the items in the list are exhausted.
The boolean value returned by screenOnPhrase is stored
in the variable named match.
If the returned value is true …
If the value returned by screenOnPhrase is true, the code in
the body of the if statement of Listing 29 is executed.
if(match == true){ |
This code performs the
following operations:
- Move the local copy
of the message from the history folder to an archive folder for the
same reasons given with respect to the Subject line
earlier. - Break out of the Iterator
loop because there is no need to test against any additional offensive
words or phrases.
At this point, the value of
match is true meaning that a match has been found. Control
is transferred to the if statement at the top of listing 30.
If screenOnPhrase returned false …
On the other hand, if no matches were found for any of the words and
phrases in bodyWordList, control reaches the if
statement at the top of Listing 30 with match containing a
value of false.
if(match == true)break; |
If match is false,
the code in Listing 30 loops back to the top of Listing 24, reads the
next line of text from the message, and begins the screening process
all
over again.
Close the file
If match is true in Listing 30, there is no need to do any
further testing so
the code in Listing 30 breaks out of the while loop responsible
for reading lines of text from the file, transferring control to the
top of Listing
31.
Control can also transfer to the top of Listing 31 when the end of the
text file has been reached. In that case, the value of match
will be false, indicating that no match was found.
inData.close();//Close file if still open |
The code in Listing 31
closes the file and finishes off the obligatory code for a try/catch
block.
Store the final phrase
In the event that a match was found, the variable phrase
contains the offending phrase that identifies the message as
spam. If a match was not found, the contents of phrase
are of no significant value. In either case, however, the value
of phrase is stored in the field named thePhrase in the
output object of type ScreenResult in Listing 32.
theResult.thePhrase = phrase; |
Return
the value of match
The code in Listing 33 returns the value of the variable match.
This will either be the initial value of false if no match was found (see
Listing 8) or will be true if a match was found and the initial
value was overwritten with true.
return match; |
Return
points for the screenMsg method
The screenMsg method contains three return statements.
The first occurs in Listing 15 where the code explicitly returns false,
indicating that a friendly phrase was found in the Subject line,
and that the message should not be deleted from the server.
The second occurs in Listing 18 where the code explicitly returns
false, indicating that the message was sent by a friend and therefore
shouldn’t be deleted from the server.
The third occurs in listing 33. A match value of false
at this point indicates that the message was not identified as SPAM and
should not be deleted from the server. A match value of
true at this point indicates that the message is believed to be SPAM
and probably should be deleted from the
server.
Preview of Future Lessons
This program is most useful
when you have well-developed lists of offending words and
phrases. Although it is possible to create those lists with a
text editor, you can be much more productive, and you are much more
likely to update the lists using the programs that I will present in
the next two lessons.
Therefore, I will give you a preview of those two
programs. I will show you three images that partially illustrate
the
capabilities of the two programs.
The first image (Figure 2) shows the GUI used to train the
algorithm to do a
better job of identifying SPAM in the Subject line of a
message.
The second and third images (Figures 3 and 4) show two
different aspects of the GUI used to train the algorithm to do a better
job of identifying SPAM in
the body text of a message.
(In all
three cases, the width of the GUI was reduced to make it fit into this
narrow publication format. The version that I routinely use is
much wider, and can therefore display much more information.)
Training
the algorithm on the Subject line
Figure 2 illustrates the procedure that I use to train the algorithm to
do a better job of identifying SPAM in the Subject line
of future messages.
Figure 2 User interface for
training on Subject line
In Figure 2, a message previously stored in the history folder has been
loaded into the GUI. The complete raw text of that message is
available for viewing in the large text area if desired. The From
line and the Subject line are displayed in the
top two text fields in the GUI. (I purposely deleted the
Email address of the sender in all three of these images.)
User instructions are displayed in the fourth text field.
Offensive text in the Subject line
In this case, the user has identified offensive text (XANAX) in
the Subject
line and has selected that text with the mouse. Two
additional steps are required to add that text to the word list used to screen the Subject
line of future messages.
The first step is to press the button labeled Copy Selected Text.
This will cause the selected text to be copied into the third text
field from the top where it can be edited if desired.
(In the
event that the spammer inserted extra characters into the offensive
text,
such as in X-ANAX, the extra characters should be deleted before
proceeding to the second step.)
The second step is to press
the button labeled Post Text. This will cause the selected and (possibly)
edited text to be automatically added to the word list.
That is all that is required to cause the program to identify this
offensive text in the Subject lines of all future
messages.
Process the next message
If the user then presses the Next button, the next message in
the history folder will be loaded into the GUI. The current
message will
not be deleted from the history folder.
(This is
what you would normally do if you are going to use the same message
later to train the algorithm to better identify SPAM on the basis of
the body text.)
If the user presses the Delete
Local File button, the current file will be deleted from the
history folder and the next message in the history folder will be
loaded into the GUI.
(This is
what you would normally do if you have determined that the message is
not SPAM, or should not be used for further training of the algorithm
for some other reason. Perhaps the message was received from a
friend whose Email address has not yet been added to the list of
friendly Email addresses discussed earlier in this lesson. Note
that deleting the message file from the local disk does not delete the
message from the server.)
A
very simple process
As you can
see, the process of training the algorithm on the Subject line
consists simply of selecting text with the mouse and pressing buttons
to cause the selected text to be added to the word list. This can
be accomplished very quickly with very little effort. Except for
the possible requirement to delete extra characters, no actual typing
is required.
(As an
alternative, the user can type anything into the third text field and
press the Post Text button to cause it to be added to the word
list. Any number of items can be added to the word list before
moving on to the next message.)
Training
the algorithm on the body text
Figure 3 illustrates one aspect of the procedure for training the
algorithm to do a better job of identifying SPAM on the basis of the
body text of future messages.
Figure 3 User interface for
training on body text and IP address
Once again, a
message previously stored in the history folder has been
loaded into the GUI. The complete raw text of that message is
available for viewing in the large text area. The From
line and the Subject line are displayed in the
top two text fields in the GUI. User instructions are displayed
in the fourth text field from the top.
Add originating IP address to the list
At this point, the user has pressed the button labeled Select IP.
This caused the program to search out the IP address of the computer
that originally sent this message, and to copy that IP address into the
third text field from the top. All that is required to add that
IP address to the list of offending phrases is to press the button
labeled Post Word.
As you can see, getting the originating IP address of a SPAM message
and adding it to the word list is very simple. As before, once it
is in the third text field, you can edit if you like before adding it
to the list.
Getting offending text from the body text
Also at this point, you can scroll the text area. If you visually
identify something in that text that you believe will uniquely identify
messages from this spammer in the future, you can copy and paste that
text into the third text field, and then add the text to the list
by pressing the Post Word button.
Adding URLs to the list
One of the best ways to identify SPAM is to identify URLs referenced in
the SPAM messages. This is something that is difficult,
or at least expensive for the spammer to change frequently. (Sometimes
the identification of one critical URL will cause hundreds and perhaps
thousands of future messages to be identified as SPAM.)
Adding URLs is very easy
Figure 4 illustrates a special feature of the program designed to let
you capitalize on that weakness. Once a message is loaded into
the GUI, each time you press the button labeled Select URL, the
program will search down through the message until it finds the next
block of text that begins with HTTP://. (This is
normally an indication of a URL.)
The program will select that URL, beginning with HTTP:// and including
everything out to the character before the next / character. That
/ character normally separates the domain name from a directory or file
name. Then the program copies the selected text into the third
text field.
(The spammer
can much more easily change directory and file names than domain names,
so they are excluded from the text that is selected and copied.)
Figure 4 User interface for
training on body text and URL
A URL has been identified
In Figure 4, the program has selected the URL being used by the spammer
and has copied it into the text field. At this point, the user
can edit the URL if appropriate, and can add it to the list by
pressing the Post Word button.
Each time the user presses the Select URL button the next URL
in the message is copied into the text field. When no more URLs
can be found, a message to that effect is displayed in the text field.
Thus, it is very easy for the user to identify all the URLs being used
by the spammer and to add some or all of them to the list.
The next message
The behavior of the Next button and the Delete Local
File/Next button are the same as discussed relative to Figure
2. I typically delete the file from the history folder after I
have used it to train the algorithm on the basis of body text.
Stay tuned
So, stay tuned. I will explain the programs that provide this
training capability in the next two lessons in this series.
Run the Program
I encourage you to copy the code from Listing 34 and the three
starter text files in Listing 35, Listing 36, and Listing 37 into
your text editor. Compile and execute the
program. Experiment with it, making changes, and observing the
results
of your
changes.
You may want to modify this code to cause the message files to be
stored
in a different location on your disk. If so, modify the strings
in Listing 34 that read “c:/MailFiles/”
+ uidl + “.txt” and “c:/MailFiles/Archives/” + uidl +”.txt”
to
specify a different folder. Make certain that the folder
where you plan to save the files exists before running the program.
Before running the program, you will need to create three text files
having the following names and purposes and store them in the folder
containing your compiled Java class files for this program:
- Pop302a.txt – contains offensive Subject line
words and phrases - Pop302b.txt – contains offensive body text words and phrases
- Pop302c.txt – contains friendly Email addresses and friendly Subject
line material
Eventually you will need to populate these files with words
and phrases that work well for you. (The algorithm training
programs that I will present in the next two lessons will be extremely
helpful in this regard.)
In the meantime, I have provided sample files in Listing 35,
Listing 36, and Listing 37 that you can use as starter lists. If
you receive the same kinds of SPAM that I receive, the words in these
lists should make it possible for you to test the program and get a few
hits on SPAM messages.
These are simply text files so feel free to add other words and
phrases as appropriate.
(Let me caution you not to enable the
DELE
code in Listing 34 until you are certain
that you actually want to delete messages from the server. Once a
message is deleted from the server, there is no way to recover it from
the server.)
Summary
The previous program explained the communications module of a program
used to remove SPAM from your Email server before it is
downloaded into your primary Email client.
This program explains my algorithm used to identify SPAM. You can
use the algorithm as is, or modify it to better suit your needs.
After about one week of training my algorithm was reliably identifying
about ninety percent of all SPAM messages. I expect this
performance to improve over time as the algorithm becomes better
trained.
This program is most useful
when you have well-developed lists of offending words and
phrases. Although it is possible to create those lists with a
text editor, you can be much more productive, and you are much more
likely to update the lists using the programs that I will present in
the next two lessons.
What’s Next?
In the next lesson in this series, I will present and explain my
program named Pop302d, which provides an easy way to
train my screening algorithm to do a better job of identifying SPAM in
the Subject line of a message.
Complete Program Listing
A complete listing of the program is provided in Listing 34. In
addition, starter text files are provided in Listing 35, Listing 36,
and Listing 37.
The three DELE statements shown in red in Listing 34
have been purposely disabled to prevent you from accidentally deleting
messages from your server while testing this program.
Do not enable these three statements until you are ready
to actually delete messages from the server. Once a message is
deleted from the server, it cannot be recovered from the server.
Disclaimer of responsibility: If you elect to use this
program
you use it at your own risk. Make absolutely certain that you
understand what you are doing before you execute the program. The
author of this program, Richard G. Baldwin, and the websites Developer.com and Gamelan.com
accept no responsibility
for any losses that you may incur as a result of using this program.
/*File Pop302.java Copyright 2004, R.G.Baldwin |
Sample file Pop302a.txt
AGE REVERSING PRODUCT |
Sample file Pop302b.txt
123.456.789.123 |
Sample file Pop302c.txt
BALDWIN@DICKBALDWIN.COM |
Copyright 2004, Richard G. Baldwin. Reproduction in whole or
in
part in any form or medium without express written permission from
Richard
Baldwin is prohibited.
About the author
Richard Baldwin
is a college professor (at Austin Community College in Austin, TX) and
private consultant whose primary focus is a combination of Java, C#,
and XML. In addition to the many platform and/or language independent
benefits of Java and C# applications, he believes that a combination of
Java, C#, and XML will become the primary driving force in the delivery
of structured information on the Web.
Richard has participated in numerous consulting projects, and he
frequently provides onsite training at the high-tech companies located
in and around Austin, Texas. He is the author of Baldwin’s
Programming Tutorials, which
has gained a worldwide following among experienced and aspiring
programmers. He has also published articles in JavaPro magazine.
Richard holds an MSEE degree from Southern Methodist University
and has many years of experience in the application of computer
technology to real-world problems.
-end-