Java Programming Notes # 2186
Preface
Protection against spam
In recent months, I have been showing you how to write different kinds
of Java programs
to protect your email inbox from viruses and spam.
For example, the series of lessons that began with the lesson entitled Enlisting
Java in the War Against SPAM, Part 1, The Communications Module and
ended
with the lesson entitled
Enlisting Java
in the War Against SPAM: Training the Body Screener showed you how to write programs that
apply spam screening algorithms to your email.
Since the publication of those
lessons, I have made numerous improvements to the
screening algorithms. I am now prepared to
share the improved version of the spam screening programs with you.
Protection against email-borne viruses
The two lessons entitled Enlisting
Java in the War Against Email Viruses and
Enlisting Java
in the War Against Email Viruses, Part 2, A Much Faster Program described two
versions of a
program designed to protect your email database from email-borne
viruses.
The program described in the first lesson was completely
general, and should be compatible with any email client program running
on any platform.
The program in the second lesson is much faster,
but is limited to email client programs that store their messages
locally in a file format known as the MBOX format.
(Netscape 7, for example, uses MBOX-formatted files for
local storage of messages.)
Pulling it all together
In this and the next several lessons, I will pull all of this
technology together. I will present and explain the integrated
programs that
I
am currently using to protect
my email against both viruses and spam.
A challenge/response operating mode
These programs incorporate all of the spam screening improvements that
I have
made since publishing the original program in the lesson entitled Enlisting
Java in the War Against SPAM, Part 1, The Communications Module.
In addition, these programs incorporate a spam blocking
technique based on a challenge/response operating mode, which I
have not previously published. (I will explain the
challenge/response operating mode more fully later in this lesson.)
Several layers of defense
As you will see, these programs provide several layers of defense
against spam and viruses, with the keystone in the defense against spam
being the challenge/response operating mode.
I named this program BigDog
because it reminds me of a big ugly dog guarding my email inbox.
A general version and a more specialized
version
As was the case in the earlier lessons, I will provide two
versions of the programs. One version provides compatibility with
all (or at
least most) email client programs. The second version is
faster than the first, but can only be used with email client programs
that use the MBOX file format for local storage of messages.
Supplementary material
I recommend that you also study the other lessons in my extensive
collection of online Java tutorials. You will find those lessons
published at Gamelan.com.
However, as of the date of this writing, Gamelan doesn’t maintain a
consolidated index of my Java tutorial lessons, and sometimes
they are difficult to locate there. You will find a consolidated
index at www.DickBaldwin.com.
Operational
Discussion
The challenge/response methodology
Although my BigDog program provides several layers of defense
against viruses and spam, the keystone in the defense against spam is
the challenge/response system that I use. The basic
concept is that the first time someone sends an email message to me,
that person should be willing to respond to an email challenge and
confirm that they really did send the message. When they respond
to the challenge, their original message is delivered to my email inbox
and future messages from that person are delivered to my email inbox
without challenge.
This blocks most spam
This prevents most spam from reaching my inbox. To begin with,
most spammers don’t use valid email addresses and therefore cannot
respond to a challenge. (They won’t even receive it.)
Even those that do use a valid email address won’t usually respond to a
challenge. They send out thousands of spam messages and can’t
take
the time to respond to a challenge from a single recipient of one of
their messages.
On the rare occasion that a spammer does respond to a challenge, I put
their email address on a bad list and all future email from
that spammer is automatically blocked.
Will provide code in future lessons
This lesson provides an overview of how the programs work together to
block spam and viruses from my inbox. The next several lessons
will provide and explain the code for my virus and spam blocking
programs named BigDog02.
Similar to other challenge/response
systems
In
terms of the challenge/response concept, the program is similar
to the
capabilities described at the following URLs.
http://www.paganini.net/ask/
http://about.mailblocks.com/
http://www.0spam.com/
If you would like to read what others have to say about this concept,
visit the URLs listed above.
In addition to challenge/response, my BigDog02
programs provide several layers of protection against both spam and
viruses. I don’t believe that is the case with the products and
services described at the URLs listed above.
A freeware product
The material at http://www.paganini.net/ask/
describes a freeware product that apparently is not Windows
compatible. Because my BigDog program is written in Java,
it is compatible with many different operating systems. My BigDog
program is also free for personal use.
An online spam blocking service
As I understand it, http://about.mailblocks.com/
is the home page of a
spam blocking
service that can be purchased on the Web. With my BigDog program,
you don’t need to purchase an additional spam blocking service.
You can perform your virus and spam blocking locally. All that
you need to purchase is a good virus scanning program, which you should
already have anyway.
A free online service
The material at http://www.0spam.com/
describes a free online
service that claims to be compatible with any operating system and most
email servers.
Which one should you use?
If your objective is simply to obtain spam blocking services, you might
best be served by signing up for and using one of the products and
services listed above.
However, if your objective also includes virus protection as well as
learning how to program in Java, you might best be served by
studying my program, and then evaluating its use against one of
the other products and services listed above.
Compatibility
The BigDog02 programs run locally on any platform that
supports
rudimentary Java, are compatible with POP2 email servers, and are
compatible with just
about any email client
program.
In writing the BigDog02 programs, I purposely avoided the use
of any exotic Java
programming features in
order to enhance the likelihood of platform compatibility across many
different operating systems.
My setup
I personally use the MBOX version of the BigDog02 program in
conjunction with the
email client program incorporated in Netscape v7.1 along with the
Norton
AntiVirus program.
It is very important to keep the virus definition files up to
date. Virus protection is only as good as the virus scanner
program being used. Because my email address has significant
exposure on a worldwide basis, I update my virus definition files on a
daily basis utilizing the Symantec Intelligent
Updater. This gives me confidence that I can trap and delete
messages transporting email-borne viruses before they contaminate my
email database. If your email address has less exposure than
mine, it may be satisfactory for you to update your virus definition
files less frequently.
(I am currently being pummeled by the different
variations of the W32.Beagle.J@mm
worm. I have trapped and deleted four or five email messages
containing this worm every day for the past couple of weeks.)
Message filters and junk mail controls
Because I use the Netscape email client, I am able to apply the Netscape
Message Filters and the Netscape
Junk Mail Controls to the output from the BigDog02 programs.
The Netscape filters allow me to direct the output from the BigDog02
program into specific email folders.
The Netscape Junk Mail Controls are based on a Bayesian analysis of
incoming
content, and as such can be trained to meet individual
spam-fighting needs. The use of the Junk Mail Controls
applies one additional level of defense against spam to the output from
the BigDog02 programs.
Not complicated to use
Because there are several programs involved, it might initially appear
that the programs are complicated to use. However, this is not
the case. My operating procedure is generally as follows.
Checking my email
Several times each day, I check my email by doing the following:
- Run the program named BigDog02g to download and save each
email message as a separate file on the disk. - Run the antivirus program against the files downloaded by BigDog02g,
deleting any files that contain a virus. - Run the program named BigDog02j to:
- Forward the remaining messages to my email client program.
- Send a challenge message to any messages received from a
stranger. - Delete the messages from my public email server.
- Run my email client program to read the messages now residing in
my local email data structure.
That’s all there is to it.
Training the spam screening algorithm
Occasionally, I do the following to enhance the ability of my spam
screening algorithm to identify spam:
- Manually copy a batch of message files from the Archives folder
into a folder named temp. - Run the program named BigDog02k to delete those files that won’t
make a positive contribution to the process of training the algorithm. - Run the program named BigDog02m to increase the vocabulary of
offensive words and phrases used by the spam screening algorithm to
identify spam.
I will have more to say later in this lesson about the role that the
spam screening algorithm plays in using these programs to block spam.
Why did I name these programs BigDog?
I gave the programs the name BigDog because they are not
wimpy little programs hiding under the desk applying filters to
incoming email messages in hopes of identifying those that are
spam. Rather, these programs are more like a big ugly dog
guarding my email
inbox, issuing a challenge to any stranger that tries to enter and
refusing entry to those that contain spam or viruses.
Messages from strangers are quarantined
All messages from strangers are put in temporary quarantine. In
order for the stranger to cause an email message to move from
quarantine
status into my email inbox, the stranger must be willing
to respond to a challenge message and to confirm that he or she
actually sent the
original message.
If the stranger responds to the challenge in the specified way, the
message
is automatically moved into my inbox. In addition, that person
ceases to be a stranger and future messages from that
person pass through to my inbox unimpeded, provided that they don’t
contain a virus.
(If I later decide that I don’t want to receive
messages from that person, it is an easy matter to block all future
messages from that email address.)
Several different programs are involved
As mentioned above, there are two programs involved in the basic
day-to-day operation of
the program and two additional programs used to train the spam
screening algorithm.
(Actually there are three programs involved in the basic
operation, but any one user
will only use two of them. All users will use the program named BigDog02g.
A
user whose email client program supports MBOX files will use the
program named BigDog02j. Other users will use program
named BigDog02i.)
The common program named BigDog02g downloads every message
currently on the user’s public email server and stores each of those
messages locally
as a separate file. This email server must be a POP3
server.
(The user never downloads mail directly from the public
email server, but rather downloads messages using the program that I
will provide.)
Scan for viruses
At this point, the user runs his or her favorite virus scanner program
on the
message files, eliminating any messages that contain viruses before
they become
co-mingled with clean messages in the user’s email data base.
(I described this operating mode and the advantages
thereof in the earlier lesson entitled Enlisting
Java in the War Against Email Viruses.)
The second program
After scanning the message files for viruses, the user runs the second
program.
(If the user’s email client program is an MBOX-based
program, the user runs the program named BigDog02j.
Otherwise,
the user runs the program named BigDog02i. Both programs
achieve
the same objective, but do so in somewhat different ways.)
Categorizing messages
The second program applies various criteria, including spam screening,
and categorizes each of the virus-free messages into
one of four categories:
- {GD} Good
- {BD} Bad
- {SP} Spam
- {QU} Quarantine
The text shown in matching curly braces in the above list is prefixed
(tagged) onto the subject line of each message. The
message
is then forwarded to the user’s email client program. This tag
can be used in conjunction with email filtering in the email client
program to direct the messages into different email folders.
(I also described the process of forwarding the
virus-free messages to the user’s email client program, but without
the
spam blocking features, in the lessons entitled Enlisting
Java in the War Against Email Viruses and
Enlisting
Java in the War Against Email Viruses, Part 2, A Much Faster Program.)
Forwarding the messages for the MBOX case
For the case where the user’s email client program is an MBOX based
program, the BigDog02j program constructs an MBOX file
containing the virus-free messages and deposits that file in the
directory tree owned by the email client program. The next time
the user starts the email client program, all of the new messages will
appear
in a new folder in the client program display. Normal email
filtering capabilities provided by the client program can then be used
to
move those messages into specific folders based on the tags that were
prefixed onto the subject lines.
The non-MBOX case
For the case where the user’s email client program is not an MBOX based
program, the BigDog02i program constructs a new email message
for each of the virus-free messages and sends those
email message to the user’s secret email account.
(The secret email account is an account that has never
been made known to
spammers, scammers, and virus writers. This email account
is not
required to be a POP3 account.)
The user checks for email on the secret account, and can apply
normal email filtering provided by the email client program to direct
the messages into different folders.
Other operations
In both cases, certain other operations are also performed. Those
operations will be described later in this lesson.
Local archives
A raw text version of every virus-free message is maintained as an
individual local file
for reasons that I will describe later.
(The user should manually delete the older messages
files in the archives periodically to free up disk space.)
Lists
Several lists are maintained and used in the screening and tagging
process. All of the lists are plain text files, which can be
created and edited using a simple text editor.
Also, as you will see later, a special program named BigDog02m
is provided that uses actual email messages to update the list used for
spam screening on subject lines and HTML text. This program makes
it easy to identify offensive words and phrases and add them to the
list.
The GOOD list
The GOOD list is a plain text file that contains key phrases and words
that identify good messages. The occurrence of
one of these phrases or words in either the subject or the sender’s
email
address in a message will cause the message to be tagged {GD} and
forwarded to my email account with no further processing.
(On my system, the GOOD list is stored in a file named
BigDog02GoodList.txt, but as a Java programmer, you can modify the
program to use a different file name if you prefer.)
Messages that I want to read
Obviously, the GOOD list is used to identify email messages that I
want
to read.
For example, each semester I provide a set of specific keywords to my
students for them to include in the subject of each message that they
send to me. I place the new keywords in my GOOD list. The
occurrence of one of the keywords in a message from a student causes
that message to be delivered to my email inbox without the student
having to
satisfy the challenge/response requirement for strangers.
The GOOD list is automatically updated by the program whenever the
sender responds to a challenge message. (I will explain this
in more detail later.) Because the list is automatically
updated, and is subject to data loss in the event of a computer crash,
several levels of backup are automatically maintained.
The BAD list
The BAD list is also a plain text file containing key words and
phrases that identify bad messages. When one of these
words or phrases occurs in either the
subject or the sender’s email address, this causes the message to be
tagged {BD} and forwarded to my email account with no further
processing. Simple message filtering within my email client
program causes these messages to be stored temporarily in a Bad folder,
just in case I may need to refer to one of them later.
(I periodically delete the files in this folder to save
disk space.)
Messages that I don’t want to read
This list contains the email
addresses and subjects commonly used in messages that I don’t want to
read. These are messages that are easy to identify
and reject without the requirement for a fancy spam screening algorithm.
For example, I frequently receive messages from MAC-MALL.COM and
PC-MALL.COM. These messages are easy to identify and
reject. As far as I
know, these are legitimate companies that use email to advertise
legitimate products (as opposed to spammers and scammers that hide
behind false email addresses and try to camouflage their messages in an
air of respectability).
The small difference between spammers and
scammers
My distinction between spammers and scammers is a very
small one. A spammer is someone who uses underhanded and
sometimes
unscrupulous email techniques in an attempt to sell me real, but
possibly
illegal products (such as prescription drugs without a
prescription).
A scammer is someone who uses underhanded and
unscrupulous email techniques simply to try to cheat me out of my
money.
For example, I suspect that some of the messages that I receive trying
to sell me prescription drugs from offshore pharmacies are
spammers. I suspect that the many messages that I receive from
Nigeria wanting to transfer millions of dollars into my bank account
are scammers, not spammers.
Since the distinction between them is so small, I usually
refer to both under the general description of spammer.
I still don’t want to read them
Even though MAC-MALL.COM is probably a legitimate company, because I
don’t own
a MAC, I am never interested in
information sent from MAC-MALL.COM.
Although I am occasionally
interested in products that are available from PC-MALL.COM, I know
where to find them on the web whenever I need them, so I’m not
interested in reading their frequent email advertisements either.
Both
of these (along with some email addresses from Nigeria) are
included in my BAD list.
Auto responder messages
My email address is well known across the web and throughout the
world. During each flurry of virus activity on the web, I receive
thousands of messages automatically sent by computers claiming that I
sent them a message containing a virus.
(These are cases where someone else sent a message
containing a virus and faked my email address as the sender.)
Since I am religious about using my anti virus software to keep my
computer clean and free of viruses, I’m confident that I didn’t send a
message containing a virus. Therefore, I’m not interested in
reading these
notification messages.
I have identified key phrases contained in
most such messages and have included these phrases in my BAD
list. This causes all such messages to be tagged as {BD} and
keeps these messages from cluttering my email inbox.
Basically, the messages in the {BD} category are messages that I never
look at unless I believe there is a possibility
that one of the valid messages that I sent to someone couldn’t be
delivered
and was returned undelivered.
The SPAM category
The BigDog02i and BigDog02j programs each apply a
common spam screener
to every message that is not categorized as either GOOD or BAD.
(This screener is an object instantiated from a class
named BigDog02SpamScreen01, which I am also going to provide in
a future lesson.)
This screener is similar to, but significantly improved over the
screener described in the earlier lesson entitled Enlisting
Java in the War Against SPAM, Part 2, The Screening Module.
For
example, this screener contains improvements in several areas discussed
below.
HTML in general
The use of HTML provides numerous opportunities for the spammer to hide
offensive content in an attempt to defeat spam blocking programs.
This includes the following:
- Creating the message in HTML with a large number of HTML tags
that tend to camouflage the offensive words and phrases. - Inserting bogus HTML tags in the middle of offensive words to
keep the offensive words from being recognized. - Writing offensive text as a series of HTML entities.
- Including large numbers of random words, phrases, and sentences
in an attempt to defeat statistical spam blocking programs.
The new screening module is reasonably effective in dealing with the
issues in the above list. Future lessons will explain the Java
code that I have written to deal with these issues.
Hiding techniques not dealt with
Some offensive content hiding techniques commonly used by spammers that
the new screening module doesn’t attempt to deal with are:
- Hiding offensive text in the background color.
- Displaying offensive text so small that it won’t be noticed by
the reader.
Although I probably could write the code to deal with these issues, I
believe that using the challenge/response operating mode for
final spam resolution is a more effective approach.
(The BigDog program uses spam screening mainly as a
preprocessor to reduce the workload on that portion of the program that
implements challenge/response. Spam screening also provides a
spam score that can be used visually to identify potentially good
messages that may not be properly handled by the challenge/response
system. I will have more to say about this later.)
Encoding in base64
As I understand it, the base64 encoding scheme was originally devised
to make it possible to reliably transmit eight-bit data through
transmission systems designed to handle seven-bit data. The
encoding scheme has been in use for many years.
Among other things, the use of base64 encoding makes it possible to:
- Transmit image data reliably across the Internet.
- Transmit non-English characters reliably across the Internet.
Unfortunately, the use of base64 encoding also makes it possible for
spammers to hide offensive text from spam blocking programs that are
not equipped to deal effectively with the hiding technique. The
new spam screening module deals effectively with the following:
- Encoded subject lines.
- Encoded body text in single-part messages.
- Encoded body text in multipart messages.
Future lessons will explain the Java code that I have written to deal
with these issues.
Non-English messages
I also receive many messages written in some language other than
English. I have no way of knowing whether these messages are
valid messages or spam, because the only spoken language that I know is
English.
Therefore, regardless of the content of the messages, I don’t want them
to clutter my email inbox because I can’t read them. Future
lessons will explain the Java code that I have written to deal with
this issue.
(I understand that some of you may want to remove this
code so that you can read the messages, and that is perfectly fine with
me.)
The bottom line on spam screening
As mentioned earlier, the BigDog programs use spam screening
mainly as a
preprocessor to reduce the workload on that portion of the program that
implements the challenge/response system, and to provide a spam
score that can be used to identify potentially good messages that are
not handled well by the challenge/response system.
Most spammers use invalid email addresses,
so there is no point in sending them a challenge message. In most
cases, it will simply be returned as undelivered mail. Even among
those that may use valid email addresses, most of them send thousands
of
spam messages at a time. They aren’t
interested in paying attention to a challenge message and usually won’t
respond.
(Occasionally a spammer or a scammer does use a valid
email address and will respond to a challenge message. Notable
among them are the people in Nigeria who send scam messages. They
often respond to the challenge, and when they do, I simply enter their
email address in my BAD list. That way, I never have to look at
another email message from that same email address.)
The spam screening lists
The new spam screening module uses two lists of offensive phrases and
words to perform the screen. These lists are maintained in simple
text files having the following names:
- BigDog02SubjAndHtml.txt
- BigDog02RawText.txt
The first list is used to screen the subject line and to screen the
body of email messages containing HTML where the HTML has been
converted
to plain text.
The second list is used to screen raw body text that does not contain
HTML.
The screening of subject lines and cleaned-up HTML runs very
rapidly. Therefore, I don’t need to be particularly concerned
about the size of the list used for that purpose.
The screening raw body text takes somewhat longer. Therefore, I
need to keep the second list shown above trimmed down such that every
item in the list has a high probability of making a hit.
Avoiding false positives
Depending on how aggressive I am in establishing and updating the
lists, and how
aggressive I am in setting a threshold value named hitLimit in
the program, I can adjust the sensitivity of the screening algorithm.
For
example, at one extreme, the spam screening algorithm will identify
almost all spam messages, but will also falsely identify some good
messages as spam (false
positives).
At the other extreme, the algorithm will fail to properly identify some
spam messages, but will experience very few, if any false
positives.
Challenge/response is the main spam blocking
process
The challenge/response procedure is the main spam blocking
technique used in
the BigDog programs. Spam screening is simply a
preprocessor for the challenge/response process. The use
of spam screening as a preprocessor avoids the sending of challenge
messages to spammers whose email addresses are probably invalid
anyway. Among other things, this conserves bandwidth.
Because the spam screening process is not the main technique used to
block spam in the BigDog program, I keep the spam screening
algorithm adjusted to a relatively low sensitivity. I would
rather have the spam screening algorithm fail to identify a spam
message and to challenge that message than to have the spam screening
algorithm falsely identify
a good message as spam.
By adjusting things this way, I can be very confident that all
messages categorized as SPAM have been correctly categorized. I
rarely even look at them. I simply direct them into a SPAM folder
(in case I want to go back and look at them later) and manually
delete them after a few days to conserve disk space.
A known deficiency in the challenge/response
procedure
Unfortunately, this is not a perfect world. I am aware
of one
major flaw in the use of the challenge/response
procedure. Occasionally I receive a message that is automatically
sent by a computer that I do want to read.
(For example, my health plan requires me to order
prescription drugs for delivery by mail rather than to purchase them
from a local pharmacy. Occasionally, I receive an email message
from the drug company. I have their email address in my GOOD
list, but there is a possibility that they may change that email
address in the future. I want to read those email
messages.)
Good messages from computers could be lost
If the address of the sending computer (or a keyword in the subject
of the message) is not in the GOOD list, that message will be
placed in quarantine (to be discussed later). A challenge
message will be sent back to computer that sent the message.
However, it is very unlikely that the computer that sent the message
will respond to the challenge.
(Even if the computer is running an auto responder
program, it is unlikely that it will respond in the required manner to
the challenge message.)
It is possible, therefore, that
the message could simply remain in the QUARANTINE folder until I
manually delete it later to free up disk space.
Spam screening to the rescue
One of the ways that I reduce the likelihood of false positives in the
spam categorization process is by requiring that the spam screening
module experience multiple hits against offensive words and phrases in
the subject and the body of the message. The number of hits must
meet a threshold value identified by a variable named hitLimit
before a message is categorized as SPAM.
The spam score
Regardless of whether the message is ultimately categorized as SPAM or
as QUARANTINE, the number of hits (the spam score) is prefixed
onto the subject of the message. This makes it visible later when
viewing the message in the email client program. For example,
here is the subject line for a typical message in the QUARANTINE folder.
{QU}{094}{0}Reduce Heart Risks
The {QU} tag indicates that this message was categorized as QUARANTINE
and caused this message to be placed in the QUARANTINE folder.
The message number
The {94} tag indicates that this was message number 94 on the public
email server before it was downloaded.
(The message number was useful for testing and debugging
the program during the development stage. I may decide to remove
it now that I have most of the bugs worked out of the program.)
A spam score of zero
The {0} tag indicates that this message had a spam score of zero.
(The spam screening module didn’t find any offensive
words or phrases in this message. In the particular case of the
message identified above however, an examination of
the message showed that the message actually was spam. The lack
of a
spam score greater than zero resulted from the fact that the spammer
managed to cleverly hide the offensive material in lots of complex
HTML.)
Checking the spam score
Before deleting messages from the QUARANTINE folder, I select all
messages that have a spam score of 0 and do a visual check on the
subjects and the senders. Selecting a spam score of 0 usually
results in a small
number of messages that require checking. Most of the messages in
QUARANTINE for which
the sender fails to respond to the challenge message have a spam score
of 1 or greater.
A backup check
The spam score provides a simple mechanism by which I can quickly do a
backup
visual check to avoid losing good messages that were sent by a
computer. I can usually identify a computer-generated message
that I want to read
by looking at the sender and the subject.
(Most computer-generated spam messages claim to have
been sent by a person, not by a computer. For example, the spam
message that I highlighted earlier claims to have been sent by someone
named Noe Bragg. This probably is not a real person, or if it is
the
name of a real person, the name probably doesn’t match up with the
email address that was given.)
A good computer-generated message
For example, here are typical From and Subject lines
for a computer-generated message that I did want to read:
From: “American Airlines (AA)”
<notify@aa.com>
Subject: Ticket Delivery Notification
When I identify such a message, I manually enter the sending computer’s
email address,
and possibly some keywords from the subject into the GOOD list so that
future messages from the same computer will be categorized as GOOD.
Could miss some messages
I usually ignore all messages in QUARANTINE with a spam score greater
than zero. I recognize that if I receive a good message from a
computer and that
message receives a spam score greater than zero, I will probably lose
the message without reading it. Unfortunately, there is no
perfect system. The likelihood of this happening is so low that
I’m willing to accept the risk.
Training the screening algorithms
It is possible to train the screening algorithms using a text editor to
add words and phrases to the text files. However, this isn’t very
convenient. The programs named BigDog02k and BigDog02m
are provided to make it both convenient and easy to train the
algorithm used to screen subject lines and HTML text for offensive
words and phrases.
(No special program is provided to make it convenient
and easy to train the algorithm that screens raw (non-HTML) body
text. Most spam messages use HTML. Therefore, the benefits
of screening raw body text are minimal. Also, because this
algorithm runs rather slowly, care must be taken to keep the list of
offensive words and phrases short and to the point. Therefore, I
decided not to make it convenient and easy to add words and phrases to
the list used to screen raw body text. As a Java programmer, you
could certainly write such a program. If you decide to do this,
you might want to take a look as the program entitled
Enlisting
Java in the War Against SPAM: Training the Body Screener as a
starting point.)
The programs named BigDog02k and BigDog02m
Training the algorithm used to screen subject lines and HTML text
essentially consists of increasing the vocabulary of offensive words
and phrases used in the screening process. I do this on a
somewhat random basis. For example, whenever I have a few minutes
to spare (such as while waiting for one of my classes to begin, for
example), I copy thirty or forty files from the Archives
folder into a folder named temp and then run these two programs
in succession.
The program named BigDog02k
This program has a very simple purpose. It examines each of the
message files in the temp folder, and deletes all of the files
that would be categorized by the programs named BigDog02i or BigDog02j
as {GD}, {BD}, or {SP}. In addition, the
program deletes all message files in the {QU} category with a spam
score greater than zero.
(There is no point in using these files for training the
algorithm because the algorithm already knows how to deal with them.)
The program named BigDog02m
This program is very similar in operation to the program described in
the lesson entitled
Enlisting Java
in the War Against SPAM: Training the Subject Line Screener. The main difference is that this
version of the program is much more sophisticated than the one
described in that earlier lesson. For example, this version knows
how to deal with spam hiding techniques such as base64 encoding and
bogus HTML tags.
For the time being, I will simply refer you back to that earlier
lesson.
However, I will discuss the operation of this version of the program in
more detail in a subsequent lesson, providing suggestions as to how the
program might be used to best advantage.
Why continue to train the algorithm
When you first start using this set of programs, you will need to
create a text file containing offensive words and phrases used in the
screening of subject lines and HTML body text. The programs named
BigDog02k and BigDog02m will provide the most
convenient way for you to populate that list.
After awhile, your list will contain a large percentage of the
offensive words and phrases commonly used in spam. For example,
my list
currently contains about 1800 offensive words and phrases.
Desensitized to spam
I currently have the sensitivity of the spam screening algorithm turned
down in order to
greatly reduce the risk of false positives.
I am currently using a hitLimit value of 6 in the
program named BigDog02j. As a result, the spam screener
must find six or more offensive words or phrases in a message to
categorize that message as
{SP} rather than {QU}.
(Actually, the spam scoring algorithm is a little more
complicated than this, but hopefully you get the idea of how to adjust
the sensitivity of the screening algorithm.)
The bad news and the good news
The bad news is that this causes the program to send a lot of
challenge messages to spammers, which result in undeliverable email
messages in return.
The good news is that this eliminates the
need for me to be concerned about false positives. Once a message
is categorized as {SP}, it has a very high probability of actually
being spam.
Diminishing returns
Once the list is populated to the level of my list, continued training
of the
algorithm becomes less effective in causing messages to be categorized
as {SP} rather than {QU}. Thus, it might appear that continued
training of the algorithm is a waste of time.
A very important aspect of continued training
However, even when the list is populated at this level, continued
training does have one very significant operational impact.
Recall my earlier discussion about needing to examine messages in the
QUARANTINE category with a spam score of zero to identify messages sent
by a computer that I might want to read. It is very desirable to
keep the number of such messages as small as practical.
Continued training of the algorithm has a very significant impact in
causing messages that might otherwise end up in QUARANTINE with a spam
score of zero
to have a spam score of one or more instead.
(Only one hit on an offensive word or phrase is
sufficient to cause the spam score to advance from zero to one.)
This significantly decreases the number of messages that I need to
examine looking for good messages sent by a computer.
The QUARANTINE category
The {QU} category is the most interesting of all the categories.
All messages that
are not categorized as {GD}, {BD}, or {SP} are put into the {QU}
category along with the spam score for the message.
Statistically, once the GOOD list is reasonably
well populated, most of the QUARANTINE messages will either be spam,
scam,
or messages that transport viruses. However, occasionally a good
message from someone whose email address hasn’t yet been added to the
GOOD list will end up in QUARANTINE.
Forwarded for viewing in email client program
As with the other three categories, all messages that fall into the
QUARANTINE category are tagged {QU} and forwarded for viewing in my
email
client program.
(As mentioned earlier, each message in QUARANTINE is
also tagged with
the spam score for that message.)
Other than to periodically look at messages in
the QUARANTINE category with a spam score of zero to identify good
computer-generated messages, I rarely look at messages in the
QUARANTINE category. If there is a good message in the QUARANTINE
category
that was sent by a human, it should be automatically identified later
and moved into the GOOD category through the challenge/response
procedure.
Good human generated messages in QUARANTINE
A very small percentage of the messages in the QUARANTINE
category are messages that I really do want to read, even
though that might not be apparent from an examination of the subject or
the sender.
(I receive many email messages regarding Java with
subjects that simply read Help, Hello, Hi,
etc. These are also common subjects for spam, scam, and messages
that transport viruses. It is this small percentage of good
messages that causes email
screening to be such a complex process.)
The challenge/response procedure
Whenever a message is categorized as QUARANTINE, a challenge message is
automatically sent to the sender of the message containing words
similar to those shown in Figure 1. The idea is that anyone who
really
wants to communicate with me via email should be willing to respond to
the
message, on a one-time basis, to have their email address added to my
GOOD list, and to have their original message moved from my QUARANTINE
folder into my GOOD folder.
I recently received a message from your |
Retrieving the original message
The subject of the challenge message contains a unique ID that
identifies the original message. When a response to the
challenge is received, the program goes to the local Archives folder,
retrieves the original message, tags it as {GD}, and
forwards it to be viewed in my email client program. It will be
there
for me to read the next time I start my email client program.
In addition, the sender’s email address is added to the GOOD list so
that future messages from that sender will be forwarded to my email
client program without delay.
An additional twist
There is one additional twist that appears near the end of Figure
1. The message also contains instructions on how to file a
complaint if
the email address has been hijacked and used for the distribution of
spam, scam, or viruses.
(Unfortunately, hijacking of email addresses is a fairly
common practice in the
distribution of viruses that get their email addresses from the local
address book on contaminated computers.)
What’s Next?
I will begin presenting and explaining code in the next lesson.
That lesson and several lessons following that one will explain each of
the programs. The lessons will also explain the code behind the
screening improvements that have been made since the publication of the
lesson entitled Enlisting
Java in the War Against SPAM, Part 2, The Screening Module.
Copyright 2004, Richard G. Baldwin. Reproduction in whole or
in
part in any form or medium without express written permission from
Richard
Baldwin is prohibited.
About the author
Richard Baldwin
is a college professor (at Austin Community College in Austin, TX) and
private consultant whose primary focus is a combination of Java, C#,
and XML. In addition to the many platform and/or language independent
benefits of Java and C# applications, he believes that a combination of
Java, C#, and XML will become the primary driving force in the delivery
of structured information on the Web.
Richard has participated in numerous consulting projects, and he
frequently provides onsite training at the high-tech companies located
in and around Austin, Texas. He is the author of Baldwin’s
Programming Tutorials, which
has gained a worldwide following among experienced and aspiring
programmers. He has also published articles in JavaPro magazine.
Richard holds an MSEE degree from Southern Methodist University
and has many years of experience in the application of computer
technology to real-world problems.
-end-