October 2, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

Overview of the BigDog Email Protection Program

  • May 11, 2004
  • By Richard G. Baldwin
  • Send Email »
  • More Articles »

Java Programming Notes # 2186


Preface

Protection against spam

In recent months, I have been showing you how to write different kinds of Java programs to protect your email inbox from viruses and spam.  For example, the series of lessons that began with the lesson entitled Enlisting Java in the War Against SPAM, Part 1, The Communications Module and ended with the lesson entitled Enlisting Java in the War Against SPAM: Training the Body Screener showed you how to write programs that apply spam screening algorithms to your email.

Since the publication of those lessons, I have made numerous improvements to the screening algorithms.  I am now prepared to share the improved version of the spam screening programs with you.

Protection against email-borne viruses

The two lessons entitled Enlisting Java in the War Against Email Viruses and Enlisting Java in the War Against Email Viruses, Part 2, A Much Faster Program described two versions of a program designed to protect your email database from email-borne viruses.

The program described in the first lesson was completely general, and should be compatible with any email client program running on any platform.

The program in the second lesson is much faster, but is limited to email client programs that store their messages locally in a file format known as the MBOX format.
(Netscape 7, for example, uses MBOX-formatted files for local storage of messages.)
Pulling it all together

In this and the next several lessons, I will pull all of this technology together.  I will present and explain the integrated programs that I am currently using to protect my email against both viruses and spam.

A challenge/response operating mode

These programs incorporate all of the spam screening improvements that I have made since publishing the original program in the lesson entitled Enlisting Java in the War Against SPAM, Part 1, The Communications Module.

In addition, these programs incorporate a spam blocking technique based on a challenge/response operating mode, which I have not previously published.  (I will explain the challenge/response operating mode more fully later in this lesson.)

Several layers of defense

As you will see, these programs provide several layers of defense against spam and viruses, with the keystone in the defense against spam being the challenge/response operating mode.

I named this program BigDog because it reminds me of a big ugly dog guarding my email inbox.

A general version and a more specialized version

As was the case in the earlier lessons, I will provide two versions of the programs.  One version provides compatibility with all (or at least most) email client programs.  The second version is faster than the first, but can only be used with email client programs that use the MBOX file format for local storage of messages.

Supplementary material

I recommend that you also study the other lessons in my extensive collection of online Java tutorials.  You will find those lessons published at Gamelan.com.  However, as of the date of this writing, Gamelan doesn't maintain a consolidated index of my Java tutorial lessons, and sometimes they are difficult to locate there.  You will find a consolidated index at www.DickBaldwin.com.

Operational Discussion

The challenge/response methodology

Although my BigDog program provides several layers of defense against viruses and spam, the keystone in the defense against spam is the challenge/response system that I use.  The basic concept is that the first time someone sends an email message to me, that person should be willing to respond to an email challenge and confirm that they really did send the message.  When they respond to the challenge, their original message is delivered to my email inbox and future messages from that person are delivered to my email inbox without challenge.

This blocks most spam


This prevents most spam from reaching my inbox.  To begin with, most spammers don't use valid email addresses and therefore cannot respond to a challenge.  (They won't even receive it.)  Even those that do use a valid email address won't usually respond to a challenge.  They send out thousands of spam messages and can't take the time to respond to a challenge from a single recipient of one of their messages.

On the rare occasion that a spammer does respond to a challenge, I put their email address on a bad list and all future email from that spammer is automatically blocked.

Will provide code in future lessons

This lesson provides an overview of how the programs work together to block spam and viruses from my inbox.  The next several lessons will provide and explain the code for my virus and spam blocking programs named BigDog02.

Similar to other challenge/response systems

In terms of the challenge/response concept, the program is similar to the capabilities described at the following URLs.

http://www.paganini.net/ask/
http://about.mailblocks.com/
http://www.0spam.com/

If you would like to read what others have to say about this concept, visit the URLs listed above.

In addition to challenge/response, my BigDog02 programs provide several layers of protection against both spam and viruses.  I don't believe that is the case with the products and services described at the URLs listed above.

A freeware product

The material at http://www.paganini.net/ask/ describes a freeware product that apparently is not Windows compatible.  Because my BigDog program is written in Java, it is compatible with many different operating systems.  My BigDog program is also free for personal use.

An online spam blocking service

As I understand it, http://about.mailblocks.com/ is the home page of a spam blocking service that can be purchased on the Web.  With my BigDog program, you don't need to purchase an additional spam blocking service.  You can perform your virus and spam blocking locally.  All that you need to purchase is a good virus scanning program, which you should already have anyway.

A free online service

The material at http://www.0spam.com/ describes a free online service that claims to be compatible with any operating system and most email servers.

Which one should you use?

If your objective is simply to obtain spam blocking services, you might best be served by signing up for and using one of the products and services listed above.

However, if your objective also includes virus protection as well as learning how to program in Java, you might best be served by studying my program, and then evaluating its use against one of the other products and services listed above.

Compatibility

The BigDog02 programs run locally on any platform that supports rudimentary Java, are compatible with POP2 email servers, and are compatible with just about any email client program.

In writing the BigDog02 programs, I purposely avoided the use of any exotic Java programming features in order to enhance the likelihood of platform compatibility across many different operating systems.

My setup

I personally use the MBOX version of the BigDog02 program in conjunction with the email client program incorporated in Netscape v7.1 along with the Norton AntiVirus program.

It is very important to keep the virus definition files up to date.  Virus protection is only as good as the virus scanner program being used.  Because my email address has significant exposure on a worldwide basis, I update my virus definition files on a daily basis utilizing the Symantec Intelligent Updater.  This gives me confidence that I can trap and delete messages transporting email-borne viruses before they contaminate my email database.  If your email address has less exposure than mine, it may be satisfactory for you to update your virus definition files less frequently.
(I am currently being pummeled by the different variations of the W32.Beagle.J@mm worm.  I have trapped and deleted four or five email messages containing this worm every day for the past couple of weeks.)
Message filters and junk mail controls

Because I use the Netscape email client, I am able to apply the Netscape Message Filters and the Netscape Junk Mail Controls to the output from the BigDog02 programs.

The Netscape filters allow me to direct the output from the BigDog02 program into specific email folders.

The Netscape Junk Mail Controls are based on a Bayesian analysis of incoming content, and as such can be trained to meet individual spam-fighting needs.  The use of the Junk Mail Controls applies one additional level of defense against spam to the output from the BigDog02 programs.

Not complicated to use

Because there are several programs involved, it might initially appear that the programs are complicated to use.  However, this is not the case.  My operating procedure is generally as follows.

Checking my email

Several times each day, I check my email by doing the following:
  • Run the program named BigDog02g to download and save each email message as a separate file on the disk.
  • Run the antivirus program against the files downloaded by BigDog02g, deleting any files that contain a virus.
  • Run the program named BigDog02j to:
    • Forward the remaining messages to my email client program.
    • Send a challenge message to any messages received from a stranger.
    • Delete the messages from my public email server.
  • Run my email client program to read the messages now residing in my local email data structure.
That's all there is to it.

Training the spam screening algorithm

Occasionally, I do the following to enhance the ability of my spam screening algorithm to identify spam:
  • Manually copy a batch of message files from the Archives folder into a folder named temp.
  • Run the program named BigDog02k to delete those files that won't make a positive contribution to the process of training the algorithm.
  • Run the program named BigDog02m to increase the vocabulary of offensive words and phrases used by the spam screening algorithm to identify spam.
I will have more to say later in this lesson about the role that the spam screening algorithm plays in using these programs to block spam.

Why did I name these programs BigDog?

I gave the programs the name BigDog because they are not wimpy little programs hiding under the desk applying filters to incoming email messages in hopes of identifying those that are spam.  Rather, these programs are more like a big ugly dog guarding my email inbox, issuing a challenge to any stranger that tries to enter and refusing entry to those that contain spam or viruses.

Messages from strangers are quarantined

All messages from strangers are put in temporary quarantine.  In order for the stranger to cause an email message to move from quarantine status into my email inbox, the stranger must be willing to respond to a challenge message and to confirm that he or she actually sent the original message.

If the stranger responds to the challenge in the specified way, the message is automatically moved into my inbox.  In addition, that person ceases to be a stranger and future messages from that person pass through to my inbox unimpeded, provided that they don't contain a virus.
(If I later decide that I don't want to receive messages from that person, it is an easy matter to block all future messages from that email address.)
Several different programs are involved

As mentioned above, there are two programs involved in the basic day-to-day operation of the program and two additional programs used to train the spam screening algorithm.
(Actually there are three programs involved in the basic operation, but any one user will only use two of them.  All users will use the program named BigDog02g.  A user whose email client program supports MBOX files will use the program named BigDog02j.  Other users will use program named BigDog02i.)
The common program named BigDog02g downloads every message currently on the user's public email server and stores each of those messages locally as a separate file.  This email server must be a POP3 server.
(The user never downloads mail directly from the public email server, but rather downloads messages using the program that I will provide.)
Scan for viruses

At this point, the user runs his or her favorite virus scanner program on the message files, eliminating any messages that contain viruses before they become co-mingled with clean messages in the user's email data base.
(I described this operating mode and the advantages thereof in the earlier lesson entitled Enlisting Java in the War Against Email Viruses.)
The second program

After scanning the message files for viruses, the user runs the second program.
(If the user's email client program is an MBOX-based program, the user runs the program named BigDog02j.  Otherwise, the user runs the program named BigDog02i.  Both programs achieve the same objective, but do so in somewhat different ways.)
Categorizing messages

The second program applies various criteria, including spam screening, and categorizes each of the virus-free messages into one of four categories:
  • {GD} Good
  • {BD} Bad
  • {SP} Spam
  • {QU} Quarantine
The text shown in matching curly braces in the above list is prefixed (tagged) onto the subject line of each message.  The message is then forwarded to the user's email client program.  This tag can be used in conjunction with email filtering in the email client program to direct the messages into different email folders.
(I also described the process of forwarding the virus-free messages to the user's email client program, but without the spam blocking features, in the lessons entitled Enlisting Java in the War Against Email Viruses and Enlisting Java in the War Against Email Viruses, Part 2, A Much Faster Program.)
Forwarding the messages for the MBOX case

For the case where the user's email client program is an MBOX based program, the BigDog02j program constructs an MBOX file containing the virus-free messages and deposits that file in the directory tree owned by the email client program.  The next time the user starts the email client program, all of the new messages will appear in a new folder in the client program display.  Normal email filtering capabilities provided by the client program can then be used to move those messages into specific folders based on the tags that were prefixed onto the subject lines.

The non-MBOX case

For the case where the user's email client program is not an MBOX based program, the BigDog02i program constructs a new email message for each of the virus-free messages and sends those email message to the user's secret email account.
(The secret email account is an account that has never been made known to spammers, scammers, and virus writers.  This email account is not required to be a POP3 account.)
The user checks for email on the secret account, and can apply normal email filtering provided by the email client program to direct the messages into different folders.

Other operations

In both cases, certain other operations are also performed.  Those operations will be described later in this lesson.

Local archives

A raw text version of every virus-free message is maintained as an individual local file for reasons that I will describe later.
(The user should manually delete the older messages files in the archives periodically to free up disk space.)
Lists

Several lists are maintained and used in the screening and tagging process.  All of the lists are plain text files, which can be created and edited using a simple text editor.

Also, as you will see later, a special program named BigDog02m is provided that uses actual email messages to update the list used for spam screening on subject lines and HTML text.  This program makes it easy to identify offensive words and phrases and add them to the list.

The GOOD list

The GOOD list is a plain text file that contains key phrases and words that identify good messages.  The occurrence of one of these phrases or words in either the subject or the sender's email address in a message will cause the message to be tagged {GD} and forwarded to my email account with no further processing.
(On my system, the GOOD list is stored in a file named BigDog02GoodList.txt, but as a Java programmer, you can modify the program to use a different file name if you prefer.)
Messages that I want to read

Obviously, the GOOD list is used to identify email messages that I want to read.

For example, each semester I provide a set of specific keywords to my students for them to include in the subject of each message that they send to me.  I place the new keywords in my GOOD list.  The occurrence of one of the keywords in a message from a student causes that message to be delivered to my email inbox without the student having to satisfy the challenge/response requirement for strangers.

The GOOD list is automatically updated by the program whenever the sender responds to a challenge message.  (I will explain this in more detail later.)  Because the list is automatically updated, and is subject to data loss in the event of a computer crash, several levels of backup are automatically maintained.

The BAD list

The BAD list is also a plain text file containing key words and phrases that identify bad messages.  When one of these words or phrases occurs in either the subject or the sender's email address, this causes the message to be tagged {BD} and forwarded to my email account with no further processing.  Simple message filtering within my email client program causes these messages to be stored temporarily in a Bad folder, just in case I may need to refer to one of them later.
(I periodically delete the files in this folder to save disk space.)
Messages that I don't want to read

This list contains the email addresses and subjects commonly used in messages that I don't want to read.  These are messages that are easy to identify and reject without the requirement for a fancy spam screening algorithm.

For example, I frequently receive messages from MAC-MALL.COM and PC-MALL.COM.  These messages are easy to identify and reject.  As far as I know, these are legitimate companies that use email to advertise legitimate products (as opposed to spammers and scammers that hide behind false email addresses and try to camouflage their messages in an air of respectability).

The small difference between spammers and scammers

My distinction between spammers and scammers is a very small one.  A spammer is someone who uses underhanded and sometimes unscrupulous email techniques in an attempt to sell me real, but possibly illegal products (such as prescription drugs without a prescription).

A scammer is someone who uses underhanded and unscrupulous email techniques simply to try to cheat me out of my money.

For example, I suspect that some of the messages that I receive trying to sell me prescription drugs from offshore pharmacies are spammers.  I suspect that the many messages that I receive from Nigeria wanting to transfer millions of dollars into my bank account are scammers, not spammers.

Since the distinction between them is so small, I usually refer to both under the general description of spammer.

I still don't want to read them

Even though MAC-MALL.COM is probably a legitimate company, because I don't own a MAC, I am never interested in information sent from MAC-MALL.COM.

Although I am occasionally interested in products that are available from PC-MALL.COM, I know where to find them on the web whenever I need them, so I'm not interested in reading their frequent email advertisements either.

Both of these (along with some email addresses from Nigeria) are included in my BAD list.

Auto responder messages

My email address is well known across the web and throughout the world.  During each flurry of virus activity on the web, I receive thousands of messages automatically sent by computers claiming that I sent them a message containing a virus.
(These are cases where someone else sent a message containing a virus and faked my email address as the sender.)
Since I am religious about using my anti virus software to keep my computer clean and free of viruses, I'm confident that I didn't send a message containing a virus.  Therefore, I'm not interested in reading these notification messages.

I have identified key phrases contained in most such messages and have included these phrases in my BAD list.  This causes all such messages to be tagged as {BD} and keeps these messages from cluttering my email inbox.

Basically, the messages in the {BD} category are messages that I never look at unless I believe there is a possibility that one of the valid messages that I sent to someone couldn't be delivered and was returned undelivered.

The SPAM category

The BigDog02i and BigDog02j programs each apply a common spam screener to every message that is not categorized as either GOOD or BAD.
(This screener is an object instantiated from a class named BigDog02SpamScreen01, which I am also going to provide in a future lesson.)
This screener is similar to, but significantly improved over the screener described in the earlier lesson entitled Enlisting Java in the War Against SPAM, Part 2, The Screening Module.  For example, this screener contains improvements in several areas discussed below.

HTML in general

The use of HTML provides numerous opportunities for the spammer to hide offensive content in an attempt to defeat spam blocking programs.  This includes the following:
  • Creating the message in HTML with a large number of HTML tags that tend to camouflage the offensive words and phrases.
  • Inserting bogus HTML tags in the middle of offensive words to keep the offensive words from being recognized.
  • Writing offensive text as a series of HTML entities.
  • Including large numbers of random words, phrases, and sentences in an attempt to defeat statistical spam blocking programs.
The new screening module is reasonably effective in dealing with the issues in the above list.  Future lessons will explain the Java code that I have written to deal with these issues.

Hiding techniques not dealt with

Some offensive content hiding techniques commonly used by spammers that the new screening module doesn't attempt to deal with are:
  • Hiding offensive text in the background color.
  • Displaying offensive text so small that it won't be noticed by the reader.
Although I probably could write the code to deal with these issues, I believe that using the challenge/response operating mode for final spam resolution is a more effective approach.
(The BigDog program uses spam screening mainly as a preprocessor to reduce the workload on that portion of the program that implements challenge/response.  Spam screening also provides a spam score that can be used visually to identify potentially good messages that may not be properly handled by the challenge/response system.  I will have more to say about this later.)
Encoding in base64

As I understand it, the base64 encoding scheme was originally devised to make it possible to reliably transmit eight-bit data through transmission systems designed to handle seven-bit data.  The encoding scheme has been in use for many years.

Among other things, the use of base64 encoding makes it possible to:
  • Transmit image data reliably across the Internet.
  • Transmit non-English characters reliably across the Internet.
Unfortunately, the use of base64 encoding also makes it possible for spammers to hide offensive text from spam blocking programs that are not equipped to deal effectively with the hiding technique.  The new spam screening module deals effectively with the following:
  • Encoded subject lines.
  • Encoded body text in single-part messages.
  • Encoded body text in multipart messages.
Future lessons will explain the Java code that I have written to deal with these issues.

Non-English messages

I also receive many messages written in some language other than English.  I have no way of knowing whether these messages are valid messages or spam, because the only spoken language that I know is English.

Therefore, regardless of the content of the messages, I don't want them to clutter my email inbox because I can't read them.  Future lessons will explain the Java code that I have written to deal with this issue.
(I understand that some of you may want to remove this code so that you can read the messages, and that is perfectly fine with me.)
The bottom line on spam screening

As mentioned earlier, the BigDog programs use spam screening mainly as a preprocessor to reduce the workload on that portion of the program that implements the challenge/response system, and to provide a spam score that can be used to identify potentially good messages that are not handled well by the challenge/response system.

Most spammers use invalid email addresses, so there is no point in sending them a challenge message.  In most cases, it will simply be returned as undelivered mail.  Even among those that may use valid email addresses, most of them send thousands of spam messages at a time.  They aren't interested in paying attention to a challenge message and usually won't respond.
(Occasionally a spammer or a scammer does use a valid email address and will respond to a challenge message.  Notable among them are the people in Nigeria who send scam messages.  They often respond to the challenge, and when they do, I simply enter their email address in my BAD list.  That way, I never have to look at another email message from that same email address.)
The spam screening lists

The new spam screening module uses two lists of offensive phrases and words to perform the screen.  These lists are maintained in simple text files having the following names:
  • BigDog02SubjAndHtml.txt
  • BigDog02RawText.txt
The first list is used to screen the subject line and to screen the body of email messages containing HTML where the HTML has been converted to plain text.

The second list is used to screen raw body text that does not contain HTML.

The screening of subject lines and cleaned-up HTML runs very rapidly.  Therefore, I don't need to be particularly concerned about the size of the list used for that purpose.

The screening raw body text takes somewhat longer.  Therefore, I need to keep the second list shown above trimmed down such that every item in the list has a high probability of making a hit.

Avoiding false positives

Depending on how aggressive I am in establishing and updating the lists, and how aggressive I am in setting a threshold value named hitLimit in the program, I can adjust the sensitivity of the screening algorithm.

For example, at one extreme, the spam screening algorithm will identify almost all spam messages, but will also falsely identify some good messages as spam (false positives).

At the other extreme, the algorithm will fail to properly identify some spam messages, but will experience very few, if any false positives.

Challenge/response is the main spam blocking process

The challenge/response procedure is the main spam blocking technique used in the BigDog programs.  Spam screening is simply a preprocessor for the challenge/response process.  The use of spam screening as a preprocessor avoids the sending of challenge messages to spammers whose email addresses are probably invalid anyway.  Among other things, this conserves bandwidth.

Because the spam screening process is not the main technique used to block spam in the BigDog program, I keep the spam screening algorithm adjusted to a relatively low sensitivity.  I would rather have the spam screening algorithm fail to identify a spam message and to challenge that message than to have the spam screening algorithm falsely identify a good message as spam.

By adjusting things this way, I can be very confident that all messages categorized as SPAM have been correctly categorized.  I rarely even look at them.  I simply direct them into a SPAM folder (in case I want to go back and look at them later) and manually delete them after a few days to conserve disk space.

A known deficiency in the challenge/response procedure

Unfortunately, this is not a perfect world.  I am aware of one major flaw in the use of the challenge/response procedure.  Occasionally I receive a message that is automatically sent by a computer that I do want to read.
(For example, my health plan requires me to order prescription drugs for delivery by mail rather than to purchase them from a local pharmacy.  Occasionally, I receive an email message from the drug company.  I have their email address in my GOOD list, but there is a possibility that they may change that email address in the future.  I want to read those email messages.)
Good messages from computers could be lost

If the address of the sending computer (or a keyword in the subject of the message) is not in the GOOD list, that message will be placed in quarantine (to be discussed later).  A challenge message will be sent back to computer that sent the message.  However, it is very unlikely that the computer that sent the message will respond to the challenge.
(Even if the computer is running an auto responder program, it is unlikely that it will respond in the required manner to the challenge message.)
It is possible, therefore, that the message could simply remain in the QUARANTINE folder until I manually delete it later to free up disk space.

Spam screening to the rescue

One of the ways that I reduce the likelihood of false positives in the spam categorization process is by requiring that the spam screening module experience multiple hits against offensive words and phrases in the subject and the body of the message.  The number of hits must meet a threshold value identified by a variable named hitLimit before a message is categorized as SPAM.

The spam score

Regardless of whether the message is ultimately categorized as SPAM or as QUARANTINE, the number of hits (the spam score) is prefixed onto the subject of the message.  This makes it visible later when viewing the message in the email client program.  For example, here is the subject line for a typical message in the QUARANTINE folder.

{QU}{094}{0}Reduce Heart Risks

The {QU} tag indicates that this message was categorized as QUARANTINE and caused this message to be placed in the QUARANTINE folder.

The message number

The {94} tag indicates that this was message number 94 on the public email server before it was downloaded.
(The message number was useful for testing and debugging the program during the development stage.  I may decide to remove it now that I have most of the bugs worked out of the program.) 
A spam score of zero

The {0} tag indicates that this message had a spam score of zero. 
(The spam screening module didn't find any offensive words or phrases in this message.  In the particular case of the message identified above however, an examination of the message showed that the message actually was spam.  The lack of a spam score greater than zero resulted from the fact that the spammer managed to cleverly hide the offensive material in lots of complex HTML.)
Checking the spam score

Before deleting messages from the QUARANTINE folder, I select all messages that have a spam score of 0 and do a visual check on the subjects and the senders.  Selecting a spam score of 0 usually results in a small number of messages that require checking.  Most of the messages in QUARANTINE for which the sender fails to respond to the challenge message have a spam score of 1 or greater.

A backup check

The spam score provides a simple mechanism by which I can quickly do a backup visual check to avoid losing good messages that were sent by a computer.  I can usually identify a computer-generated message that I want to read by looking at the sender and the subject.
(Most computer-generated spam messages claim to have been sent by a person, not by a computer.  For example, the spam message that I highlighted earlier claims to have been sent by someone named Noe Bragg.  This probably is not a real person, or if it is the name of a real person, the name probably doesn't match up with the email address that was given.)
A good computer-generated message

For example, here are typical From and Subject lines for a computer-generated message that I did want to read:

From: "American Airlines (AA)" <notify@aa.com>
Subject: Ticket Delivery Notification


When I identify such a message, I manually enter the sending computer's email address, and possibly some keywords from the subject into the GOOD list so that future messages from the same computer will be categorized as GOOD.

Could miss some messages

I usually ignore all messages in QUARANTINE with a spam score greater than zero.  I recognize that if I receive a good message from a computer and that message receives a spam score greater than zero, I will probably lose the message without reading it.  Unfortunately, there is no perfect system.  The likelihood of this happening is so low that I'm willing to accept the risk.

Training the screening algorithms

It is possible to train the screening algorithms using a text editor to add words and phrases to the text files.  However, this isn't very convenient.  The programs named BigDog02k and BigDog02m are provided to make it both convenient and easy to train the algorithm used to screen subject lines and HTML text for offensive words and phrases.
(No special program is provided to make it convenient and easy to train the algorithm that screens raw (non-HTML) body text.  Most spam messages use HTML.  Therefore, the benefits of screening raw body text are minimal.  Also, because this algorithm runs rather slowly, care must be taken to keep the list of offensive words and phrases short and to the point.  Therefore, I decided not to make it convenient and easy to add words and phrases to the list used to screen raw body text.  As a Java programmer, you could certainly write such a program.  If you decide to do this, you might want to take a look as the program entitled Enlisting Java in the War Against SPAM: Training the Body Screener as a starting point.)
The programs named BigDog02k and BigDog02m

Training the algorithm used to screen subject lines and HTML text essentially consists of increasing the vocabulary of offensive words and phrases used in the screening process.  I do this on a somewhat random basis.  For example, whenever I have a few minutes to spare (such as while waiting for one of my classes to begin, for example), I copy thirty or forty files from the Archives folder into a folder named temp and then run these two programs in succession.

The program named BigDog02k

This program has a very simple purpose.  It examines each of the message files in the temp folder, and deletes all of the files that would be categorized by the programs named BigDog02i or BigDog02j as {GD}, {BD}, or {SP}.  In addition, the program deletes all message files in the {QU} category with a spam score greater than zero.
(There is no point in using these files for training the algorithm because the algorithm already knows how to deal with them.)
The program named BigDog02m

This program is very similar in operation to the program described in the lesson entitled Enlisting Java in the War Against SPAM: Training the Subject Line Screener.  The main difference is that this version of the program is much more sophisticated than the one described in that earlier lesson.  For example, this version knows how to deal with spam hiding techniques such as base64 encoding and bogus HTML tags.

For the time being, I will simply refer you back to that earlier lesson.  However, I will discuss the operation of this version of the program in more detail in a subsequent lesson, providing suggestions as to how the program might be used to best advantage.

Why continue to train the algorithm

When you first start using this set of programs, you will need to create a text file containing offensive words and phrases used in the screening of subject lines and HTML body text.  The programs named BigDog02k and BigDog02m will provide the most convenient way for you to populate that list.

After awhile, your list will contain a large percentage of the offensive words and phrases commonly used in spam.  For example, my list currently contains about 1800 offensive words and phrases.

Desensitized to spam

I currently have the sensitivity of the spam screening algorithm turned down in order to greatly reduce the risk of false positives.

I am currently using a hitLimit value of 6 in the program named BigDog02j.  As a result, the spam screener must find six or more offensive words or phrases in a message to categorize that message as {SP} rather than {QU}.
(Actually, the spam scoring algorithm is a little more complicated than this, but hopefully you get the idea of how to adjust the sensitivity of the screening algorithm.)
The bad news and the good news

The bad news is that this causes the program to send a lot of challenge messages to spammers, which result in undeliverable email messages in return.

The good news is that this eliminates the need for me to be concerned about false positives.  Once a message is categorized as {SP}, it has a very high probability of actually being spam.

Diminishing returns

Once the list is populated to the level of my list, continued training of the algorithm becomes less effective in causing messages to be categorized as {SP} rather than {QU}.  Thus, it might appear that continued training of the algorithm is a waste of time.

A very important aspect of continued training

However, even when the list is populated at this level, continued training does have one very significant operational impact.  Recall my earlier discussion about needing to examine messages in the QUARANTINE category with a spam score of zero to identify messages sent by a computer that I might want to read.  It is very desirable to keep the number of such messages as small as practical.

Continued training of the algorithm has a very significant impact in causing messages that might otherwise end up in QUARANTINE with a spam score of zero to have a spam score of one or more instead.
(Only one hit on an offensive word or phrase is sufficient to cause the spam score to advance from zero to one.)
This significantly decreases the number of messages that I need to examine looking for good messages sent by a computer.

The QUARANTINE category

The {QU} category is the most interesting of all the categories.  All messages that are not categorized as {GD}, {BD}, or {SP} are put into the {QU} category along with the spam score for the message.

Statistically, once the GOOD list is reasonably well populated, most of the QUARANTINE messages will either be spam, scam, or messages that transport viruses.  However, occasionally a good message from someone whose email address hasn't yet been added to the GOOD list will end up in QUARANTINE.

Forwarded for viewing in email client program

As with the other three categories, all messages that fall into the QUARANTINE category are tagged {QU} and forwarded for viewing in my email client program.
(As mentioned earlier, each message in QUARANTINE is also tagged with the spam score for that message.)
Other than to periodically look at messages in the QUARANTINE category with a spam score of zero to identify good computer-generated messages, I rarely look at messages in the QUARANTINE category.  If there is a good message in the QUARANTINE category that was sent by a human, it should be automatically identified later and moved into the GOOD category through the challenge/response procedure.

G
ood human generated messages in QUARANTINE


A very small percentage of the messages in the QUARANTINE category are messages that I really do want to read, even though that might not be apparent from an examination of the subject or the sender.
(I receive many email messages regarding Java with subjects that simply read Help, Hello, Hi, etc.  These are also common subjects for spam, scam, and messages that transport viruses.  It is this small percentage of good messages that causes email screening to be such a complex process.)
The challenge/response procedure

Whenever a message is categorized as QUARANTINE, a challenge message is automatically sent to the sender of the message containing words similar to those shown in Figure 1.  The idea is that anyone who really wants to communicate with me via email should be willing to respond to the message, on a one-time basis, to have their email address added to my GOOD list, and to have their original message moved from my QUARANTINE folder into my GOOD folder.

I recently received a message from your
Email address with the following subject
and date:

SUBJECT: (Subject of the original message)
DATE: (Date of the original message)

Because your Email address has not been
entered into the Approved Sender list of my
SPAM blocking software, the message has been
placed in the Quarantine folder. To move
the message from the Quarantine folder into
my Inbox, you will need to press your Reply
button and send this message back to me
making no changes to the Subject line or the
body of the message. This will also cause
your Email address to be added to my
Approved Sender list so that future messages
from you won't be similarly delayed.

I apologize for this inconvenience.
However, due to the large amount of SPAM
that I must contend with, I have been
forced to implement a mail handling system
that asks you for a one-time confirmation
that you intend to communicate with me via
Email.

If you didn't send the original message, I
apologize for the intrusion. However, it is
possible that someone is using your Email
address for misleading, possibly fraudulent,
and possibly malicious purposes. I strongly
encourage you to file a complaint regarding
the inappropriate use of your Email address.
I have provided all of the information below
that you will need to file such a
complaint.

The information provided below my signature
block is the full header of the original
Email message. You will find a short
tutorial at
http://www.dickbaldwin.com/java/Java2158.htm
that explains how to use this header to file
a complaint.

If we all ban together in opposing SPAM and
Email viruses, perhaps we can have a
positive impact on this increasingly serious
problem.

Regards,
Richard G. Baldwin

=======HEADER BEGINS HERE========

(Header details deleted for brevity.)

Figure 1

Retrieving the original message

The subject of the challenge message contains a unique ID that identifies the original message.  When a response to the challenge is received, the program goes to the local Archives folder, retrieves the original message, tags it as {GD}, and forwards it to be viewed in my email client program.  It will be there for me to read the next time I start my email client program.

In addition, the sender's email address is added to the GOOD list so that future messages from that sender will be forwarded to my email client program without delay.

An additional twist

There is one additional twist that appears near the end of Figure 1.  The message also contains instructions on how to file a complaint if the email address has been hijacked and used for the distribution of spam, scam, or viruses.
(Unfortunately, hijacking of email addresses is a fairly common practice in the distribution of viruses that get their email addresses from the local address book on contaminated computers.)

What's Next?

I will begin presenting and explaining code in the next lesson.  That lesson and several lessons following that one will explain each of the programs.  The lessons will also explain the code behind the screening improvements that have been made since the publication of the lesson entitled Enlisting Java in the War Against SPAM, Part 2, The Screening Module.


Copyright 2004, Richard G. Baldwin.  Reproduction in whole or in part in any form or medium without express written permission from Richard Baldwin is prohibited.

About the author

Richard Baldwin is a college professor (at Austin Community College in Austin, TX) and private consultant whose primary focus is a combination of Java, C#, and XML. In addition to the many platform and/or language independent benefits of Java and C# applications, he believes that a combination of Java, C#, and XML will become the primary driving force in the delivery of structured information on the Web.

Richard has participated in numerous consulting projects, and he frequently provides onsite training at the high-tech companies located in and around Austin, Texas.  He is the author of Baldwin's Programming Tutorials, which has gained a worldwide following among experienced and aspiring programmers. He has also published articles in JavaPro magazine.

Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.

Baldwin@DickBaldwin.com

-end-
 







Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel