GuidesIntroduction to Input Validation with Perl

Introduction to Input Validation with Perl

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

A very important, well known, yet too often lightly dismissed problem in
software security is that of trust management. There are many parties involved
in the building and deployment of a software product (even if there’s only
one developer), and the entities that interact with the resulting system
are even more, and often with diverse interests. Among these entities there
necessarily exists a complicated network of explicit and implicit trust
relationships, which in all but the simplest situations is too difficult
(often impossible) to analyze. The problem, of course, dwells in the fact that
links in this trust relationship network are vulnerable to abuse if their
role and importance has been underestimated in the development process.
Thus, the challenge that developers face is that of balancing the amount of
effort invested in analyzing all possible interactions with their software
and the desired level of reliability and security of their product. If these
two objections could be quantified, their values would be inversely proportional
to each other.

A thorough examination of trust management issues in software security
could easily constitute a multivolume work by itself, and there is a lot of
related research underway. For a good general introduction to the subject,
consult [1] and Chapter 13 of [2]. This article will focus on one particular
aspect of the problem — that of proper input validation. The article has
two objectives. The first goal is to introduce the reader to the problem
and to discuss some relevant secure programming techniques. The second objective
is to confront the question “How can software be designed and implemented to
withstand malicious input attacks?
” The high-level scripting language Perl
and the GNU/Linux platform will be used to illustrate key implementation ideas,
but most of the discussion will be applicable to any other development environment.

Before we attempt to answer the above-posed question, two other preliminary
questions must be considered: “What constitutes input to a program?
and “What constitutes malicious input?” We need a good understanding of
these terms, for much depends upon the answers. Unfortunately, both concepts
are fuzzier than we would like, and their meaning in turn depends on the
purpose and the details of the software itself. It is clear, for instance,
that if a program interacts with a human user then every bit of data provided
by the user is input to the program. Furthermore, all other resources, such as
filesystems, databases, and network interfaces, from which the program obtains
information can also be viewed as inputs. More subtly, a program often depends
on certain parameters of the environment in which it executes, such as various
path lists, input field separator values, etc. Those too are input.

The answer to the second question, namely “What constitutes malicious
input?
” is even more dependent on the particular situation. Input that is
fully legitimate for one application can be devastating for another. So
much so in fact, that we will avoid giving an empirical formulation of the
concept and will instead make our own intuitive working definition:
Malicious input is all input that causes software to behave in a manner
inconsistent with the software’s specified behavior
. Thus if a user
provides data which makes the program do something it is not supposed to do,
we will call this data “malicious input.

And as long as I am wishing, I would also like a pony,” said Alice.
So let us continue asking questions while we are on it. How does malicious
input actually break software? The short answer is: by exploiting trust
relationships. Software design is akin to mathematical modeling in some
aspects. In both cases we seek to develop an approximation of some ideal
system, which is close enough to the real world as to be useful, yet simplified
enough as to be manageable. The only way we know of to do this is by
making assumptions. Programmers make assumptions all the time, often
intentional but just as often unconscious. Many of these assumptions
involve extending trust to other parties, such as the user, the software
distribution medium, the execution environment, the development environment,
and many, many others. Malicious input violates the assumptions and throws
the software into a state that has not been anticipated by its creators.
Sometimes such input is generated accidentally, but more frequently trust
relationships are abused deliberately, thus the adjective “malicious.”
Let us look at an example.

Suppose we want to build a simple CGI application that provides a calendar
service to users, featuring calendar personalization, event notification
by e-mail, and the possibility of viewing other users’ personal calendars.
To ensure some degree of privacy, the application may offer the option of
marking personal entries into the calendar as private or public, so that an
entry will be viewable to other users only if it is marked as public.
A user would first register an account with the service, then log in,
and be presented with options to view or edit their personal calendar, look
up the public entries in someone else’s calendar, request e-mail notification
for events, or log out. An entry (or event) in the calendar would consist of
a date, a description of the event, a flag of whether the entry is private,
and a list of users who requested e-mail notification when the event occurs.
The application may allow event descriptions to be entered as hypertext so they
can link to other events or external resources.

Before we plunge into more details, let us consider the security implications
of our design. First we need to identify the entities involved. We, as
the designers and developers of the system, are one such entity (although
it may be possible to break this further down.) The vendor that will deliver
our product to customers may be another. The customer, who will install our
application on their Web server to offer service to end users is also an entity.
Then, of course, there are the end users who will benefit from the service, and
others who just happen to be browsing the customer’s Web site. There are
malicious entities, who keep a sharp eye on the server and will not leave anything
as attractive as our software unnoticed. If we think more about it, we are
likely to be able to extend this list to a fair size. If we were doing something
more complicated, the list could quickly grow out of manageable proportions, while
it will still most likely be incomplete. But once we have identified the players,
we need to concern ourselves with establishing the relations between them, and
whether any of these relations present a security risk. As it turns out, most
all of them do. The customer who installed our software is clearly taking the
risk of someone compromising the application and consequently the customer’s
server, if not their entire network. This may be accomplished by crackers
who exploit an input vulnerability of the Calendar, or perhaps by someone who
managed to insert a “trojan horse” in our software before it reached the
customer. The end users also take numerous risks by merely using the service.
If someone compromises the server, the users’ secrets will be revealed. If
our protection of private entries in the calendar is not well implemented,
users may be able to view each other’s secrets. If we were not careful when we
built the Calendar’s ability to contain hypertext, a user may exploit a
vulnerability in another user’s browser by placing malicious hypertext code
in a public entry. Some attacks and vulnerabilities are obvious, but most
are often hidden. And the ones that are hidden are usually the cruelest,
associated with the greatest risk. Computer security is about handling
subtlety. As developers, we must try our best to protect each entity that
interacts with our software from every other entity.

With these security objectives in mind, we can now focus on the components of
our application. Just as there are many entities interacting with our software,
so are there many pieces that comprise it and interact with each other internally.
Our CGI application may, for instance, have an authentication mechanism that
prompts the user for their log-in information, enciphers the password and compares
it to the one stored in that user’s record in a file on the server’s filesystem.
It may store users’ calendars in a SQL database and request and update database
records in accord with the users’ actions. It may also make use of an external
mail agent app to send event notifications to users. For any but the
simplest applications, we would also use various modules and libraries developed
earlier, possibly by someone else. The pattern here is that many components
perform different functions and pass information to each other, ALL under the
requests of the end user. The input that the user provides to our application
is used to determine the behavior of these components. Unexpected input is
likely to cause unexpected behavior, and because of the complex interconnections
that we have seen, unexpected behavior usually propagates exponentially. As a
special case, carefully crafted input can cause carefully predicted
unexpected” behavior, designed to exploit a weakness of the system.
To prevent this, all input (whether from the user or from other sources, as we
discussed earlier) must be cautiously examined and filtered before it can be
used safely.

To illustrate this idea, suppose we want our Calendar to offer users the
ability to save different versions of their or other users’ calendars for later
reference. While viewing a calendar, the user would click on a Save button and
will be asked to provide a name for the calendar they wish to store. From within
the CGI script we may handle this in a manner similar to this:

  $username = param('username');  # Get the user name from a CGI parameter
  $filename = param('filename');  # Get the desired file name 
  $filename = 'users/'.$username.'/'.$filename;  # Specify the directory
  open (CLF, ">$filename");       # Open the file for writing
  print CLF $currentCalendar;     # Print the current calendar into the file
  close (CLF);                    # Close the file

To a security conscious programmer this would scream trouble, but sometimes
even experienced developers fail to recognize how user input can impact the
security of their system. The problem here is that we are trusting the user
to provide a nice simple name for their calendar. While almost anything that
a normal user can think of to name their calendar would be okay for us, a
knowledgeable attacker can easily come up with file names that can break our
software and the server on which it is running. One such choice, known as a
backward directory traversal attack” is to precede the filename with a
series of “../” (dot-dot-slash) symbols, which instruct the operating system to
look for the file in an upper-level directory. If the Web server has sufficient
privileges, the attacker can use this technique to overwrite any file on
the host. To prevent this, we must filter out any such symbols before we use
the input. An easy and fairly efficient way to do this is by using regular
expressions:

  $filename = param('filename');  # ...as before
  $filename =~ s/..//g;        # <-- Filter out ../ symbols!
  $filename = 'users/'.$username.'/'.$filename;  # as before...

This is a little better, but we must also make sure we do this for the
$username variable, for it too comes to be part of the filename, and there’s
nothing yet to stop the user from putting “../” as part of their user name.
Unfortunately, the dot-dot-slash trick is not the only thing that can form
malicious input. Suppose later, when the user wants to load a saved calendar,
our script does this:

  $filename = param('filename');  # Get file name from a CGI parameter
  $filename =~ s/..//g;        # Filter out ../ symbols
  chdir ('users/'.$username);     # Switch to user's directory
  open (CLF, "$filename");        # Open the file for reading
  $currentCalendar = <CLF>;       # Read in the calendar
  print $currentCalendar;         # Output it to the user
  close (CLF);                    # Close the file
  chdir ('../..');                # Go back to the script's directory

Many things can go wrong here. If the filename is a reference to an already
open file handle, such as “&STDIN”, the script will read from that file
handle instead (which could happen to be associated with some other user’s
calendar!). If the filename begins with a pipe symbol (“|”), followed by a shell
command, the script will execute the shell command and will read from it’s output.
This way the attacker can read any file on the system, and worse yet, execute any
command that the server has privileges to run! Those are all subtle consequences
of the way Perl implements the “open” function. In other languages and platforms
some of those problems may not be present, but other similar problems are likely
to exist on any system. Crackers make a hobby of learning such obscure things.
Security conscious developers must make it their duty to do the same.

How can we avoid the problems mentioned? We could add more regular expressions
to filter out other bad characters, such as “&” and “|”. But it becomes very
difficult to be certain that we have not missed any characters that have special
meaning to Perl. The function system() exists to convince us of this
fact. It is a function which, like open(), demonstrates a
vulnerability in the interface between the script and the underlying operating
system. Say we want to use the program sendmail to notify a user of an event
in their calendar. We might do something like:

  $username = param('username');
  $message = "message_for_$username";
  open (MF, ">$message");
  print MF "To: $username n";
  print MF "From: The Calendar n";
  print MF "Subject: Your event is about to happen! nn";
  print MF $event;
  print MF ".n";
  close (MF);
  system ("sendmail -t < $message");

The troubling variable here is $username, because it came from the user, is not
validated, and makes it into a system call. The problems are very similar to the
ones we had with open(), except here we have a lot more of them. If
the user name contains a semicolon, followed by a command, the command will get
executed right along with sendmail. We could filter out semicolons, but the
shell could use other symbols for command separation (and indeed almost all
shells use the newline character for this purpose as well.) Pipes can also
cause problems here, and so can the “>” and “<” symbols used for output and
input redirection. The backtick operator in Perl is equivalent to the
system() function in this respect. The bottom line is, there are
too many characters and combinations of characters that have a special meaning
to too many components. Worse yet, we can not even know what all of those
characters are, because Perl, the shell, and other applications we might have
to deal with, can all assign special meanings to their own sets of special
characters, and they don’t even have to tell us about it. So filtering out
symbols is not really the best idea. A better approach is to filter in
symbols. That is, only allow characters that are legitimate, and cut off
everything else. A regular expression that would do this for our $filename
(and for the $username just as well) could look like this:

  $filename =~ s/[^A-Za-z0-9_-.]//g;

Notice that we only allow letters, numbers, underscores, dashes, and
periods — what you would normally expect to find in a file name. This makes
us feel a little less nervous about those open() and
system() calls. But what about all the other input that gets used
by our script? In [4], you can find many more examples of Perl functions that
have the problems of open() and system(). We can not,
however, always afford to filter input so drastically. It is unreasonable,
for instance, to disallow users to have semicolons in the free-text entries in
their calendars. Sometimes we can do things like escaping every non-alphabetic
character in the input with a backslash to prevent it from being interpreted
by Perl or by the shell. We would use a regular expression to do this:

  $userinput =~ s/(W)/$1/g

Other times, however, it is not Perl or the shell that concerns us, but some
other piece of software, like the user’s browser for example. Since in our
application we allow users to enter hypertext (say, HTML) that will later be
interpreted by other users’ browsers (e.g., when they view this user’s public
calendar), we in effect allow one user to run code on another user’s machine.
This can clearly have undesirable effects. So far we have talked about being
careful as to which users we trust and how much we trust them. In this case,
the user trusts us not to have malicious hypertext contents on the pages they
are viewing. For instance, if a malicious user creates an HTML form as part of a
calendar entry and makes it so that it resembles our own layout, but asks users
for their log-in information and mails it to the attacker, then other users
will be apt to believe that it is all right to provide their personal information,
since the form is part of our program. In a similar manner, a calendar entry
may contain Javascript or some other code that will get interpreted by other
users’ browsers. Those types of attacks are known as “cross-site scripting
and are also a kind of malicious input attack, since the vindictive content
came into the system through input that our application accepted.

A good way to minimize cross-site scripting problems is to restrict the set of
HTML tags that we allow users to enter. Almost all HTML tags which have attributes
can be dangerous, especially the ones that point to other URIs and the ones that
can contain scripts. Thus for a restricted subset of HTML to be safe, it must
by necessity be very (sometimes overly) restricted. A good starting point is
to allow only the tags <p>, <b>, <i>, <em>, <strong>,
<pre>, <br>, and their corresponding ending tags [5]. To do this,
we can first convert all occurrences of these tags in the input to [p], [b], [i],
etc. We can then invalidate all other tags by replacing < with &lt;, >
with &gt;, and & with &amp;. After we have done this, we can replace
the [tags] back to <tags> again, and be done with it. Here’s some code that
would do this:

  $input =~ s/<p>/[p]/gi;
  $input =~ s/<b>/[b]/gi;
  
  # ...
  
  $input =~ s/</p>/[/p]/gi;
  $input =~ s/</b>/[/b]/gi;

  # ...

  $input =~ s/</&lt;/g;
  $input =~ s/>/&gt;/g;
  $input =~ s/&/&amp;/g;

  $input =~ s/[p]/<p>/gi;
  $input =~ s/[b]/<b>/gi;
  
  # ...

  $input =~ s/[/p]/</p>/gi;
  $input =~ s/[/b]/</b>/gi;

  # and so on.  

This solution does not, however, allow us to embed hyperlinks in user input.
To do this, we must allow the <a href=””> tag, but then we have to
validate the contents of the referred URI very carefully. See [5] for a
discussion on URI validation, as well as many other aspects of secure
programming.

Input, as we saw earlier, is not always so obvious. There are environment
variables that influence the behavior of our applications. Those must be
carefully examined before being used, and the ones that will not be used must be
erased early in the program’s execution. There are also hidden parameters in
HTML forms that get passed to CGI scripts for various purposes. Just because
these parameters are called hidden does not mean that the user can not see them.
In fact, the user can set them to any desired value, since the HTML form is
submitted by the user’s browser. Applications must not trust the origins of
this input. The same goes for “cookies” which are used as persistent data
storage on the client side. The user has full freedom to modify the value of
such cookies, so they can not be depended on for security critical tasks.
We must be aware of all this if we want to have any hope of building secure
and dependable software.

Let us now turn back to our original question: “How can software be designed
and implemented to withstand malicious input attacks?
” The short answer here
is that it can’t. Maybe for some unreasonably restricted meanings of “software”
and “input” we could conceive of every possible malicious input attack and devise
defensive mechanisms, but in the real world the possibilities for something to go
wrong are literally unbounded. Furthermore, we are almost guaranteed that if an
attack which we have not considered exists, someone with the right morals and
motivation will eventually find it and exploit our software. What then are we to
do?

Act for the best, hope for the best, and take what comes.” [3] To this,
I have to add “…then repeat.” As we have seen, it turns out that by
restricting the set of entities our software trusts to a minimum and by
conscientiously validating all input, we can avoid the majority of malicious
attacks and almost all accidental breaches (see [6] for further discussion on the
role of trust in security). It is, therefore, most important to think of as
many things that can break our software as possible while still in the initial
design phases, and to keep all of this in mind during implementation.
Security, however, is a process, not a product, as Bruce Schneier so rightly put it.
As the software we build serves its uptime, it will inevitably be subjected to
rough weather and will sometimes fail to defend itself and the people it serves.
We must then make a note of this, go back to the code or even back to the drawing
board, and apply the lessons we’ve learned to ensure that we have eliminated the
new avenue for security compromises.

This last piece is a matter of discipline.

References and Recommended Reading

[1] B. Schneier, “Secrets and Lies”, 2000.

[2] J. Viega and G. McGraw, “Building Secure Software”, 2001.

[3] Fitz-James Stephen, quoted by William James in “The Will to Believe”.

[4] J. Dimov,
“Security Issues in Perl Scripts”, 2000.

[5] D. Wheeler,
“Secure Programming for Linux and Unix HOWTO”, 2001.

[6] J. Viega, T. Kohno, B. Potter, “Trust and Mistrust in Secure Applications”,
Communications of the ACM, vol. 44, num. 2, 2001.

About the Author

Jordan Dimov is a consultant for
Cigital Inc. in Dulles, Va., and a member
of Cigital’s Software Security Group.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories