Introduction to Input Validation with Perl
A very important, well known, yet too often lightly dismissed problem in software security is that of trust management. There are many parties involved in the building and deployment of a software product (even if there's only one developer), and the entities that interact with the resulting system are even more, and often with diverse interests. Among these entities there necessarily exists a complicated network of explicit and implicit trust relationships, which in all but the simplest situations is too difficult (often impossible) to analyze. The problem, of course, dwells in the fact that links in this trust relationship network are vulnerable to abuse if their role and importance has been underestimated in the development process. Thus, the challenge that developers face is that of balancing the amount of effort invested in analyzing all possible interactions with their software and the desired level of reliability and security of their product. If these two objections could be quantified, their values would be inversely proportional to each other.
A thorough examination of trust management issues in software security could easily constitute a multivolume work by itself, and there is a lot of related research underway. For a good general introduction to the subject, consult  and Chapter 13 of . This article will focus on one particular aspect of the problem -- that of proper input validation. The article has two objectives. The first goal is to introduce the reader to the problem and to discuss some relevant secure programming techniques. The second objective is to confront the question "How can software be designed and implemented to withstand malicious input attacks?" The high-level scripting language Perl and the GNU/Linux platform will be used to illustrate key implementation ideas, but most of the discussion will be applicable to any other development environment.
Before we attempt to answer the above-posed question, two other preliminary questions must be considered: "What constitutes input to a program?" and "What constitutes malicious input?" We need a good understanding of these terms, for much depends upon the answers. Unfortunately, both concepts are fuzzier than we would like, and their meaning in turn depends on the purpose and the details of the software itself. It is clear, for instance, that if a program interacts with a human user then every bit of data provided by the user is input to the program. Furthermore, all other resources, such as filesystems, databases, and network interfaces, from which the program obtains information can also be viewed as inputs. More subtly, a program often depends on certain parameters of the environment in which it executes, such as various path lists, input field separator values, etc. Those too are input.
The answer to the second question, namely "What constitutes malicious input?" is even more dependent on the particular situation. Input that is fully legitimate for one application can be devastating for another. So much so in fact, that we will avoid giving an empirical formulation of the concept and will instead make our own intuitive working definition: Malicious input is all input that causes software to behave in a manner inconsistent with the software's specified behavior. Thus if a user provides data which makes the program do something it is not supposed to do, we will call this data "malicious input."
"And as long as I am wishing, I would also like a pony," said Alice. So let us continue asking questions while we are on it. How does malicious input actually break software? The short answer is: by exploiting trust relationships. Software design is akin to mathematical modeling in some aspects. In both cases we seek to develop an approximation of some ideal system, which is close enough to the real world as to be useful, yet simplified enough as to be manageable. The only way we know of to do this is by making assumptions. Programmers make assumptions all the time, often intentional but just as often unconscious. Many of these assumptions involve extending trust to other parties, such as the user, the software distribution medium, the execution environment, the development environment, and many, many others. Malicious input violates the assumptions and throws the software into a state that has not been anticipated by its creators. Sometimes such input is generated accidentally, but more frequently trust relationships are abused deliberately, thus the adjective "malicious." Let us look at an example.
Suppose we want to build a simple CGI application that provides a calendar service to users, featuring calendar personalization, event notification by e-mail, and the possibility of viewing other users' personal calendars. To ensure some degree of privacy, the application may offer the option of marking personal entries into the calendar as private or public, so that an entry will be viewable to other users only if it is marked as public. A user would first register an account with the service, then log in, and be presented with options to view or edit their personal calendar, look up the public entries in someone else's calendar, request e-mail notification for events, or log out. An entry (or event) in the calendar would consist of a date, a description of the event, a flag of whether the entry is private, and a list of users who requested e-mail notification when the event occurs. The application may allow event descriptions to be entered as hypertext so they can link to other events or external resources.
Before we plunge into more details, let us consider the security implications of our design. First we need to identify the entities involved. We, as the designers and developers of the system, are one such entity (although it may be possible to break this further down.) The vendor that will deliver our product to customers may be another. The customer, who will install our application on their Web server to offer service to end users is also an entity. Then, of course, there are the end users who will benefit from the service, and others who just happen to be browsing the customer's Web site. There are malicious entities, who keep a sharp eye on the server and will not leave anything as attractive as our software unnoticed. If we think more about it, we are likely to be able to extend this list to a fair size. If we were doing something more complicated, the list could quickly grow out of manageable proportions, while it will still most likely be incomplete. But once we have identified the players, we need to concern ourselves with establishing the relations between them, and whether any of these relations present a security risk. As it turns out, most all of them do. The customer who installed our software is clearly taking the risk of someone compromising the application and consequently the customer's server, if not their entire network. This may be accomplished by crackers who exploit an input vulnerability of the Calendar, or perhaps by someone who managed to insert a "trojan horse" in our software before it reached the customer. The end users also take numerous risks by merely using the service. If someone compromises the server, the users' secrets will be revealed. If our protection of private entries in the calendar is not well implemented, users may be able to view each other's secrets. If we were not careful when we built the Calendar's ability to contain hypertext, a user may exploit a vulnerability in another user's browser by placing malicious hypertext code in a public entry. Some attacks and vulnerabilities are obvious, but most are often hidden. And the ones that are hidden are usually the cruelest, associated with the greatest risk. Computer security is about handling subtlety. As developers, we must try our best to protect each entity that interacts with our software from every other entity.
With these security objectives in mind, we can now focus on the components of our application. Just as there are many entities interacting with our software, so are there many pieces that comprise it and interact with each other internally. Our CGI application may, for instance, have an authentication mechanism that prompts the user for their log-in information, enciphers the password and compares it to the one stored in that user's record in a file on the server's filesystem. It may store users' calendars in a SQL database and request and update database records in accord with the users' actions. It may also make use of an external mail agent app to send event notifications to users. For any but the simplest applications, we would also use various modules and libraries developed earlier, possibly by someone else. The pattern here is that many components perform different functions and pass information to each other, ALL under the requests of the end user. The input that the user provides to our application is used to determine the behavior of these components. Unexpected input is likely to cause unexpected behavior, and because of the complex interconnections that we have seen, unexpected behavior usually propagates exponentially. As a special case, carefully crafted input can cause carefully predicted "unexpected" behavior, designed to exploit a weakness of the system. To prevent this, all input (whether from the user or from other sources, as we discussed earlier) must be cautiously examined and filtered before it can be used safely.
To illustrate this idea, suppose we want our Calendar to offer users the ability to save different versions of their or other users' calendars for later reference. While viewing a calendar, the user would click on a Save button and will be asked to provide a name for the calendar they wish to store. From within the CGI script we may handle this in a manner similar to this:
$username = param('username'); # Get the user name from a CGI parameter $filename = param('filename'); # Get the desired file name $filename = 'users/'.$username.'/'.$filename; # Specify the directory open (CLF, ">$filename"); # Open the file for writing print CLF $currentCalendar; # Print the current calendar into the file close (CLF); # Close the file
To a security conscious programmer this would scream trouble, but sometimes even experienced developers fail to recognize how user input can impact the security of their system. The problem here is that we are trusting the user to provide a nice simple name for their calendar. While almost anything that a normal user can think of to name their calendar would be okay for us, a knowledgeable attacker can easily come up with file names that can break our software and the server on which it is running. One such choice, known as a "backward directory traversal attack" is to precede the filename with a series of "../" (dot-dot-slash) symbols, which instruct the operating system to look for the file in an upper-level directory. If the Web server has sufficient privileges, the attacker can use this technique to overwrite any file on the host. To prevent this, we must filter out any such symbols before we use the input. An easy and fairly efficient way to do this is by using regular expressions:
$filename = param('filename'); # ...as before $filename =~ s/..//g; # <-- Filter out ../ symbols! $filename = 'users/'.$username.'/'.$filename; # as before...
This is a little better, but we must also make sure we do this for the $username variable, for it too comes to be part of the filename, and there's nothing yet to stop the user from putting "../" as part of their user name. Unfortunately, the dot-dot-slash trick is not the only thing that can form malicious input. Suppose later, when the user wants to load a saved calendar, our script does this:
$filename = param('filename'); # Get file name from a CGI parameter $filename =~ s/..//g; # Filter out ../ symbols chdir ('users/'.$username); # Switch to user's directory open (CLF, "$filename"); # Open the file for reading $currentCalendar = <CLF>; # Read in the calendar print $currentCalendar; # Output it to the user close (CLF); # Close the file chdir ('../..'); # Go back to the script's directory
Many things can go wrong here. If the filename is a reference to an already open file handle, such as "&STDIN", the script will read from that file handle instead (which could happen to be associated with some other user's calendar!). If the filename begins with a pipe symbol ("|"), followed by a shell command, the script will execute the shell command and will read from it's output. This way the attacker can read any file on the system, and worse yet, execute any command that the server has privileges to run! Those are all subtle consequences of the way Perl implements the "open" function. In other languages and platforms some of those problems may not be present, but other similar problems are likely to exist on any system. Crackers make a hobby of learning such obscure things. Security conscious developers must make it their duty to do the same.
How can we avoid the problems mentioned? We could add more regular expressions
to filter out other bad characters, such as "&" and "|". But it becomes very
difficult to be certain that we have not missed any characters that have special
meaning to Perl. The function
system() exists to convince us of this
fact. It is a function which, like
open(), demonstrates a
vulnerability in the interface between the script and the underlying operating
system. Say we want to use the program sendmail to notify a user of an event
in their calendar. We might do something like:
$username = param('username'); $message = "message_for_$username"; open (MF, ">$message"); print MF "To: $username n"; print MF "From: The Calendar n"; print MF "Subject: Your event is about to happen! nn"; print MF $event; print MF ".n"; close (MF); system ("sendmail -t < $message");
The troubling variable here is $username, because it came from the user, is not
validated, and makes it into a system call. The problems are very similar to the
ones we had with
open(), except here we have a lot more of them. If
the user name contains a semicolon, followed by a command, the command will get
executed right along with sendmail. We could filter out semicolons, but the
shell could use other symbols for command separation (and indeed almost all
shells use the newline character for this purpose as well.) Pipes can also
cause problems here, and so can the ">" and "<" symbols used for output and
input redirection. The backtick operator in Perl is equivalent to the
system() function in this respect. The bottom line is, there are
too many characters and combinations of characters that have a special meaning
to too many components. Worse yet, we can not even know what all of those
characters are, because Perl, the shell, and other applications we might have
to deal with, can all assign special meanings to their own sets of special
characters, and they don't even have to tell us about it. So filtering out
symbols is not really the best idea. A better approach is to filter in
symbols. That is, only allow characters that are legitimate, and cut off
everything else. A regular expression that would do this for our $filename
(and for the $username just as well) could look like this:
$filename =~ s/[^A-Za-z0-9_-.]//g;
Notice that we only allow letters, numbers, underscores, dashes, and
periods -- what you would normally expect to find in a file name. This makes
us feel a little less nervous about those
system() calls. But what about all the other input that gets used
by our script? In , you can find many more examples of Perl functions that
have the problems of
system(). We can not,
however, always afford to filter input so drastically. It is unreasonable,
for instance, to disallow users to have semicolons in the free-text entries in
their calendars. Sometimes we can do things like escaping every non-alphabetic
character in the input with a backslash to prevent it from being interpreted
by Perl or by the shell. We would use a regular expression to do this:
$userinput =~ s/(W)/\$1/g
A good way to minimize cross-site scripting problems is to restrict the set of HTML tags that we allow users to enter. Almost all HTML tags which have attributes can be dangerous, especially the ones that point to other URIs and the ones that can contain scripts. Thus for a restricted subset of HTML to be safe, it must by necessity be very (sometimes overly) restricted. A good starting point is to allow only the tags <p>, <b>, <i>, <em>, <strong>, <pre>, <br>, and their corresponding ending tags . To do this, we can first convert all occurrences of these tags in the input to [p], [b], [i], etc. We can then invalidate all other tags by replacing < with <, > with >, and & with &. After we have done this, we can replace the [tags] back to <tags> again, and be done with it. Here's some code that would do this:
$input =~ s/<p>/[p]/gi; $input =~ s/<b>/[b]/gi; # ... $input =~ s/</p>/[/p]/gi; $input =~ s/</b>/[/b]/gi; # ... $input =~ s/</</g; $input =~ s/>/>/g; $input =~ s/&/&/g; $input =~ s/[p]/<p>/gi; $input =~ s/[b]/<b>/gi; # ... $input =~ s/[/p]/</p>/gi; $input =~ s/[/b]/</b>/gi; # and so on.
This solution does not, however, allow us to embed hyperlinks in user input. To do this, we must allow the <a href=""> tag, but then we have to validate the contents of the referred URI very carefully. See  for a discussion on URI validation, as well as many other aspects of secure programming.
Input, as we saw earlier, is not always so obvious. There are environment variables that influence the behavior of our applications. Those must be carefully examined before being used, and the ones that will not be used must be erased early in the program's execution. There are also hidden parameters in HTML forms that get passed to CGI scripts for various purposes. Just because these parameters are called hidden does not mean that the user can not see them. In fact, the user can set them to any desired value, since the HTML form is submitted by the user's browser. Applications must not trust the origins of this input. The same goes for "cookies" which are used as persistent data storage on the client side. The user has full freedom to modify the value of such cookies, so they can not be depended on for security critical tasks. We must be aware of all this if we want to have any hope of building secure and dependable software.
Let us now turn back to our original question: "How can software be designed and implemented to withstand malicious input attacks?" The short answer here is that it can't. Maybe for some unreasonably restricted meanings of "software" and "input" we could conceive of every possible malicious input attack and devise defensive mechanisms, but in the real world the possibilities for something to go wrong are literally unbounded. Furthermore, we are almost guaranteed that if an attack which we have not considered exists, someone with the right morals and motivation will eventually find it and exploit our software. What then are we to do?
"Act for the best, hope for the best, and take what comes."  To this, I have to add "...then repeat." As we have seen, it turns out that by restricting the set of entities our software trusts to a minimum and by conscientiously validating all input, we can avoid the majority of malicious attacks and almost all accidental breaches (see  for further discussion on the role of trust in security). It is, therefore, most important to think of as many things that can break our software as possible while still in the initial design phases, and to keep all of this in mind during implementation. Security, however, is a process, not a product, as Bruce Schneier so rightly put it. As the software we build serves its uptime, it will inevitably be subjected to rough weather and will sometimes fail to defend itself and the people it serves. We must then make a note of this, go back to the code or even back to the drawing board, and apply the lessons we've learned to ensure that we have eliminated the new avenue for security compromises.
This last piece is a matter of discipline.
References and Recommended Reading
 B. Schneier, "Secrets and Lies", 2000.
 J. Viega and G. McGraw, "Building Secure Software", 2001.
 Fitz-James Stephen, quoted by William James in "The Will to Believe".
 J. Viega, T. Kohno, B. Potter, "Trust and Mistrust in Secure Applications", Communications of the ACM, vol. 44, num. 2, 2001.
About the AuthorJordan Dimov is a consultant for Cigital Inc. in Dulles, Va., and a member of Cigital's Software Security Group.