Chapter 4. Validate All Input


Wisdom will save you from the ways of wicked men, from men whose words are perverse...

  Proverbs 2:12 (NIV)
Table of Contents
4.1. Command line
4.2. Environment Variables
4.2.1. Some Environment Variables are Dangerous
4.2.2. Environment Variable Storage Format is Dangerous
4.2.3. The Solution - Extract and Erase
4.3. File Descriptors
4.4. File Contents
4.5. Web-Based Application Inputs (Especially CGI Scripts)
4.6. Other Inputs
4.7. Human Language (Locale) Selection
4.7.1. How Locales are Selected
4.7.2. Locale Support Mechanisms
4.7.3. Legal Values
4.7.4. Bottom Line
4.8. Character Encoding
4.8.1. Introduction to Character Encoding
4.8.2. Introduction to UTF-8
4.8.3. UTF-8 Security Issues
4.8.4. UTF-8 Legal Values
4.8.5. UTF-8 Illegal Values
4.8.6. UTF-8 Related Issues
4.9. Prevent Cross-site Malicious Content on Input
4.10. Filter HTML/URIs That May Be Re-presented
4.10.1. Remove or Forbid Some HTML Data
4.10.2. Encoding HTML Data
4.10.3. Validating HTML Data
4.10.4. Validating Hypertext Links (URIs/URLs)
4.10.5. Other HTML tags
4.10.6. Related Issues
4.11. Forbid HTTP GET To Perform Non-Queries
4.12. Limit Valid Input Time and Load Level

Some inputs are from untrustable users, so those inputs must be validated (filtered) before being used. You should determine what is legal and reject anything that does not match that definition. Do not do the reverse (identify what is illegal and write code to reject those cases), because you are likely to forget to handle an important case of illegal input.

There is a good reason for identifying ``illegal'' values, though, and that's as a set of tests (usually just executed in your head) to be sure that your validation code is thorough. When I set up an input filter, I mentally attack the filter to see if there are illegal values that could get through. Depending on the input, here are a few examples of common ``illegal'' values that your input filters may need to prevent: the empty string, ".", "..", "../", anything starting with "/" or ".", anything with "/" or "&" inside it, any control characters (especially NIL and newline), and/or any characters with the ``high bit'' set (especially values decimal 254 and 255). Again, your code should not be checking for ``bad'' values; you should do this check mentally to be sure that your pattern ruthlessly limits input values to legal values. If your pattern isn't sufficiently narrow, you need to carefully re-examine the pattern to see if there are other problems.

Limit the maximum character length (and minimum length if appropriate), and be sure to not lose control when such lengths are exceeded (see Chapter 5 for more about buffer overflows).

For strings, identify the legal characters or legal patterns (e.g., as a regular expression) and reject anything not matching that form. There are special problems when strings contain control characters (especially linefeed or NIL) or shell metacharacters; it is often best to ``escape'' such metacharacters immediately when the input is received so that such characters are not accidentally sent. CERT goes further and recommends escaping all characters that aren't in a list of characters not needing escaping [CERT 1998, CMU 1998]. See Section 7.2 for more information on limiting call-outs.

Limit all numbers to the minimum (often zero) and maximum allowed values. A full email address checker is actually quite complicated, because there are legacy formats that greatly complicate validation if you need to support all of them; see mailaddr(7) and IETF RFC 822 [RFC 822] for more information if such checking is necessary.

Filenames should be checked; usually you will want to not include ``..'' (higher directory) as a legal value. In filenames it's best to prohibit any change in directory, e.g., by not including ``/'' in the set of legal characters. Often you shouldn't support ``globbing'', that is, expanding filenames using ``*'', ``?'', ``['' (matching ``]''), and possibly ``{'' (matching ``}''). For example, the command ``ls *.png'' does a glob on ``*.png'' to list all PNG files. The C fopen(3) command (for example) doesn't do globbing, but the command shells perform globbing by default, and in C you can request globbing using (for example) glob(3). If you don't need globbing, just use the calls that don't do it where possible (e.g., fopen(3)) and/or disable them (e.g., escape the globbing characters in a shell). Be especially careful if you want to permit globbing. Globbing can be useful, but complex globs can take a great deal of computing time. For example, on some ftp servers, performing a few of these requests can easily cause a denial-of-service of the entire machine:

ftp> ls */../*/../*/../*/../*/../*/../*/../*/../*/../*/../*/../*/../*
Trying to allow globbing, yet limit globbing patterns, is probably futile. Instead, make sure that any such programs run as a separate process and use process limits to limit the amount of CPU and other resources they can consume. See Section 6.3.8 for more information on this approach, and see Section 3.6 for more information on how to set these limits.

When accepting cookie values, make sure to check the the domain value for any cookie you're using is the expected one. Otherwise, a (possibly cracked) related site might be able to insert spoofed cookies. Here's an example from IETF RFC 2965 of how failing to do this check could cause a problem:

Unless you account for them, the legal character patterns must not include characters or character sequences that have special meaning to either the program internals or the eventual output:

These tests should usually be centralized in one place so that the validity tests can be easily examined for correctness later.

Make sure that your validity test is actually correct; this is particularly a problem when checking input that will be used by another program (such as a filename, email address, or URL). Often these tests have subtle errors, producing the so-called ``deputy problem'' (where the checking program makes different assumptions than the program that actually uses the data). If there's a relevant standard, look at it, but also search to see if the program has extensions that you need to know about.

While parsing user input, it's a good idea to temporarily drop all privileges, or even create separate processes (with the parser having permanently dropped privileges, and the other process performing security checks against the parser requests). This is especially true if the parsing task is complex (e.g., if you use a lex-like or yacc-like tool), or if the programming language doesn't protect against buffer overflows (e.g., C and C++). See Section 6.3 for more information on minimizing privileges.

When using data for security decisions (e.g., ``let this user in''), be sure to use trustworthy channels. For example, on a public Internet, don't just use the machine IP address or port number as the sole way to authenticate users, because in most environments this information can be set by the (potentially malicious) user. See Section 6.9 for more information.

The following subsections discuss different kinds of inputs to a program; note that input includes process state such as environment variables, umask values, and so on. Not all inputs are under the control of an untrusted user, so you need only worry about those inputs that are.