how to use data validation to avoid Remote File Inclusion (RFI) vulnerabilities in your code, with examples ~ WhiteCyber

A remote file inclusion (RFI) vulnerability is a security flaw in programming code. Whenever a script receives data from outside itself, there is a danger that the data was sent by a malicious attacker, a hacker, who designed it so that it would corrupt the execution of the script and trick it into doing something it wasn't supposed to do.
One of the actions that a corrupted script can be tricked into doing is to fetch a file from a distant website (that is the "remote file") and include() it into the body of the corrupted script (that is the "inclusion"). Whatever program code is in the remote (but now local!) file becomes part of the corrupted script, and it executes right along with all the other code.
RFI therefore allows hackers to run their code on your server, with the same access permissions (to folders and files) that your own code has.
The key to success of an RFI attack is that the hacker must be able to send the URL of the remote file into your script, disguised as innocent data.
That's easy. All they have to do is find (or guess) the avenues by which your script accepts incoming data, make note of the variable names you use (or guess, using common names), and then start sending your script ordinary requests of the type it normally expects, but with one difference: the values of the variables it sends are all the URL of the remote script they want your script to execute.
They have no control over whether your script actually uses the incoming data in PHP include(), include_once(), require(), or require_once() statements (or their equivalents in other languages), but it is so common for that to be the case that this is a high percentage play for them.
An important defense against RFI attack is to write your scripts to examine every incoming variable to ensure that its data type, character composition, format, and value are "legal" according to the characteristics your script expects that variable to have. If an incoming variable is not what you expect, your script must not use it.
This is called data validation or "sanitizing" or "scrubbing". It is the topic of this article.

Untrusted data (from outside the script) requires validation

When you set a variable explicitly in a script with:

$a = 4;

that data is considered trusted. You are in control of it. You set it yourself, and presumably not maliciously. Likewise, if you read data from a file or other source that you completely control, that data is trusted.
Data is "untrusted" when it comes from a source you don't control completely.
Common ways PHP scripts receive untrusted data from the outside world:

$_GET[''] data, received from the user in the URL query string.
$_POST[''] data, usually received from the user through HTML form submissions.
$_COOKIE[''] data, received from the user in the cookie sent by their browser.
You initially control the cookie when you create it, but the user can edit and modify it before their browser sends it back.
$_REQUEST[''] data: all the $_GET[''], $_POST[''], and $_COOKIE[''] variables, combined into one array.
Any of the other variables listed here (and on the linked pages) that came from the user or their browser, or that are based on information that came from them.

Before each of these incoming variables is used in a script, it is necessary to ensure that it has a value in the set of, or within the range of, the legitimate values your script is designed to handle for that variable. If it is not, you should instead give it a safe default value, or not use it at all, or reject the submission and inform the user that the input was invalid, whichever option is appropriate to your application.
Some of the ways you can test variables include:

Ensure length is within the expected range, or cut it to the maximum acceptable length.
Test it against a regular expression (regex) to ensure it doesn't contain unacceptable characters.
Ensure numeric input has only digits and other number-related characters. Examples: +-0123456789.e
Ensure numeric input is within the expected numeric range.
Compare the value (string or numeric) against a list of all possible acceptable values. Ensure that it matches one of them.
Test the input with a PHP Validating Filter.

The example code will show methods of doing all these tests, but first let's do some experimenting to see what RFI is all about.

Valid and invalid data in form submissions, and an RFI demonstration

Although this page doesn't have a form on it, it is designed to handle form submissions using HTTP GET requests.
You submit the data manually by copying and pasting URLs for this page into your browser's address bar. The URLs have the same format as ones generated by a browser when you submit a form. Doing it manually will help understand what an RFI attack is and how it works.
The hypothetical form has two fields:

Age: input is numeric only, and only values from 0 to 114 are allowed.
Favorite Color: input can be text (red, blue, green) or numeric (1=red, 2=blue, 3=green).
If the script receives a number, it is translated within the script to the corresponding color.

Here are some example URLs to paste into your browser address bar. You'll see the result of your "form submissions" at the top of the resulting page.

This is a legitimate request with legal values. Your age is 25, and your favorite color is blue. If you wish, you can manually edit the URL to experiment with other combinations:

http://25yearsofprogramming.com/blog/2011/20110124.htm?age=25&color=blue

This is a legitimate request with age=50 and a numerically encoded color (3=green):

http://25yearsofprogramming.com/blog/2011/20110124.htm?age=50&color=3

This is an invalid request with a non-numeric age and a color that the script is not designed to handle. The resulting output color is not violet but red because the script sets red as the default color. When the user submits an illegal color, their input is ignored. Age is handled the same way. The default age is 0:

http://25yearsofprogramming.com/blog/2011/20110124.htm?age=noyb&color=violet

So far, all seems quiet. Whatever you enter, the output you get is an age and a color. If you try to do something invalid, you get a default age and a default color. Big deal.

The next experiment is an RFI attack. Copy and paste the URL into your address bar. If your browser displays the text below as 2 lines, copy them both. It's a one-line URL. If your sharp eyes spot a typographical error in the URL query string, don't correct it. It's intentional⁵:

http://25yearsofprogramming.com/blog/2011/20110124.htm?age=75&color=htpp://25yearsofprogramming.com/robots.txt

What happened??! You've got a lot of nerve! You hacked my website!! OK, not really. We're pretending. What happened was this:

My script was expecting you to supply a color, as before. It uses a PHP include() command to include one of my website files into the page text, based on which color you submitted.
I FAILED to use the methods described later in this article to test whether what you supplied really was a legitimate color, or any color at all.

So my script made the mistake of including the value that you DID supply, which was the URL of my robots.txt file, and that's what you see on the result page.

You completely hijacked the color processing that was supposed to occur, and tricked my script into doing something different, not what I, the programmer, wanted to happen, but what YOU, the hacker, wanted to happen.

The simple printout of my robots.txt looks harmless enough, but do you appreciate what a horrible thing has just happened?
What if you had given it the URL of a PHP script located on (for our example) YOUR website? My script would just as happily have fetched that file from your website, and included that into the page. But this time it's not a harmless robots.txt. It contains PHP code. That code would have become part of MY script, and it would have executed.

If my original code (in pseudo-code form) were:

fetch_the_color_file();
place_its_text_on_the_page();
do_more_stuff();

it could then become:

fetch_the_color_file();
place_its_text_on_the_page(); // but it's PHP code!, so it runs and does this:
make_a_list_of_all_files_in_my_site();
for(every_file)
{
 open_the_file_in_append_mode();
 add_a_virus_infected_iframe_to_the_bottom();
 save_the_file();
}
do_more_stuff();

All that extra code came from the file on your website. It got inserted right into the middle of my own code, and it ran. Now every single page of my site has a virus-infected iframe in it. That's what I get for failing to make sure that what you sent me was a legitimate color!

PHP $_GET[''] Data Validation Example Code

The examples show several methods of validating $_GET[''] variables. The same methods apply to $_POST[''] or the others. Links go to documentation pages at php.net.
The basic strategy is the same for all methods:

It's easier if you don't use the incoming $_GET[''] variables directly throughout your script.
Instead, create an ordinary local variable to hold each incoming value.
Initialize each variable with a legitimate starting value.
That will be its default if the incoming replacement value is missing or invalid.
Test each incoming $_GET[''] to make sure it is completely legitimate for what it is supposed to be.
If it is supposed to be an integer, it must contain only digits.
If it is supposed to be one of 5 possible values, make sure it exactly matches one of the 5.
Transfer the $_GET[''] value to the local variable only if it survived the tests.
Otherwise, use the default values or abort the script, whichever is appropriate to the application.

After you remove the comments, you'll see that none of the examples have much code.
Validation can be easy.

1) Compare incoming value against an array of all possible legal values

If there are many legal values, you could keep the list in a file and use the file() function to read it into the array when you need it.

<?php
// LOCAL VARIABLE WITH ITS LEGITIMATE DEFAULT VALUE
$Color = 'red';

// ARRAY OF ALL POSSIBLE LEGAL VALUES FOR THE VARIABLE
$LegalColors = array
(
 'red',
 'blue',
 'green'
);

if(isset($_GET['color'])) // IF USER SUBMITTED A COLOR VALUE
{
 // REMOVE IRRELEVANT LEADING/TRAILING WHITESPACE FROM THE INCOMING TEXT
 $_GET['color'] = trim($_GET['color']);

 // CHECK AGAINST THE LEGAL-VALUES ARRAY, WITH STRICT TYPE CHECKING
 if(in_array($_GET['color'], $LegalColors, TRUE))
 {
  // TRANSFER THE INCOMING VALUE TO THE LOCAL VARIABLE
  $Color = $_GET['color'];
 }
 // AN else {} HERE COULD ABORT THE SCRIPT IF THE VALUE WAS ILLEGAL
}
?>

2) Test incoming value against a regular expression

This example uses a regular expression to test against all possible legal values. That is an exact duplication of the array validation method above, but regex testing can be used more flexibly than that: you can test for variations and patterns rather than against specific entire strings. Be sure that your regular expression matches all the possible legal values, but nothing else.

<?php
$Color = 'red';

if(isset($_GET['color']))
{
 $_GET['color'] = trim($_GET['color']);
 if(preg_match('/^(red|blue|green)$/u', $_GET['color']))
  $Color = $_GET['color'];
}
?>

// OTHER USEFUL REGULAR EXPRESSIONS. A WEB SEARCH WILL FIND MANY COMMON ONES.

if(preg_match('/^[A-Z]{1,8}$/', $_GET['var']))  // 1-8 UPPERCASE ALPHABETIC
if(preg_match('/^[A-Z0-9]{1,8}$/i', $_GET['var'])) // 1-8 UPPER/lower ALPHANUMERIC

3) Test incoming value with switch cases

The switch method allows some additional flexibility: you can translate incoming values to different values for internal use. This example, in addition to allowing the color names, allows numeric color values of 1,2,3 and uses the cases to translate them to red,blue,green for internal use. If I only used the numbers in publicly visible URLs, I could prevent anyone knowing what values they are translated to internally.

<?php
$Color = 'red';

if(isset($_GET['color']))
{
 $_GET['color'] = trim($_GET['color']);

 switch($_GET['color'])   
 {
  case 'red': 
  case '1': 
   $Color = 'red';  
   break;
  case 'blue': 
  case '2':
   $Color = 'blue';
   break;
  case 'green': 
  case '3':
   $Color = 'green';
   break;
  default:
   // YOU COULD ABORT SCRIPT HERE
   break;
 }
}
?>

4) Validating numeric values

<?php
$Age = 0;

if(isset($_GET['age']))  // IF USER SUBMITTED AN AGE VALUE
{
 $_GET['age'] = trim($_GET['age']);

 // "IF THE INPUT CONSISTS OF 1 TO 3 DIGITS"
 if(preg_match('/^[0-9]{1,3}$/', $_GET['age']))
 {
  // FORCE THE VARIABLE TO THE REQUIRED TYPE
  settype($_GET['age'], 'integer');

  // TEST FOR ACCEPTABLE MINIMUM, MAXIMUM VALUES
  if(($_GET['age'] >= 0) && ($_GET['age'] <= 114))
  {
   // ACCEPT THE VALUE TO OUR LOCAL VARIABLE
   $Age = (int)$_GET['age']; 
  }
 }
 // AGAIN, IF INCOMING VALUE WASN'T VALID, LOCAL $Age WASN'T CHANGED.
}
?>

5) Validating numeric values with a PHP Filter

This alternative uses a PHP 5.2+ validating "filter function" to validate an integer with less code:

<?php

$Age = 0;

// RETURN VALUE IS THE VALIDATED INTEGER (ON SUCCESS), OR FALSE, OR NULL
$i = filter_input(INPUT_GET, 'age', 
 FILTER_VALIDATE_INT, 
 array('options'=>array('min_range'=>0, 'max_range'=>114)));

if(($i !== FALSE) && ($i !== NULL))
 $Age = $i;

?>

The following code is equivalent. I currently recommend using it instead because it appears to me that filter_var is more reliable, predictable, and portable than filter_input. The user comments at php.net about filter_input (see the link) mention odd behavior that I've also experienced.
We must use isset() because filter_var throws an error if the tested variable isn't set. In the example, the default $Age of 0 is used if $_GET['age'] is not set, and is also specified as the default value if it is set but invalid.

<?php

$Age = 0;
if(isset($_GET['age'])) 
 $Age = filter_var($_GET['age'],
  FILTER_VALIDATE_INT, 
  array('options'=>array('default'=>$Age, 'min_range'=>0, 'max_range'=>114)));

?>

More defenses against RFI

Two other methods of RFI defense can serve as backup, in case you make a mistake in your script and allow some variables to go unvalidated, or in case an application you use contains not-yet-discovered RFI vulnerabilities:

Configure PHP so that it will not fetch files from remote websites, even if a program tells it to.
See the settings allow_url_fopen and allow_url_include.
Use Apache .htaccess to ban (reject, without processing) requests where the HTTP query string contains a URL disguised as innocent data.

Notes

All security-related validation must be done in your server-side PHP code. If you want, you can do preliminary validation on the client side with JavaScript in the user's browser, but they can easily avoid that validation by turning JavaScript off. Besides, malicious robots (which are the real threat) don't run your JavaScript. They don't even load your web page. They send their malicious data directly to your PHP script. It's similar to how, in the earlier RFI experiment, you entered your "malicious" requests directly into your address bar.
There is another type of attack called Local File Inclusion (LFI). It attempts to trick your script into including a sensitive file (such as a password file) from your server. The method of attack is the same: it sends the path and name of the file it wants to see, hoping your script will include() it. The defense is also the same: if you receive a variable value that is not one of the legal ones you were expecting, don't use it.
You can discover whether your website is receiving RFI or LFI attacks at my hack attempt identifier.
Whenever a security vulnerability report, such as at Secunia.com, says that an application uses "unsanitized input", it means that the application fails to use the methods described above to validate input. It is therefore vulnerable to attack.
Due to the protections I use against real RFI attacks, this example only works because it is a simulation. If you paste the correctly formatted URL into your address bar, it will be a real RFI attack. You'll get a blank page. Use your browser's Back button to return to this page you're reading now.
Although this article and its examples are about PHP, the same principles apply to all languages. It's not the 1980's anymore. Any program that will be used by someone other than you needs to protect itself against the possibility that incoming data is malicious. Data validation is tedious for the programmer and causes code bloat, but it's necessary.

Posted in: Web Security

WhiteCyber

Saturday, June 11, 2011