DEF CON Quals were this weekend, and as always, they delivered some of the hardest pwning challenges we’ve seen this year. insanity was not even one of them, but we still spent several hours solving it, and had tons of fun. The concept is a classic: The program implements a custom VM with weird operations, and you have to reverse and exploit it…
There is a slight twist though: insanity accepts its input as “voice commands”, encoded as a 16 kHz, 8-bit audio file. TL;DR here is the full exploit for your listening pleasure:
But let’s cut to it. The program uses the PocketSphinx library for speech recognition. Here is the code of the main function:
So, how does it work?
The program reads a series of data chunks from the user, each prefixed by their length (packed as a little-endian DWORD). It decompresses every chunk using zlib and XORs every byte with 0x80. The samples get expanded to 16 bits by multiplying the values with 0x100.
At this point, we made a stupid reversing mistake that made the challenge more diffult than necessary for us: We thought that file had to fit into a buffer of size 0x10000. At 16 kHz, this makes each chunk at most 4 seconds in playback time, which is not very much. In fact however, the input is streamed into the PocketSphinx library, so it could be as large as we want.
Each audio chunk is then individually passed to the PocketSphinx library
for recognition. The program assumes that the recognized string has the
"insanity "*X + "insane" for some integer X. The value X is then
appended to the VM memory and will later be interpreted as a VM instruction.
If recognition succeeds (as in, returns some hypothesis string), but does not
yield a string of the correct format, the program simply continues to read
chunks. This means that if we send a fake chunk with non-audio data and then
end with 4 zero bytes, we control the contents of the
completely. This behaviour will turn out to be useful in the actual exploit.
Pretty insane so far? Yeah we though so too.
Generating the numbers 1 to 8
We tried several methods of generating the audio files, and ended up with something that works at least up the value 8, which was enough for our purposes:
- Use the macOS speech synthesizer to create the sentence:
say -v samantha -o sound.aiff insanity insanity ... insane.
- Convert to 16-bit, 16 kHz WAV using ffmpeg:
ffmpeg -i sound.aiff -ar 16000 sound.wav.
- For values 6 and above, do manual post-processing to bring the playback time down to under 4 seconds (the limit which we thought we had). We used audacity to cut off some silence in the beginning and end of the file, truncate the last word a bit and increase the tempo by ~35%.
- Read the 16-bit samples from the audio file and convert them to 8-bit using
rounding. This step could have easily been done using
ffmpeg -a pcm_u8, but this gave us worse recognition results than with the rounding for our sped up speech samples.
- XOR every sample with 0x80.
Shout out to Samantha for lending us her easily recognizable voice. You might want to listen to the resulting value 8 (in case you missed it in the exploit above, or you are simply slowly going insane just as we did during the development of this exploit):
Here is Python code that automates the last two steps:
And here is some sample code that sends a series of VM commands:
The virtual machine implemented by the program is fairly simple. It’s a stack based machine, operating on 8 byte values which can either be strings or integers. Strings are stored as pointers to null-terminated strings and marked as such by setting the MSB to 1. The stack grows upwards, towards higher addresses. After all code is read into the program, a 0 is placed directly behind the code, and the stack pointer is set to point there (i.e. the stack begins right after the code). Additionally, the first two values (before the code, at offset 0 and 1) are string pointers to the memory itself and to the decoded input string (“insanity insanity …”). The final layout the looks something like this:
+--------+ +--> "insanity insanity ..." | +----+---+----+---+--------+--------+--------+--------+--------+ +-->| str | str | instr0 | instr1 | ... | zero | .... | +--------+--------+--------+--------+--------+--------+--------+ ^ ^ | IP | SP
It should be noted that nothing prevents SP from pointing into code (and IP from pointing into the stack for that matter). As such we control the initial stack by writing the values after our instructions and removing the trailing zero (e.g. by issuing an
add as first instruction).
The VM implements the following opcodes:
exits the interpreter loop and thus causes main() to return
Pushes the string “insanity” onto the stack
Pops two values from the stack and writes their sum back. In case the two values are strings, the function will concatenate the strings (into a newly allocated buffer). However, it fails to set the MSB of the resulting pointer. A pointer into the heap is thus treated as an integer
Subtracts the two values on the top of the stack. Only works for integers
Multiplies the two values on the top of the stack. Only works for integers
Compares the two values on the stack and pushes either 1 (equal) or 0 (not equal) onto the stack. Works for integers and strings
Loads an integer value from one of the two string pointers stored at the beginning of the virtual memory. The offset is popped from the stack. There is no bounds checking, so this enables a relative read past the virtual memory buffer
Stores the value at SP - 1 into the virtual address at SP. No bounds checking is performed, so this gives us an arbitrary write past the virtual memory buffer.
8: Cond. Jmp
If the value at SP - 1 is not equal to 0, this performs a jump to the virtual address at SP. No bounds checking is performed on the virtual address.
Converts the lower byte of the value at SP into a newly allocated string and pushes that onto the stack
opcode - 10 onto the stack. I.e. the opcode 14 would push the value 4 onto the stack.
Since we are somewhat limited in the values we can write to the initial stack, we have to generate large integer values ourselves. The following Python code implements this:
We now have everything we need to finish the exploit.
The binary is compiles as PIE, as such, we require an information leak first. This is quite easy though: since we have an unbounded read relative to the base address of the VM memory, which is located on the stack, we can read the return address of main() into our program, add an offset to it, and write the resulting value back onto the stack. We will need multiple reads and writes for the final exploit, and thus need to load multiple large values onto the stack. Since this is somewhat cumbersome with our current setup, we would first like to pivot execution into an area of memory that we fully control: the zlib output buffer, which is located behind the virtual memory buffer. To do that, we will supply a small program that will simply perform a long jump into the start of the decompression buffer:
calc(17362) push(1) # The conditional for the jmp, must be != 0 jmp()
We can now write arbitrary bytecode. The final steps are fairly straightforward: We assemble a very short ROP chain on the stack that will load a pointer to the string “/bin/sh” (found in the libc) into rdi, then jump to system(). Implemented in “Insanity”, this looks as follows:
push(33779) # Push offset to return address from memory buffer onto the stack load(0) # Load return address onto the stack. # '0' selects the string at offset 0 in the virtual memory, which points to itself push(offset_system - offset___libc_start_main_ret) add() # Add offset to leaked libc address push(33781) # Push new offset, 2 qwords behind return address store() # Trigger out-of-bounds write # Repeat ... push(33779) load(0) push(offset_str_bin_sh - offset___libc_start_main_ret) add() push(33780) store() push(33779) load(0) push(offset_pop_rdi - offset___libc_start_main_ret) add() push(33779) store() # Trigger ROP chain exit()
The final exploit can be found here.