Intercepting System Calls on x86_64 Windows

Table of Contents:

Abstract

This post presents the reader a technique that can be used in order to
achieve Universal System Call Hooking under WOW64
[1] from usermode.

Throughout the next paragraphs we will briefly introduce the reader to
system calls, motivation for this article, the way Linux’ ptrace(2) handles
system call interception, how WOW64 windows performs system calls and
finally how we take advantage of this technique by intercepting
every system call, giving third party software a way to analyze and/or
modify system calls made by a process.

Additionally, we will discuss possible improvements, advantages and
disadvantages of the presented technique.

Finally, we provide the reader a Proof of Concept including all sources,
pre-compiled binaries, a sample output and a small analysis based on the
output and the source.

Introduction

A system call is, as the name suggests, a call to the “system” or kernel.
System calls are the lowest processes get because it’s the only way to
communicate with the kernel, whatever the kernel does with the data a
process provides, cannot be seen by the process; processes only see the
result of the system call. It’s worthy to note that any kind of
Input/Output goes through the kernel (be it a file, socket, etc.)

Whereas the Linux kernel provides the ptrace(2)
[2] API, which allows processes
to intercept system calls made by child processes
[3], Windows does not
provide such functionality. Windows does, however, ship with a fairly
comprehensive debugging framework, which we will not use here (that
is, we do not use dbghelp.dll to help debugging, or attach to a process
at all
, this means that anti-debugger methods do not work for
processes that are being hooked using this technique, or RREAT
[POC] in general.

Because windows does not natively support system call hooking, there have
been a number of kernel drivers (e.g. rootkits), which would intercept
system calls by hooking the SSDT table
[4], installing kernel hooks, etc.

Using the technique presented in this post, one does not need
administrator privileges
to perform system call hooking
(which is required to load
drivers into the windows kernel), instead, it requires (atleast) the same
privileges as the process we want to intercept. This makes the technique
fairly attractive because for one, a user doesn’t have to run as
administrator in order to debug another process. Secondly, 64bit windows
kernels make it fairly hard to load drivers at all, and once loaded, the
driver is not allowed to hook anything at all (using the traditional
methods), because PatchGuard
[5]
will jump in and restore any hooks.

ptrace(2)’s approach

The Linux kernel provides the ptrace(2) API which handles everything
related to the debugging of child processes. It also supports intercepting
system calls and does this in a very nice way, therefore we use the same
way in our implementation.

Linux’ ptrace(2) works as follows. When a child process performs a system
call, the parent (the debugger) gets a notification before the
system call is executed (we will call this the
pre-event from now on), from here on the parent will be able to
inspect the arguments to the system call by reading the registers (or
stack, depending on the syscall calling conventions
[6].)
The parent also has the ability
to alter arguments, because it can change registers and write memory in the
child. After the parent finishes the pre-event inspection, it will tell
ptrace(2) that the child can execute the system call. The Linux kernel
will then continue with the execution of the system call that was made by
the child process (with either the original parameters, or parameters that
were altered by the parent), after the system call finishes and right
before the child process gains execution again, the post-event is
triggered; the parent receives another notification, this time the parent
is able to read the return code and any altered parameter (e.g. functions
such as read(2) fill a buffer, given as parameter, with contents from a
file.) Note that the parent is, logically, also capable of altering the
return code and/or anything else.

It should be obvious that the ability to intercept system calls gives full
control over a processes’ behaviour.

Architecture

The technique presented here is specific to WOW64 processes (that is,
32bit binaries running on a 64bit windows version), however,
with some extra work this technique can also be deployed to work on
32bit windows versions (although it’s not as “reliable” on 32bit windows
for various reasons.)

In order to be able to run 32bit binaries on a 64bit windows version, there
has to be an extra layer between usermode and kernelmode. This is because
a 64bit kernel expects 64bit pointers as parameter, whereas the 32bit
application provides 32bit pointers.

This brings us to segments, on WOW64 there are two code segments, the code
from our 32bit binary will run in segment 0×23, which tells the CPU to
emulate a 32bit CPU. However, when a syscall is made, the CPU has to switch
to segment 0×33, which makes the CPU use the 64bit instruction set.

Switching segments in WOW64 is done by a so-called far jump (a jump which
has an address and a segment.) Because this far jump is kind of tricky,
and hardcoded, it only appears once in ntdll.dll (where all system calls
take place.)

By replacing the instruction at this particular address, we can intercept
every system call.

Windows’ Implementation

As stated earlier, there is only one address in ntdll.dll where the segment
switch is done. So whenever a process makes a system call, this particular
address is hit. In other words, by redirecting the execution flow from this
address, we are able to intercept every system call the process makes.

The next problem we face; how do we get that address? Let’s dive into some
assembly, namely that of the ZwCreateFile()
[7]
function (on Windows 7 SP1.)

All this function does is setup the arguments correctly for the system
call, the important parts are 0×52 (this is another notation for 52h) which
is the system call number for the ZwCreateFile() API and the fact that the
‘edx’ register is loaded with the address to the first argument. Followed
by the instructions to initialize a few registers is the call instruction,
this instruction continues code execution at an address which is specified
by fs:[0xc0], we fire up a debugger and find the address in fs:[0xc0]
(simply by executing “mov eax, fs:[0xc0]” and reading the value of ‘eax’
after executing it.) We analyze the instruction at this address and see
that it is, as you might have guessed already, a far jump!

We have just seen what the ZwCreateFile() function looks like, and it’s
worthy to note that every system call looks just like it, except they have
a different number for the ‘eax’ register (every function has a unique
system call number.)

Now we know how we have to hook at which address, it is time to get to our
implementation.

Internal Workings

Our PoC is based on several components; a redirection at the childs far
jump, some injected code in the child to handle system calls and notify
the parent and finally a thread in the parent, which waits for
notifications.

Setting the Universal System Call Hooking mechanism up goes as
following.

The parent creates an Event object and duplicates it to the child process,
this way when the child signals the event object, the parent will get a
notification, which is what we need for pre- and post-event notifications.

The parent allocates a memory page in the child and writes some
hand-written machine code to it, before the machine code (btw, this is
32bit machine code) is written to the child, a few depencies are updated
in the machine code, such as the address of the far jump and the handle of
the Event
object in the child process. As ntdll.dll appears to be mapped to the same
base address in the child and parent process, we can simply take the far
address (an address with segment) from the parent process.

The parent creates a thread (in its own process) which is basically an
infinite loop that waits for the child to signal the event object, more on
this later.

The parent overwrites the far jump in the child with a jump to the
hand-written machine code that we injected earlier.

Implementation of the Injected Machine Code

The injected machine code behaves just like ptrace(2)’s method, but in
order to achieve this, a few hacks are necessary. (ptrace(2)’s method:
pre-event, perform real system call, post-event.)

After a child notifies the parent it enters a busy-loop
[8], the busy-loop
gives the parent enough time to catch up. The parent will now suspend
[9]
the thread in the child process, and after it has been suspended, read the
CPU registers.

The parent keeps track whether this notification is a pre-event or a
post-event, simply by toggling a boolean value every notification.

If the notification is a pre-event, the parent will read the values from
the ‘eax’ and ‘edx’ CPU registers (these contain the system call number and
the address of the first argument, respectively.) From here on, the parent
can decide to read all arguments to the function, by reading data from the
child process (at the address specified by the ‘edx’ register.)

However, when the notification is a post-event, the parent can read
the return value of the system call and optionally any altered arguments.
A lot of windows APIs alter arguments to the system call, such as
ZwReadFile(), a low-level variant of fread(), which changes the contents
of the ‘buffer’ parameter to the contents from a file (or any other
stream.) In other words, the post-event allows the parent to read the
‘buffer’ parameter that was filled by the ZwReadFile() system call, and
therefore the parent is able to read and modify these contents.

When the parent finishes processing either a pre- or post-event, it will
set the Instruction Pointer (also called Program Counter) past the
busy-loop and resume the thread, this way the thread will happily continue
executing.

To perform the signal to the parent we need the ZwSetEvent() API, however,
we don’t want to intercept our own API call, so instead of calling the API
(or fs:[0xc0] for that matter), we do a call to the far address directly.

Features and Limitations

The main feature is, well, Universal System Call Hooking in another
process.

Advantages of the technique described here include; the ability to run
without the need for administrator access, the ability to read and modify
arguments (or even the system call!), etc.

The major two disadvantages are; a process will always be able to bypass
this method if it’s specifically told to and there can be quite some
slowdown as a system call turns into three system calls (ZwSetEvent()
twice for pre- and post-events and of course the real system call) and any
overhead that the parent brings, not including any extra overhead the
kernel brings for Inter Process Communication (signalling an event object
in another process) and manipulation of the thread in the child process
(i.e. suspending and resuming a thread, reading and writing a threads CPU
registers, and reading and writing memory in the child.)

The current implementation, which can be found in the Proof of Concept, is
unfortunately single-threaded only.

Optimizations and Improvements

A major improvement, speed-wise, is by using a whitelist table in the
child. Our Proof of Concept already does this; when the parent injects the
hand-written machine code, it also allocates a 64k table (no optimizations
here.) Each entry in this table maps to a boolean value which indicates if
the system call should be hooked or not. So, when the child performs a
system call, the value of the ‘eax’ register (which thus contains the
system call number) is checked against the table, if the boolean value is
false then code simply executes the original system call and does nothing
else, this way system calls we are not interested in will not be sent to
the parent, therefore the only overhead for those system calls is a table
lookup, which is quite “cheap.”

As the current implementation only allows single-threaded applications,
multi-threaded support should also be added. A simple
approach is as follows. When the child creates a new thread, the parent
creates a new event object in the parent and duplicates it to the child,
the parent then makes sure that the event object in the child is put
somewhere in the threads Thread Local Storage
[10].
From here on, when the child creates
a system call, it will notify the parent with an event object unique to
the current thread. The parent will then see that the system call occured
from a certain thread and it will do its thing on the specific thread etc.
Note that the parent is able to receive a notification when a child
creates a new thread, because this is a system call as well.

Because replacing the far jump to a normal jump is quite easy to detect, it
might be interesting to, instead of replacing the instruction completely,
only replace the far address (this would result in an address with a
segment of 0×23.) If software still picks this up, one could go further and
modify the 64bit code located at the original far address (e.g. jump back
to 32bit code from there.) Methods to make it harder to detect are
only limited by your imagination ;)

If somebody were to add multi-threaded support, great care has to be taken
regarding Race Conditions
[11]. Race Conditions
can occur when multiple threads try to read and write to and from the same
memory addresses, for example. In our situation, a malicious thread might
alter a threads registers (by using the windows API) thereby possibly
causing undefined behaviour in the parent. The best way to avoid this is
by reading all registers, stack memory and whatever is needed only once
and reusing this data, rather than reading it every time it’s needed.

Proof of Concept

Source with Binaries can be found here.
Up-to-date source can be found on github.
This Proof of Concept has been tested successfully on
64bit Windows Vista with Service Pack 2 and
64bit Windows 7 with Service Pack 1.

The PoC consists of two parts, a parent and a child. The child attempts to
open a file called “hello world!”, and yes, this is an invalid filename.
The parent reads the filename from the child process after receiving the
pre-event notification and shows the return value (which is an error,
because the filename is invalid) after receiving the post-event.
Running the Proof of Concept (on a 64bit windows machine of course!)
gives something like the following.

$ parent.exe
Universal System Call Hooking PoC   (C) 2012 Jurriaan Bremer
Started hooking child with process identifier: 3308
Child 3308 opens file: "hello world!".
Child 3308 return value: 0xc0000034 -1073741772

Note that 0xc0000034 is defined as STATUS_OBJECT_NAME_NOT_FOUND, in other
words, the filename is incorrect.

If we dive into the child’s source, we see a single line of code:

fclose(fopen("hello world!", "r"));

What is interesting about this is the fact that, although we call the
fopen() function in our code, code execution still ends up at
ZwCreateFile(). This is merely a conclusion that fopen() is a huge wrapper
around CreateFile(), which is in turn a wrapper around ZwCreateFile().

The parent’s source is also fairly straight-forward and well-documented,
go read it if you like. As the PoC’s source will show, the internals of
our implementation lay in a library named
RREAT, needless to say, you
can find the internals there.

A small note regarding RREAT; RREAT was made with in mind that it should
unpack packed software, etc. Therefore it’s built in such a way that a
failing API leads to an exit() call, when you’re unpacking a script and
something goes differently than expected, then that means that the script
was not packed in the way you thought it was, therefore your unpacking
script is wrong, hence there is no need for the process to keep running.
Because of this RREAT is not very robust by default, keep this in mind
when developing on top of RREAT ;)

That was it for today, hopefully you liked the content of this post. Feel
free to contact me with any suggestions, critics etc. Pull requests are
also more than welcome, see here.

Cheers,
Jurriaan

References

  1. WOW64 – Windows 32bit on Windows 64bit
  2. ptrace(2) man page
  3. Attaching to a running process makes it a child process as well.
  4. SSDT Table Hooking
  5. Patchguard – Kernel Patch Protection
  6. Calling Conventions on Wikipedia
  7. ZwCreateFile() on MSDN
  8. Busy Loop on Wikipedia
  9. When a thread is suspended, it doesn’t execute until it’s resumed again.
  10. Thread Local Storage on Wikipedia
  11. Race Conditions on Wikipedia

4 thoughts on “Intercepting System Calls on x86_64 Windows

  1. Looks interesting!
    Why the POC is limited to debugging a single specific child instead of allowing the user of testing it with any executable? (I’m specifically thinking about using it on php…)

    • The POC is limited due to the “current” technical implementation. The parent injects some code into the child and gets a notification whenever the child does a system call by using “call dword fs:[0xc0]” at that particular address in ntdll. Even though you could do this to trace PHP, I’d suggest you use a DBI framework like Pintool (http://pintool.org/, or see my other blogposts about Pintool) to solve this problem, as it’s fairly trivial to “bypass” the hooking method discussed in this article.

      If you have any further questions, please let me know ;)

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>