Deep-Dive in WoW64

Mon 12 September 2016 by hakril

Intro

When working on PythonForWindows, I had multiple encounters with some specificities of WoW64 (Windows 32-bit on Windows 64-bit) and the challenges/opportunities it offers.

Few weeks ago, I tried to play once again with cross-bitness execution and more precisely 64b Vectored Exception Handler (VEH). What I thought would be a "fun afternoon" revealed to be way more challenging than expected and forced me to explore the internals of WoW64.

In this article all Python code is based on PythonForWindows

The basics - What I knew before the exploration

Wow 64

WoW64 is the x86 emulator that allows 32-bit Windows-based applications to run seamlessly on 64-bit Windows. [1]

On the filesystem (FS) there is the C:\windows\Syswow64 directory that contains the 32bits version of the system DLL / binaries. There is a redirection layer somewhere in WoW64 that redirects the FS requests from C:\windows\system32 to C:\windows\Syswow64 but I didn't knew where and how.

When exploring the loaded modules of a WoW64 process, we can see some peculiarities:

>>> import windows
>>> cp = windows.current_process
>>> cp.bitness
32
>>> cp.peb.modules[:4]  # Extract the list of loaded modules from the PEB
[<LoadedModule "python.exe" at 0x3c152b0>, <LoadedModule "ntdll.dll" at 0x3c11030>, <LoadedModule "kernel32.dll" at 0x3c110d0>, <LoadedModule "kernelbase.dll" at 0x3c11120>]
>>> cp.peb_syswow.modules  # Extract the list of loaded modules from the PEB64
[<RemoteLoadedModule64 "python.exe" at 0x3c152b0>, <RemoteLoadedModule64 "ntdll.dll" at 0x3c15760>, <RemoteLoadedModule64 "wow64.dll" at 0x3c157b0>, <RemoteLoadedModule64 "wow64win.dll" at 0x3c15080>, <RemoteLoadedModule64 "wow64cpu.dll" at 0x3c15800>]

As we can see, a WoW64 process has two PEB:

  • A 32bits one that is the "standard" PEB of the WoW64 process and contains all 32b loaded DLL
  • A 64bits one that is used by WoW64 and contains the Wow64 DLL
    • ntdll (64 bits)
    • wow64.dll
    • wow64win.dll
    • wow64cpu.dll

A WoW64 process contains both 32b and 64b code, this means that it can freely switch from one mode to the other. In intel the switch in processor mode is controlled by the Code Segment Selector (CS) through far jmp, far call, iret, ...

On windows the CS values for userland mode are:

  • 0x23 for the 32bits mode
  • 0x33 for the 64bits mode

We can see the segments referenced by those two CS values using windbg:

32.1: kd> dg 0x23
                                                    P Si Gr Pr Lo
Sel        Base              Limit          Type    l ze an es ng Flags
---- ----------------- ----------------- ---------- - -- -- -- -- --------
0023 00000000`00000000 00000000`ffffffff Code RE Ac 3 Bg Pg P  Nl 00000cfb

32.1: kd> dg 0x33
                                                    P Si Gr Pr Lo
Sel        Base              Limit          Type    l ze an es ng Flags
---- ----------------- ----------------- ---------- - -- -- -- -- --------
0033 00000000`00000000 00000000`00000000 Code RE Ac 3 Nb By P  Lo 000002fb

The important part is the "Long" flag, Nl means 32b ("Not long" I guess) and Lo "Long". This means that a 32b process can easily execute 64b code by using far jmp/call/ret instructions. It might be useful to clarify why the Limit of 0x33 is 0, it's because the processor does not check the limit of 64b segments. This have been documented in various post and it's quite useful to interact with 64b process from a 32b one. I made an implementation in PythonForWindows before exploring WoW64 seriously.

>>> import windows
>>> import windows.native_exec.simple_x64 as x64  # x64 assembler
>>> windows.current_process.bitness
32
>>> code = x64.assemble("""
... mov rax, 0x4141414141414141;
... mov rcx, 0x0101010101010101;
... add rax, rcx;
... ret""")
>>> code
'H\xb8AAAAAAAAH\xb9\x01\x01\x01\x01\x01\x01\x01\x01H\x01\xc8\xc3'
>>> r = windows.syswow64.execute_64bits_code_from_syswow(code)
>>> hex(r)
'0x4242424242424242L'

Exception Handling (EH)

Exception Handling constructions will play an important role in the next parts of this article. This is a small overview of what I knew about the different types of exception handling.

In 32bit there is the Structured Exception Handling (SEH) based on the stack.

In 64bits, instead of SEH there is the Table-based Exception Handling (TEH) which is a lot of structures in the PE itself. All I knew at is that I never took the time to look into it until know.

In 32 and 64 bit there is the Vectored Exception Handling (VEH) that are called before the other EH constructions. VEH can be added/removed via ntdll!AddVectoredExceptionHandler and ntdll!RemoveVectoredExceptionHandler. VEH are quite cool and can allow you to write a kind of debugger working in a lone process.

An important note about the VEH API, a VEH can return 2 values:

  • EXCEPTION_CONTINUE_SEARCH (0) will call the other VEH and then the other EH constructions
  • EXCEPTION_CONTINUE_EXECUTION (-1) will resume the execution at the context passed to the VEH

Problem - "why does it not work?"

Here are the important points:

  • You can execute 64b code in a WoW64 process
  • ntdll64 is loaded in the process
  • VEH API is in ntdll
  • VEH are cool

Knowing that, I wanted to setup a 64b VEH on a WoW64 process. All we need to do is:

  • Get the address of ntdll64!AddVectoredExceptionHandler
  • Write our 64b VEH in memory
  • Call ntdll64!AddVectoredExceptionHandler(0, handler_addr)
  • Trigger an exception (for testing purpose)

For our first try, we will make a simple VEH that write a value in memory (as a proof of execution) and returns EXCEPTION_CONTINUE_SEARCH(0) to trigger the standard EH constructions.

## simple_continue_search.py
import ctypes
import windows
import windows.native_exec.simple_x64 as x64

# VEH installer, this code will be in the 'veh' module for the next examples.
cp = windows.current_process
RtlAddVectoredExceptionHandler = cp.peb_syswow.modules[1].pe.exports["RtlAddVectoredExceptionHandler"]

def install_veh_64(handler_code):
    handler_addr = windows.native_exec.native_function.allocator.write_code(handler_code)
    install_veh =  x64.MultipleInstr()
    install_veh += x64.Mov("RCX", 0)
    install_veh += x64.Mov("RDX", handler_addr)
    install_veh += x64.Push("RDX") * 4
    install_veh += x64.Mov("RAX", RtlAddVectoredExceptionHandler)
    install_veh += x64.Call("RAX")
    install_veh += x64.Pop("RDX") * 4
    install_veh += x64.Ret()
    windows.syswow64.execute_64bits_code_from_syswow(install_veh.get_code())
# End of 'veh' module

DATA = ctypes.c_ulonglong()
# Our VEH
handler =  x64.MultipleInstr()
handler += x64.Mov(x64.deref(ctypes.addressof(DATA)), 0x42424242)
handler += x64.Mov("RAX", 0)
handler += x64.Ret()

# Install our 64b VEH
install_veh_64(handler.get_code())

print("Data = {0:#x}".format(DATA.value))
try:
    cp.execute("\xff" * 10) # Trigger an EXCEPTION_ILLEGAL_INSTRUCTION(0xc000001dL)
except WindowsError as e:
    print(e)
print("Data = {0:#x}".format(DATA.value))
PS D:\Sogeti\2016\SysWow64DD\article_examples> python .\simple_continue_search.py
Data = 0x0
[Error -1073741795] Windows Error 0xC000001D
Data = 0x42424242

Nice! we are effectively able to execute a 64b VEH for a 32b exception. But, our exception is still sent to the 32b world as our VEH returns EXCEPTION_CONTINUE_SEARCH.

The next step is to make a VEH that will returns EXCEPTION_CONTINUE_EXECUTION to completely bypass the 32b world for the handling of our exception.

The next sample generates an Illegal Instruction by executing "\xff\xff\xff\xff", our VEH will just replace those by four nop and resume execution.

## simple_resume_execution.py
import windows
import windows.generated_def as gdef  # Windows definitions: structs / constants
import windows.native_exec.simple_x64 as x64
from veh import install_veh_64

cp = windows.current_process

## Our trigger function
trigger = windows.native_exec.create_function("\xff\xff\xff\xff\xc3", [gdef.PVOID])
trigger_addr = trigger.code_addr

# Our VEH
handler =  x64.MultipleInstr()
## XSTATE hack (see "The last pitfall" at the end of the article)
handler += x64.Mov("RAX", x64.mem("[RCX + 8]"))
handler += x64.Mov(x64.mem("[RAX + 0x30]"), 0x10001f)
## End of XSTATE hack
handler +=     x64.Mov("RAX", 0x90909090)
handler +=     x64.Mov(x64.deref(trigger_addr), "EAX")
handler += x64.Mov("RAX", 0xffffffff)
handler += x64.Ret()

install_veh_64(handler.get_code())

print("Trigger at {0:#x} with code {1!r}".format(trigger_addr, cp.read_memory(trigger_addr, 5)))
try:
    trigger()
except WindowsError as e:
    print(e)
print("Trigger at {0} with code {1}".format(trigger_addr, cp.read_memory(trigger_addr, 5)))
PS D:\Sogeti\2016\SysWow64DD\article_examples> python .\simple_resume_execution.py
Trigger at 0x1380029 with code '\xff\xff\xff\xff\xc3'
## Nothing.. it just hang.

Okay, it does not seem to work, let's attach a 64b debugger (windbg obviously).

(1bdc.1bbc): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
wow64cpu!TurboDispatchJumpAddressStart+0x5:
00000000`779d1db6 41ff24cf        jmp     qword ptr [r15+rcx*8] ds:00000000`779d23f8=48000001c4878b48

0:000> rrcx
rcx=000000000000011d

## Let's dump the code of trigger()
0:000> u 0x1380029 L5
00000000`01380029 90              nop
00000000`0138002a 90              nop
00000000`0138002b 90              nop
00000000`0138002c 90              nop
00000000`0138002d c3              ret

It looks like our VEH is executed as the code of trigger() has been rewrote with nop but we crashed in wow64cpu!TurboDispatchJumpAddressStart+0x5 with a pretty big rcx for an offset in a jump-table.

At this time few questions arise:

  • Where am I? what really is wow64cpu and why do I crash in 64b mode?
  • What is this jump in TurboDispatchJumpAddress?
  • Why is my RCX this big?
  • Why does EXCEPTION_CONTINUE_EXECUTION break everything?

What we really want to know is how does WoW64 works, so let's seriously explore it and its internals.

Exploring Wow64

Initialization

When the first userland instruction is executed for our Wow64 process, 3 PE are loaded in memory:

0:000> u eip L1
ntdll!LdrInitializeThunk:
00007ffe`7cd58c80 4053            push    rbx

0:000> lm f
start             end                 module name
00000000`1c980000 00000000`1c98b000   python   python.exe
00000000`774c0000 00000000`7762f000   ntdll_774c0000 ntdll.dll
00007ffe`7cd40000 00007ffe`7ceed000   ntdll    ntdll.dll

The ntdll_774c0000 in `windbg is the 32b version whereas the other one is 64b. In the rest of this article I will use ntdll64 and ntdll32 to mark the difference.

One of the first thing ntdll64!LdrpInitializeProcess does is checking if it's running in a WoW64 process. The check is done with NtQueryInformationProcess(ProcessWow64Information) which returns:

  • 0 if the process is not WoW64
  • The address of the PEB32 if the process is WoW64

If the process is WoW64, ntdll64!LdrpLoadWow64 is called. It will load wow64.dll (and it's dependencies wow64cpu.dll and wow64win.dll) and extracts the addresses of the interesting exports for ntdll64:

  • Wow64LdrpInitialize
  • Wow64ApcRoutine
  • Wow64PrepareForException

Finally, it calls wow64!Wow64LdrpInitialize.

And here we are, in the first WoW64 specific code!

The first thing Wow64LdrpInitialize do is retrieving the PEB32 with an interesting technique. It use the following instructions:

mov     rax, gs:30h
mov     ecx, [rax + 2030h]

After the first instruction rax contains the address of the TEB64 (which has the size of 0x1818 on my windows 8.1). Futhermore RAX + 0x2000 happens to be the address of the TEB32 and 0x30 is the offset of the fields ProcessEnvironmentBlock in the TEB32.

This means that an invariant of WoW64 for now is that the TEB32 is just after the TEB64 (+0x2000). The free space after the PEB32 (PEB32 + 0x250) is used as Wow64InfoPtr and stored in [TEB64 + 0x14D0] (Inside the array TlsSlots). Whereas the use of Wow64InfoPtr is not really important in the rest of the article, it's interesting to see the postulate about memory used by Wow64.

wow64!Wow64LdrpInitialize then call Wow64!ProcessInit that do three interesting things among others.

  • First it tries to load the "system32\wow64log.dll" DLL, if the DLL is present and exports the following functions [Wow64LogInitialize, Wow64LogSystemService, Wow64LogMessageArgList, Wow64LogTerminate] it will stay in memory as a logging DLL. As it is not present by default, it provides a nice persistence method in WoW64 processes that have been already discovered by other people. [6]
  • In a second time, Wow64!ProcessInit calls wow64!Wow64pInitializeFilePathRedirection that is a good indicator to find the code responsible of the FileSystem redirection.
  • Lastly, it calls wow64!CpuProcessInit that will set wow64!CpuThreadSize to the size of a CONTEXT32.

Back to wow64!Wow64LdrpInitialize! After initializing the process, WoW64 needs to initialize the Thread. It first checks the destination of the entry point, if it's in a 64b dll, it just returns and leaves ntdll64 do its works (I guess that's for the threads on ntdll!DbgBreakPoint that might be injected by debuggers). The part that interest us is the one where the target is in the initial exe. In this case WoW64 needs to initialize the thread for WoW64 execution.

It will reserve a buffer of size wow64!CpuThreadSize + 0x0f on the stack and call wow64!ThreadInit whose most notable action is a call to wow64cpu!CpuThreadInit with this buffer as parameter. wow64cpu!CpuThreadInit performs two important actions:

  • It sets TEB32.WOW32Reserved ([TEB32+0xC0]) to wow64cpu!KiFastSystemCall
  • It writes the buffer received as parameter in [TEB64 + 0x1488]

It also writes some values in the buffer that can be identified to Segments Selectors. Furthermore those values match the offset of a CONTEXT32 with an initial displacement.

lea     r8, [rcx + 0Fh] ; RCX if our buffer
mov     rcx, gs:30h
and     r8, 0FFFFFFFFFFFFFFF0h
mov     [rcx + 1488h], r8
mov     [r8 + YOLOCONTEXT.SegCs], 23h ; YOLOCONTEXT is a CONTEXT32 with 0x90 of padding before
mov     [r8 + YOLOCONTEXT.SegDs], 2Bh
mov     [r8 + YOLOCONTEXT.SegEs], 2Bh
mov     [r8 + YOLOCONTEXT.SegSs], 2Bh
mov     [r8 + YOLOCONTEXT.SegFs], 53h
mov     [r8 + YOLOCONTEXT.SegGs], 2Bh
mov     [r8 + YOLOCONTEXT.EFlags], 202h

This allows us to make a guess that there is a CONTEXT32 accessible somewhere in the TEB64 of our thread, should be useful!

PLOT TWIST: while writing the article I discovered that the initialization of the CONTEXT32 at [TEB + 0x1488] previously described is also done in the kernel by nt!PspWow64InitThread.

We are now at the last step of the initialization of the WoW64 Thread: wow64!Wow64SetupInitialCall. This function arranges the CONTEXT32 in the same way the kernel does before passing execution to userland (in nt!PspInitializeThunkContext) by:

  • Reserving space on the stack for another CONTEXT32
  • Copying the initial context of the stack
  • Modifying the initial context
    • setting esp to the new stack value after CONTEXT32 reservation
    • setting eip to ntdll32!LdrInitializeThunk

That setup allows the correct initialization of the userland by ntdll32 that will execute ntdll32!LdrInitializeThunk, call NtContinue with the context of the stack that will resume execution on the 'real' ntdll entry point ntdll32!RtlUserThreadStart.

Everything is setup, let's execute some 32 bits code!

64b to 32b execution

The last instruction in ntdll64!LdrpInitializeProcess is a call to wow64!RunCpuSimulation which is just an infinite loop around wow64cpu!CpuSimulate.

wow64cpu!CpuSimulate performs the following actions:

  • load SysWowCallDispatchTable in r15
  • load CpupReturnFromSimulatedCode in r10
  • restore all 32b registers from the CONTEXT32 at [TEB + 0x1488]
  • stack pivot on the 32b stack (the 64b stack pointer is stored in r14)
  • Far Jmp on CS32:Eip
  • BOOM 32b code execution !

After the far jmp we are now executing the 32b code of our WoW64 process. Next step is to analyse how syscalls are made and the 32 to 64 transition.

32b to 64b execution

This part has already been documented [3] .

The normal transition between 32b and 64b occurs when the 32b code performs a syscall.

ntdll32 does not really perform a syscall but call [TEB32 + 0xC0] which was set to wow64cpu!KiFastSystemCall in wow64cpu!CpuThreadInit. wow64cpu!KiFastSystemCall is the only 32bits code into wow64cpu:

jmp     0033:wow64cpu!CpupReturnFromSimulatedCode

So the bitness transition entry point really is wow64cpu!CpupReturnFromSimulatedCode that performs the following actions:

  • Save the 32b registers in the TEB (eip is set to the return value on the stack)
  • Stack pivot on the 64b stack (which was in r14)
  • Execution is then dispatched by the 16 high bits of eax through the SysWowCallDispatchTable

This switch is the instruction that leads our VEH to crash.

The SysWowCallDispatchTable is nothing very fancy, it allows for quick dispatch of simple syscall. There are multiple targets for syscall with 0 to 4 arguments (Signed or Non-Signed) that does not need modification and some fast path for specific syscalls like GetCurrentProcessorNumber. The default target redirects to wow64!Wow64SystemServiceEx that dispatch to the specific handler for each sycall. Those handlers, among other things, transforms complex 32b structure to 64b ones before calling the actual syscall.

For example: wow64!whNtCreateFile calls a method to create and fill a 64b OBJECT_ATTRIBUTES that itself calls wow64!RedirectObjectAttributes which is responsible of renaming access to c:\windows\system32 into c:\windows\syswow64 (with few exceptions) [7] .

Normal execution in WoW64

Now, we have a good idea of how normal execution in WoW64 works, but it does not help us to understand the cause of our crash. So the next step is to examine how exception handling works in WoW64. Only problem is, I had no idea at the time how normal exception handling worked in 64b.

WoW64 Exception Handling

Table-based Exception Handling (TEH)

Before exploring how 64b EH works all I knew was that there is a big list of RUNTIME_FUNCTION in IDA.

A RUNTIME_FUNCTION contains 3 informations:

  • Function start address
  • Function end address
  • Unwind info address

The UNWIND_INFO contains all the effects the function has on the stack and nonvolatile registers ! This information is stored in the UNWIND_CODE which are pseudo code describing SAVE of nonvolatile registers and ALLOC on the stack.

  • UWOP_PUSH_NONVOL
  • UWOP_ALLOC_SMALL
  • ...

This construction allows to walk the stack and get a valid context at any frame without using any tricks and only by relying on the informations in the PE.

The UNWIND_INFO also contains the exception handler for the function and some handler-specific data. For C/C++ binaries the language handler is __C_specific_handler_0. The handler-specific data are in a structure called SCOPE_TABLE

typedef struct _SCOPE_TABLE {
    ULONG Count;
    struct
    {
        ULONG BeginAddress;
        ULONG EndAddress;
        ULONG HandlerAddress;
        ULONG JumpTarget;
    } ScopeRecord[1];
} SCOPE_TABLE, *PSCOPE_TABLE;

Each ScopeRecord element represents a try{}/Except{} or finally{} constructs and have the following format:

If JumpTarget is not NULL, it represent a try{}/Except{} construct else it's a finally{} one. In the case of a try{}/Except{}, the HandlerAddress can return the following values [2]:

  • EXCEPTION_CONTINUE_EXECUTION (–1): Exception is dismissed. Continue execution at the point where the exception occurred.
  • EXCEPTION_CONTINUE_SEARCH (0): Exception is not recognized. Continue to search up the stack for a handler.
  • EXCEPTION_EXECUTE_HANDLER (1): Exception is recognized. Transfer control to the exception handler.

What we need to remember is that if HandlerAddress returns 1, the stack is unwind to the matching function and execution resume at JumpTarget.

You can find more informations about TEH in those articles:

Passing the exception to 32b

With these information in mind, let's explore how exception handling is normally performed in a WoW64 process.

When an exception that must be forwarded to the userland occurs, the kernel do two things:

  • It pushes a CONTEXT and an EXCEPTION_RECORD on the stack
  • It calls ntdll!KiUserExceptionDispatch

ntdll!KiUserExceptionDispatch is responsible of calling the appropriate exceptions handler (VEH / SEH / TEH) and restoring the program's state according to the handlers or raising again the exception if the exception was not handled.

The subtility with an exception happening during WoW64 32bits code execution is that we have the CONTEXT64 of the exception pushed on the 32b stack. To retrieve a consistent state ntdll64 first call wow64!Wow64PrepareForException which performs two actions:

  • It copies the top of the 32b stack to the 64b one and stack-pivot
  • It consolidates the CONTEXTs state in wow64cpu!CpuResetToConsistentState

wow64cpu!CpuResetToConsistentState updates both CONTEXTs by applying these transformations:

  • Translates the CONTEXT64 to a valid CONTEXT32 and stores it in [TEB + 0x1488]
  • Updates the CONTEXT64 to be a valid representation of the 64b execution
    • set rip to wow64cpu!CpupReturnFromSimulatedCode
    • set rsp to 64b stack (r14 in the original CONTEXT)
    • set SegCS to CS_64BIT

After that, ntdll64 is able to perform the standard task of exception dispatching. Now let's examine what is happening at the exception dispatch level for the two cases encountered with our VEH.

VEH64b: The case where everything works

In this scenario, no VEH handle the exception, in this case it triggers the TEH. ntdll64 walks the stack looking for try{}/Except{} constructions. The interesting try{}/Except{} is the one for the function wow64!RunCpuSimulation (around the infinite loop call to wow64cpu!CpuSimulate). The except handler RunCpuSimulation$filt$0 call the method wow64!Pass64bitExceptionTo32Bit. This method, once again, performs the same work as the kernel concerning exception dispatching:

  • Reserve some space on the 32b stack
  • Copy the CONTEXT32 on it
  • Set eip to ntdll32!KiUserExceptionDispatcher

All this work is done on the CONTEXT32 at [TEB + 0x1488].

Once RunCpuSimulation$filt$0 has done its work, it returns EXCEPTION_EXECUTE_HANDLER meaning that the execution continue to the corresponding SCOPE_TABLE.JumpTarget which is... The infinite loop around wow64cpu!CpuSimulate! (at the same time the 64b stack is restored during the unwind) The execution is then resumed to the 32b world, more precisely to ntdll32!KiUserExceptionDispatcher that will handle the 32b exception.

Normal exception handling of 32b exception in WoW64

VEH64b: The case where everything is wrong

In the second scenario, a VEH handle the exception, in the case the whole process of TEH is bypassed and ntdll64 restore the execution based on the CONTEXT64 on the stack. Don't forget that the CONTEXT64 is modified by wow64cpu!CpuResetToConsistentState to show a valid representation of the 64b world at the time of the 32b execution.

Here is the stack frame of the CONTEXT64:

# Rdx is our CONTEXT64
0:000> dt nt!_CONTEXT @rdx Rsp Rip
ntdll!_CONTEXT
+0x098 Rsp : 0xdfed80
+0x0f8 Rip : 0x779d1d84

0:000> kn = 0xdfed80 0x779d1d84  10
# Child-SP          RetAddr           Call Site
00 00000000`00dfed80 00000000`779f219a wow64cpu!CpupReturnFromSimulatedCode
01 00000000`00dfee30 00000000`779f20d2 wow64!RunCpuSimulation+0xa
02 00000000`00dfee80 00007ff9`240a3c35 wow64!Wow64LdrpInitialize+0x172
03 00000000`00dff3c0 00007ff9`2408245e ntdll!LdrpInitializeProcess+0x1591
04 00000000`00dff6e0 00007ff9`23ff8c8e ntdll!_LdrpInitialize+0x8977e
05 00000000`00dff750 00000000`00000000 ntdll!LdrInitializeThunk+0xe

The CONTEXT64 rip was set to wow64cpu!CpupReturnFromSimulatedCode whereas rax is set to the eax at execution time.

Given that the following code is executed by wow64cpu!CpupReturnFromSimulatedCode:

mov     ecx, eax
shr     ecx, 10h
jmp     qword ptr [r15 + rcx * 8]

It's seems obvious that the CONTEXT64 crafted by wow64cpu have not been designed at all to be used as a valid context for restoration. In the current state our VEH is doomed to fail. Let's remind ourself what is the expected behavior and tools available to our VEH:

  • We want to resume the 32b execution to when the exception was raised
  • We have complete control of the CONTEXT64 that will be resumed if we return EXCEPTION_CONTINUE_EXECUTION.

Let's have look at the others stack-frame described by our CONTEXT64. The interesting one is the frame 01, which is the infinite loop of wow64cpu!CpuSimulate. Returning to this code would allow us to resume 32b execution to the state of [TEB + 0x1488] which is exactly the 32b state at the time of exception.

How to resume execution to wow64!RunCpuSimulation+0xa with a consistent CONTEXT64? Easy, let's unwind the stack! We just need our VEH to perform a 1-frame in-place unwind of the CONTEXT64 it received as parameter.

Here are the step to perform the unwinding:

  • call ntdll64!RtlLookupFunctionEntry(CONTEXT64.Rip) to retrieve some information for the next call.
  • call ntdll64!RtlVirtualUnwind [5]

The ntdll!RtlVirtualUnwind API has an explicit name, it performs one step of unwinding on a context: exactly what we are looking for. After that our CONTEXT64 should be ready to be restored, so our VEH can return EXCEPTION_CONTINUE_EXECUTION without crashing the process.

The last pitfall

Before showing the POC, there is a last cause of failure for our VEH. On the hardware that have the xsave feature, the CONTEXT64.ContextFlags that we receive will have the flag CONTEXT_XSTATE (which indicates the save/restore of CPU extended state via XSAVE/XRSTOR instructions). The subtility is that a CONTEXT64 with the flag CONTEXT_XSTATE must be aligned on 64 bytes (It's a XRSTOR requirement). The kernel, in its infinite wisdom, align the stack before copying the CONTEXT of an exception that needs to be delivered to userland. However, when wow64!Wow64PrepareForException copies the CONTEXT64 from the 32b stack to the 64b one, it only align the stack on 0x10, potentially making this CONTEXT64 unfit for a restore with CONTEXT_XSTATE.

The simplest solution is to change the CONTEXT64.ContextFlags in our VEH to remove the CONTEXT_XSTATE flag. One could also change the realignment instruction in wow64!Wow64PrepareForException from and REG, 0FFFFFFFFFFFFFFF0h to and REG, 0FFFFFFFFFFFFFFC0h.

A VEH64 in WoW64

proof of concept

With all that let's have a look at a fully working POC.

To prove that the VEH64 is executing and successfully resume the 32b execution the POC do the following things:

  • Installs the VEH64
  • Executes an invalid instruction ("\xff\xff\xff\xff") that trigger the VEH
  • The VEH:
    • Rewrites the invalid instruction with four NOP ("\x90\x90\x90\x90")
    • Increments a counter as a proof of execution
    • Sets CONTEXT32.Eax to 0x42424242
    • Changes the CONTEXT64.ContextFlags (XSTATE hack)
    • Performs the unwind
import windows
import windows.generated_def as gdef
import windows.native_exec.simple_x64 as x64
import windows.native_exec.simple_x86 as x86  # x86 assembler
from windows.native_exec import create_function  # Create python function executing native code
from veh import install_veh_64

BYPASS_XSTATE_CRASH = True
cp = windows.current_process

# Retrieve the important functions in ntdll64
RtlLookupFunctionEntry = cp.peb_syswow.modules[1].pe.exports["RtlLookupFunctionEntry"]
RtlVirtualUnwind = cp.peb_syswow.modules[1].pe.exports["RtlVirtualUnwind"]

# Out trigger function (32b)
mov_eax = x86.assemble("mov eax, 0x11223344")
code = mov_eax + "\xff" * 4 + x86.assemble("ret")
trigger_bp32 = windows.native_exec.create_function(code, [gdef.PVOID])
print("trigger_bp32 = 0x{0:08x}".format(trigger_bp32.code_addr))
bad_instr_addr = trigger_bp32.code_addr + len(mov_eax)

# Alloc some memory for our VEH64
counter = windows.native_exec.native_function.allocator.reserve_size(8)
addr1 = windows.native_exec.native_function.allocator.reserve_size(8)
addr2 = windows.native_exec.native_function.allocator.reserve_size(8)
addr3 = windows.native_exec.native_function.allocator.reserve_size(8)
addr4 = windows.native_exec.native_function.allocator.reserve_size(0x1000)

# x64 calling convention helpers
reserve_for_call = x64.Push("RDX") * 4
clean_after_call = x64.Pop("RDX") * 4

# Our VEH !
veh64 =  x64.MultipleInstr()
veh64 +=     x64.Inc(x64.deref(counter))
veh64 += x64.Mov("RAX", x64.mem("[RCX]"))
veh64 += x64.Mov("EAX", x64.mem("[RAX]"))
veh64 += x64.Cmp("EAX", int(gdef.STATUS_ILLEGAL_INSTRUCTION))
veh64 += x64.Jnz(":CONTINUE_SEARCH")
veh64 +=     x64.Mov("RAX", 0x90909090)
veh64 +=     x64.Mov(x64.deref(bad_instr_addr), "EAX") # Rewrite the "\xff\xff\xff\xff"
veh64 +=     x64.Mov("EAX", x64.mem("gs:[0x30]")) # Retrieve TEB
veh64 +=     x64.Mov("EAX", x64.mem("[EAX + 0x1488]")) # Retrieve CONTEXT32
veh64 +=     x64.Mov("EDI", 0x42424242)
veh64 +=     x64.Mov(x64.mem("[EAX + 0xb4]"), "EDI") # setup CONTEXT32.Eax
veh64 +=     x64.Mov("RAX", x64.mem("[RCX + 8]")) # Retrieve CONTEXT64
veh64 +=     x64.Mov("RDI", x64.mem("[RAX + 0xf8]")) # RIP AT Exception
veh64 +=     x64.Mov("RSI", x64.mem("[RAX + 0x98]")) # RSP AT Exception
if BYPASS_XSTATE_CRASH:
    veh64 +=     x64.Mov(x64.mem("[RAX + 0x30]"), 0x10001f) # Remove CONTEXT_XSTATE
# Call RtlLookupFunctionEntry
# Save RCX
veh64 +=    x64.Push("RCX")
# setup stack/registers for call
veh64 +=    reserve_for_call
veh64 +=    x64.Mov("RCX", "RDI")
veh64 +=    x64.Mov("RDX", addr1)
veh64 +=    x64.Mov("R8", addr2)
veh64 +=    x64.Mov("RAX", RtlLookupFunctionEntry)
veh64 +=    x64.Call("RAX")
veh64 +=    clean_after_call
# Call to RtlVirtualUnwind(0, [addr1], RDI, RAX, [RCX + 8], addr3, addr4, NULL)
veh64 +=    x64.Push(0)
veh64 +=    x64.Push(addr4)
veh64 +=    x64.Push(addr3)
veh64 +=    x64.Mov("RCX", x64.mem("[RSP + 0x18]"))
veh64 +=    x64.Mov("RCX", x64.mem("[RCX + 0x8]"))
veh64 +=    x64.Push("RCX")
veh64 +=    x64.Mov("R9", "RAX")
veh64 +=    x64.Mov("R8", "RDI")
veh64 +=    x64.Mov("RDX", x64.deref(addr1))
veh64 +=    x64.Mov("RCX", 0)
veh64 +=    reserve_for_call
veh64 +=    x64.Mov("RAX", RtlVirtualUnwind)
veh64 +=    x64.Call("RAX")
veh64 +=    clean_after_call
# Pop param from RtlLookupFunctionEntry
veh64 +=    x64.Pop("RCX") * 4
# Pop save RCX
veh64 +=    x64.Pop("RCX")
veh64 +=    x64.Mov("RAX", 0xffffffff)
veh64 +=    x64.Ret()
veh64 += x64.Label(":CONTINUE_SEARCH")
veh64 += x64.Xor("RAX", "RAX")
veh64 += x64.Ret()

install_veh_64(veh64.get_code())

# Triggers and test
import ctypes
print("Bad instr code: <{0}>".format(cp.read_memory(bad_instr_addr, 4).encode('hex')))
res = trigger_bp32()
print("trigger_bp32 result = <{0}>".format(hex(res)))
print("Counter = " + hex(cp.read_qword(counter)))
print("Bad instr code: <{0}>".format(cp.read_memory(bad_instr_addr, 4).encode('hex')))
PS D:\Sogeti\2016\SysWow64DD\article> python .\VectoredHandlerSysWow64_simple.py
trigger_bp32 = 0x01200029
Bad instr code: <ffffffff>
trigger_bp32 result = <0x42424242>
Counter = 0x1
Bad instr code: <90909090>

It works !

What can be done from the VEH64

As demonstrated in the previous POC, the VEH64 can (obviously) read and write the whole memory of the process and modify the CONTEXT32.

However, wow64cpu!CpuSimulate does not restore the complete context before passing execution to 32b code: the debug registers are not restored (because it can only be done from ring-0).

But, we also have access to the CONTEXT64 that is restored in kernel via NtContinue, so in order to put an hardware breakpoint, you need to setup the Dr[0-3, 7] in the CONTEXT64 and voila!

As a final demo, here is an update of the previous POC with the setup of a hardware-exec breakpoint through the CONTEXT64. The POC still triggers an invalid instruction that is rewritten by nop. The handler additionally adds a hardware-execute breakpoint on the first nop that will be triggered at the second call of the function.

Note: The hardware-execute breakpoint is not triggered directly at the return from the VEH the first time because the CPU add the Resume-Flag (RF) in the eflags pushed on the stack for a fault-class exception (Man intel 3B: 17.3.1.1).

When the breakpoint is triggered for a STATUS_WX86_SINGLE_STEP, the VEH just removes the hardware breakpoint and changes the value of CONTEXT32.Eax to 0x11223344.

import windows
import windows.generated_def as gdef
import windows.native_exec.simple_x64 as x64
import windows.native_exec.simple_x86 as x86
from windows.native_exec import create_function
from veh import install_veh_64

BYPASS_XSTATE_CRASH = True
cp = windows.current_process

# Retrieve the important functions in ntdll64
RtlLookupFunctionEntry = cp.peb_syswow.modules[1].pe.exports["RtlLookupFunctionEntry"]
RtlVirtualUnwind = cp.peb_syswow.modules[1].pe.exports["RtlVirtualUnwind"]

# Our trigger function, that setup Eax to 0x10 before generating the Illegal Instruction
mov_eax = x86.assemble("mov eax, 0x10")
code = mov_eax + "\xff" * 4 + x86.assemble("ret")
trigger_bp32 = windows.native_exec.create_function(code, [gdef.PVOID])
bad_instr_addr = trigger_bp32.code_addr + len(mov_eax)

# Alloc some memory for our VEH64
counter = windows.native_exec.native_function.allocator.reserve_size(8)
addr1 = windows.native_exec.native_function.allocator.reserve_size(8)
addr2 = windows.native_exec.native_function.allocator.reserve_size(8)
addr3 = windows.native_exec.native_function.allocator.reserve_size(8)
addr4 = windows.native_exec.native_function.allocator.reserve_size(0x1000)

# x64 calling convention helpers
reserve_for_call = x64.Push("RDX") * 4
clean_after_call = x64.Pop("RDX") * 4

# Our VEH !
veh64 =  x64.MultipleInstr()
veh64 +=     x64.Inc(x64.deref(counter))
veh64 += x64.Mov("RAX", x64.mem("[RCX]"))
veh64 += x64.Mov("EAX", x64.mem("[RAX]"))
veh64 += x64.Cmp("EAX", int(gdef.STATUS_ILLEGAL_INSTRUCTION))
veh64 += x64.Jnz(":CONTINUE_SEARCH")
# case: STATUS_ILLEGAL_INSTRUCTION
veh64 +=     x64.Mov("RAX", 0x90909090)
veh64 +=     x64.Mov(x64.deref(bad_instr_addr), "EAX")  # Rewrite the "\xff\xff\xff\xff"
veh64 +=     x64.Mov("EAX", x64.mem("gs:[0x30]"))  # Retrieve TEB
veh64 +=     x64.Mov("EAX", x64.mem("[EAX + 0x1488]"))  # Retrieve CONTEXT32
veh64 +=     x64.Mov("EDI", 0x42424242)
veh64 +=     x64.Mov(x64.mem("[EAX + 0xb4]"), "EDI")  # setup CONTEXT32.Eax
veh64 +=     x64.Mov("RAX", x64.mem("[RCX + 8]"))  # Retrieve CONTEXT64
# Setup the hardware-execute BP
veh64 +=     x64.Mov(x64.mem("[RAX + 0x60]"),  bad_instr_addr)  # Setup DR3 on the first nop
veh64 +=     x64.Mov(x64.mem("[RAX + 0x70]"), 0xc0)   # Setup Dr7
veh64 +=     x64.Label(":CONTINUE_EXECUTION")
# Unwind the stack and resuming execution
veh64 +=     x64.Mov("RAX", x64.mem("[RCX + 8]"))
veh64 +=     x64.Mov("RDI", x64.mem("[RAX + 0xf8]"))  # RIP AT Exception
veh64 +=     x64.Mov("RSI", x64.mem("[RAX + 0x98]"))  # RSP AT Exception
if BYPASS_XSTATE_CRASH:
    veh64 +=     x64.Mov(x64.mem("[RAX + 0x30]"), 0x10001f)  # Remove CONTEXT_XSTATE
# Call RtlLookupFunctionEntry
veh64 +=    x64.Push("RCX")
veh64 +=    reserve_for_call
veh64 +=    x64.Mov("RCX", "RDI")
veh64 +=    x64.Mov("RDX", addr1)
veh64 +=    x64.Mov("R8", addr2)
veh64 +=    x64.Mov("RAX", RtlLookupFunctionEntry)
veh64 +=    x64.Call("RAX")
veh64 +=    clean_after_call
# Call to RtlVirtualUnwind(0, [addr1], RDI, RAX, [RCX + 8], addr3, addr4, NULL)
veh64 +=    x64.Push(0)
veh64 +=    x64.Push(addr4)
veh64 +=    x64.Push(addr3)
veh64 +=    x64.Mov("RCX", x64.mem("[RSP + 0x18]"))
veh64 +=    x64.Mov("RCX", x64.mem("[RCX + 0x8]"))
veh64 +=    x64.Push("RCX")
veh64 +=    x64.Mov("R9", "RAX")
veh64 +=    x64.Mov("R8", "RDI")
veh64 +=    x64.Mov("RDX", x64.deref(addr1))
veh64 +=    x64.Mov("RCX", 0)
veh64 +=    reserve_for_call
veh64 +=    x64.Mov("RAX", RtlVirtualUnwind)
veh64 +=    x64.Call("RAX")
veh64 +=    clean_after_call
veh64 +=    x64.Pop("RCX") * 4
veh64 +=    x64.Pop("RCX")
# Return EXCEPTION_CONTINUE_EXECUTION
veh64 +=    x64.Mov("RAX", 0xffffffff)
veh64 +=    x64.Ret()
veh64 += x64.Ret()
veh64 += x64.Label(":CONTINUE_SEARCH")
veh64 += x64.Mov("RDI", x64.mem("[RCX + 8]"))
veh64 += x64.Mov(x64.mem("[RDI + 0x70]"), 0) # Set Dr7 to 0
veh64 += x64.Cmp("EAX", int(gdef.STATUS_WX86_SINGLE_STEP))
veh64 += x64.Jnz(":CONTINUE_VEH_SEARCH")
# case: STATUS_WX86_SINGLE_STEP (hardware-execute BP in 32b)
veh64 +=     x64.Mov("EAX", x64.mem("gs:[0x30]"))  # Retrieve TEB
veh64 +=     x64.Mov("EAX", x64.mem("[EAX + 0x1488]")) # Retrieve CONTEXT32
veh64 +=     x64.Mov("EDI", 0x11223344)
veh64 +=     x64.Mov(x64.mem("[EAX + 0xb4]"), "EDI")  # # setup CONTEXT32.Eax
veh64 +=     x64.Jmp(":CONTINUE_EXECUTION")
veh64 += x64.Label(":CONTINUE_VEH_SEARCH")
veh64 += x64.Xor("RAX", "RAX")
veh64 += x64.Ret()

install_veh_64(veh64.get_code())


print("Bad instr code: <{0}>".format(cp.read_memory(bad_instr_addr, 4).encode('hex')))
print("[+] Trigger 1 !") # Trigger the Invalid Instruction
res = trigger_bp32()
print("trigger_bp32 result = <{0}>".format(hex(res)))
print("Counter = " + hex(cp.read_qword(counter)))
print("Bad instr code: <{0}>".format(cp.read_memory(bad_instr_addr, 4).encode('hex')))


print("[+] Trigger 2 !") # Trigger the hardware-execute breakpoint
res = trigger_bp32()
print("trigger_bp32 result = <{0}>".format(hex(res)))
print("Counter = " + hex(cp.read_qword(counter)))
print("Bad instr code: <{0}>".format(cp.read_memory(bad_instr_addr, 4).encode('hex')))


print("[+] Trigger 3 !") # Trigger nothing
res = trigger_bp32()
print("trigger_bp32 result = <{0}>".format(hex(res)))
print("Counter = " + hex(cp.read_qword(counter)))
print("Bad instr code: <{0}>".format(cp.read_memory(bad_instr_addr, 4).encode('hex')))
PS D:\Sogeti\2016\SysWow64DD\article> python .\VectoredHandlerSysWow64.py
Bad instr code: <ffffffff>
[+] Trigger 1 !
trigger_bp32 result = <0x42424242>
Counter = 0x1
Bad instr code: <90909090>
[+] Trigger 2 !
trigger_bp32 result = <0x11223344>
Counter = 0x2
Bad instr code: <90909090>
[+] Trigger 3 !
trigger_bp32 result = <0x10>
Counter = 0x2
Bad instr code: <90909090>