Stack spoofing is a really cool malware technique that isn’t new, but has been receving some more attention recently. The goal of this post is to introduce readers to the concept and dive into two implementations. This post will focus only on call stack spoofing in x64 Windows with “active” spoofing techniques.
Preface
None of the work here is novel. The majority of the stack spoofing research that I will be describing was done by Klezvirus, trickster012, waldo-irc, and namazso. Massive thanks to them for providing the foundation for me to dig into such a cool topic.
During my internship at X-Force Red I got to work with 0xboku in researching the work above. Some of that work got to be integrated with BokuLoader, which was really awesome. Without Bobby’s help I wouldn’t have been able to understand these techniques as much, so massive thanks to him for sticking through several hour calls with my dumb questions.
Lastly, I am not an expert on this subject, this post is meant to just share what I understand with others trying to cook funny software.
Introduction
EDR have more capabilities beyond inline hooking. It is possible for them to utilize the call stack of a function call to determine whether a function is malicious or not. Elastic has demonstrated this capability and I am guessing this call stack telemetry is obtained via kernel callbacks registered for process and thread creation events, and probably more.
Call stack?
Every thread in a process has its own call stack. When a function is called, a frame is created for it; this frame is a space for it store some data for that function, like local variables. Right below the base of the frame contains a return address; the location of the frame's caller. This way, when the function finishes execution, it can return to its caller.
Using a debugger, we can place breakpoints on some function calls and view our call stack. When our implant (in my case, meterpreter), makes a function call (gethostbyname
in the screenshot below), we can see that the call stack points back to its location in memory.
A little note: the most recent caller is higher in the stack (lower number), each entry you go down is the above entry’s caller. The additional addresses beyond the meterpreter return address are “leaked” stack values; when unbacked memory is encountered by a walk, it assumes the stack frame is just of size 8, causing it to just dump stack values until it encounters a 0.
The size and the allocation don't line up!
Yeah lol, that's just a consequence of my janky meterpreter reflective loader that cannot free itself due to IAT hooks that are defined within the loader. gethostbyname was called by meterpreter.
From here, EDR could:
- Kill our implant because that address is RX (Read-Executable) memory unbacked by a file on disk
- Run a memory scan and kill our implant because it will probably get flagged
- Flag our process to warrant more investigation by an IR operator
- Some other voodoo
This is not ideal. So how can we address this? We can try spoofing our callstack; making it less suspicious by attempting to hide the source of our call. There are multiple ways to achieve this; this blog post will not cover all of them.
Active Spoofing vs Passive Spoofing
The techniques I will introducing fall under the “active” category; that is, they allow for the spoofing of any function being called. See the following projects:
- VulcanRaven
- AceLdr’s ret addr spoofing
Passive spoofing is something I sort of went over in my previous blog post and it is a type of spoofing that can only occur when the implant is sleeping. See the following projects:
- ThreadStackSpoofer
- CallStackMasker
- Unwinder
- TitanLdr’s sleep technique. This was also implemented in AceLdr
The active spoofing techniques I am covering have been done by the SilentMoonwalk project.
x64 Call Stack Primer
Unlike x86, only the RSP is necessary for keeping track of stack frames. Take the following code
void FuncC()
{
return;
}
void FuncB(int a, int b, int c, int d)
{
FuncC();
return;
}
int FuncA(int a, int b, int br, int uh)
{
int c = a;
int d = b;
FuncB(1, 1, 1, 1);
return 5;
}
void main()
{
FuncA(1, 2, 3, 4);
}
The contents of the call stack when FuncB() is called is below on the left. The Process Hacker stack walk view is on the right.
The space between the return addresses (the “Return to X”) is what I will refer to as the “stack size” of a function. Understanding how to calculate this stack size will be important for trying to spoof stacks as it is a common attribute utilized when unwinding the stack.
The .pdata section and unwind codes
The .pdata section is a PE section which includes information to assist with exception handling. More specifically, it is an array of RUNTIME_FUNCTION
s.
typedef struct _IMAGE_RUNTIME_FUNCTION_ENTRY {
DWORD BeginAddress;
DWORD EndAddress;
union {
DWORD UnwindInfoAddress;
DWORD UnwindData;
} DUMMYUNIONNAME;
} RUNTIME_FUNCTION
The BeginAddress
and EndAddress
are offsets to the base of the PE. The actual code for the function this RUNTIME_FUNCTION
belongs to is between those addresses. The union contains the interesting bit; an offset to the UNWIND_INFO
struct.
typedef struct _UNWIND_INFO {
BYTE Version : 3;
BYTE Flags : 5;
BYTE SizeOfProlog;
BYTE CountOfCodes;
BYTE FrameRegister : 4;
BYTE FrameOffset : 4;
UNWIND_CODE UnwindCode[1];
} UNWIND_INFO, * PUNWIND_INFO;
This UNWIND_INFO
contains an array of UNWIND_CODE
s. We can iterate through these to calculate the stack size of the function.
typedef union _UNWIND_CODE {
struct {
BYTE CodeOffset;
BYTE UnwindOp : 4;
BYTE OpInfo : 4;
};
USHORT FrameOffset;
} UNWIND_CODE, * PUNWIND_CODE;
There are different code types, with some affecting the stack size of a function by varying amounts.
Creating our own frames
With the information we have so far, we can try to create “synthetic” frames via the following steps.
- Decrement
rsp
by the stack size - Set
[rsp]
to the a return address - Repeat at if we want to create another frame
Active Spoofing: Synthetic Frames
Note: This method was done by both VulcanRaven and SilentMoonwalk and is essentially an extension of namaszo’s return address spoofing, implemented by AceLdr by Kyle Avery.
More details about the technique here
The technique, originally posted in 2018, would spoof the stack by replacing the return address to a jmp [rbx]
gadget within a signed dll. This is done via the following steps:
- Store the original return address in a struct
- Overwrite the return address with the address of the struct
- Store a handler address at the base of the struct
- Store the original rbx in the struct
- Set the rbx to the address of the struct.
- Jump to the function we wish to call
Putting everything together, this is what we need to accomplish:
- Locate the
RUNTIME_FUNCTION
of the frames we wish to craft within their modules’ .pdata section - Parse the
UNWIND_CODE
s to calculate their stack sizes - Push a 0
- Create our fake frames
- Locate a
jmp [rbx]
gadget and create a frame for that - Jump to the function we wish to call
The strategy I implemented in my first POC pushes a 0 and creates two fake “origin” frames; these were ntdll.RtlUserThreadStart + 0x21
and kernel32.BaseThreadInitThunk+0x14
as they were the most common start frames on the system I was using. The initial 0 being pushed will cause stack walks to prematurely end their walks, so only our synthetic frames and the frames generated after our function call are shown.
More details about LoudSunRun here
I wanted to recreate the synthetic frames technique from SilentMoonWalk, since the original codebase was kinda large and I wanted to understand the technique more. The main differences between my implementation and SilentMoonWalk's synthetic one is a smaller code base, indirect syscall support, and multiple argument support (via relocating stack arguments to the call site).
On the left is the stack we’re trying to create, and on the right is the execution flow.
Spoofing NtAllocateVirtualMemory
causes the stack to appear like below.
We can see the initial 0 pushed caused the walk to be stopped at our synthetic RtlUserThreadStart
frame; our function call no longer traces back to meterpreter.
The code to the butchered version of KaynLdr I used to replace meterpreter’s reflective loader is available here. This was so I could integrate this implementation of stack spoofing into meterpreter via IAT hooking. It does not bypass defender lol.
More details about the reflective loader here
I wanted to try writing a reflective loader, so I took 5pider's KaynLdr and modified it a bit. I saw IAT hooking mentioned by Bobby Cooke's reflective loader post, and saw the opportunity to try to implement stack spoofing via IAT hooks. The difference between the loader and LoudSunRun is that the loader is able to spoof calls without any WinAPIs being used as function address resolution and the RUNTIME_FUNCTION
resolution is done manually.
Can I see the relevant code?
SureRuntime Function Finder
This function finds the RUNTIME_FUNCTION
of an addrew via iterated through all the modules in the PEB and checks if they are .dll
s. Afterwards it gets the .pdata address range of the module and iterates through all of the RUNTIME_FUNCTION
s of the module, checking if the requested address lies within any of the RUNTIME_FUNCTION
s. If not, the next module is checked and the loop continues.
PRUNTIME_FUNCTION GetRuntimeFunction( PVOID Address, PVOID* ImageBase )
{
// Walk the PEB Modules, try to find the module that function falls within
PLDR_DATA_TABLE_ENTRY pModule = ( PLDR_DATA_TABLE_ENTRY ) ( ( PPEB ) PPEB_PTR )->Ldr->InMemoryOrderModuleList.Flink;
PLDR_DATA_TABLE_ENTRY pFirstModule = pModule;
do
{
// Get Size of DLL;
PVOID Base = NULL;
PIMAGE_NT_HEADERS NtHeaders = NULL;
PIMAGE_SECTION_HEADER SecHeader = NULL;
DWORD TextSize = NULL;
PVOID LowerBound = NULL;
PVOID UpperBound = NULL;
Base = pModule->Reserved2[ 0 ];
if ( Base )
{
if ( *( PWORD )( Base ) == 0x5A4D )
{
NtHeaders = (PVOID) ( Base + ( ( PIMAGE_DOS_HEADER ) Base )->e_lfanew );
if ( *( PWORD )NtHeaders == 0x4550 )
{
if ( NtHeaders->FileHeader.Characteristics & 0x2000 )
{
SecHeader = IMAGE_FIRST_SECTION( NtHeaders );
for ( int i = 0; i < NtHeaders->FileHeader.NumberOfSections; i++ )
{
if ( HashString( SecHeader[ i ].Name, 0 ) == H_PDATA )
{
LowerBound = (PBYTE) Base + SecHeader[ i ].VirtualAddress;
UpperBound = (PBYTE) LowerBound + SecHeader[ i ].Misc.VirtualSize;
}
}
if ( LowerBound && UpperBound )
{
// Let's find the PRUNTIME_FUNCTION in the pdata
PVOID RelativeAddress = Address - Base;
for ( PRUNTIME_FUNCTION p = LowerBound; p < UpperBound ; p++ )
{
if ( RelativeAddress >= p->BeginAddress && RelativeAddress <= p->EndAddress )
{
*ImageBase = Base;
return p;
}
}
}
}
}
}
}
pModule = ( PLDR_DATA_TABLE_ENTRY ) pModule->Reserved1[ 0 ];
} while ( pModule && pModule != pFirstModule );
return NULL;
}
Spoof Struct
This struct contains a lot of information necessary for the assembly code to set up the frames and perform the synthetic spoofing properly. AceLdr/Namszo's return address patching used a struct, and since this technique just builds upon that, I just expanded the struct.
typedef struct
{
PVOID Fixup; // 0
PVOID OG_retaddr; // 8
PVOID rbx; // 16
PVOID rdi; // 24
PVOID BTIT_ss; // 32
PVOID BTIT_retaddr; // 40
PVOID Gadget_ss; // 48
PVOID RUTS_ss; // 56
PVOID RUTS_retaddr; // 64
PVOID ssn; // 72
PVOID trampoline; // 80
PVOID rsi; // 88
PVOID r12; // 96
PVOID r13; // 104
PVOID r14; // 112
PVOID r15; // 120
} PRM, * PPRM;
_retaddr
is the return address we want on to show up on the frame. _ss
is the stack size of the respective address we're creating a frame work. The first member is Fixup
so we can just store this struct address in the rbx
and our gadget will jump to the handler. There are some nonvolatile registers stored here for restoration at the handler (Fixup
). And there's a ssn
in case the function we wish to spoof is a syscall. This code already jumps directly to the func we want to spoof, so syscall support was easy.
Assembly to set up frames
This is a decent size chunk of assembly and uses a lot of offsets based on the previous struct. It's painful to read lol. The calling convention for Spoof is Spoof(param1, param2, param3, param4, &SpoofStruct, AddrOfFunc, NumofStackArgs, argN)
. The first 4 args are passed via rcx rdx r8 r9
so those are easy. Argument 5 is the spoof struct, 6 is the address of the function we wish to execute (a syscall instruction works here, too), and 7 is the number of arguments passed onto the stack. Since any argument beyond the first 4 are passed via the stack, I wanted to notify Spoof
if there was more than 4 arguments, and how many more there were. This is important because aside from just setting up the frames and jumping to our function, Spoof
is going to relocate our stack arguments onto the new call site. This makes it such that you don't need to pass arguments via struct members.
Spoof:
pop r12 ; Real return address in r12
mov r10, rdi ; Store OG rdi in r10
mov r11, rsi ; Store OG rsi in r11
mov rdi, [rsp + 32] ; Storing struct in the rdi
mov rsi, [rsp + 40] ; Storing function to call
; ---------------------------------------------------------------------
; Storing our original registers
; ---------------------------------------------------------------------
mov [rdi + 24], r10 ; Storing OG rdi into param
mov [rdi + 88], r11 ; Storing OG rsi into param
mov [rdi + 96], r12 ; Storing OG r12 into param
mov [rdi + 104], r13 ; Storing OG r13 into param
mov [rdi + 112], r14 ; Storing OG r14 into param
mov [rdi + 120], r15 ; Storing OG r15 into param
; ---------------------------------------------------------------------
; Prepping to move stack args
; ---------------------------------------------------------------------
xor r11, r11 ; r11 will hold the # of args that have been "pushed"
mov r13, [rsp + 30h] ; r13 will hold the # of args total that will be pushed
xor r14, r14
add r14, 8
add r14, [rdi + 56] ; stack size of RUTS
add r14, [rdi + 48] ; stack size of BTIT
add r14, [rdi + 32] ; stack size of our gadget frame
sub r14, 20h ; first stack arg is located at +0x28 from rsp, so we sub 0x20 from the offset. Loop will sub 0x8 each time
mov r10, rsp
add r10, 30h ; offset of stack arg added to rsp
looping:
xor r15, r15 ; r15 will hold the offset + rsp base
cmp r11, r13 ; comparing # of stack args added vs # of stack args we need to add
je finish
; ---------------------------------------------------------------------
; Getting location to move the stack arg to
; ---------------------------------------------------------------------
sub r14, 8 ; 1 arg means r11 is 0, r14 already 0x28 offset.
mov r15, rsp ; get current stack base
sub r15, r14 ; subtract offset
; ---------------------------------------------------------------------
; Procuring the stack arg
; ---------------------------------------------------------------------
add r10, 8
push qword [r10]
pop qword [r15] ; move the stack arg into the right location
; ---------------------------------------------------------------------
; Increment the counter and loop back in case we need more args
; ---------------------------------------------------------------------
add r11, 1
jmp looping
finish:
; ----------------------------------------------------------------------
; Pushing a 0 to cut off the return addresses after RtlUserThreadStart.
; ----------------------------------------------------------------------
push 0
; ----------------------------------------------------------------------
; RtlUserThreadStart + 0x14 frame
; ----------------------------------------------------------------------
sub rsp, [rdi + 56]
mov r11, [rdi + 64]
mov [rsp], r11
; ----------------------------------------------------------------------
; BaseThreadInitThunk + 0x21 frame
; ----------------------------------------------------------------------
sub rsp, [rdi + 32]
mov r11, [rdi + 40]
mov [rsp], r11
; ----------------------------------------------------------------------
; Gadget frame
; ----------------------------------------------------------------------
sub rsp, [rdi + 48]
mov r11, [rdi + 80]
mov [rsp], r11
; ----------------------------------------------------------------------
; Adjusting the param struct for the fixup
; ----------------------------------------------------------------------
mov r11, rsi ; Copying function to call into r11
mov [rdi + 8], r12 ; Real return address is now moved into the "OG_retaddr" member
mov [rdi + 16], rbx ; original rbx is stored into "rbx" member
mov rbx, [rdi] ; Fixup address is moved into rbx
mov [rdi], rbx ; Fixup member now holds the address of Fixup
mov rbx, rdi ; Address of param struct (Fixup) is moved into rbx
; ----------------------------------------------------------------------
; Syscall stuff. Shouldn't affect performance even if a syscall isnt made
; ----------------------------------------------------------------------
mov r10, rcx
mov rax, [rdi + 72]
jmp r11
section .text$C
Fixup:
mov rcx, rbx
add rsp, [rbx + 48] ; Stack size
add rsp, [rbx + 32] ; Stack size
add rsp, [rbx + 56] ; Stack size
mov rbx, [rcx + 16]
mov rdi, [rcx + 24] ; ReStoring OG rdi
mov rsi, [rcx + 88] ; ReStoring OG rsi
mov r12, [rcx + 96] ; ReStoring OG r12
mov r13, [rcx + 104] ; ReStoring OG r13
mov r14, [rcx + 112] ; ReStoring OG r14
mov r15, [rcx + 120] ; ReStoring OG r15
jmp [rcx + 8]
Synthetic Frames vs Elastic + Bitdefender
I tested this successfully (so far as I can tell) against Elastic and Bitdefender. Call stack spoofing isn’t a silver bullet; if we’re doing a sacrificial process + creating a remote thread, there’s much more going on than just a suspicious unbacked call site. Something less loud in nature and still popular (I think) is unhooking. Though Elastic does not use userland hooks, BitDefender does, so I tested a POC which performed unhooking with indirect syscalls, one with spoofing, and one without. The video below has more details.
The Cooler Spoof: Moonwalking
With the synthetic approach, the stack never truly unwinds to its base; the zero causes a premature cutoff for the unwinding process. The SilentMoonWalk project has the capability to create frames of dynamic size. Klezvirus called it moonwalking, but I like to call it “stitching” because of the way the dynamic stack size would “stitch” multiple frames together.
Earlier I stated that stack size was a common attribute used during stack unwinding, but it is not the only one. Interestingly, if a frame that utilizes the RBP
as a frame pointer is followed by a frame which pushes the RBP
, then it is possible to create a frame of dynamic size.
Below the frame setup that we’re going for is on the left, and the walk perspective is on the right.
It’s a complicated setup, hence I didn’t integrate it into my meterpreter reflective loader. But it’s really cool because the stack walk actually unwinds to the base of the stack. The UWOP_SET_FPREG
frame being dynamic in size makes it seem like the implant’s frames get stitched into it frame, hence I like to call it stitching. It still includes the gadget frame so we can regain execution flow in a similar manner to the synthetic frame approach.
Let’s actually break down the necessary components to perform this, though.
- A frame for a function that uses the RBP as a frame pointer (Frame #1)
- This can be determined via the operation code within the
UNWIND_CODE
of theRUNTIME_FUNCTION
of the frame’s function. If this code is 3 (UWOP_SET_FPREG
), then this function utilizes theRBP
as a frame pointer.
- This can be determined via the operation code within the
- A frame for a function that pushes the RBP (Frame #2)
- This can be determined via the operation code being an
UWOP_PUSH_NONVOL
, with the operation info in theUNWIND_CODE
being 5 (RBP_OP_INFO
)
- This can be determined via the operation code being an
- The index of the RBP being pushed (was it the first register pushed, or the second, etc.)
- We just iterate through the
UNWIND_CODE
s of Frame #2 and count how manyUWOP_PUSH_NONVOL
s there were until we hit one withRBP_OP_INFO
- We just iterate through the
- The offset from the origin frame (ie:
BaseThreadInitThunk + 0x14
)- This offset is calculated via the stack address of
BaseThreadInitThunk + 0x14
- ( Stack Size of Frame #1 - (UNWIND_CODE.FrameOffset
of Frame #1 * 16 ) - 8 )
- This offset is calculated via the stack address of
With this information, we then build our frames like so:
- Create Frame #1
- Create Frame #2
- Insert the stack address at an offset above Frame #1. This stack address is equal to the origin frame address, minus the previously mentioned offset. The stack address placement offset is equal to the index of
RBP
being pushed, times 8. - Create gadget frame
While the spoofed call output may seem similar to the synthetic approach, it actually unwinds to the base of the stack rather than the stack being truncated by a 0.
Note that this was a standalone POC and not implemented into the meterpreter reflective loader, hence all memory addresses are backed by modules. The LoudSunRun.exe
backed frames would be the implant frames had this been implemented.
Defensive Considerations
The spoofed stack from these techniques can be identified manually by looking weird. For example, in the previously shown spoofed NtAllocateVirtualMemory
call, NtAllocateVirtualMemory
usually won’t return to SetDefaultCommConfigW
, if at all.
These implementations relay on a gadget that jumps to a nonvolatile registers to return execution flow to the implant. This is because nonvolatile registers have their values preserved over function calls; they are a safe place to store the address of a fixup handler and/or a struct with information. A frame with a return address to a jmp nonvolatile
or jmp [nonvolatile]
register should be seen as suspicious. This seems to be something game cheats have actually picked upon, based on this article by the Secret Club.
Additionally, Klezvirus introduced a novel detection in his Defcon Talk. His tool, Eclipse, checks if the instruction before the return address isn’t a call
; if there isn’t then a call has been spoofed. If we check our jmp [rbx]
gadget, we can see that the instruction before it definitely isn’t a call.
Even if the instruction before the return address is a call
, Eclipse checks to make sure that the call
is calling the function in the next frame. There is a ton of other interesting and novel techniques in that talk that I can’t cover because I don’t know enough, so please read it if you can.
Bonus: Function Proxying
This technique isn’t stack spoofing but it is intended to address the same defense detection, so I thought it’s worth covering. Essentially, some functions can run callbacks and pass arguments to them, all within a separate thread. This causes a target function to be executed with normal looking call stack without the need for spoofing. See the following posts/projects:
- WorkItemLoadLibrary by rad9800
- Hiding in PlainSight by Paranoid Ninja
While those stacks are completely clean, there’s the possiblity of EDR hooking those WinAPIs needed to perform function proxying. And to unhook them it would require executing other WinAPIs/syscalls which could be detected via call stack again. A chicken and egg problem, sort of. There’s probably a solution to this, I’m just not there yet lol.
Regarding detection opportunities, proxy calling is in a similar case as spoofing stuff, except rather than having a gadget frame, if the “lower api” call is proxied (e.g: to bypass hooks), then the stack will still be missing the typical frames for that function.
Conclusion
This topic is super cool, and there’s more to dig into it. It took a few months, but I think it was worth it. I hope this helped to introduce and explain this (in my opinion) beast of a technique.
Credits and References
A big list of people who helped me or projects/blogs that I referenced during my research.
- KlezVirus - SilentMoonWalk is amazing, and his Defcon talk even more so. Seriously, check the slides and all the Moonwalk talks
- 0xboku - For a ton of help during my internship and guiding me through this topic
- William Burgess - CallStackMasker, VulcanRaven, and this article
- Namaszo - Return address spoofing
- 5pider - Havoc Framework, Kaynldr, ShellcodeTemplate, and for being epic
- Kudaes - Unwinder
- mgeeky - ThreadStackSpoofer
- Paranoid Ninja - Hiding in PlainSight
- rad9800 - Proxy Loading
- Kyle Avery - AceLdr
- realoriginal - TitanLdr
- spotheplanet - Unhooking code