What is Loader Lock? | Elliot on Security

by Prapattimynk, Thursday, 7 December 2023 (3 months ago)
What is Loader Lock? | Elliot on Security


In Windows, every DLL starts by executing its initialization function known as DllMain. This function runs while internal loader synchronization objects, including loader lock, are held. So, you must be especially careful not to violate a lock hierarchy in your DllMain; otherwise, a deadlock may occur.

Loader lock is a critical section. In WinDbg, you can detect the presence of loader lock with this command:

0:000> !critsec ntdll!LdrpLoaderLock

CritSec ntdll!LdrpLoaderLock+0 at 00007ffef2ef55c8
WaiterWoken        No
LockCount          0
RecursionCount     1
OwningThread       46e0
EntryCount         0
ContentionCount    0
*** Locked

External code can search the Process Environment Block (PEB) for loader lock. Then you can use RtlIsCriticalSectionLocked (an NTDLL export) to check its status:

0:000> dt _PEB @$peb -n LoaderLock
ntdll!_PEB
   +0x110 LoaderLock : 0x00007ffe`f2ef55c8 _RTL_CRITICAL_SECTION

Its location at offset +0x110 in the PEB has been stable since Windows NT 4.0 (the predecessor to Windows 2000). This is the offset for 64-bit processes; 32-bit processes have this member at offset +0xa0. Still, loader lock is officially an opaque implementation detail that Microsoft is contractually free to change or remove at any time.

A previous (outdated) look at loader lock states that the purpose of this lock is controlling access between threads to the module list. Let’s put that theory to the test on a modern Windows 10 system.

First, we will confirm the hypothesis was true at the time. We will base this analysis on ReactOS code. ReactOS is an open source reimplementation of Microsoft Windows built from the ground up by reverse engineering. It targets Windows Server 2003 support (additionally featuring some Windows 7+ APIs); this is around the same time Raymond Chen wrote his article summarizing loader lock in 2004.

Looking into the LoadLibrary function seems like an excellent place to start.

Delving into ReactOS source code, we follow the call chain from LoadLibraryWLoadLibraryExWLdrLoadDllLdrpLoadDll. In both LdrLoadDll and LdrpLoadDll (and all of their subfunctions), it’s clear to see that loader lock and no other locks are acquired before reading/modifying existing entries or adding/removing new entries to/from the module list. In LdrLoadDll, loader lock is acquired and released with LdrLockLoaderLock and LdrUnlockLoaderLock, respectively. In LdrpLoadDll, the effect is the same by directly calling RtlEnterCriticalSection and RtlLeaveCriticalSection on loader lock (LdrpLoaderLock).

Information alertInfo
Function prefixes such as Ldr (loader) and Ldrp (loader internals) are function prefixes used to sort Native API / NT components into groups. Here’s a longer list of them if you want to know more.

For reading/modifying existing module entries, this fact becomes even more apparent when simply looking at any function which touches the LoadCount of a module in the list. If a module’s LoadCount (or reference count) equals zero, it gets unloaded from the process. A module’s LoadCount is stored with the rest of the module’s information in the module list. Loader lock is always the only lock acquired before interacting with the LoadCount.

Looking into a reading function like GetModuleHandle, we can see that BasepGetModuleHandleExW -> RtlPcToFileHeader is in turn called to find the requested module. GetModuleHandle calls BasepGetModuleHandleExW with NoLock set to TRUE, thereby causing BasepGetModuleHandleExW to not acquire loader lock (no LdrLockLoaderLock). However, upon entry into RtlPcToFileHeader, loader lock is immediately acquired (RtlEnterCriticalSection (NtCurrentPeb()->LoaderLock)) before walking the module list to find the requested module. This quick look confirms that the loader also acquires loader lock for reads (this was obvious because performing writes already wasn’t atomic/lock-free in nature, but it’s good to verify).

From this short look into the loader source code, we can conclude that the theory is absolutely true for the legacy Windows Server 2003 loader. Furthermore, the legacy loader uses loader lock as one big lock around all functions that do loader work. This lock protects against not only concurrent module list access but also concurrent module loads/unloads, initialization/deinitialization (i.e. DllMain), and more.

Reading the source code of the aforementioned LdrLoadDll and LdrpLoadDll ReactOS functions, one might notice that LdrLoadDll acquires loader lock then calls LdrpLoadDll and it acquires loader lock again. How can LdrpLoadDll acquire loader lock when LdrLoadDll has already acquired it?

This question is along the same vein as a similar question I got in response to my previous article regarding loader lock: HOW is it possible for LoadLibrary to work from DllMain when we’re still under Loader Lock!?!? It’s true, calling LoadLibrary from DllMain (while not considered best practices by Microsoft) successfully loads libraries with no prior steps:

// DllMain boilerplate code (required in every DLL)
BOOL WINAPI DllMain(HINSTANCE hinstDll, DWORD fdwReason, LPVOID lpvReserved)
{
    switch (fdwReason)
    {
    case DLL_PROCESS_ATTACH:
        // This DLL, for example, will successfully load from DllMain
        // Ensure the DLL isn't already loaded with the WinDbg !address command
        LoadLibrary(L"user32");
    }

    return TRUE;
}

But how? Confusion regarding this stems from a misunderstanding of what a critical section is. A critical section is a thread synchronization mechanism. It’s not for synchronizing subroutines within the same thread.

This fact means critical sections support recursive acquisition (this is fundamental). That is, a lock can be acquired multiple times in the same thread without waiting for its release. Here we have our sample test DLL (Dll2) containing the above code as a demonstration of this ability:

Loader lock is acquired recursively by the same OwningThread, thus increasing RecursionCount. As a result, program execution continues without waiting.

This screenshot is taken on Windows 10, hence the LdrpReleaseLoaderLock function.

The recursive acquisition of loader lock is a natural occurrence when loading a library that depends on other libraries. Indeed, the ReactOS code for LdrLoadDll makes reference to recursive loads with variable names such as LdrpShowRecursiveLoads.

If another thread tries to come along and acquire loader lock (a critical section) simultaneously as our thread is already holding the lock, then that increases the lock’s ContentionCount and the other thread has to wait for its release.

A critical section can effectively be used as a subroutine synchronization mechanism if you’re careful not to call any code that would recursively acquire it. However, that’s not its primary purpose and using a critical section in that scenario unnecessarily increases overhead when a simpler lock would suffice. Splitting up a thread synchronization lock into separate, more specialized locks would also increase concurrency, thus improving a system’s perceived performance.

Keep in mind that, while possible, loading libraries from DllMain is still not the best practice and is officially unsupported by Microsoft. It’s poor for performance because invoking long-running operations while holding loader lock blocks other threads from loading libraries. If such operations are carried out during load-time (at process startup), all program execution gets held up! This is priority inversion and it’s best avoided. Additionally, there may be previous Windows versions where loading libraries from DllMain isn’t possible due to some design/implementation quirk (and we all know how much Microsoft likes their decades upon decades of backward compatibility). In particular, it appears that doing risky things from DllMain during DLL_PROCESS_DETACH (not during DLL_PROCESS_ATTACH as shown above) is an acutely horrible idea, especially before Windows Vista.

Now that we’re familiar with loaders gone by – let’s take a look at a modern Windows 10 (22H2) loader!

Through this analysis, we will see how what was once one large blocking loader lock around all loader work (coarse-grained locking) has been split up into smaller, specialized locks (more fine-grained locking), to increase concurrency, thereby improving perceived performance.

Before analyzing the modern Windows loader, it’s essential to determine what type of data structures the loader stores module information in. The shared data structures determine where and what kind of locking would be necessary to protect module information from unsynchronized access, thereby helping us in our analysis.

The module list is stored as a single list but linked in multiple orders. It holds LDR_DATA_TABLE_ENTRY structures, each of which lives in an allocation on the heap:

0:000> x /0 ntdll!PebLdr
00007ffe`f2efb4c0
0:000> dt _PEB_LDR_DATA 00007ffe`f2efb4c0
ntdll!_PEB_LDR_DATA
   +0x000 Length           : 0x58
   +0x004 Initialized      : 0x1 ''
   +0x008 SsHandle         : (null) 
   +0x010 InLoadOrderModuleList : _LIST_ENTRY [ 0x000001e7`f9c12d30 - 0x000001e7`f9c12ba0 ]
   +0x020 InMemoryOrderModuleList : _LIST_ENTRY [ 0x000001e7`f9c12d40 - 0x000001e7`f9c12bb0 ]
   +0x030 InInitializationOrderModuleList : _LIST_ENTRY [ 0x000001e7`f9c12bc0 - 0x000001e7`f9c12bc0 ]
   +0x040 EntryInProgress  : (null) (Unused in Windows 10)
   +0x048 ShutdownInProgress : 0"
   +0x050 ShutdownThreadId : (null) 
0:000> $$ The following command is generated by clicking on `InLoadOrderModuleList` in WinDbg command output
0:000> dx -r1 (*((ntdll!_LIST_ENTRY *)0x7ffef2efb4d0))
(*((ntdll!_LIST_ENTRY *)0x7ffef2efb4d0))                 [Type: _LIST_ENTRY]
    [+0x000] Flink            : 0x1e7f9c12d30 [Type: _LIST_ENTRY *]
    [+0x008] Blink            : 0x1e7f9c12ba0 [Type: _LIST_ENTRY *]
$$ Click on `Flink`/`Blink` (forward/backward link) to inspect the next/previous list entry from the current (first) entry of 0x7ffef2efb4d0
0:000> !address 0x1e7f9c12d30
...
Usage:                  Heap
...
0:000> dt _LDR_DATA_TABLE_ENTRY 0x1e7f9c12d30
ntdll!_LDR_DATA_TABLE_ENTRY
   +0x000 InLoadOrderLinks : _LIST_ENTRY [ 0x000001e7`f9c12ba0 - 0x00007ffe`f2efb4d0 ]
   +0x010 InMemoryOrderLinks : _LIST_ENTRY [ 0x000001e7`f9c12bb0 - 0x00007ffe`f2efb4e0 ]
   +0x020 InInitializationOrderLinks : _LIST_ENTRY [ 0x00000000`00000000 - 0x00000000`00000000 ]
   +0x030 DllBase          : 0x00007ff6`28690000 Void
   +0x038 EntryPoint       : 0x00007ff6`286912d0 Void
   +0x040 SizeOfImage      : 0x7000
   +0x048 FullDllName      : _UNICODE_STRING "C:\Users\user\source\repos\EmptyProject\x64\Release\EmptyProject.exe"
   +0x058 BaseDllName      : _UNICODE_STRING "EmptyProject.exe"
   +0x068 FlagGroup        : [4] "???"
   +0x068 Flags            : 0x22c4 (Flags variable stores all flag states)
   +0x068 PackagedBinary   : 0y0 (List all possible flags)
   +0x068 MarkedForRemoval : 0y0
   +0x068 ImageDll         : 0y1
   +0x068 LoadNotificationsSent : 0y0
   +0x068 TelemetryEntryProcessed : 0y0
   +0x068 ProcessStaticImport : 0y0
   +0x068 InLegacyLists    : 0y1
   +0x068 InIndexes        : 0y1
   +0x068 ShimDll          : 0y0
   +0x068 InExceptionTable : 0y1
   +0x068 ReservedFlags1   : 0y00
   +0x068 LoadInProgress   : 0y0
   +0x068 LoadConfigProcessed : 0y1
   +0x068 EntryProcessed   : 0y0
   +0x068 ProtectDelayLoad : 0y0
   +0x068 ReservedFlags3   : 0y00
   +0x068 DontCallForThreads : 0y0
   +0x068 ProcessAttachCalled : 0y0
   +0x068 ProcessAttachFailed : 0y0
   +0x068 CorDeferredValidate : 0y0
   +0x068 CorImage         : 0y0
   +0x068 DontRelocate     : 0y0
   +0x068 CorILOnly        : 0y0
   +0x068 ChpeImage        : 0y0
   +0x068 ReservedFlags5   : 0y00
   +0x068 Redirected       : 0y0
   +0x068 ReservedFlags6   : 0y00
   +0x068 CompatDatabaseProcessed : 0y0 (End list of all possible flags)
   +0x06c ObsoleteLoadCount : 0xffff
   +0x06e TlsIndex         : 0
   +0x070 HashLinks        : _LIST_ENTRY [ 0x00007ffe`f2efb240 - 0x00007ffe`f2efb240 ]
   +0x080 TimeDateStamp    : 0x655c238e
   +0x088 EntryPointActivationContext : (null) 
   +0x090 Lock             : (null) (Unused in Windows 10; verified with watchpoints)
   +0x098 DdagNode         : 0x000001e7`f9c12e60 _LDR_DDAG_NODE
   +0x0a0 NodeModuleLink   : _LIST_ENTRY [ 0x000001e7`f9c12e60 - 0x000001e7`f9c12e60 ]
   +0x0b0 LoadContext      : 0x000000e5`62eff0e0 _LDRP_LOAD_CONTEXT
   +0x0b8 ParentDllBase    : (null) 
   +0x0c0 SwitchBackContext : (null) 
   +0x0c8 BaseAddressIndexNode : _RTL_BALANCED_NODE
   +0x0e0 MappingInfoIndexNode : _RTL_BALANCED_NODE
   +0x0f8 OriginalBase     : 0x00007ff6`28690000
   +0x100 LoadTime         : _LARGE_INTEGER 0x01da1deb`cb90b0a4
   +0x108 BaseNameHashValue : 0x6190c450
   +0x10c LoadReason       : 4 ( LoadReasonDynamicLoad )
   +0x110 ImplicitPathOptions : 0
   +0x114 ReferenceCount   : 2
   +0x118 DependentLoadFlags : 0
   +0x11c SigningLevel     : 0"
$$ Pro tip: Generate a list of all module entries with this command:
0:000> !list -x "dt ntdll!_LDR_DATA_TABLE_ENTRY" @@C++(&@$peb->Ldr->InLoadOrderModuleList)

A nuance is that the first entry in the module list doesn’t point to a LDR_DATA_TABLE_ENTRY but instead to ntdll!PebLdr+0x10 (confirmed by running ln command on the first entry, shown as 0x7ffef2efb4d0 above). This first entry is written during ntdll!PebLdr initialization (inspect with !peb command) near the start of ntdll!LdrpInitializeProcess. ntdll!LdrpInitializeProcess is an expansive function handling all process initialization when a process first starts up.

As we can see, all three of these link orders, including InLoadOrderLinks, InMemoryOrderLinks, and InInitializationOrderLinks are of type LIST_ENTRY, which means they’re doubly linked and circular.

Visualization of module list. Arrows are bidirectional due to double linking. All credit to the original author.

Each LDR_DATA_TABLE_ENTRY possesses a HashLinks member. These hash links point into LdrpHashTable and act as a layer of abstraction over each list entry in the previously shown module list. These hash links improve lookup performance when searching for a module.

This hash table contains 32 buckets. LdrpHashTable is 512 bytes in size (ln command), and each bucket is made up of a list entry containing two pointers for Flink/Blink (16 bytes), so we can prove 32 buckets by doing 512 / 16 = 32. This size has remained unchanged since the legacy loader (in ReactOS source code).

Hashing each name is done by calling LdrpHashUnicodeString (which in turn calls RtlHashUnicodeString). Upon resolution (hashing), each name resolves to one of the LDR_DATA_TABLE_ENTRY entries (technically LDR_MODULE entries which are the same thing but with fewer members) in the module list.

A hash table (or hash map) is an array with each index (“bucket”) in that array being a structure containing the entry for the key and a list head. Suppose a collision occurs (imperfect hash function) whereby hashing resolves a name to the same bucket. In that case, the list entry points to a separate overflow bucket containing all overlapping entries. This process is called separate chaining and is the most common method of conflict resolution. Software uses hash tables because they typically outperform other data structures at their job.

Starting with Windows 8, each LDR_DATA_TABLE_ENTRY is given two new members called BaseAddressIndexNode and MappingInfoIndexNode, both of type RTL_BALANCED_NODE.

I’ll let this excerpt from Windows Internals: System architecture, processes, threads, memory management, and more, Part 1 (7th edition) take it from here:

Additionally, because lookups in linked lists are algorithmically expensive (being done in linear time), the loader also maintains two red-black trees, which are efficient binary lookup trees. The first is sorted by base address, while the second is sorted by the hash of the module’s name. With these trees, the searching algorithm can run in logarithmic time, which is significantly more efficient and greatly speeds up process-creation performance in Windows 8 and later.

Beginning with Windows 8, each LDR_DATA_TABLE_ENTRY is given a LDR_DDAG_NODE member. Microsoft added this member to solve issues in the resolution of complex dependency chains between libraries as they’re loaded and unloaded.

The extra “D” on DDAG most likely stands for “dependency”, which makes sense because this DAG is for tracking dependencies.

A graph data structure is a superset of the tree and directed acyclic graph (DAG) data structures. Trees and DAGs are directional, meaning they have parent-child relationships. Each node in a tree can only have one parent, unlike a DAG where each node can have multiple parents. Both data structures are acyclic.

Controlling access to shared data structures like those reviewed above is likely only achievable by full or per-node locking. A programmer’s choice would be weighed for costs and benefits; typically, per-node locking is not worth the trade-off. In the case of the Windows loader, I’ll tell you upfront that the loader only does full locking to control access to data structures.

Particularly in the case of the module linked list, it’s linked together in three different orders. There’s no single atomic assembly instruction (such as lock cmpxchg for atomically modifying a simple flag) you could give the CPU to do all that in one step, thus enabling a developer to write so-called “lock-free” code. So, we can expect to see the code employing OS-level synchronization mechanisms.

Starting with Windows Vista / Server 2008, a new lock variety was added to Windows known as the slim read/write (SRW) lock. SRW locks introduced two new lock types to the Windows API, including an exclusive/write lock and shared/read lock. Most notable for our purposes is the exclusive SRW lock. Unlike critical sections, this lock type doesn’t keep track of the acquiring thread ID, making it useful for doing synchronization between subroutines (within the same thread and between threads; the acquiring thread is irrelevant). In terms of locks, it’s about as minimal as it gets only storing a single pointer-sized integer which is set to indicate whether the lock is unlocked (0), owned/locked (1), contended (2), or there’s a wait block for keeping track of who tried to acquire a contended lock first when there are multiple waiters (bitwise or with the StackWaitBlock address all according to ReactOS code). Its minimal nature could improve performance for highly parallelized workloads that don’t require the extra features offered by a critical section. In the following analysis, we will see how the modern Windows loader uses this newer exclusive SRW lock.

In WinDbg, we set a breakpoint on ntdll!RtlAcquireSRWLockExclusive, tell the debugger to stop on NTDLL library load using the sxe ld:ntdll command, and hit Go!

Pretty soon, we hit our breakpoint when LdrpInitializeProcess -> LdrpInsertModuleToIndex calls RtlAcquireSRWLockExclusive to acquire a lock known as the LdrpModuleDatatableLock (LdrpInitializeProcess does a few things before this using different SRW locks but they’re unrelated).

ntdll!LdrpInsertModuleToIndex:
mov     qword ptr [rsp+8], rbx
push    rdi
sub     rsp, 20h
mov     rdi, rcx
mov     rbx, rdx
lea     rcx, [ntdll!LdrpModuleDatatableLock (7ff9f74bd260)]
call    ntdll!RtlAcquireSRWLockExclusive (7ff9f73790a0)
mov     rdx, rbx
mov     rcx, rdi
call    ntdll!LdrpInsertModuleToIndexLockHeld (7ff9f7364744)
lea     rcx, [ntdll!LdrpModuleDatatableLock (7ff9f74bd260)]
mov     rbx, qword ptr [rsp+30h]
add     rsp, 20h
pop     rdi
; This is a tail call optimization:
; It's equivalent to a call then ret but faster
jmp     ntdll!RtlReleaseSRWLockExclusive (7ff9f7362c70)

Upon analyzing the registers immediately before LdrpInsertModuleToIndexLockHeld so we can know the passed arguments, we see that this is adding NTDLL’s own LDR_DATA_TABLE_ENTRY to the index of modules (confirmed by running r rcx; dt _LDR_DATA_TABLE_ENTRY &LTRCX_VALUE>). Stepping up in the call stack to LdrpInitializeProcess (this is an expansive function for handling all process initialization on process startup), we see these three interesting functions called one after another:

  • LdrpAllocateModuleEntry
  • LdrpInsertDataTableEntry
  • LdrpInsertModuleToIndex

Let’s do a deep dive into what each of these functions is doing to our known data structures.

Module Entry Creation Deep Dive🔗

Calls RtlAllocateHeap to allocate the new module entry to the heap. These allocations are done into the process heap which has already been created during LdrpInitializeProcess by calling LdrpInitializeProcessHeap, which in turn calls RtlCreateHeap.

RtlAllocateHeap is called twice:

The memory returned by the first call becomes a pointer to this module’s LDR_DATA_TABLE_ENTRY. This memory address become the return value for LdrpAllocateModuleEntry as a whole.

The memory returned by the second call creates a DDAG_NODE, which is pointed to by its own LDR_DATA_TABLE_ENTRY.

In the context of being called from LdrpInitializeProcess during process startup, NTDLL is a little special in that a pointer to its LDR_DATA_TABLE_ENTRY gets put into ntdll!LdrpNtDllDataTableEntry for easy access shortly after LdrpAllocateModuleEntry returns.

Hashes the BaseDllName member (e.g. ntdll.dll) from the LDR_DATA_TABLE_ENTRY by calling LdrpHashUnicodeString. Based on the hash, a bucket from ntdll!LdrpHashTable is chosen. A pointer to this bucket is added to HashLinks, a doubly linked list in LDR_DATA_TABLE_ENTRY. Then, a pointer to the current LDR_DATA_TABLE_ENTRY.HashLinks gets put into the hash table at the chosen bucket. Insertion is done with InsertTailList which always gets compiled inline.

Adds the newly allocated LDR_DATA_TABLE_ENTRY into the module list. LdrpInsertDataTableEntry links the new list entry in InLoadOrderLinks and InMemoryOrderLinks orders.

ReactOS has a function similar to this called LdrpInsertMemoryTableEntry which appears to have been its name in the Windows 2000 era. One difference I notice is that LdrpInsertDataTableEntry performs extra sanity checks before modifying both these data structures. If one of those checks fails, a __fastfail (int 29h) with code FAST_FAIL_CORRUPT_LIST_ENTRY is raised. These few extra checks bolster security against exploits by catching memory corruption earlier.

Calls RtlRbInsertNodeEx to create an RTL_BALANCED_NODE at MappingInfoIndexNode inside the current LDR_DATA_TABLE_ENTRY (this is not a pointer; the structure is directly embedded). The tree’s root node at ntdll!LdrpMappingInfoIndex (an RTL_RB_TREE) is only modified if one of its direct descendants is added or removed. Otherwise, the Parent argument is non-NULL, and RtlRbInsertNodeEx creates the new node as a descendant of the specified node.

RtlRbInsertNodeEx is called again, this time performing the operation for LDR_DATA_TABLE_ENTRY.BaseAddressIndexNode and ntdll!LdrpModuleBaseAddressIndex.

Initialization is the last step in the process of setting up a module.

The remaining module list order, InInitializationOrderLinks, gets linked. For NTDLL, InInitializationOrderLinks is linked immediately after ntdll!RtlInitializeHistoryTable returns still in LdrpInitializeProcess. For a normal module load (e.g. LoadLibrary), this happens early in LdrpInitializeNode (called by LdrpInitializeGraphRecurse). LdrpInitializeNode later calls LdrpCallInitRoutine, in turn calling the module’s DllMain where module initialization occurs.

During module initialization and deinitialization, the DAG comes into play. For a normal library load (e.g. LoadLibrary), LdrpInitializeGraphRecurse recurses on the DAG to start calling each module’s initialization function (DllMain) in the correct order until all dependencies are initialized:

call    ntdll!LdrpAcquireLoaderLock (7ffef2dce6c4)
mov     rcx, qword ptr [rdi+98h]
mov     rcx, qword ptr [rdi+98h] ; Duplicate instruction is here as a performance optimization for some CPU bug?
lea     r8, [rsp+50h]
mov     rdx, rsi
mov     byte ptr [rsp+50h], 0
call    ntdll!LdrpInitializeGraphRecurse (7ffef2dfc018)
mov     r8d, eax
mov     edx, 2
mov     ebx, eax
call    ntdll!LdrpReleaseLoaderLock (7ffef2dce664)

Likewise, during a normal FreeLibrary, LdrpUnloadNode calls our DLL’s DllMain, passing DLL_PROCESS_DETACH as the reason. Note that Windows loader only unloads the immediate node (module) and none of its dependencies; this is just how FreeLibrary works on Windows:

call    ntdll!LdrpAcquireLoaderLock (7ffef2dce6c4)
mov     rcx, rbx
call    ntdll!LdrpUnloadNode (7ffef2dfa4c8)
xor     r8d, r8d
lea     edx, [r8+8]
call    ntdll!LdrpReleaseLoaderLock (7ffef2dce664)

On FreeLibrary, module deinitialization is the last thing done before LDR_DATA_TABLE_ENTRY.ReferenceCount and LDR_DATA_TABLE_ENTRY->DdagNode.LoadCount is decremented, causing module unload if both counts equal zero.

An interesting lock we have surrounding these function calls.

LdrpModuleDatatableLock is an exclusive lock that protects the module linked list, hash table, and red-black tree during module entry read or write operations.

Whenever the Windows loader wants to ensure these data structures remain in an unchanged, consistent, and valid state, LdrpModuleDatatableLock is acquired. This includes, for example, acquisition during module search operations like LdrpFindLoadedDllByName.

A nuance is that acquiring LdrpModuleDatatableLock (or LdrpLoaderLock) isn’t necessary during LdrpInitializeProcess. This is because our thread remains the only thread in the process and new threads spawned into our process during LdrpInitializeProcess (a remote process could call CreateRemoteThread) won’t be able to make progress anyway due to LdrpInitCompleteEvent (a Win32 event) waiting. During early initialization in LdrpInitialize, new threads wait (NtWaitForSingleObject) on LdrpInitCompleteEvent before doing anything. It’s not until LdrpProcessInitializationComplete calls NtSetEvent on LdrpInitCompleteEvent, thereby allowing other threads to move, that locking is necessary. Even if LdrpModuleDatatableLock is a subroutine locking mechanism, it’s not relevant to hold it because, in practice, our single thread isn’t going to do something rash that would deadlock itself or do inconsistent modification to a data structure whether LdrpModuleDatatableLock is locked or not (certainly not before calling into any third-party, non-Microsoft code).

This nuance allows LdrpModuleDatatableLock to not be held during LdrpInsertDataTableEntry, instead only acquiring it in LdrpInsertModuleToIndex. I suspect that the only reason LdrpInitializeProcress calls LdrpInsertModuleToIndex, thus acquiring LdrpModuleDatatableLock, is because not acquiring it would mean having to call LdrpInsertModuleToIndexLockHeld, which would be a misnomer in this context; it’s not that it’s locked it’s just that you don’t have to acquire the lock given this unique circumstance of process initialization.

I’ve confirmed that during normal loader operation (e.g. doing a LoadLibrary), the loader calls RtlAcquireSRWLockExclusive to acquire LdrpModuleDatatableLock, safely calls LdrpInsertDataTableEntry, then safely calls LdrpInsertModuleToIndexLockHeld directly, lastly releasing by doing a RtlReleaseSRWLockExclusive on LdrpModuleDatatableLock. This pattern of operations protects all of the relevant module info data structures during process run-time.

But wait, how about the one and only: loader lock? We had to get away from it to put this information in context, but we’re ready to discuss it now!

So, we know that loader lock isn’t responsible for protecting the linked list, hash table, or red-black tree. This leaves only one data structure: the DAG.

Recall the code in the Module Initialization section. During LoadLibrary in LdrpInitializeGraphRecurse, you would have seen the code get an address at a register plus offset 0x98 (mov rcx, qword ptr [rdi+98h]). For FreeLibrary, this same operation is also done before LdrpUnloadNode; it’s just out of frame. And what do we know is at offset 0x98?

   +0x098 DdagNode         : 0x000001e7`f9c12e60 _LDR_DDAG_NODE

Yes, here we have a pointer to a DdagNode being extracted from the LDR_DATA_TABLE_ENTRY of our currently loading module!

I believe we have our answer: Loader lock is a critical section that controls access to the DAG data structure used by the loader to track dependency chains and protects against concurrent DLL initialization/deinitialization.

The loader also acquires loader lock during RtlExitUserProcess, which ensures no new libraries are initializing while the process tries to shut down.

The Windows and Linux (GNU ld) loaders are very different. For one, Linux only maintains a single data structure for module info, a non-circular doubly linked list called link_map… that’s it.

On Linux, the Windows critical section (a thread synchronization mechanism) equivalent is a mutex (this is part of POSIX defined pthread). On Windows, critical sections and mutexes are the same, except the former is intra-process, whereas the latter is inter-process.

Glibc source code refers to loader lock as _dl_load_lock. This lock protects from concurrent loads and unloads and is structured similarly to the Windows Server 2003 loader lock. It’s acquired by dlopen or dlclose upon committing to do any loader work.

_dl_load_write_lock works like a modern Windows loader’s LdrpModuleDatatableLock. These exclusive/write locks control access to module data structures. The only difference is that _dl_load_write_lock is shortly acquired/released once for every dlopen on Linux. In contrast, I counted the equivalent LdrpModuleDatatableLock to be acquired 20 times for each LoadLibrary (passing in the full path to an empty test DLL) on Windows.

Architecturally speaking, the reason loader lock problems are significantly more prevalent on Windows than on Linux comes down to each design approach of these operating systems: Windows lock hierarchies are much less modular than Linux. In other words, the loader’s state may be implictly shared with other Windows components due to the monolithic architecture of the Windows API. Hence, doing unrelated things that synchronize threads like spawning and waiting on a thread’s creation (without loading/unloading any libraries on the new thread) can violate the greater NTDLL lock hierarchy. Contrast that with the Unix philosophy.

I intend to release a GitHub repo soon containing the full info, including my experiments with the Windows vs. Linux loaders!

In this article, we learned the building blocks of a modern Windows loader. Using this knowledge, we understood how operating systems perform locking around the relevant shared data structures and sections of code.

Atop these building blocks is another layer of abstraction: the parallel loader. Introduced in Windows 10 to further improve performance, this is a thread pool (i.e. a bunch of threads assigned and ready to do one task at any time) for only loader work. These show up as ntdll!TppWorkerThread threads in WinDbg. Following NTDLL initialization, LdrpInitializeProcess calls LdrpInitParallelLoadingSupport, thus beginning parallel loader setup. I’ll glaze over this by stating that the LdrpAllocatePlaceHolder function allocates a LdrpWorkQueue LIST_ENTRY item and then calls LdrpAllocateModuleEntry to create a module entry, thus creating a whole work item. Loader work threads then read work items from the work queue and do the appropriate work (mapping or snapping). Now you’re seeing how this whole system starts to come together! If you want to learn more about the modern Windows loader, then I recommend you check out Windows 10 Parallel Loading Breakdown by Jeffrey Tang from BlackBerry as he provides a fantastic high-level overview.

In any case, I hope reading this article allowed you to more deeply appreciate everything that goes on under the hood when you double-click a program on Windows.

Comments

Your email address will not be published. Required fields are marked *

Ads Blocker Image Powered by Code Help Pro

AdBlocker Detected!!!

We have detected that you are using extensions to block ads. Please support us by disabling these ads blocker.