The Nightmare of Dynamic Linking
In Part 3, we built a functional user-space loader in Zig. It could read an ELF file, map it into memory, setup the stack, and jump to the entry point. It worked perfectly for static binaries.
But when we tried to run /bin/ls, it crashed.
The reason is simple: /bin/ls is missing parts. It doesn't contain the code for printf, open,
or malloc. It relies on libc.so; To run it, we need a Dynamic Linker.
You might think: "Well, we already wrote a loader. Can't we just load libc.so the same way we
loaded the main program?"
If we only had to support our own custom libraries, dynamic linking would be straightforward. We
could define a simple ABI, load the file, and patch a few pointers. The reason writing a
production-grade dynamic linker is one of the hardest tasks in systems programming isn't the concept
of linking itself, but satisfying the insane, undocumented, implicit requirements of libc.
Shared libraries are the work of the devil, the one true sign that the apocalypse is at hand. — Tom Duff
The Recursive Problem
The Dynamic Linker (usually /lib64/ld-linux-x86-64.so.2) is specified in the PT_INTERP header of
the executable.
Our job is to map the executable, map the interpreter, and then jump to the interpreter's entry
point. The interpreter then loads libc, resolves symbols, and eventually jumps to the application.
But here is the catch: The Dynamic Linker is itself a shared library.
It contains position-independent code. It has a Global Offset Table (GOT). It needs relocation.
But who relocates the relocator? Well the relocater relocates the relocator by relocating the relocations.
When the dynamic linker starts, it is running in a hostile environment. It cannot access its own global variables because its GOT hasn't been initialized yet. It cannot use string literals. It has to perform a "Bootstrap Relocation" on itself using careful assembly or restricted C code that avoids static data entirely.
The Final Boss: libc
If it were just about mapping files and looking up symbols, writing a dynamic linker would be a fun weekend project.
The real reason it is a nightmare is libc.
We tend to think of libc as a collection of helper functions (strlen, printf) and nothing
really useful compared to modern standard libraries. In reality, libc acts as a comprehensive
runtime environment that makes massive assumptions about the state of the machine, that when not met
lead to it exploding, and still it provides nothing really useful.
The "System Call" Lie
The integration between libc and the OS is deeper than you might expect. On Linux, they are so
intertwined that the line blurs.
Run man 2 mmap on your terminal. Section 2 is reserved for "System Calls," yet the documentation
plainly states:
1NAME
2 mmap, munmap - map or unmap files or devices into memory
3
4LIBRARY
5 Standard C library (libc, -lc)
The documentation lies. It describes the C wrapper, not the actual kernel interface.
The actual kernel mmap syscall on x86_64 has different arguments, returns a negative integer between
-1 and -4095 on error, and uses registers r10 and r8. The C function mmap, however, expects
arguments on the stack (on 32-bit), returns -1 on error, and sets errno.
"So what?" you ask. "I'll just write my own syscall wrapper in assembly. Zig does this for standalone binaries!"
You can, but the moment you load libc.so alongside your code, you enter a minefield. libc
assumes it owns the world. It caches process IDs. It assumes it handles all signal dispositions. It
assumes it manages the brk heap.
Most importantly, other libraries you might want to load (like libGL.so, libssl.so, or
libX11.so) depend on libc symbols. They will call open, malloc, and pthread_mutex_lock. If
your loader hasn't initialized libc's internal state exactly the way _start usually does, those
functions will crash or corrupt memory.
You are effectively locked into the libc ecosystem not because of the kernel, but because of the
dependency tree of every other shared object in existence.
The Horror of TLS (Thread Local Storage)
Try accessing errno. It looks like a global integer, right?
1int x = errno;
It isn't. Since multiple threads can run at once, errno must be unique to each thread. In modern
glibc, that line of C code actually compiles to something like this assembly:
1mov rax, fs:[0xffffffffffffff18] ; location of errno relative to FS segment
On x86_64, the fs segment register is used for Thread Local Storage. libc expects fs to point
to a Thread Control Block (TCB).
If our loader jumps to libc code without allocating a TCB and telling the kernel about it (via the
arch_prctl syscall), the CPU tries to dereference fs:offset. Since fs is 0 by default, this is
a null pointer dereference. And since 0 isn't usually mapped: Segfault.
But it gets worse. The TCB has a specific layout that libc relies on. The first entry in the TCB
must be a pointer to the TCB itself. Why? Because some optimizations verify fs validity by
checking if fs:[0] == fs. If you don't set this up exactly right, the runtime crashes before main
even starts.
The dlopen & pthread_create Conspiracy
Here is where the OS, the Linker, and libc form a tangled knot of dependencies.
When you run a program, the linker calculates how much TLS space is needed by the executable and all
loaded libraries. It allocates a block, sets fs, and everyone is happy. This is Static TLS.
But then someone calls dlopen("plugin.so", RTLD_LAZY).
We are loading a new library at runtime. This library has its own __thread variables. It needs TLS
space. But we have threads running already! We can't just resize their stacks or shift their
existing TLS blocks, because that would invalidate pointers held by other threads.
So now we need Dynamic TLS.
The dynamic linker maintains a "generation counter." Every time a new library with TLS is loaded, the generation increments.
The linker allocates space for the new TLS data in the heap and updates a global directory (the
Dynamic Thread Vector or DTV). Existing threads are now "out of date." The next time a thread tries
to access a TLS variable from that new library, it triggers a trap (or runs a helper
__tls_get_addr). The helper notices the generation mismatch, allocates the memory for that
specific thread, and updates the thread's private DTV.
And pthread_create? It needs to know about all of this. It needs to ask the dynamic linker: "How
much Static TLS do I need to allocate for this new thread? And how much 'Surplus' space should I
leave just in case someone calls dlopen later?"
This creates a circular dependency that is terrifying to behold. The threading library (part of
libc) needs to allocate TLS for new threads, but it doesn't know about the loaded modules or the
DTV. So, libc calls back into private functions inside the dynamic linker (like
_dl_allocate_tls) to do the heavy lifting.
If you are writing a custom dynamic linker, you can't just load libc; you have to expose an undocumented interface to libc so it can do its job. So you might just write a libc as well.
IFUNCs: Code Running During Relocation
libc is rightfully obsessed with performance. For that it it has multiple memcpy: one
for AVX-512, one for AVX2, one for SSE4, a generic one, and others.
When you look up the symbol memcpy in the symbol table, it doesn't give you the address of the
function, but the one of a Resolver Function.
The loader is supposed to run this resolver function. The resolver checks the CPU features (CPUID)
and returns the address of the actual memcpy to use. The loader then writes that address into the
GOT.
Think about the implications of this. The loader must execute arbitrary code from the library during the relocation phase.
This creates a massive "chicken and egg" problem. If the resolver function tries to call malloc,
and malloc hasn't been relocated yet, the process crashes. If it calls a function in another
library that hasn't been loaded, it crashes.
Because of this tight coupling, the Dynamic Linker and libc are usually built from the same source
tree (e.g., glibc). They share internal implementation details that are not part of the public
ABI. If you try to replace one without the other, you are in for a world of pain.
This happens before the program has started. Before main. Before constructors. If the resolver
function tries to touch a global variable that hasn't been relocated yet, or calls a function that
hasn't been bound, the whole house of cards collapses.
Conclusion
We started this series by asking "What happens before main?".
In the static case (our loader from Part 3), the answer is: "The kernel maps the file, sets up the stack, and we jump."
In the dynamic case, the answer is: "We map the linker, which bootstraps itself, allocates thread control blocks, negotiates with the kernel to set segment registers, resolves hardware-specific function implementations by running arbitrary code, manages a generational garbage collector for thread-local storage, runs constructors, and then jumps."
It is a miracle it works at all.
This complexity is likely why many modern languages - like Rust, Go and Zig - default to static compilation.
Personally, I prefer static compilation. It is simple. With virtual memory, we don't even need the
code to be relocatable (executables can just assume a fixed address). And we have the space. In
modern user-space programs, assets (images, audio, UI resources) take up way more space than the
code itself. Saving a few megabytes by sharing libc often isn't worth the architectural nightmare
required to support it.