Linux syscall ABI

Tue 05 July 2011 by jj

A quick post to summarize the linux kernel syscall ABI on i386 architecture.

It is hard to come by a short summary of how to do direct syscalls under the linux kernel. This does not intend to be exhaustive nor authoritative, but at least now I'll have something to bookmark :)

Linux kernel - X86

int 80h

The kernel abi allows up to 6 register arguments

value storage
syscall nr eax
arg 1 ebx
arg 2 ecx
arg 3 edx
arg 4 esi
arg 5 edi
arg 6 ebp

After the syscall, the return value is stored in eax, and execution continues after the int 80h instruction. All other register values are preserved.

Ex:

mov eax, __NR_write
mov ebx, 1
mov ecx, string_label
mov edx, string_length
int 80h
ret

sysenter

I think this instruction is specific to Intel CPUs.

value storage
syscall nr eax
arg 1 ebx
arg 2 ecx
arg 3 edx
arg 4 esi
arg 5 edi
arg 6 dword ptr [ebp]

Due to the CPU design, after the syscall execution resumes at a fixed address, which under linux is defined at boot to be somewhere in the vdso.

The kernel restores esp to the value ebp had during sysenter, and jumps to the following code :

pop ebp
pop edx
pop ecx
ret

This means that after the syscall, the situation is:

final value values at sysenter
eax syscall return value
eip dword ptr [ebp+12]
ecx dword ptr [ebp+8]
edx dword ptr [ebp+4]
ebp dword ptr [ebp]
esp ebp+16

Ex:

mov eax, __NR_write
mov ebx, 1
mov ecx, string_label
mov edx, string_length

push syscall_ret
sub esp, 12
mov ebp, esp
sysenter

ud2

syscall_ret:
ret

I'm not sure how all this would work in the event of a sys_restart.

Also note that ebp must point to valid memory, even if the syscall does not return nor uses stack arguments (e.g. __NR_exit)

syscall

This instruction is specific to AMD CPUs.

I was not able to test this one, which I believe to be similar to sysenter, except that syscall saves its return address, so the kernel resumes execution right after the syscall instruction instead of the fixed vdso address.

gs:[10h]

The correct way to make a syscall under linux is to use the vdso trampoline, that the kernel will initialize with the correct opcode sequence for your CPU.

The ABI is the same as the int 80h one.

Note that the glibc loader is responsible for setting up gs:10h, the kernel will *not* do that on its own. The dynamic loader ld-linux.so initializes this pointer using set_thread_area() with the vdso base address found in the auxiliary vector at the process entrypoint.

Ex:

mov eax, __NR_write
mov ebx, 1
mov ecx, string_label
mov edx, string_length
call gs:[10h]
ret

Linux kernel - x64 (aka X86_64, amd64)

syscall

Both AMD and Intel use the syscall instruction.

value storage
syscall nr rax
arg 1 rdi
arg 2 rsi
arg 3 rdx
arg 4 r10
arg 5 r9
arg 6 r8

Execution resumes after the syscall instruction, with the return value in the rax register.

rcx and r11 values are not preserved across the syscall, all others are.

Ex:

mov rax, __NR_write
mov rdi, 1
mov rsi, string_label
mov rdx, string_length
syscall
ret

int 80h

On kernels compiled with the 'CONFIG_IA32_EMULATION' feature, X64 code can call legacy 32-bit syscalls using int 80h.

The ABI is the same as for x86 (arg 0 in ebx, ...)

Note that this mode can not reference memory above 0xffffffff, and that the syscall number stored in eax is the X86 one.

sysenter

It is possible to use the sysenter instruction in X64 binaries, but I dont know the ABI here. The kernel seems to segfault every time, and I did not investigate more.

To be continued !