SIMD Registers (XMM / YMM / ZMM)

x86-64 includes a layered SIMD register file introduced across three instruction-set extensions: SSE (Streaming SIMD Extensions), AVX, and AVX-512.

Register hierarchy

Extension	Width	Register names	Count
SSE (baseline on x86-64)	128 bits	`xmm0`–`xmm15`	16
AVX	256 bits	`ymm0`–`ymm15`	16
AVX-512	512 bits	`zmm0`–`zmm31`	32

Each wider register aliases the lower-width registers: xmm3 = lower 128 bits of ymm3 = lower 128 bits of zmm3.

Writing xmm3 via a legacy SSE instruction (non-VEX-encoded) zeroes bits 128–255 of ymm3; writing via a VEX-encoded instruction only zeroes the upper bits of ymm3 if the destination is written explicitly.

Lane interpretation

A single XMM register can hold:

Type	Contents
`__m128i`	16 × i8, 8 × i16, 4 × i32, 2 × i64
`__m128`	4 × float (single-precision)
`__m128d`	2 × double (double-precision)
`__m256`	8 × float

Calling conventions

System V: xmm0–xmm7 pass floating-point/vector arguments; xmm0–xmm1 return values. xmm8–xmm15 are caller-saved.
Microsoft x64: xmm0–xmm3 pass vector arguments; xmm6–xmm15 are callee-saved.

Reverse-engineering notes

Auto-vectorised loops emit MOVDQU, PCMPEQB, PAND, PADDW etc. on xmm registers. Decompilers often fail to reconstruct the high-level loop; reading the raw assembly is usually faster.
MOVAPS / MOVAPD require 16-byte-aligned addresses; MOVUPS / MOVUPD do not. Seeing an alignment fault with SSE is almost always a misaligned MOVAPS on a non-16-byte-aligned stack.
xmm0 is frequently used for double and float return values — if a function returns xmm0 the return type is floating-point.
vzeroupper / vzeroall appear before calling non-AVX code to avoid AVX–SSE transition penalties; their presence signals that the surrounding code uses YMM or ZMM registers.