The Architectural Bottleneck of Memory-Bound Inference
The transition of Large Language Models (LLMs) from the training environment to production deployment exposes a fundamental dichotomy in modern computational architecture. While the training phase of a transformer model is inherently compute-bound — characterized by highly parallelized, dense matrix multiplications that efficiently saturate the ALUs and Tensor Cores of a GPU — the autoregressive inference process presents an entirely different systemic profile. In production serving, LLM inference is fundamentally memory-bound, governed strictly by the bandwidth of the memory bus rather than the raw FLOPS capable of being executed by the silicon.
To comprehend this bottleneck, one must dissect the anatomy of an inference request, which operates across two distinct and mechanically disparate phases: the prefill phase and the decode phase. During the prefill phase, the inference engine ingests the entirety of the user's prompt simultaneously — this phase is heavily compute-bound, as the system calculates the initial hidden states and computes the attention matrices for the full input sequence in parallel. The primary latency metric associated with the prefill phase is Time-To-First-Token (TTFT).
However, once the first token is emitted, the engine transitions into the decode phase, which generates all subsequent tokens sequentially, one step at a time. At each iteration, the engine must load the entire model weight matrix from global VRAM into the compute cores, alongside the historical context of all previously generated tokens. This historical context is maintained in a dynamic tensor known as the Key-Value (KV) cache. While the KV cache resolves the mathematical redundancy of autoregressive generation, it introduces a catastrophic systems architecture challenge: the tensor grows dynamically and unpredictably with every single iteration.
In traditional deep learning frameworks, execution kernels demand that tensors be stored in highly contiguous memory spaces to facilitate the vectorized memory accesses expected by standard CUDA implementations. The strict requirement for memory contiguousness, colliding with the unpredictable length of generated output sequences, creates severe inefficiencies in VRAM management — ultimately bottlenecking the number of concurrent requests a GPU can handle, stranding computational capacity behind an impenetrable memory wall.
The Pathology of KV Cache Memory Waste
Prior to the introduction of advanced virtualization techniques in GPU memory management, legacy LLM serving engines such as FasterTransformer and HuggingFace TGI suffered from phenomenally poor memory utilization. Empirical profiling of these early inference servers revealed a startling metric: only 20.4% to 38.2% of the allocated KV cache memory was actively utilized to store actual token states. The remaining 61.8% to 79.6% of precious GPU VRAM was entirely wasted, locked away by suboptimal, contiguous allocation strategies.
This monumental degree of resource waste is not a singular phenomenon but rather a combination of three distinct pathological memory inefficiencies: internal fragmentation, reservation waste, and external fragmentation.
Internal Fragmentation
Internal fragmentation constitutes the largest and most pervasive source of memory waste. The root cause lies in the unpredictability of the generative process. Because the final output length is strictly unknown at the moment a request is admitted, the inference engine is forced into a defensive posture — pre-allocating a monolithic block of memory based on the model's maximum permissible context length.
If an inference server receives a prompt and allocates a contiguous buffer sufficient for 2048 tokens, but the model completes the response and emits an End-Of-Sequence (EOS) token after only 40 tokens, the remaining 2008 memory slots are trapped. This over-allocated, unused memory is physically bound to that specific sequence's execution context — inaccessible to the global memory pool — effectively starving the GPU of its own capacity and contributing to up to 80% of the observed memory waste.
Reservation Waste
Reservation waste represents a secondary, corollary inefficiency. Even in theoretical scenarios where a serving engine implements sophisticated heuristics to accurately guess the final sequence length, the memory required for all future tokens must still be reserved in advance. Because the system must preserve the contiguous nature of the memory chunk for the entire lifetime of the request, the space allocated for token 100 cannot be used while the engine is currently generating token 10. This reserved space is effectively idle — forcing the global scheduler into a highly conservative stance, prioritizing the future stability of an ongoing generation over the aggregate throughput of the overall system.
External Fragmentation
External fragmentation is a systemic degradation that occurs over time as a byproduct of varying sequence lengths. As multiple concurrent requests of differing sizes complete their lifecycles and release their contiguous memory blocks, the available VRAM becomes checkerboarded — characterized by numerous small, non-contiguous gaps of free memory interspersed between active allocations. Even if the total cumulative volume of free VRAM is theoretically sufficient to host a new large incoming request, the allocator will fail to serve the request if it cannot locate a single, unbroken, continuous block.
VRAM MathematicsMathematical Derivation of VRAM Consumption
The exact byte size of the KV cache per token is a deterministic function of the model's architectural dimensions and execution precision. The governing equation is:
$$\text{Memory\_per\_token} = 2 \times N_{\text{layers}} \times (N_{\text{heads}} \times D_{\text{head}}) \times \text{Bytes}_{\text{precision}}$$The initial multiplier of 2 accounts for storing both the Key and Value vectors independently. $N_{\text{layers}}$ is the total number of transformer blocks, $N_{\text{heads}}$ is the total number of attention heads, $D_{\text{head}}$ is the internal dimensionality of each attention head, and $\text{Bytes}_{\text{precision}}$ is determined by the numerical data type — FP16/BF16 yields 2 bytes, INT8/FP8 yields 1 byte, and INT4 yields 0.5 bytes.
Applying this derivation to the OPT-13B model — featuring 40 transformer layers, 40 attention heads, and a head dimension of 128 — executing in standard FP16 precision:
$$\text{Memory\_per\_token} = 2 \times 40 \times (40 \times 128) \times 2 = 819{,}200 \text{ bytes} \approx 0.8 \text{ MB}$$While 0.8 MB per token appears trivial in isolation, it scales brutally under naive contiguous allocation. Pre-allocating a contiguous block to support 2048 tokens instantly spikes to ~1.6 GB per individual request. Attempting to batch 16 concurrent requests consumes 25.6 GB of VRAM solely for the KV cache, independent of model weights. If actual generated sequences average only 100 tokens, 1.52 GB per request — or 24.3 GB total — is entirely lost to fragmentation.
| Waste Category | Root Architectural Cause | Impact on Concurrency & Hardware |
|---|---|---|
| Internal Fragmentation | Pre-allocating contiguous blocks to the maximum sequence boundary | Consumes and traps up to 60–80% of total allocated VRAM for unused tokens. |
| Reservation Waste | Reserving contiguous bounds for anticipated future token generation | Idles usable VRAM that could otherwise be assigned to parallel prefill phases. |
| External Fragmentation | Releasing contiguous blocks of highly variable, unpredictable lengths | Leaves checkerboard gaps in VRAM, preventing large block allocations and forcing queues. |
Virtual Memory Paradigms Applied to GPU VRAM: The PagedAttention Architecture
To resolve the strict, debilitating dependency on contiguous memory allocation, systems architects sought inspiration from foundational operating system designs — specifically, the concept of virtual memory and demand paging. In standard operating systems like Linux or Windows, virtual memory decouples the continuous, linear address space expected by an executing software process from the actual, physical address space of the hardware RAM. This abstraction allows the OS to partition physical memory into small, fixed-size units known as page frames, loading only the actively required pages into physical memory while keeping the logical view contiguous for the application.
The PagedAttention architecture imports this exact OS paradigm directly into the GPU's execution environment. Under the PagedAttention framework, the KV cache for a given sequence is forcibly decoupled into two distinct entities: a logical KV cache and a physical KV cache.
The logical KV cache is presented to the autoregressive generation loop as an entirely contiguous sequence — perfectly mirroring traditional tensor layouts and satisfying the basic mathematical requirements of the transformer architecture. Beneath this abstraction, however, the physical KV cache is heavily partitioned into small, fixed-size blocks, each containing the attention keys and values for a specific, constant number of tokens. In this architectural analogy, physical KV blocks map to OS memory pages, individual tokens map to bytes, and concurrent sequences map to OS processes.
Because the physical blocks do not need to reside contiguously within global VRAM, external fragmentation is completely eradicated. The memory allocator is free to select any available physical block from its global pool to assign to an expanding sequence. Since blocks are allocated strictly on-demand — a new block is only drawn from the pool when the previous block is entirely filled — reservation waste is entirely eliminated. Internal fragmentation is reduced to an absolute mathematical minimum: it becomes localized entirely within the unfilled positions of the very last physical block of a sequence.
Block Table Address Translation
The linchpin of the PagedAttention architecture is the Block Table, a heavily optimized, low-level data structure responsible for translating logical blocks to physical blocks on the fly. Much like an OS page table utilizes a Translation Lookaside Buffer (TLB), the Block Table maintains a highly efficient, per-sequence mapping that bridges the virtual and physical domains.
When the execution engine evaluates the Query vector for a newly generated token, it must calculate attention scores against all previous tokens. The engine iterates linearly through the logical blocks of the sequence. For each logical block, the execution thread uses the Block Table to perform an immediate lookup, retrieving the physical block address residing in global VRAM. The engine then fetches the Key vectors from that physical address, computes the attention scores via the standard dot product ($Q \cdot K^T$), and loads the corresponding Value vectors to multiply and accumulate. The intermediate results across all physically distributed blocks are normalized via a softmax function to produce the final, cohesive attention output.
The Mechanics of Block Sizing and L1/L2 Cache Utilization
The determination of the exact number of tokens stored within a single physical KV block — referred to as the block size — is a critical optimization vector that dictates hardware-level cache efficiency and memory bandwidth saturation. If the system configuration sets the block size to 16, and the model's head dimension is 128, a single block for one specific head will store exactly $16 \times 128 = 2048$ floating-point elements.
Counter-intuitively, allocating exceptionally large block sizes (e.g., 64 or 128 tokens per block) yields significantly degraded performance metrics during the decode phase. Smaller block sizes, typically configured at 16 or 32 tokens, maintain substantially higher memory bandwidth utilization because they exhibit vastly superior hit rates within the highly constrained L1 cache of the Streaming Multiprocessor (SM). While smaller blocks invariably introduce minor computational overhead on the CPU side due to increased dynamic block allocations and necessary Block Table updates, the massive reduction in slow global VRAM round-trips heavily offsets this latency.
CUDA InternalsLow-Level CUDA Implementation and Execution Hierarchy
Executing mathematical attention over physically fragmented memory blocks inherently breaks compatibility with standard, highly optimized dense attention kernels. Legacy implementations, such as the baseline FlashAttention kernels, fundamentally mandate that the input tensors reside in contiguous global memory to leverage highly optimized block-tiling and SRAM caching. To achieve high performance without this contiguity, serving frameworks like vLLM implement completely customized, multi-head query attention kernels written directly in low-level CUDA (attention_kernels.cu).
The architecture of this custom kernel maps the complex logical flow of PagedAttention directly onto the physical execution hierarchy of the NVIDIA GPU — specifically distributing work across thread blocks, warps, and individual threads. In CUDA architecture, a warp is a synchronous execution unit comprising exactly 32 threads (WARP_SIZE) that execute identical instructions simultaneously in a SIMT fashion. A thread block is a larger grouping of threads that are scheduled on the same Streaming Multiprocessor and share ultra-fast, low-latency shared memory.
Within the highly optimized PagedAttention kernel, each thread block is assigned the calculation for one specific attention head of a specific sequence. The warps within that thread block are then systematically distributed across the physically scattered KV blocks. For instance, if a sequence context spans 6 physical blocks and the thread block utilizes 4 warps, the kernel's internal scheduler will interleave the workload to maximize parallel memory reads: Warp 0 computes attention over the 0th and 4th physical blocks, Warp 1 handles the 1st and 5th blocks, Warp 2 computes the 2nd block, and Warp 3 handles the 3rd block.
// Each thread block → one attention head of one sequence
// Warps within the thread block → distributed over KV blocks
grid_dim.x = num_seqs
grid_dim.y = num_heads
block_dim.x = NUM_THREADS // typically 128 or 256
// Inside kernel: warp assignment
int warp_id = threadIdx.x / WARP_SIZE; // 0 … NUM_WARPS-1
int num_warps = NUM_THREADS / WARP_SIZE;
int kv_block_start = warp_id; // interleaved stride
int kv_block_step = num_warps;
for (int b = kv_block_start; b < num_kv_blocks; b += kv_block_step) {
int physical_block_id = block_table[seq_id][b];
float* k_cache = kv_cache + physical_block_id * block_stride_k;
float* v_cache = kv_cache + physical_block_id * block_stride_v;
// dot(Q, K^T), accumulate softmax numerator …
}
This warp-level parallelization over disjointed memory locations ensures that memory coalescing is maximized within the strict constraints of fragmented physical pages. When threads within a warp read data from slow global VRAM into fast shared memory, their access patterns are meticulously structured to align with the 128-byte cache line boundaries of the GPU architecture — ensuring the kernel maintains near-theoretical limits on memory bandwidth utilization, successfully simulating the speed of dense matrix operations despite the lack of macroscopic tensor contiguity.
vAttentionBypassing Kernel Complexity: The vAttention Paradigm
While PagedAttention effectively eradicates memory fragmentation, it imposes immense software complexity and engineering overhead. Because physical memory is non-contiguous, developers cannot easily integrate new, state-of-the-art dense attention kernels without painstakingly rewriting them to parse the proprietary Block Tables.
The vAttention framework proposes a novel, competing architectural paradigm that achieves the exact same reduction in memory fragmentation without abandoning fundamental memory contiguity. Instead of managing physical pages at the user-space software level using Block Tables, vAttention leverages the low-level Virtual Memory Management (VMM) APIs built directly into recent CUDA versions: cuMemAddressReserve, cuMemCreate, cuMemMap, and cuMemSetAccess.
Under the vAttention architecture, the serving framework preemptively reserves a massive, entirely contiguous block of virtual memory for every sequence upon initialization using cuMemAddressReserve. This reservation consumes virtual address space but uses zero physical VRAM. As the sequence grows dynamically during decode, vAttention allocates physical memory pages and maps them to the reserved contiguous virtual address space on-demand using cuMemMap.
To the application layer and to standard, unmodified execution kernels like FlashAttention, the KV cache appears perfectly contiguous in virtual space — satisfying all strict mathematical requirements of optimized deep learning frameworks. Meanwhile, the CUDA driver handles the physical fragmentation, allocating blocks wherever free VRAM exists. By removing CPU overhead from user-level Block Table lookups and permitting the direct use of highly optimized, unmodified FlashAttention kernels, vAttention has demonstrated raw decode throughput improvements of up to 1.99× over standard PagedAttention implementations in specific benchmarks.
Memory SharingAdvanced Memory Sharing: Copy-on-Write and Prefix Caching
The introduction of block-level virtualization unlocks highly complex memory management capabilities that were fundamentally impossible with standard contiguous tensors. Chief among these are Copy-on-Write (CoW) semantics and Automatic Prefix Caching (APC), which further compress memory overhead during complex decoding strategies like parallel sampling, beam search, and multi-turn conversational agents.
Reference Counting and Copy-on-Write Mechanisms
When a system executes parallel sampling — generating multiple distinct outputs simultaneously from a single, shared input prompt — traditional serving engines allocate independent, duplicate KV caches for the prompt across every parallel sequence. If a prompt is 1000 tokens long and the user requests 5 parallel completions, traditional systems store 5000 tokens worth of identical prompt data, introducing massive redundant memory duplication.
PagedAttention natively resolves this by implementing physical block sharing. The Block Tables for multiple distinct sequences can all point to the exact same physical blocks for their shared prompt. To ensure memory safety, the underlying BlockSpaceManager implements strict, OS-style reference counting. Each physical block explicitly tracks how many logical sequences currently reference it.
If multiple sequences share a physical block, its reference count is initialized to $N$. When one of the sequences diverges during generation and needs to append a new token to a shared block that still contains empty slots, the memory kernel detects that the reference count is greater than 1. Instead of mutating the shared block and corrupting the state for other sequences, the system triggers a Copy-on-Write mechanism — allocates a new independent physical block from the free_block_queue, copies the existing token data, decrements the original block's reference count, and updates the diverging sequence's Block Table to point to the newly allocated block. This fine-grained, block-level CoW can reduce overall memory overhead by up to 55%, translating directly into a 2.2× improvement in generation throughput for high-branching tasks.
Automatic Prefix Caching vs. Radix Tree Architectures
Beyond local sequence sharing, modern serving architectures extend PagedAttention to global memory domains using Automatic Prefix Caching (APC). A physical KV block can be uniquely and deterministically identified by computing a cryptographic hash of the tokens residing within the block combined with the tokens in the prefix preceding that block. The inference engine maintains a global hash table mapping hash(prefix_tokens + block_tokens) directly to physical block IDs. When a new request enters the scheduler, the engine parses its prompt, computes block-level hashes, and queries the global table. If a cache hit occurs, the scheduler entirely skips the compute-heavy prefill phase for those tokens.
Alternative frameworks like SGLang propose a different methodology known as RadixAttention. Rather than hashing discrete, fixed-size blocks, RadixAttention structures the entire KV cache prefix-sharing memory pool into an irregular Radix Tree, mapping the conversational flow of requests into a dependency graph. The system counts the total number of requests residing within each sub-tree and applies weighted priority evictions based on tree-shape locality — rendering it exceptionally efficient in highly dynamic, unpredictable scenarios such as Tree-of-Thought reasoning or complex multi-turn agentic workflows where conversational branches diverge and converge asynchronously.
| Feature | vLLM — Automatic Prefix Caching | SGLang — RadixAttention |
|---|---|---|
| Data Structure | Global Hash Table mapping block tokens to physical IDs. | Irregular Radix Tree dependency graph mapping conversational flows. |
| Optimization Philosophy | Static and structured; excels in predictable batch processing and templated prompts. | Dynamic and automatic; excels in unpredictable multi-turn and Tree-of-Thought branching. |
| Eviction Mechanism | Flat Least Recently Used (LRU) queue based on block hashes. | Weighted priority evictions based on sub-tree locality and historical request counts. |
Iteration-Level Scheduling and Continuous Batching
Memory fragmentation optimizations are functionally useless if the scheduler routing the workloads cannot exploit the newly freed memory space. Legacy LLM inference engines utilized static, request-level batching. In a static batch, the GPU processes a group of requests but must wait for the entire batch to complete all their generations before dispatching the next batch from the queue. If one request generates 500 tokens while the others generate 10 tokens, the GPU sits heavily underutilized, blocked by the longest sequence while the completed slots remain empty.
To maximize the dynamic allocation nature of PagedAttention, advanced frameworks implement continuous batching, also known as iteration-level scheduling. The scheduler evaluates the queue at every single decoding step (iteration). When a sequence emits an EOS token, the BlockSpaceManager immediately frees its physical blocks back to the memory pool (free_block_queue). In the very next iteration step, the scheduler detects the newly available VRAM and dynamically pulls a new request from the waiting queue into the active batch, instantly initiating its prefill phase and absorbing latency variance.
Preemption, CPU Swapping, and Eviction Policies
Because continuous batching pushes GPU memory utilization to its absolute physical limits, the scheduler frequently encounters boundary states where the pool of free physical blocks is completely exhausted during an active decode step. To recover gracefully, the scheduler engages in request preemption. The engine selects a set of lower-priority sequences to evict from the active batch, relying on two distinct eviction policies:
Swapping: The engine leverages a secondary CpuBlockPool to orchestrate a data transfer over the PCIe bus. The physical blocks belonging to the preempted sequences are copied into the host's CPU RAM, freeing the high-speed GPU blocks for the remaining active requests. When the preempted requests are later re-scheduled, the scheduler issues a swap-in command to restore the blocks back to the GPU.
Recomputation: If swapping is disabled or CPU memory is bounded, the scheduler drops the KV blocks of the preempted sequence entirely. When the request is eventually resumed, it is treated as a completely new prompt, forcing the engine to waste compute cycles recomputing the entire sequence up to its preemption point.
| Preemption Mode | Mechanism | Operational Cost | Optimal Use Case |
|---|---|---|---|
| Swapping | Evicts active KV blocks to host CPU RAM via PCIe interconnect. | Latency tied to PCIe bandwidth limits (e.g., PCIe 4.0/5.0 bottlenecks). | Long-context inference where prefill recomputation cost is prohibitive. |
| Recomputation | Discards physical blocks entirely; sequence is re-prefilled upon resumption. | High computational ALU overhead to rebuild the KV cache from scratch. | Short-context prompts where FLOPS are cheaper than PCIe transfer times. |
Vulnerability Analysis: The Fill-and-Squeeze Attack
The highly deterministic nature of the scheduler's memory boundaries exposes continuous batching frameworks to a specific Denial-of-Service (DoS) vector classified as the "Fill-and-Squeeze" attack. An adversary weaponizes the Inter-Token Latency (ITL) side-channel across two distinct phases:
During the "Fill" phase, the attacker monitors the ITL metrics by sending short probe queries. When ITL drops, indicating abundant free KV cache capacity, the attacker injects high-intensity payloads — prompts explicitly engineered to force maximum context length generations. This action deterministically exhausts the free_block_queue, forcing the scheduler into Memory-Based Head-of-Line (HOL) blocking, where the admission of all new requests is paused due to VRAM starvation.
During the "Squeeze" phase, once the system is hovering precisely at its physical memory limit, the attacker switches tactics to continuous, low-intensity micro-requests. Because physical memory is fully saturated, these micro-requests tip the load balance, forcing the scheduler into an uncontrolled, catastrophic thrashing loop of continuous LIFO preemption and highly latent PCIe swapping or recomputation.
Hardware Bottlenecks: RTX 3090 vs RTX 4090 under High Throughput
Deploying PagedAttention frameworks on consumer or prosumer-grade hardware uncovers complex physical bottlenecks. During the prefill phase, inference is heavily compute-bound. Because the RTX 4090 possesses substantially greater arithmetic logic and significantly higher clock speeds, it outperforms the RTX 3090 drastically in TTFT metrics — generally up to 50% faster in scenarios heavily reliant on raw FLOPS.
However, during the decode phase — the continuous, autoregressive generation of output tokens — performance scaling flattens dramatically. Decoding is strictly memory-bandwidth bound. The RTX 3090 features a memory bandwidth of 936 GB/s, while the newer RTX 4090 only offers a marginal architectural uplift to 1008 GB/s. As a result, the Output Tokens Per Second (OTPS) metric across high-concurrency batching is much closer than the generational gap would imply. In highly optimized continuous batching workloads using vLLM, the RTX 3090 operates only 14% to 16% slower than the RTX 4090 in sustained throughput for models like Qwen-8B or LLaMA-7B.
| Hardware Architecture | Core Generation | Peak Memory Bandwidth | Median TTFT (Prefill) | Median OTPS (Decode) |
|---|---|---|---|---|
| NVIDIA RTX 3090 | Ampere | 936 GB/s | ~564 ms | 28 OTPS (QwQ-32B) |
| NVIDIA RTX 4090 | Ada Lovelace | 1008 GB/s | <500 ms | ~32 OTPS (QwQ-32B) |
| NVIDIA H200 | Hopper (HBM3e) | ~4.8 TB/s | ~334 ms | 30 OTPS (Cloud Configured) |
The PCIe Interconnect Penalty
When configuring multi-GPU setups for Tensor Parallelism (TP) or managing heavy KV cache CPU-swapping during scheduler preemption, the interconnection bus becomes the paramount bottleneck. The RTX 3090 is the final generation of NVIDIA consumer cards to officially feature NVLink — a proprietary interconnect capable of bidirectional bandwidths far exceeding standard PCIe. The RTX 4090 strictly relies on the standard PCIe bus for all GPU-to-GPU and GPU-to-CPU communications.
A PCIe 5.0 x16 interconnect provides a theoretical 64 GB/s unidirectional bandwidth (yielding roughly 52–60 GB/s in real-world application), whereas PCIe 4.0 x16 drops to half that capacity. In high-concurrency environments, prefix-cache retrieval and page swapping over PCIe can account for up to 70% to 90% of total switching latency. Consequently, local infrastructure heavily relies on preventing page faults entirely — optimizing the model's memory footprint through FP8 or INT4 quantization directly reduces the byte-per-token overhead, staving off PCIe swap saturation and preserving high concurrent batching rates.
TensorRT-LLMHardware-Specific Compilation: TensorRT-LLM
While frameworks like vLLM provide highly flexible, hardware-agnostic hypervisors for managing PagedAttention, TensorRT-LLM takes a more rigid approach. It diverges from the open-source adaptability of vLLM by tightly coupling execution logic to proprietary NVIDIA hardware. While vLLM processes models out-of-the-box using JIT-like compilation and custom CUDA kernels, TensorRT-LLM relies on aggressive, ahead-of-time compilation utilizing deep CUDA graph fusions and Tensor Core optimization paths highly specific to the deployed hardware architecture (e.g., Ada Lovelace or Hopper).
This creates a strict operational dichotomy. TensorRT-LLM achieves peak, absolute theoretical hardware efficiency on enterprise deployments — particularly for FP8 inference on data-center H100s — but suffers from complex setup overhead, rigid deployment pipelines, and a complete lack of hardware independence. In contrast, vLLM remains flexible, capable of managing PagedAttention and continuous batching across diverse consumer setups, AMD ROCm architectures, and Intel Gaudi hardware without requiring engineers to rewrite model definitions.
Engine EvolutionThe Evolution of the Scheduling Engine: V1 vs V2 Architectures
In legacy systems, the engine was strictly phase-locked — a single iteration step could process either prefill requests or decode requests, but never both simultaneously. This led to microscopic stalls in execution, where GPU cycles were wasted transitioning between memory-bound decodes and compute-bound prefills.
Modern engine re-architectures deeply decouple these constraints, introducing highly asynchronous scheduling modules. The modern V2 scheduler evaluates the waiting, running, and swapped queues globally at every iteration. It strictly prioritizes decode requests to ensure generation latency remains low, but simultaneously computes precise token budgets for the upcoming iteration. If the allocate_slots function confirms that the free_block_queue possesses sufficient capacity, the scheduler seamlessly interleaves the processing of partial prefill chunks alongside the autoregressive decode passes within the exact same GPU execution step. This interleaved compute-memory utilization keeps both the ALUs and the memory bus saturated constantly, dramatically raising the aggregate throughput floor and preventing the GPU from ever entering an idle state during high-load production serving.
Synthesizing the Paradigm Shift
The transition from static, contiguous memory allocation to dynamic, virtualized memory paging represents the most critical architectural breakthrough in modern LLM serving infrastructures. By accurately diagnosing the pathologies of internal fragmentation, reservation waste, and external fragmentation, PagedAttention entirely circumvents the hardware limitations that previously throttled local inference to a fraction of its theoretical capacity.
Through the rigorous application of Block Tables, fine-grained CUDA thread-warp synchronization, and block-level Copy-on-Write semantics, continuous batching frameworks can push GPU VRAM utilization to its absolute physical boundary, reducing memory waste to near zero. While the selection of underlying hardware — ranging from PCIe-bottlenecked Ada Lovelace architectures to NVLink-enabled Ampere clusters — dictates the raw throughput ceiling, the virtualization of the KV cache ensures that every byte of bandwidth is harnessed efficiently.
As competing paradigms like RadixAttention and low-level CUDA VMM mapping continue to evolve, the core abstraction of decoupled logical and physical memory will remain the foundational pillar ensuring that large language models can be deployed locally, efficiently, and at massive scale.