Glossary
Core terms in AI compute-center storage acceleration (KV Cache, disaggregation, EBOF, NVMe-oF).
Which AI-storage terms should you know first?
- Disaggregation
- Decoupling storage and compute so each scales independently
- KV Cache offload
- Tiered offload of GPU-memory KV cache to fast external all-flash
- NVMe-oF / RoCEv2
- Remote NVMe over lossless Ethernet at near-local latency
- EBOF
- Ethernet-attached bunch of flash for scale-out
KV Cache
Cached attention key-value pairs during LLM inference that avoid recomputation and speed long-context generation; consumes large GPU memory.
KV-Cache offload
Tiered offload of KV Cache to external high-speed storage to extend context and lift concurrency/throughput (up to ~73.7% cost cut, S5).
Disaggregation
Decoupling storage from compute so each scales independently, avoiding buying compute just to add storage.
EBOF
Ethernet-attached Bunch of Flash; an all-flash expansion unit over NVMe-oF enabling independent storage scale-out.
GPUDirect Storage
Lets GPUs exchange data with storage bypassing the CPU, cutting copies and latency (up to 351 GiB/s sequential read, S4).
NVMe-oF
NVMe over Fabrics; extends NVMe across the network so remote all-flash behaves like local disks.
RoCEv2
RDMA over Converged Ethernet v2; low-latency lossless RDMA transport over Ethernet.
CPFS
Parallel file system providing high aggregate bandwidth shared storage for multi-GPU training/inference.
Token throughput
Effective tokens generated per unit compute per unit time; a key economics metric for compute centers.
GPU utilization
Share of time a GPU is doing useful compute; often only 30-50% when IO-bound, liftable 2-3x by storage acceleration (S4).
All-flash storage
Storage built entirely on NVMe SSDs, offering high IOPS, high bandwidth and low latency.
WS5000 / WS7000
ZK-Storage's disaggregated all-flash appliances: WS5000 in mass production; WS7000 for AI compute centers (70M-IOPS class).
Inference Context Externalization
An architecture that centralizes intermediate state data generated during large model inference on a shared storage layer and manages it through standardized protocols. By decoupling compute from context storage, it alleviates single-node VRAM bottlenecks, supports multi-instance sharing and dynamic migration, thereby improving cluster resource scheduling efficiency and horizontal scalability.
Microsecond Parallel Data Channel
A storage transmission architecture for AI workloads delivering ultra-low latency and high concurrency via multi-link aggregation. It ensures stable data supply during intensive scheduling, with typical access latency around 20 μs and random IOPS around 50,000,000, effectively supporting heterogeneous clusters with over 90% domestic GPU adaptation.
Last updated: