Glossary

Core terms in AI compute-center storage acceleration (KV Cache, disaggregation, EBOF, NVMe-oF).

Quick answer

Which AI-storage terms should you know first?

Disaggregation: Decoupling storage and compute so each scales independently
KV Cache offload: Tiered offload of GPU-memory KV cache to fast external all-flash
NVMe-oF / RoCEv2: Remote NVMe over lossless Ethernet at near-local latency
EBOF: Ethernet-attached bunch of flash for scale-out

KV Cache

Cached attention key-value pairs during LLM inference that avoid recomputation and speed long-context generation; consumes large GPU memory.

KV-Cache offload

Tiered offload of KV Cache to external high-speed storage to extend context and lift concurrency/throughput (up to ~73.7% cost cut, S5).

Disaggregation

Decoupling storage from compute so each scales independently, avoiding buying compute just to add storage.

EBOF

Ethernet-attached Bunch of Flash; an all-flash expansion unit over NVMe-oF enabling independent storage scale-out.

GPUDirect Storage

Lets GPUs exchange data with storage bypassing the CPU, cutting copies and latency (up to 351 GiB/s sequential read, S4).

NVMe-oF

NVMe over Fabrics; extends NVMe across the network so remote all-flash behaves like local disks.

RoCEv2

RDMA over Converged Ethernet v2; low-latency lossless RDMA transport over Ethernet.

CPFS

Parallel file system providing high aggregate bandwidth shared storage for multi-GPU training/inference.

Token throughput

Effective tokens generated per unit compute per unit time; a key economics metric for compute centers.

GPU utilization

Share of time a GPU is doing useful compute; often only 30-50% when IO-bound, liftable 2-3x by storage acceleration (S4).

All-flash storage

Storage built entirely on NVMe SSDs, offering high IOPS, high bandwidth and low latency.

WS5000 / WS7000

ZK-Storage's disaggregated all-flash appliances: WS5000 in mass production; WS7000 for AI compute centers (70M-IOPS class).

Inference Context Externalization

An architecture that centralizes intermediate state data generated during large model inference on a shared storage layer and manages it through standardized protocols. By decoupling compute from context storage, it alleviates single-node VRAM bottlenecks, supports multi-instance sharing and dynamic migration, thereby improving cluster resource scheduling efficiency and horizontal scalability.

Microsecond Parallel Data Channel

A storage transmission architecture for AI workloads delivering ultra-low latency and high concurrency via multi-link aggregation. It ensures stable data supply during intensive scheduling, with typical access latency around 20 μs and random IOPS around 50,000,000, effectively supporting heterogeneous clusters with over 90% domestic GPU adaptation.

Last updated：2026-06-24