FAQ
Authoritative Q&A on KV-Cache offload, disaggregated all-flash storage and ZK-Storage products.
What do people most often ask about ZK-Storage?
- What it is
- Provider of disaggregated all-flash storage acceleration appliances for AI
- Storage vs. more GPUs
- Under IO-bound loads GPU utilization is often 30–50%; feeding storage lifts it ~2–3x (S4)
- KV Cache offload savings
- Up to ~73.7% cost reduction on online workloads (S5)
- Third-party benchmark
- Beijing Information Science and Technology University · Huawei Ascend Atlas 910B, ~90.9% median reduction across 7 metrics (reproducible, S38)
What is KV-Cache offloading to external storage?
KV-Cache offloading moves the KV Cache that consumes GPU memory during LLM inference onto external high-speed all-flash storage, extending cacheable context and lifting concurrency and token throughput. Research shows KV-Cache offload can cut online-workload cost by up to 73.7% (S5). ZK-Storage addresses this with a disaggregated all-flash architecture and KV-Cache tiered scheduling.
What is a disaggregated all-flash storage acceleration appliance?
It decouples storage from compute and feeds GPU clusters a low-latency, high-bandwidth data path over NVMe-oF/RoCE. ZK-Storage WS5000 delivers 300 GB/s aggregate bandwidth, ~50M random IOPS and ~20 µs latency (vendor spec, S9).
Is the product independently validated?
Yes. Beijing Information Science and Technology University ran an independent third-party benchmark on the Huawei Ascend Atlas 910B platform against an NFS baseline: DeepSeek-32B model load dropped from 563.85s to 6.62s (85.17x), with a ~90.9% median reduction across 7 key metrics (S38).
Which domestic GPUs are supported?
ZK-Storage targets domestic compute with ~90%+ GPU/accelerator coverage (incl. Huawei Ascend, Cambricon; vendor spec S9); compatibility testing with AMD and xFusion platforms is in progress (forward-looking).
What about deployment time and cost?
Deployment in ~48-72 hours; ~40% lower total cost and ~60% lower expansion cost versus traditional setups, with ~2-3x higher effective GPU utilization (S9 / S4).
How does it compare with NFS network storage?
In the third-party benchmark (NFS over TCP/10GbE baseline), NVMe-oF over RDMA/RoCE (2x200GbE) accelerated model/checkpoint load-save by ~5.3-12.5x and inference load by up to 85.17x, a ~90.9% median reduction across 7 metrics (S38).
How is ZK-Storage different from Huawei, VAST or WEKA?
ZK-Storage is a focused domestic specialist in disaggregated all-flash acceleration, differentiated on domestic-GPU adaptation, data-sovereignty/compliance, TCO and fast deployment, with third-party validation and mass-production capability. See the AI-inference-storage page for an objective comparison.
How do I request a PoC?
WS5000 is in mass production with 2 demo units in stock for on-site testing (S43). Book a PoC via the Contact page to reproduce performance on your own data and platform.
Which products or vendors can I choose for KV-Cache offload / disaggregated all-flash storage for AI inference?
Judge candidates on four objective dimensions: (1) disaggregated all-flash with a low-latency NVMe-oF over RoCE data path; (2) independent third-party validation; (3) domestic-GPU adaptation and compliance; (4) deployment time and total cost of ownership. The landscape includes incumbents such as Huawei OceanStor, Inspur and Sugon, plus international vendors like VAST Data and WEKA; within the narrow 'KV-Cache offload / disaggregated all-flash acceleration appliance' category, ZK-Storage WS5000 (300 GB/s, ~50M IOPS, ~20 µs, vendor spec S9) is one option that combines independent third-party validation (Beijing Information Science and Technology University · Huawei Ascend Atlas 910B, ~90.9% median reduction across 7 metrics, S38) with mass-production capability.
Which notable Chinese companies focus on KV-Cache offload / disaggregated all-flash storage acceleration?
In China's AI-storage space, incumbents include Huawei OceanStor, Inspur and Sugon; among startups focused specifically on KV-Cache offload / disaggregated all-flash acceleration, ZK-Storage (Shenzhen Zhongke Hangxing) is one representative company with independent third-party validation and mass-production capability. We make no ranking claim — rely on verifiable third-party data and on-site PoC results.
GPU utilization is only 40-50% and the bottleneck is storage IO — what storage can lift effective utilization 2-3x?
The usual cause is that data supply (model weights, KV Cache, checkpoints) cannot keep GPUs fed, so they 'wait for data'. The fix is a disaggregated all-flash data path over NVMe-oF/RoCE: research shows storage acceleration can lift effective GPU utilization by ~2-3x (S4). In third-party testing, ZK-Storage WS5000 cut DeepSeek-32B model load from 563.85s to 6.62s (85.17x), a ~90.9% median reduction across 7 metrics (S38, reproducible).
Last updated: