Scalable AI Inference grows in complexity and cost as service adoption scales. Today, we’re happy to announce our effort to make it much faster and much more affordable. This one is a 300-level blog…so if you’re not familiar with how scalable AI inference services are built, just contact us and we’ll talk you through it.
The Scalable Inference Problem Statement
As a chat or agentic AI session grows in length across multiple prompts and responses, the history that is created is known as context. Context is created and stored using self-attention mechanisms that store session history as a series of vectorized tokens (stored as keys and values) that consume considerable amounts of GPU and CPU memory, often leveraging key-value caches.
As inference services scale, organizations are finding that it’s difficult to keep context data in system memory for two primary reasons:
- As context length grows, machine memory consumption scales linearly. Long-sequence chat or agentic sessions can put pressure on system resources and cause memory overflow.
- Cache space is limited to what can be held in a GPU machine. AI services with multiple tenants (that periodically sign in and out of AI applications) need to constantly evict non-active session data from GPU and CPU cache to make room for whatever is happening at the moment.
Yes, caches can be rehydrated from remote disk, but this today is a clumsy and disconnected operation that often depends upon (and suffers from) slow cloud object storage. The time to rehydrate the context and session is so long that several leading AI-as-a-service shops choose to simply recalculate an entire prompt history rather than grab all of the context and attention data from object storage. Now…recomputing session history can sometimes be faster than retrieving the right tokens from a remote key value store, but it can also be much more expensive because the cost of recomputation grows quadratically as the chat history sequence grows longer and longer.
Within the market, there has been some effort to optimize CPU & GPU memory paging. vLLM is one such effort that has helped the industry move forward on more efficient inference applications by batching and managing memory better. This, however, does not integrate with distributed NVMe based systems to provide another tier in the memory hierarchy, nor is it global…so GPU environments are divided into small and divided caches… (see where we’re going here??..)
Today, we’re excited to take the covers off of our approach to making scalable, multi-tenant inference fast, more cost-efficient and global. We’re introducing a new approach to serving and accelerating context data and attention data to generative AI inference applications…something we’re cheekily calling Undivided Attention (VUA).
Undivided Attention: What’s the Concept?
What we’ve built is a Linux-based agent that runs in your GPU servers and provides a new data presentation layer to AI frameworks.
This new approach is a distributed system that aims to advance the state of the art in multiple dimensions:
- A hierarchical system to manage data across GPU memory, CPU memory and shared, RDMA-attached NVMe storage subsystems. In the case of VAST, our Disaggregated and Shared Everything (DASE) system architecture connects the VUA client to a shared pool of data and metadata and shovels cache misses from a NVMe-based “origin” via NVIDIA’s GPUDirect protocol. The result is an infinite memory space for context data.
- Second, VUA layers in the ability to intelligently store and serve prefixes. Intelligent inference data management is not just about building a bigger and more integrated network-attached data shovel, it’s also about building a smarter shovel. To this end, this new accelerator seeks to first forward the most relevant keys and values according to prefix definition and length.
- Prefixes can be served according to priority and policy. For example, the longest prefixes associated with a sequence can be served first to a GPU machine so that the full self-attention of a session can be most quickly understood
- Prefixes can also be stored to help multiple related prompts share similar context within a GPU machine, thus reducing page misses and reducing how often the system needs to go back to NVMe infrastructure
- Third, VAST’s unique data structures allow for VUA to search through prefixes in constant time regardless of the size of the vector space. We call this data structure the Element Store, and elements are organized using wide-fanout V-Trees that can be searched through in millisecond time across massive metadata spaces. Building systems that can quickly search through billions to trillions of prefixes is not simple, but our unique data structures and parallel architecture make filesystem and metadata semantics much more scalable.
- To start, we’ll use POSIX semantics to rapidly search for prefixes using our new data structures. Reliance on a file system as the underlying data container is only possible with VAST because of our embarrassingly parallel scale-out metadata services. Any other system would see too much pressure resulting from a flurry of stat calls and related system calls. Having said that, we can do better…
- …We also realized that a globally indexed set of values can provide even faster context retrieval possibilities, so we’ll be integrating VAST DB into the VUA context retrieval mechanisms.
- Fourth, consider that VAST systems are heavily traced. All data access statistics are already captured and logged for purposes of providing superior observability, supportability and auditability. With VUA, we apply these same statistical collection mechanisms to decide what prefixes are most important and elevate data into and out of different system memories.
- Finally, this new system is global. Each GPU server now has shared access to the same extended context cache space, the same rapidly-searchable metadata space and the same global context and attention data and data index.
- The accelerator, in terms of data sharing, only works north-south today (each machine sees a global hierarchical data space, but machines don’t see what’s in each other’s cache… so a CPU/GPU memory cache miss always goes to NVMe)
- For a future release, we’re also considering a globally distributed cache where machines will also be able to see their peers within or across data centers and do low-latency retrieval of relevant keys and values based upon the above prefix filtering.
The result?
We hope users of VUA acceleration tools reduce the complexity of deploying shared AI inference services at scale.
- It should never be required to recompute a session history because distributed storage is now fast enough
- More than just providing fast network-based access to context and attention data, this new system should also dramatically reduce the amount of data spread across GPU machines and fast network by leveraging an intelligent prefix-based search and filtering mechanism
“Hey VAST… I want your Undivided Attention (thingy)!”
“How do I get it?”
Call us! We’re just this month rolling out a preliminary version of VUA to our model builder customers and the model serving inference clouds they work on. We want and need your feedback to make this great, so give us a ring and let us know how we can help you make inference more affordable while also refining and enhancing the AI end user experience.