Kubernetes for GenAI: Why it makes so much sense

Generative AI (or GenAI) is evolving rapidly and becoming indispensable for many organizations. It goes beyond simple predictions and improves applications through code completion, automation, deep knowledge and expertise. Whether your use case is web-based chat, customer service, documentation search, content generation, image editing, infrastructure troubleshooting, or countless other functions, GenAI promises to help us become more efficient problem solvers.

Kubernetes, which recently celebrated its 10th anniversary, offers valuable features for running GenAI workloads. Over the years, Kubernetes and the cloud-native community have improved, integrated, and automated numerous infrastructure layers to make the lives of administrators, developers, and operations professionals easier.

GenAI can leverage this work to build frameworks that work well on Kubernetes. For example, the Operator Framework is already being used to integrate GenAI with Kubernetes because it enables the creation of applications in an automated and scalable way.

Let’s dig a little deeper into why Kubernetes is a great fit for building GenAI workloads.

Why generative AI on Kubernetes makes sense

Kubernetes provides building blocks for any type of application. It offers workload scheduling, automation, observability, persistent storage, security, networking, high availability, node labeling, and other features critical to GenAI and other applications.

Take, for example, deploying a basic GenAI model like Google’s Gemma or Meta’s Llama2 to worker nodes with graphics processing units (GPUs). The Container Storage Interface (CSI) driver mechanisms built into Kubernetes make it much easier to provide persistent shared storage for a model so that inference engines can quickly load it into the GPU’s main memory.

Another example is running a vector database like Chroma within a Retrieval-Augmented Generation (RAG) pipeline. Databases often need to remain highly available, and Kubernetes’ built-in scheduling capability coupled with CSI drivers can allow vector databases to be moved to different workers in the Kubernetes cluster. This is critical in the event of node, network, zone, and other failures, as it keeps your pipelines running with access to the embeds.

Whether it’s observability, networking, or more, Kubernetes’s “batteries included” architecture makes it a great place for GenAI applications.

Enabling GPUs on Kubernetes

Upstream Kubernetes supports management of Intel, AMD, and NVIDIA GPUs through its device plugin framework, provided an administrator has provisioned and installed the required hardware and drivers on the nodes.

This, along with third-party integration via plug-ins and operators, equips Kubernetes with the essential building blocks needed to enable GenAI workloads.

Vendor support such as Intel Device Plugins Operator and NVIDIA GPU Operator can also help reduce management overhead. For example, NVIDIA GPU Operator helps manage the installation and lifecycle of drivers, CUDA runtime, and container toolkits without having to run them separately.

Providing models and inference engines

Enabling GPUs on a Kubernetes cluster is only a small part of the overall GenAI puzzle. GPUs are needed to run GenAI models on Kubernetes, but the full infrastructure layer includes other elements such as shared storage, inference engines, serving layers, embedding models, web apps, and batch jobs required to run a GenAI application.

Once a model is trained and available, it needs to be downloaded and loaded into the Kubernetes environment. Many of the base models can be downloaded from Hugging Face and then loaded into the serving layer, which is part of the inference server or inference engine.

An inference engine or server, such as NVIDIA Triton Inference Server and Hugging Face Text Generation Interface (TGI), consists of software that interacts with pre-trained models, loading and unloading models, processing requests to the model, returning results, monitoring logs and versions, and more.

Inference engines and serving layers do not need to run on Kubernetes, but that’s what I’ll focus on here. You can deploy Hugging Face TGI to Kubernetes via Helm, a Kubernetes application package manager. This Helm chart from Substratus AI is an example of how you can deploy TGI and expose it in a Kubernetes environment using a simple configuration file to define the model and GPU tagged nodes.

Data and storage

Running models and GenAI architectures requires several types of data storage in addition to the raw data sets fed into the training process.

For one, it is not realistic to replicate large language models (LLMs), which can be gigabytes to terabytes in size, into an environment after downloading them. A better approach is to use shared storage, such as a powerful shared file system like a Network File System (NFS). This allows a model to be loaded into shared storage and deployed to any node, which may need to load and deploy the model on an available GPU.

Another potential use case for data storage is running a RAG framework to supplement running models with external or newer sources. RAG frameworks often use vectorized data and vector databases, and a block storage-based Persistent Volume (PV) and Persistent Volume Claim (PVC) in Kubernetes can improve vector database availability.

Finally, the application using the model may need its own persistence to store user data, sessions, and more. This depends heavily on the application and its data storage requirements. For example, a chatbot may store the last input queries of a specific user to store history for retrospective reference.

RAG frameworks

Another use case is implementing RAG or a context augmentation framework using tools like LlamaIndex or Langchain. Deployed base models are typically trained on datasets at a given point in time, and RAG or context augmentation can add additional context to an LLM. These frameworks add a step to the query process that can pass newly acquired data as well as the user query to the LLM.

For example, a model trained on corporate documents may implement a RAG framework to add newly created documents created after the model was trained to add context for a query. Data in a RAG framework is typically loaded and then processed into smaller chunks (called vectors) and stored in embeddings in a vector database such as Chroma, PGVector, or Milvus. These embeddings can represent various data types, including text, audio, and images.

RAG frameworks can retrieve relevant information from the embeddings and the model can use it as additional context in its generative response. Vector data is often more compressed and smaller than the model, but can still benefit from using persistent storage.

Using Kubernetes to run stateful workloads is nothing new. Existing projects like Postgres can add the PGVector extension to a Postgres cluster deployed via CloudNativePG using a PVC. PVCs enable high availability of persistent storage locations for databases, allowing data to be moved around a Kubernetes cluster. This can be important for the integrity of the RAG framework in the event of failures or pod lifecycle events.

Diploma

Kubernetes provides a GenAI toolbox that supports compute scheduling, third-party operators, storage integrations, GPU enablement, security frameworks, monitoring and logging, application lifecycle management, and more. These are all significant tactical advantages of using Kubernetes as a platform for GenAI.

When you use Kubernetes as a platform for your GenAI application, the benefits it offers to operators, engineers, DevOps professionals, and application developers ultimately extend to the deployment and usability of GenAI infrastructure and applications.

Ryan Wallner is a senior developer advocate at Dell Technologies and host of the Kubernetes Bytes podcast. Ryan is a cloud native and Kubernetes enthusiast, husband, and father of a fearless daughter. Ryan loves adventure motorcycling, hiking, and mountain biking.

Why generative AI on Kubernetes makes sense

Enabling GPUs on Kubernetes

Providing models and inference engines

Data and storage

RAG frameworks

Diploma

You Might Also Like

Rangers’ Max Scherzer will make his 2024 MLB season debut against the Royals after injuries

Tips to keep your children safe in extreme summer heat

How to travel cheaply around Europe this summer? Use a roaming eSIM to avoid data charges

Leave a Reply Cancel reply