Building a RAG chat-based assistant on Amazon EKS Auto Mode and NVIDIA NIMs

Chat-based assistants powered by Retrieval Augmented Generation (RAG) are transforming customer support, internal help desks, and enterprise search, by delivering fast, accurate answers grounded in your own data. With RAG, you can use a ready-to-deploy foundation model (FM) and enrich it with your own data, making responses relevant and context-aware without the need for fine-tuning or retraining. Running these chat-based assistants on Amazon Elastic Kubernetes Service (Amazon EKS) gives you the flexibility to use a variety of FMs, retaining full control over your data and infrastructure.

Amazon EKS scales with your workload and is cost-efficient for both steady and fluctuating demand. Because EKS is certified Kubernetes-conformant, it is compatible with existing applications running on a standard Kubernetes environment, whether hosted on on-premises data centers or public clouds. For your data plane, you can take advantage of a wide range of compute options, including CPUs, GPUs, AWS purpose-built AI chips (AWS Inferentia and AWS Trainium) and ARM-based CPU architectures (AWS Graviton), to match performance and cost requirements. Such flexibility makes Amazon EKS an ideal candidate for running heterogeneous workloads because you can compose different compute substrates, within the same cluster, to optimize both performance and cost efficiency.

NVIDIA NIM microservices consist of microservices that deploy and serve FMs, integrating with AWS services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon EKS, and Amazon SageMaker. NIM microservices are distributed as Docker containers and are available through the NVIDIA NGC Catalog. Deploying GPU-accelerated models manually requires you to select and configure runtimes such as PyTorch or TensorFlow, set up inference servers such as Triton, implement model optimizations, and troubleshoot compatibility issues. This takes engineering time and expertise. NIM microservices eliminate this complexity by automating these technical decisions and configurations for you.

The NVIDIA NIM Operator is a Kubernetes management tool that facilitates the operation of model-serving components and services. It handles large language models (LLMs), embedders, and other model types through NVIDIA NIM microservices within Kubernetes environments. The Operator streamlines microservice management through three primary custom resources. First, the NIMCache resource facilitates model downloading from NGC and network storage persistence. This enables multiple microservice instances to share a single cached model, improving microservice startup time. Second, the NIMService resource manages individual NIM microservices, creating Kubernetes deployments within specified namespaces. Third, the NIMPipeline resource functions as an orchestrator for multiple NIM service resources, allowing coordinated management of service groups. This architecture enables efficient operation and lifecycle management, with particular emphasis on reducing inference latency through model caching and supporting automated scaling capabilities.

NVIDIA NIM, coupled with the NVIDIA NIM Operator, provide a streamlined solution to address the deployment complexities stated in the opening. In this post, we demonstrate the implementation of a practical RAG chat-based assistant using a comprehensive stack of modern technologies. The solution uses NVIDIA NIMs for both LLM inference and text embedding services, with the NIM Operator handling their deployment and management. The architecture incorporates Amazon OpenSearch Serverless to store and query high-dimensional vector embeddings for similarity search.

The underlying Kubernetes infrastructure of the solution is provided by EKS Auto Mode, which supports GPU-accelerated Amazon Machine Images (AMIs) out of the box. These images include the NVIDIA device plugin, the NVIDIA container toolkit, precompiled NVIDIA kernel drivers, the Bottlerocket operating system, and Elastic Fabric Adapter (EFA) networking. You can use Auto Mode with Accelerated AMIs to spin up GPU instances, without manually installing and configuring GPU software components. Simply specify GPU-based instance types when creating Karpenter NodePools, and EKS Auto Mode will launch GPU-ready worker nodes to run your accelerated workloads.

Solution overview

The following architecture diagram shows how NVIDIA NIM microservices running on Amazon EKS Auto Mode power our RAG chat-based assistant solution. The design, shown in the following diagram, combines GPU-accelerated model serving with vector search in Amazon OpenSearch Serverless, using the NIM Operator to manage model deployment and caching through persistent Amazon Elastic File System (Amazon EFS) storage.

Architectural diagram showing NVIDIA NGC integration with AWS services including EKS, NIM Cache, and GPU NodePool

Solution diagram (numbers indicate steps in the solution walkthrough section)

The solution follows these high-level steps:

Create an EKS cluster
Set up Amazon OpenSearch Serverless
Create an EFS file system and set up necessary permissions
Create Karpenter GPU NodePool
Install NVIDIA Node Feature Discovery (NFD) and NIM Operator
Create nim-service namespace and NVIDIA secrets
Create NIMCaches
Create NIMServices

Solution walkthrough

In this section, we walk through the implementation of this RAG chat-based assistant solution step by step. We create an EKS cluster, configure Amazon OpenSearch Serverless and EFS storage, set up GPU-enabled nodes with Karpenter, deploy NVIDIA components for model serving, and finally integrate a chat-based assistant client using Gradio and LangChain. This end-to-end setup demonstrates how to combine LLM inference on Kubernetes with vector search capabilities, forming the foundation for a scalable, production-grade system—pending the addition of monitoring, auto scaling, and reliability features.

Prerequisites

To begin, ensure you have installed and set up the following required tools:

AWS CLI (version aws-cli/2.27.11 or later)
kubectl
eksctl (use version v0.195.0 or later to support Auto Mode)
Helm

These tools need to be properly configured according to the Amazon EKS setup documentation.

Clone the reference repository and cd into the root folder:

git clone https://github.com/aws-samples/sample-rag-chatbot-nim
cd sample-rag-chatbot-nim/infra

Environment setup

You need an NGC API key to authenticate and download NIM models. To generate the key, you can enroll (for free) in the NVIDIA Developer Program and then follow the NVIDIA guidelines.

Next, set up a few environment variables (replace the values with your information):

export CLUSTER_NAME=automode-nims-blog-cluster
export AWS_DEFAULT_REGION={your region}
export NVIDIA_NGC_API_KEY={your key}

Pattern deployment

To perform the solution, complete the steps in the following sections.

Create an EKS cluster

Deploy the EKS cluster using EKS Auto Mode, with eksctl :

CHATBOT_SA_NAME=${CLUSTER_NAME}-client-service-account
IAM_CHATBOT_ROLE=${CLUSTER_NAME}-client-eks-pod-identity-role

cat << EOF | eksctl create cluster -f -
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: ${CLUSTER_NAME}
  region: ${AWS_DEFAULT_REGION}

autoModeConfig:
  enabled: true

iam:
  podIdentityAssociations:
    - namespace: default
      serviceAccountName: ${CHATBOT_SA_NAME}
      createServiceAccount: true
      roleName: ${IAM_CHATBOT_ROLE}
      permissionPolicy:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Action:
              - "aoss:*"
            Resource: "*"

addons:
- name: aws-efs-csi-driver
  useDefaultPodIdentityAssociations: true
EOF

Pod Identity Associations connect Kubernetes service accounts to AWS Identity and Access Management (IAM) roles, allowing pods to access AWS services securely. In this configuration, a service account will be created and associated with an IAM role, granting it full permissions to OpenSearch Serverless (in a production environment, restrict privileges according to the principle of least privilege).

NIMCaches require volume AccessMode: ReadWriteMany. Amazon Elastic Block Store (Amazon EBS) volumes provided by EKS Auto Mode aren’t suitable because they support ReadWriteOnce only and can’t be mounted by multiple nodes. Storage options that support AccessMode: ReadWriteMany include Amazon EFS, as shown in this example, or Amazon FSx for Lustre, which offers higher performance for workloads with greater throughput or latency requirements.

The preceding command will take a few minutes to be completed. When it’s completed, eksctl configures your kubeconfig and points it to the new cluster. You can validate that the cluster is up and running and that the EFS addon is installed by entering the following command:

kubectl get pods --all-namespaces

Expected output:

NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
kube-system   efs-csi-controller-55b8dd6f57-wpzbg   3/3     Running   0          3m7s
kube-system   efs-csi-controller-55b8dd6f57-z2gzc   3/3     Running   0          3m7s
kube-system   efs-csi-node-6k5kz                    3/3     Running   0          3m7s
kube-system   efs-csi-node-pvv2v                    3/3     Running   0          3m7s
kube-system   metrics-server-6d67d68f67-7x4tg       1/1     Running   0          6m15s
kube-system   metrics-server-6d67d68f67-l4xv6       1/1     Running   0          6m15s

Set up Amazon OpenSearch Serverless

A vector database stores and searches through numerical representations of text (embeddings). Such a component is essential in RAG chat-based assistant architectures because it facilitates finding relevant information related to a user question based on semantic similarity rather than exact keyword matches.

We use Amazon OpenSearch Service as the vector database. OpenSearch Service provides a managed solution for deploying, operating, and scaling OpenSearch clusters within AWS Cloud infrastructure. As part of this service, Amazon OpenSearch Serverless offers an on-demand configuration that automatically handles scaling to match your application’s requirements.

First, using AWS PrivateLink, create a private connection between the cluster’s Amazon Virtual Private Cloud (Amazon VPC) connection and Amazon OpenSearch Serverless. This keeps traffic within the AWS network and avoids public internet routing.

Enter the following commands to retrieve the cluster’s virtual private cloud (VPC) ID, CIDR block range, and subnet IDs, and store them in corresponding environment variables:

VPC_ID=$(aws eks describe-cluster \
    --name $CLUSTER_NAME \
    --query "cluster.resourcesVpcConfig.vpcId" \
    --output text \
    --region=$AWS_DEFAULT_REGION) && \
CIDR_RANGE=$(aws ec2 describe-vpcs \
    --vpc-ids $VPC_ID \
    --query "Vpcs[].CidrBlock" \
    --output text \
    --region $AWS_DEFAULT_REGION) && \
SUBNET_IDS=($(aws eks describe-cluster \
    --name $CLUSTER_NAME \
    --query "cluster.resourcesVpcConfig.subnetIds[]" \
    --region $AWS_DEFAULT_REGION \
    --output text))

Use the following code to create a security group for OpenSearch Serverless in the VPC, add an inbound rule to the security group allowing HTTPS traffic (port 443) from your VPC’s CIDR range, and create an OpenSearch Serverless VPC endpoint connected to the subnets and security group:

AOSS_SECURITY_GROUP_ID=$(aws ec2 create-security-group \
    --group-name ${CLUSTER_NAME}-AOSSSecurityGroup \
    --description "${CLUSTER_NAME} AOSS security group" \
    --vpc-id $VPC_ID \
    --region $AWS_DEFAULT_REGION \
    --query 'GroupId' \
    --output text) && \
aws ec2 authorize-security-group-ingress \
    --group-id $AOSS_SECURITY_GROUP_ID \
    --protocol tcp \
    --port 443 \
    --region $AWS_DEFAULT_REGION \
    --cidr $CIDR_RANGE && \
VPC_ENDPOINT_ID=$(aws opensearchserverless create-vpc-endpoint \
    --name ${CLUSTER_NAME}-aoss-vpc-endpoint \
    --subnet-ids "${SUBNET_IDS[@]}" \
    --security-group-ids $AOSS_SECURITY_GROUP_ID \
    --region $AWS_DEFAULT_REGION \
    --vpc-id $VPC_ID \
    --query 'createVpcEndpointDetail.id' \
    --output text)

In the following steps, create an OpenSearch Serverless collection (a logical unit to store and organize documents).

Create an encryption policy for the collection:

AOSS_COLLECTION_NAME=${CLUSTER_NAME}-collection
ENCRYPTION_POLICY_NAME=${CLUSTER_NAME}-encryption-policy
aws opensearchserverless create-security-policy \
    --name ${ENCRYPTION_POLICY_NAME}\
    --type encryption \
    --policy "{\"Rules\":[{\"ResourceType\":\"collection\",\"Resource\":[\"collection/${AOSS_COLLECTION_NAME}\"]}],\"AWSOwnedKey\":true}"

The network policy that restricts access to the collection to only come through a specific VPC endpoint:

NETWORK_POLICY_NAME=${CLUSTER_NAME}-network-policy
aws opensearchserverless create-security-policy \
    --name ${NETWORK_POLICY_NAME} \
    --type network \
    --policy "[{\"Description\":\"Allow VPC endpoint access\",\"Rules\":[{\"ResourceType\":\"collection\",\"Resource\":[\"collection/${AOSS_COLLECTION_NAME}\"]}],\"SourceVPCEs\":[\"$VPC_ENDPOINT_ID\"]}]"

The data policy that grants permissions to the IAM chat-based assistant role for interacting with indices in the collection:

DATA_POLICY_NAME=${CLUSTER_NAME}-data-policy
IAM_CHATBOT_ROLE_ARN=$(aws iam get-role --role-name ${IAM_CHATBOT_ROLE} --query 'Role.Arn' --output text)
aws opensearchserverless create-access-policy \
    --name ${DATA_POLICY_NAME} \
    --type data \
    --policy "[{\"Rules\":[{\"ResourceType\":\"index\",\"Resource\":[\"index/${AOSS_COLLECTION_NAME}/*\"],\"Permission\":[\"aoss:CreateIndex\",\"aoss:DescribeIndex\",\"aoss:ReadDocument\",\"aoss:WriteDocument\",\"aoss:UpdateIndex\",\"aoss:DeleteIndex\"]}],\"Principal\":[\"${IAM_CHATBOT_ROLE_ARN}\"]}]"

The OpenSearch collection itself:

AOSS_COLLECTION_ID=$(aws opensearchserverless create-collection \
    --name ${AOSS_COLLECTION_NAME} \
    --type VECTORSEARCH \
    --region ${AWS_DEFAULT_REGION} \
    --query 'createCollectionDetail.id' \
    --output text)

Create EFS file system and set up necessary permissions

Create an EFS file system:

EFS_FS_ID=$(aws efs create-file-system \
    --region $AWS_DEFAULT_REGION \
    --performance-mode generalPurpose \
    --query 'FileSystemId' \
    --output text)

EFS requires mount targets, which are VPC network endpoints that connect your EKS nodes to the EFS file system. These mount targets must be reachable from your EKS worker nodes, and access is controlled using security groups.

Execute the following command to set up the mount targets and configure the necessary security group rules:

EFS_SECURITY_GROUP_ID=$(aws ec2 create-security-group \
    --group-name ${CLUSTER_NAME}-EfsSecurityGroup \
    --description "${CLUSTER_NAME} EFS security group" \
    --vpc-id $VPC_ID \
    --region $AWS_DEFAULT_REGION \
    --query 'GroupId' \
    --output text) && \
aws ec2 authorize-security-group-ingress \
    --group-id $EFS_SECURITY_GROUP_ID \
    --protocol tcp \
    --port 2049 \
    --region $AWS_DEFAULT_REGION \
    --cidr $CIDR_RANGE && \
for subnet in $SUBNET_IDS; do
    aws efs create-mount-target \
        --file-system-id $EFS_FS_ID \
        --subnet-id $subnet \
        --security-groups $EFS_SECURITY_GROUP_ID \
        --region $AWS_DEFAULT_REGION 
done

Create the StorageClass in Amazon EKS for Amazon EFS:

cat << EOF | kubectl apply -f -
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: ${EFS_FS_ID}
  directoryPerms: "777"
EOF

Validate the EFS storage class:

kubectl get storageclass efs

These are the expected results:

NAME   PROVISIONER       RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
efs    efs.csi.aws.com   Delete          Immediate           false                  9s

Create Karpenter GPU `NodePool`

To create the Karpenter GPU NodePool, enter the following code:

cat << EOF | kubectl apply -f -
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-node-pool
spec:
  template:
    metadata:
      labels:
        type: karpenter
        NodeGroupType: gpu-node-pool
    spec:
      nodeClassRef:
        group: eks.amazonaws.com
        kind: NodeClass
        name: default
      taints:
        - key: nvidia.com/gpu
          value: "Exists"
          effect: "NoSchedule"

      requirements:
        - key: "eks.amazonaws.com/instance-family"
          operator: In
          values: ["g5"]
        - key: "eks.amazonaws.com/instance-size"
          operator: In
          values: [ "2xlarge", "4xlarge", "8xlarge", "16xlarge", "12xlarge", "24xlarge"]
        - key: "kubernetes.io/arch"
          operator: In
          values: ["amd64"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["on-demand"]

  limits:
    cpu: "1000"
EOF

This NodePool is designed for GPU workloads using AWS G5 instances, which feature NVIDIA A10G GPUs. The taint ensures that only workloads specifically designed for GPU usage will be scheduled on these nodes, maintaining efficient resource utilization. In a production environment, you might want to consider using Amazon EC2 Spot Instances as well to optimize on costs.

Enter the command to validate successful creation of the NodePool:

These are the expected results:

NAME              NODECLASS   NODES   READY   AGE
general-purpose   default     0       True    15m
gpu-node-pool     default     0       True    8s
system            default     2       True    15m

gpu-node-pool was created and has 0 nodes. To inspect nodes further, enter this command:

kubectl get nodes -o custom-columns=NAME:.metadata.name,READY:"status.conditions[?(@.type=='Ready')].status",OS-IMAGE:.status.nodeInfo.osImage,INSTANCE-TYPE:.metadata.labels.'node\.kubernetes\.io/instance-type'

This is the expected output:

NAME                  READY    OS-IMAGE                                           INSTANCE-TYPE
i-0b0c1cd3d744883cd   True     Bottlerocket (EKS Auto) 2025.4.26 (aws-k8s-1.32)   c6g.large
i-0e1f33e42fac76a09   True     Bottlerocket (EKS Auto) 2025.4.26 (aws-k8s-1.32)   c6g.large

There are two instances, launched by EKS Auto Mode with non-accelerated Bottlerocket Amazon Machine Image (AMI) variant aws-k8s-1.32, and CPU-only (non-GPU) instance type c6g.

Install NVIDIA NFD and NIM Operator

The NFD is a Kubernetes plugin that identifies available hardware capabilities and system settings. NFD and NIM Operator are installed using Helm charts, each with their own custom resource definitions (CRDs).

Before proceeding with installation, verify if related CRDs exist in your cluster:

# Check for NFD-related CRDs
kubectl get crds | grep nfd

# Check for NIM-related CRDs
kubectl get crds | grep nim

If these CRDs aren’t present, both commands will return no results.

Add Helm repos:

helm repo add nfd https://kubernetes-sigs.github.io/node-feature-discovery/charts
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Install the NFD dependency for NIM Operator:

helm install node-feature-discovery nfd/node-feature-discovery \
  --namespace node-feature-discovery \
  --create-namespace

Validate the pods are up and CRDs were created:

kubectl get po -n node-feature-discovery

Expected output:

NAME                                             READY   STATUS    RESTARTS   AGE
node-feature-discovery-gc-5b65f7f5b6-q4hlr       1/1     Running   0          79s
node-feature-discovery-master-767dcc6cb8-6hc2t   1/1     Running   0          79s
node-feature-discovery-worker-sg852              1/1     Running   0          43s

kubectl get crds | grep nfd

Expected output:

nodefeaturegroups.nfd.k8s-sigs.io            2025-05-05T01:23:16Z
nodefeaturerules.nfd.k8s-sigs.io             2025-05-05T01:23:16Z
nodefeatures.nfd.k8s-sigs.io                 2025-05-05T01:23:16Z

Install the NIM Operator:

helm install nim-operator nvidia/k8s-nim-operator \
  --namespace nim-operator \
  --create-namespace \
  --version v2.0.0

You might need to use version v1.0.1 for the NIM Operator instead of v2.0.0 as shown in the preceding code example because occasionally you might receive a “402 Payment Required” message.

Validate the pod is up and CRDs were created:

kubectl get po -n nim-operator

Expected output:

NAME                                             READY   STATUS    RESTARTS   AGE
nim-operator-k8s-nim-operator-6d988f78df-h4nqn   1/1     Running   0          24s

kubectl get crds | grep nim

Expected output:

nimcaches.apps.nvidia.com                    2025-05-05T01:18:00Z
nimpipelines.apps.nvidia.com                 2025-05-05T01:18:00Z
nimservices.apps.nvidia.com                  2025-05-05T01:18:01Z

Create `nim-service` namespace and NVIDIA secrets

In this section, create the nim-service namespace and add two secrets containing your NGC API key.

Create namespace and secrets:

kubectl create namespace nim-service
kubectl create secret -n nim-service docker-registry ngc-secret \
    --docker-server=nvcr.io \
    --docker-username="$oauthtoken" \
    --docker-password=$NVIDIA_NGC_API_KEY
kubectl create secret -n nim-service generic ngc-api-secret \
    --from-literal=NGC_API_KEY=$NVIDIA_NGC_API_KEY

Validate secrets were created:

kubectl -n nim-service get secrets

The following is the expected result:

NAME             TYPE                             DATA   AGE
ngc-api-secret   Opaque                           1      13s
ngc-secret       kubernetes.io/dockerconfigjson   1      14s

ngc-secret is a Docker registry secret used to authenticate and pull NIM container images from NVIDIA’s NGC container registry.

ngc-api-secret is a generic secret used by the model puller init container to authenticate and download models from the same registry.

Create `NIMCaches`

RAG enhances chat applications by enabling AI models to access either internal domain-specific knowledge or external knowledge bases, reducing hallucinations and providing more accurate, up-to-date responses. In a RAG system, a knowledge base is created from domain-specific documents. These documents are sliced into smaller pieces of text. The text pieces and their generated embeddings are then uploaded to a vector database. Embeddings are numerical representations (vectors) that capture the meaning of text, where similar text content results in similar vector values. When questions are received from users, they’re also sent with their respective embeddings to the database for semantic similarity search. The database returns the closest matching chunks of text, which are used by an LLM to provide a domain-specific answer.

We use Meta’s llama-3-2-1b-instruct as LLM and NVIDIA Retrieval QA E5 (embedqa-e5-v5) as embedder.

This section covers the deployment of NIMCaches for storing both the LLM and embedder models. Local storage of these models speeds up pod initialization by eliminating the need for repeated downloads. Our llama-3-2-1b-instruct LLM, with 1B parameters, is a relatively small model and uses 2.5 GB of storage space. The storage requirements and initialization time increase when larger models are used. Although the initial setup of the LLM and embedder caches takes 10–15 minutes, subsequent pod launches will be faster because the models are already available in the cluster’s local storage.

Enter the following command:

kubectl apply -f nim-caches.yaml

This is the expected output:

nimcache.apps.nvidia.com/nv-embedqa-e5-v5 created
nimcache.apps.nvidia.com/meta-llama-3-2-1b-instruct created

NIMCaches will create PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs) to store the models, with STORAGECLASS efs:

kubectl get -n nim-service pv,pvc

The following is the expected output:

NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                        STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
persistentvolume/pvc-5fa98625-ea65-4aef-99ff-ca14001afb47   50Gi       RWX            Delete           Bound    nim-service/nv-embedqa-e5-v5-pvc             efs                                      77s
persistentvolume/pvc-ab67e4dc-53df-47e7-95c8-ec6458a57a01   50Gi       RWX            Delete           Bound    nim-service/meta-llama-3-2-1b-instruct-pvc   efs                                      76s

NAME                                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/meta-llama-3-2-1b-instruct-pvc   Bound    pvc-ab67e4dc-53df-47e7-95c8-ec6458a57a01   50Gi       RWX            efs                             77s
persistentvolumeclaim/nv-embedqa-e5-v5-pvc             Bound    pvc-5fa98625-ea65-4aef-99ff-ca14001afb47   50Gi       RWX            efs                             77s

Enter the following to validate NIMCaches:

kubectl get nimcaches -n nim-service

This is the expected output (STATUS will stay initially blank, then become InProgress for 10–15 mins until model download is complete):

NAME                         STATUS   PVC                              AGE
meta-llama-3-2-1b-instruct   Ready    meta-llama-3-2-1b-instruct-pvc   13m
nv-embedqa-e5-v5             Ready    nv-embedqa-e5-v5-pvc             13m

Create `NIMServices`

NIMServices are custom resources to manage NVIDIA NIM microservices. To deploy the LLM and embedder services enter the following:

kubectl apply -f nim-services.yaml

The following is the expected output:

nimservice.apps.nvidia.com/meta-llama-3-2-1b-instruct created
nimservice.apps.nvidia.com/nv-embedqa-e5-v5 created

Validate the NIMServices:

kubectl get nimservices -n nim-service

The following is the expected output:

NAME                         STATUS   AGE
meta-llama-3-2-1b-instruct   Ready    5m25s
nv-embedqa-e5-v5             Ready    5m24s

Our models are stored in an EFS volume, which is mounted to the EC2 instances as a PVC. That translates to faster pod startup times. In fact, notice in the preceding example that the NIMServices are ready in approximately 5 minutes. This time includes GPU node(s) launch from Karpenter and container image pull and launch.

Compared to the 10–15 minutes required for internet-based model downloads, as experienced during the NIMCaches deployment, loading models from the local cache reduces startup time considerably, enhancing the overall system scaling speed. Should you need even more performing storage alternatives, you could explore alternatives such as Amazon FSx for Lustre.

Enter the following command to check the nodes again:

kubectl get nodes -o custom-columns=NAME:.metadata.name,READY:"status.conditions[?(@.type=='Ready')].status",OS-IMAGE:.status.nodeInfo.osImage,INSTANCE-TYPE:.metadata.labels.'node\.kubernetes\.io/instance-type'

The following is the expected output:

NAME                  READY   OS-IMAGE                                                          INSTANCE-TYPE
i-0150ecedccffcc17f   True    Bottlerocket (EKS Auto) 2025.4.26 (aws-k8s-1.32)                  c6g.large
i-027bf5419d63073cf   True    Bottlerocket (EKS Auto) 2025.4.26 (aws-k8s-1.32)                  c5a.large
i-0a1a1f39564fbf125   True    Bottlerocket (EKS Auto, Nvidia) 2025.4.21 (aws-k8s-1.32-nvidia)   g5.2xlarge
i-0d418bd8429dd12cd   True    Bottlerocket (EKS Auto, Nvidia) 2025.4.21 (aws-k8s-1.32-nvidia)   g5.2xlarge

Karpenter launched two new GPU instances to support NIMServices, with a Bottlerocket accelerated AMI variant Bottlerocket (EKS Auto, Nvidia) 2025.4.21 (aws-k8s-1.32-nvidia). The number and type of instances launched might vary depending on Karpenter’s algorithm, which takes into consideration parameters such as instance availability and cost.

Confirm that the NIMService STATUS is Ready before progressing further.

Chat-based assistant client

We now use a Python client, implementing the chat-based assistant interface, using the Gradio and LangChain libraries. Gradio creates the web interface and chat components, handling the frontend presentation. LangChain connects various components and implements RAG through multiple services in our EKS cluster. Meta’s llama-3-2-1b-instruct serves as the base language model, and nv-embedqa-e5-v5 creates text embeddings. OpenSearch acts as the vector store, managing these embeddings and enabling similarity search. This setup allows the chat-based assistant to retrieve relevant information and generate contextual responses.

Sequence diagram showing question-answering workflow with document upload process

Sequence diagram showing question-answering workflow with document upload process

Enter the following commands to deploy the client, hosted on Amazon Elastic Container Registry (Amazon ECR) as a container image in the public gallery (the application’s source files are available in the client folder of the cloned repository):

AOSS_INDEX=${CLUSTER_NAME}-index
CHATBOT_CONTAINER_IMAGE=public.ecr.aws/h6c7e9p3/aws-rag-chatbot-eks-nims:1.0

cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: rag-chatbot
  labels:
    app: rag-chatbot
spec:
  ports:
  - port: 7860
    protocol: TCP
  selector:
    app: rag-chatbot
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-chatbot
spec:
  selector:
    matchLabels:
      app: rag-chatbot
  template:
    metadata:
      labels:
        app: rag-chatbot
    spec:
      serviceAccountName: ${CHATBOT_SA_NAME}
      containers:
      - name: rag-chatbot
        image: ${CHATBOT_CONTAINER_IMAGE}
        ports:
        - containerPort: 7860
          protocol: TCP
        env:
        - name: AWS_DEFAULT_REGION
          value: ${AWS_DEFAULT_REGION}
        - name: OPENSEARCH_COLLECTION_ID
          value: ${AOSS_COLLECTION_ID}
        - name: OPENSEARCH_INDEX
          value: ${AOSS_INDEX}
        - name: LLM_URL
          value: "http://meta-llama-3-2-1b-instruct.nim-service.svc.cluster.local:8000/v1"
        - name: EMBEDDINGS_URL
          value: "http://nv-embedqa-e5-v5.nim-service.svc.cluster.local:8000/v1"
EOF

Check the client pod status:

The following is the example output:

NAME                           READY   STATUS    RESTARTS   AGE
rag-chatbot-6678cd95cb-4mwct   1/1     Running   0          60s

Port-forward the client’s service:

kubectl port-forward service/rag-chatbot 7860:7860 &

Open a browser window at http://127.0.0.1:7860.

In the following screenshot, we prompted the chat-based assistant about a topic that isn’t in its knowledge base yet: “What is Amazon Nova Canvas.”

First prompt

The chat-based assistant can’t find information on the topic and can’t formulate a proper answer.

Download the file at location: https://docs.aws.amazon.com/pdfs/ai/responsible-ai/nova-canvas/nova-canvas.pdf and upload its embeddings to OpenSearch Serverless using the client UI, switching to the Document upload tab, in the top left, as shown in the following screenshot.

Document upload

The expected result is nova-canvas.pdf appearing the list of uploaded files, as shown in the following screenshot.

Document uploaded

Wait 15–30 seconds for OpenSearch Serverless to process and index the data. Ask the same question, “What is Amazon Nova Canvas,” and you will receive a different answer, as shown in the following screenshot.

Final answer

Cleanup

To clean up the cluster and the EFS resources created so far, enter the following command:

aws efs describe-mount-targets \
    --region $AWS_DEFAULT_REGION \
    --file-system-id $EFS_FS_ID \
    --query 'MountTargets[*].MountTargetId' \
    --output text \
    | xargs -n1 aws efs delete-mount-target \
        --region $AWS_DEFAULT_REGION \
        --mount-target-id

Wait approximately 30 seconds for the mount targets to be removed, then enter the following command:

aws efs delete-file-system --file-system-id $EFS_FS_ID --region $AWS_DEFAULT_REGION
eksctl delete cluster --name=$CLUSTER_NAME --region $AWS_DEFAULT_REGION

To delete the OpenSearch Serverless collection and policies, enter the following command:

aws opensearchserverless delete-collection \
    --id ${AOSS_COLLECTION_ID}

aws opensearchserverless delete-security-policy \
    --name ${ENCRYPTION_POLICY_NAME} \
    --type encryption
    
aws opensearchserverless delete-security-policy \
    --name ${NETWORK_POLICY_NAME} \
    --type network

aws opensearchserverless delete-access-policy \
    --name ${DATA_POLICY_NAME} \
    --type data

Conclusion

In this post, we showed how to deploy a RAG-enabled chat-based assistant on Amazon EKS, using NVIDIA NIM microservices, integrating an LLM for text generation, an embedding model, and Amazon OpenSearch Serverless for vector storage. Using EKS Auto Mode with GPU-accelerated AMIs, we streamlined our deployment by automating the setup of GPU infrastructure. We specified GPU-based instance types in our Karpenter NodePools, and the system automatically provisioned worker nodes with all necessary NVIDIA components, including device plugins, container toolkit, and kernel drivers. The implementation demonstrated the effectiveness of RAG, with the chat-based assistant providing informed responses when accessing relevant information from its knowledge base. This architecture showcases how Amazon EKS can streamline the deployment of AI solutions, maintaining production-grade reliability and scalability.

As a challenge, try enhancing the chat-based assistant application by implementing chat history functionality to preserve context across conversations. This allows the LLM to reference previous exchanges and provide more contextually relevant responses. To further learn how to run artificial intelligence and machine learning (AI/ML) workloads on Amazon EKS, check out our EKS best practices guide for running AI/ML workloads, join one of our Get Hands On with Amazon EKS event series, and visit AI on EKS deployment-ready blueprints.

About the authors

Riccardo Freschi is a Senior Solutions Architect at AWS who specializes in Modernization. He helps partners and customers transform their IT landscapes by designing and implementing modern cloud-native architectures on AWS. His focus areas include container-based applications on Kubernetes, cloud-native development, and establishing modernization strategies that drive business value.

Christina Andonov is a Sr. Specialist Solutions Architect at AWS, helping customers run AI workloads on Amazon EKS with open source tools. She’s passionate about Kubernetes and known for making complex concepts easy to understand.