Skip to main content
This tutorial walks you through building a Retrieval-Augmented Generation (RAG) chatbot that answers questions by using your own documents. You will use Managed Service for PostgreSQL with the pgvector extension to store and search text embeddings, JupyterLab to prepare the knowledge base and run retrieval logic, and a Serverless AI LLM endpoint (vLLM with an OpenAI-compatible API) for inference. You will deploy the Qwen3-0.6B model for generation. You run the full RAG pipeline from the notebook and call the endpoint for generation. By the end, you will have a working example of an electronics store assistant that uses product catalogs and policies to answer customer questions.

Costs

Nebius AI Cloud charges you for the following billing items:

Prerequisites

  • Make sure that you are in a group that has at least the editor role within your tenant; for example, the default editors group. You can check this in the AdministrationIAM section of the web console.
  • Create resources in a project in one of the following regions: eu-north1, eu-west1, us-central1 or private eu-north2.
  • In the web console, go to AdministrationLimits and make sure that you have at least one NVIDIA® Hopper® H200 for regular VMs without reservations and one virtual machine under Compute, and one allocation under Virtual Private Cloud in the region that you use. Increase quotas if needed.

Steps

Create a Managed Service for PostgreSQL cluster

  1. In the web console, go to Managed Service for PostgreSQL.
  2. Click Create cluster and configure the cluster:
    • Enter a name.
    • In Access, select Public and private so that JupyterLab can connect to the cluster from the internet.
    • In Resources, choose a preset (for example, 4 vCPUs – 16 GiB RAM) and set the storage size.
    • In Database, set the database name, username and password. For example: database rag-example, user rag_example_user and a strong password.
  3. Click Create cluster and wait until it is ready.
  4. Open the cluster page. On Cluster overview, click Copy endpoint URL and select Public RW endpoint URL to copy the connection host. In the General block you can copy Bootstrap database and Username if needed. You will need the endpoint host, database name, username and password in the notebook.
Managed Service for PostgreSQL clusters have the pgvector extension available. You will enable it in the notebook.

Deploy the JupyterLab application

  1. Deploy JupyterLab on a VM:
    1. In the web console, go to Applications.
    2. Find JupyterLab by searching for it or browsing by category, then open the application page.
    3. Click Deploy on VM. The creation page for a container over VM opens.
    4. Securely save the generated token. You will need the token to access the application UI.
    5. Set computing resources (for example, Non-GPU Intel Ice Lake and 4 vCPUs – 16 GiB RAM) and local storage size.
    6. In Access, add a username and SSH key.
    7. Click Create container over VM and wait until the VM is running. It takes about five minutes.
  2. Open the application UI:
    1. In the sidebar, go to ComputeContainers over VMs.
    2. Open the VM page and then click the web UI link to connect to JupyterLab. Open a new notebook there. You will run all the code below in this notebook.

Prepare the knowledge base in JupyterLab

  1. In JupyterLab, open a terminal and install dependencies:
    pip install psycopg2-binary sentence_transformers
    
  2. Create three Markdown files in the notebook’s working directory: laptop_catalog.md, headphones_catalog.md, and return_policy.md. Use sections separated by ## headers. Full example contents:
    ## Dell Inspiron 15
    - Type: Laptop
    - Price: $849.99
    - RAM: 16GB DDR4
    - Storage: 512GB SSD
    - CPU: Intel Core i7-1255U
    - Display: 15.6" FHD
    - Battery Life: Up to 8 hours
    - Return Policy: 14 days unopened, 7 days if opened (10% restocking fee)
    
    ## Lenovo IdeaPad Slim 5
    - Type: Laptop
    - Price: $699.99
    - RAM: 8GB DDR4
    - Storage: 256GB SSD
    - CPU: AMD Ryzen 5 5500U
    - Display: 14" FHD
    - Battery Life: Up to 10 hours
    - Return Policy: 14 days unopened only
    
    ## Apple MacBook Air M2
    - Type: Laptop
    - Price: $1,199.00
    - RAM: 8GB Unified Memory
    - Storage: 512GB SSD
    - CPU: Apple M2
    - Display: 13.6" Retina
    - Battery Life: Up to 18 hours
    - Return Policy: 7 days unopened only
    
    ## Sony WH-1000XM5
    - Type: Headphones
    - Price: $399.99
    - Type: Over-Ear
    - Features: Noise Cancelling, Wireless, 30-hour Battery Life
    - Return Policy: 14 days with packaging
    
    ## Apple AirPods Pro (2nd Gen)
    - Type: Headphones
    - Price: $249.00
    - Type: In-Ear
    - Features: Active Noise Cancellation, Transparency Mode, Wireless Charging
    - Return Policy: 7 days unopened
    
    ## JBL Tune 510BT
    - Type: Headphones
    - Price: $49.95
    - Type: On-Ear
    - Features: Wireless, 40-hour Battery, Fast Charging
    - Return Policy: 30 days (opened or unopened)
    
    ## General Return Rules:
    - Proof of purchase is required for all returns.
    - All returns must be initiated within the stated return window.
    
    ## Return Electronics (Laptops, Phones, Tablets):
    - Unopened: Return within 14 days for full refund.
    - Opened: Return within 7 days with 10% restocking fee.
    
    ## Return Accessories (Headphones, Chargers, Cases):
    - Varies by product. See specific item entry.
    - Generally:
      - High-end headphones: 14 days with packaging.
      - Budget headphones: 30 days regardless of packaging.
    
    ## Return Refund Timeline:
    - Refunds are processed within 5 business days after return approval.
    
    ## Non-Returnable Items:
    - Software licenses, gift cards, and opened hygiene products.
    
  3. In the notebook, run the following in order:
    1. Connect to the database and create the table. Replace <host>, <dbname>, <user> and <password> with your Managed Service for PostgreSQL cluster connection details: use Public RW endpoint URL as the hostname and the Bootstrap database, Username and password from the General block. The all-MiniLM-L6-v2 model produces 384-dimensional vectors. If you use another model, change vector(384) in the table definition.
      import psycopg2
      
      conn = psycopg2.connect(
          dbname="<dbname>",
          user="<user>",
          password="<password>",
          host="<host>",
          port=5432,
          sslmode="require"
      )
      cur = conn.cursor()
      
      cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
      conn.commit()
      
      cur.execute("""
          CREATE TABLE IF NOT EXISTS documents (
              id SERIAL PRIMARY KEY,
              content TEXT,
              embedding vector(384)
          )
      """)
      conn.commit()
      
    2. Load the embedding model and define helpers.
      from sentence_transformers import SentenceTransformer
      
      model = SentenceTransformer('all-MiniLM-L6-v2')
      
      def get_embedding(text):
          return model.encode(text).tolist()
      
      def add_document(content):
          embedding = get_embedding(content)
          cur.execute("INSERT INTO documents (content, embedding) VALUES (%s, %s)", (content, embedding))
      
      def chunk_text(text):
          raw_chunks = text.split("## ")
          return ["## " + chunk.strip() for chunk in raw_chunks if chunk.strip()]
      
      def get_chunks_from_docs(*filenames):
          for filename in filenames:
              with open(filename, 'r', encoding='utf-8') as f:
                  text = f.read()
                  for chunk in chunk_text(text):
                      add_document(chunk)
              conn.commit()
      
    3. Ingest the documents:
      get_chunks_from_docs('laptop_catalog.md', 'headphones_catalog.md', 'return_policy.md')
      
    4. Define retrieval and prompt building:
      def search_documents(query, limit=5):
          query_embedding = get_embedding(query)
          cur.execute("""
              SELECT content, embedding <=> %s::vector AS distance
              FROM documents
              ORDER BY distance
              LIMIT %s
          """, (query_embedding, limit))
          return [f"{i}. {content}" for i, (content, _) in enumerate(cur.fetchall(), 1)]
      
      def build_prompt(query, retrieved_chunks):
          context = "\n\n".join(retrieved_chunks)
          return f"""Answer the following question based on the context below.
      
      Context:
      {context}
      
      Question: {query}
      
      Answer:"""
      
    Your knowledge base is now ready for RAG queries.

Deploy a serverless LLM endpoint

The endpoint exposes an OpenAI-compatible API that you will call from your JupyterLab notebook.
  1. In the web console, go to AI ServicesEndpoints and click Create endpoint.
  2. Under Endpoint settingsImage path, enter vllm/vllm-openai:latest.
  3. Under Ports, set the container port to 8000.
  4. Expand Advanced settings:
    • In Entrypoint command, enter python3 -m vllm.entrypoints.openai.api_server.
    • In Arguments, enter --model Qwen/Qwen3-0.6B --host 0.0.0.0 --port 8000.
  5. Under Computing resources, choose With GPU, then select a platform (for example, NVIDIA® H200 NVLink) and a preset (for example, 1 GPU — 16 vCPUs — 200 GiB RAM).
  6. Click Create and wait until the endpoint status is Running. It may take five minutes.
  7. Open the endpoint. In the Network section, copy the value of a public endpoint in the http://<IP_address>:8000 format.

Call the LLM from the notebook

  1. Add the following code to your JupyterLab notebook to call the deployed endpoint. Replace <ENDPOINT_IP> with the endpoint address from Deploy a serverless LLM endpoint.
    import requests
    import re
    
    API_BASE = "http://<ENDPOINT_IP>"
    MODEL_ID = "Qwen/Qwen3-0.6B"
    
    def ask_llm(prompt):
        response = requests.post(
            f"{API_BASE}/v1/chat/completions",
            headers={
                "Content-Type": "application/json"
            },
            json={
                "model": MODEL_ID,
                "messages": [
                    {
                        "role": "system",
                        "content": "You are a helpful store assistant. Answer based only on the provided information."
                    },
                    {
                        "role": "user",
                        "content": prompt
                    }
                ],
                "max_tokens": 2560,
                "temperature": 0.6
            }
        )
        data = response.json()
        raw_text = data["choices"][0]["message"]["content"]
        # Optional: remove reasoning traces if present
        clean_text = re.sub(r"<think>.*?</think>", "", raw_text, flags=re.DOTALL).strip()
        return clean_text
    
  2. Run a RAG query:
    query = input("Your question: ")
    results = search_documents(query)
    answer = ask_llm(build_prompt(query, results))
    print(answer)
    
The pipeline retrieves relevant document chunks from Managed Service for PostgreSQL and sends the constructed prompt to your serverless endpoint for generation.
Your question: What laptop and headphones should I buy with a budget of 1000 USD?Answer: The laptop to buy is the Sony WH-1000XM5, and the headphones are the JBL Tune 510BT. Both products are within the 1000 USD budget.

How to delete the created resources

Some of the created resources are chargeable. If you don’t need them, delete these resources so Nebius AI Cloud doesn’t charge for them:
  • JupyterLab VM: In the web console, go to ComputeContainers over VMs, open the VM and delete it.
  • Serverless AI endpoint: In the web console, go to AI ServicesEndpoints. Open the endpoint, stop it if it is running, and then delete it.
  • Managed Service for PostgreSQL cluster: Follow the instructions.

Postgres, PostgreSQL and the Slonik Logo are trademarks or registered trademarks of the PostgreSQL Community Association of Canada, and used with their permission.