View on GitHub

Lightspeed Core Stack

Lightspeed Core Stack

RAG Configuration Guide

This document explains how to configure and customize your RAG pipeline. You will:


Table of Contents


Introduction

Lightspeed Core Stack (LCS) supports two complementary RAG strategies:

Both strategies can be enabled independently via the rag section of lightspeed-stack.yaml. See BYOK Feature Documentation for configuration details.

The Embedding Model is used to convert queries and documents into vector representations for similarity matching.

[!NOTE] The same Embedding Model should be used to both create the vector store and to query it.


Prerequisites

Set Up the Vector Database

Use the rag-content repository to build a compatible vector database.

[!IMPORTANT] The resulting DB must be in a supported format (e.g., FAISS with SQLite metadata). This can be configured when using the tool to generate the index.


Download an Embedding Model

Download a local embedding model such as sentence-transformers/all-mpnet-base-v2 by using the script in rag-content or manually download and place in your desired path.

[!NOTE] The embedding model can also be downloaded automatically at first start-up (which will be slower). In the byok_rag section of lightspeed-stack.yaml, specify a supported model name as embedding_model instead of a local path. The model will be downloaded to the ~/.cache/huggingface/hub folder.


Configure BYOK Knowledge Sources

[!WARNING] Deprecated in 0.7.0: The top-level byok_rag, rag, okp, and reranker sections are deprecated. In 0.7.0, all RAG-related configuration is unified under a single rag section: stores move to rag.byok.stores (with backend instead of rag_type), retrieval strategies move to rag.retrieval.inline/rag.retrieval.tool, OKP moves to rag.okp, and the reranker moves to rag.retrieval.inline.reranker. See the v0.7.0 Migration Guide for full details and examples.

BYOK knowledge sources are configured in the byok_rag section of lightspeed-stack.yaml. The required configuration is automatically generated at startup when using make run, make run-stack, docker-compose, or library mode — no manual enrichment is needed.

FAISS example

byok_rag:
  - rag_id: custom-index
    rag_type: inline::faiss
    embedding_model: sentence-transformers/all-mpnet-base-v2  # or path to local model
    embedding_dimension: 768
    vector_db_id: vs_8c94967b-81cc-4028-a294-9cfac6fd9ae2                                    # Generated by rag-content during index creation
    db_path: <path-to-vector-index>                            # e.g. /home/USER/vector_db/faiss_store.db

Where:

See the full working config example for more details.

pgvector example

This example shows how to configure a remote PostgreSQL database with the pgvector extension for storing embeddings.

You will need to install PostgreSQL with a matching version to pgvector, then log in with psql and enable the extension with:

CREATE EXTENSION IF NOT EXISTS vector;

Each pgvector-backed table follows this schema:

[!NOTE] The vector_store_id (e.g. rhdocs) is used to point to the table named vector_store_rhdocs in the specified database, which stores the vector embeddings.

byok_rag:
  - rag_id: pgvector-example
    rag_type: remote::pgvector
    embedding_model: sentence-transformers/all-mpnet-base-v2
    embedding_dimension: 768
    vector_db_id: rhdocs  # becomes PostgreSQL table 'vector_store_rhdocs'
    host: ${env.POSTGRES_HOST}
    port: ${env.POSTGRES_PORT}
    db: ${env.POSTGRES_DATABASE}
    user: ${env.POSTGRES_USER}
    password: ${env.POSTGRES_PASSWORD}

[!NOTE] Connection fields (host, port, db, user, password) default to ${env.POSTGRES_*} environment variable references when omitted. Use environment variables for credentials.


Add an Inference Model (LLM)

vLLM on RHEL AI (Llama 3.1) example

[!NOTE] The following example assumes that podman’s CDI has been properly configured to enable GPU support.

The vllm-openai Docker image is used to serve the Llama-3.1-8B-Instruct model.
The following example shows how to run it on RHEL AI with podman:

podman run \
  --device "${CONTAINER_DEVICE}" \
  --gpus ${GPUS} \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
  -p ${EXPORTED_PORT}:8000 \
  --ipc=host \
  docker.io/vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json --chat-template examples/tool_chat_template_llama3.1_json.jinja

The example command above enables tool calling for Llama 3.1 models. For other supported models and configuration options, see the vLLM documentation: vLLM: Tool Calling

After starting the container, configure the vLLM provider in your run.yaml, matching model_id with the model provided in the podman run command.

[...]
models:
[...]
- model_id: meta-llama/Llama-3.1-8B-Instruct # Same as the model name in the 'podman run' command
  provider_id: vllm
  model_type: llm
  provider_model_id: null

providers:
  [...]
  inference:
  - provider_id: vllm
    provider_type: remote::vllm
    config:
      url: http://localhost:${env.EXPORTED_PORT:=8000}/v1/ # Replace localhost with the url of the vLLM instance
      api_token: <your-key-here> # if any

OpenAI example

Add a provider for your language model in your run.yaml (e.g., OpenAI):

models:
[...]
- model_id: my-model 
  provider_id: openai
  model_type: llm
  provider_model_id: <model-name> # e.g. gpt-4o-mini

providers:
[...]
  inference:
  - provider_id: openai
    provider_type: remote::openai
    config:
      api_key: ${env.OPENAI_API_KEY}

Make sure to export your API key:

export OPENAI_API_KEY=<your-key-here>

[!NOTE] When experimenting with different models, providers and vector_dbs, you might need to manually unregister the old ones via the CLI.

Azure OpenAI

Not yet supported.

Ollama

The remote::ollama provider does not support tool calling, so RAG as a tool is not available. However, inline RAG is supported.

vLLM Mistral

The RAG tool calls where not working properly when experimenting with mistralai/Mistral-7B-Instruct-v0.3 on vLLM.

OKP/Solr Vector IO

The OKP (Offline Knowledge Portal) Solr Vector IO is a read-only vector search provider that integrates with Apache Solr for enhanced vector search capabilities. It enables retrieving contextual information from Solr-indexed Red Hat documents to enhance query responses with support for hybrid search and chunk window expansion.

How to Enable OKP/Solr Vector IO

1. Configure Lightspeed Stack (lightspeed-stack.yaml):

rag:
  inline:
    - okp               # inject OKP context before the LLM request
  tool:
    - okp               # expose OKP as the file_search tool

okp:
  rhokp_url: ${env.RH_SERVER_OKP}   # OKP base URL (env var or literal URL)
  offline: true         # true = use parent_id for source URLs (offline mode)
                        # false = use reference_url (online mode)

Set rhokp_url to the base URL of your OKP server. Use ${env.RH_SERVER_OKP} to read the URL from the environment; when omitted or empty, a default from the application constants is used.

[!NOTE] When okp is listed in rag.inline or rag.tool, Lightspeed Stack automatically enriches the underlying configuration at startup with the required vector_io provider and registered_resources entries for the OKP vector store. No manual registration is needed.

Query Request Example:

curl -sX POST http://localhost:8080/v1/query \
    -H "Content-Type: application/json" \
    -d '{"query" : "how do I secure a nodejs application with keycloak?"}' | jq .

Query Processing:

  1. When OKP is enabled, queries use the portal-rag vector store
  2. Vector search is performed with configurable parameters:
    • k: Number of results (default: 5)
    • score_threshold: Minimum similarity score (default: 0.0)
    • mode: Search mode (default: “hybrid”). Per-request configurable.
  3. Results include document metadata and source URLs
  4. Document URLs are built based on the offline setting:
    • Offline mode: Uses parent_id with Mimir base URL
    • Online mode: Uses reference_url from document metadata

Query Filtering:

To further filter the OKP context, set the chunk_filter_query field in the okp section of lightspeed-stack.yaml. Filters follow the OKP key:value format and are applied as a static fq parameter on every OKP search request.

okp:
  rhokp_url: ${env.RH_SERVER_OKP}
  chunk_filter_query: "product:*openshift*"

Per-request filtering is also available on all inference endpoints via request field solr: mode (semantic, hybrid, or lexical) and filters (key:value format). Legacy payloads that omit mode/filters and send filter key:value pairs at the top level still work with mode set to hybrid.

Example:

{
  "query": "How do I configure routes?",
  "solr": {
    "mode": "hybrid",
    "filters": { "fq": ["product:*openshift*"] }
  }
}

Prerequisites:

Chunk volume:

[!WARNING] Deprecated in 0.7.0: The chunk limit constants below are replaced by configurable fields in lightspeed-stack.yaml (rag.byok.max_chunks, rag.okp.max_chunks, rag.retrieval.inline.max_chunks, rag.retrieval.tool.max_chunks). See the v0.7.0 Migration Guide for details.

OKP and BYOK scores are not directly comparable (different scoring systems), so score_multiplier (a BYOK-only concept) does not apply to OKP results. To control the number of retrieved chunks, set the constants in src/constants.py:

Constant Value Description
INLINE_RAG_MAX_CHUNKS 10 Hard upper bound on the final merged inline RAG chunks (BYOK + OKP) delivered to the LLM
OKP_RAG_MAX_CHUNKS 5 Fetch hint for OKP (Inline RAG); controls how many chunks enter the reranking pool
BYOK_RAG_MAX_CHUNKS 10 Fetch hint for BYOK stores (Inline RAG); controls how many chunks enter the reranking pool
TOOL_RAG_MAX_CHUNKS 10 Max chunks retrieved via Tool RAG (file_search); independent from INLINE_RAG_MAX_CHUNKS

Limitations:


Complete Configuration Reference

To enable RAG functionality, configure the byok_rag and rag sections in your lightspeed-stack.yaml.

Below is an example of a working lightspeed-stack.yaml configuration with:

[!TIP] We recommend starting with a minimal working configuration and extending it as needed.

name: Lightspeed Core Service (LCS)
service:
  host: localhost
  port: 8080
  auth_enabled: false

byok_rag:
  - rag_id: ocp-docs
    rag_type: inline::faiss
    embedding_model: sentence-transformers/all-mpnet-base-v2
    embedding_dimension: 768
    vector_db_id: vs_3a7f9b2e-45dc-4e1a-b8f2-1c9d0e3f5a6b
    db_path: /home/USER/lightspeed-stack/vector_dbs/ocp_docs/faiss_store.db

rag:
  inline:
    - ocp-docs
  tool:
    - ocp-docs

The BYOK vector store providers and registered resources are automatically generated at startup from the byok_rag entries above. Models and inference providers must be configured separately in your run.yaml.


System Prompt Guidance for RAG (as a tool)

When using RAG, the knowledge_search tool must be explicitly referenced in your system prompt. Without clear instructions, models may inconsistently use the tool.

Tool-Aware sample instruction:

You are a helpful assistant with access to a 'knowledge_search' tool. When users ask questions, ALWAYS use the knowledge_search tool first to find accurate information from the documentation before answering.

RAG annotations

The top-level vector_stores block in run.yaml may include annotation_prompt_params to control whether extra RAG annotation instructions are injected into the model prompt (for example, citation-style markers). The default configuration sets enable_annotations: false under that block to avoid unwanted annotations.