RAG Configuration Guide
This document explains how to configure and customize your RAG pipeline. You will:
- Initialize a vector store
- Download and point to a local embedding model
- Configure an inference provider (LLM)
- Choose a RAG strategy (Inline RAG or Tool RAG)
Table of Contents
- Introduction
- Prerequisites
- Configure BYOK Knowledge Sources
- Add an Inference Model (LLM)
- Complete Configuration Reference
- System Prompt Guidance for RAG (as a tool)
- RAG annotations
- References
Introduction
Lightspeed Core Stack (LCS) supports two complementary RAG strategies:
- Inline RAG: context is fetched from BYOK vector stores and/or OKP and injected before the LLM request. No tool calls are required.
- Tool RAG: the LLM can call the
file_searchtool during generation to retrieve context on demand from BYOK vector stores and/or OKP.
Both strategies can be enabled independently via the rag section of lightspeed-stack.yaml. See BYOK Feature Documentation for configuration details.
The Embedding Model is used to convert queries and documents into vector representations for similarity matching.
[!NOTE] The same Embedding Model should be used to both create the vector store and to query it.
Prerequisites
Set Up the Vector Database
Use the rag-content repository to build a compatible vector database.
[!IMPORTANT] The resulting DB must be in a supported format (e.g., FAISS with SQLite metadata). This can be configured when using the tool to generate the index.
Download an Embedding Model
Download a local embedding model such as sentence-transformers/all-mpnet-base-v2 by using the script in rag-content or manually download and place in your desired path.
[!NOTE] The embedding model can also be downloaded automatically at first start-up (which will be slower). In the
byok_ragsection oflightspeed-stack.yaml, specify a supported model name asembedding_modelinstead of a local path. The model will be downloaded to the~/.cache/huggingface/hubfolder.
Configure BYOK Knowledge Sources
[!WARNING] Deprecated in 0.7.0: The top-level
byok_rag,rag,okp, andrerankersections are deprecated. In 0.7.0, all RAG-related configuration is unified under a singleragsection: stores move torag.byok.stores(withbackendinstead ofrag_type), retrieval strategies move torag.retrieval.inline/rag.retrieval.tool, OKP moves torag.okp, and the reranker moves torag.retrieval.inline.reranker. See the v0.7.0 Migration Guide for full details and examples.
BYOK knowledge sources are configured in the byok_rag section of lightspeed-stack.yaml. The required configuration is automatically generated at startup when using make run, make run-stack, docker-compose, or library mode — no manual enrichment is needed.
FAISS example
byok_rag:
- rag_id: custom-index
rag_type: inline::faiss
embedding_model: sentence-transformers/all-mpnet-base-v2 # or path to local model
embedding_dimension: 768
vector_db_id: vs_8c94967b-81cc-4028-a294-9cfac6fd9ae2 # Generated by rag-content during index creation
db_path: <path-to-vector-index> # e.g. /home/USER/vector_db/faiss_store.db
Where:
embedding_modelis the embedding model identifier or path to the local model folderdb_pathis the path to the vector index (.db file in this case)vector_db_idis the ID generated by rag-content during index creation (e.g.vs_8c94967b-81cc-4028-a294-9cfac6fd9ae2)
See the full working config example for more details.
pgvector example
This example shows how to configure a remote PostgreSQL database with the pgvector extension for storing embeddings.
You will need to install PostgreSQL with a matching version to pgvector, then log in with
psqland enable the extension with:CREATE EXTENSION IF NOT EXISTS vector;
Each pgvector-backed table follows this schema:
id(text): UUID identifier of the chunkdocument(jsonb): json containing content and metadata associated with the embeddingembedding(vector(n)): the embedding vector, wherenis the embedding dimension and will match the model’s output size (e.g. 768 forall-mpnet-base-v2)
[!NOTE] The
vector_store_id(e.g.rhdocs) is used to point to the table namedvector_store_rhdocsin the specified database, which stores the vector embeddings.
byok_rag:
- rag_id: pgvector-example
rag_type: remote::pgvector
embedding_model: sentence-transformers/all-mpnet-base-v2
embedding_dimension: 768
vector_db_id: rhdocs # becomes PostgreSQL table 'vector_store_rhdocs'
host: ${env.POSTGRES_HOST}
port: ${env.POSTGRES_PORT}
db: ${env.POSTGRES_DATABASE}
user: ${env.POSTGRES_USER}
password: ${env.POSTGRES_PASSWORD}
[!NOTE] Connection fields (
host,port,db,user,password) default to${env.POSTGRES_*}environment variable references when omitted. Use environment variables for credentials.
Add an Inference Model (LLM)
vLLM on RHEL AI (Llama 3.1) example
[!NOTE] The following example assumes that podman’s CDI has been properly configured to enable GPU support.
The vllm-openai Docker image is used to serve the Llama-3.1-8B-Instruct model.
The following example shows how to run it on RHEL AI with podman:
podman run \
--device "${CONTAINER_DEVICE}" \
--gpus ${GPUS} \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
-p ${EXPORTED_PORT}:8000 \
--ipc=host \
docker.io/vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--enable-auto-tool-choice \
--tool-call-parser llama3_json --chat-template examples/tool_chat_template_llama3.1_json.jinja
The example command above enables tool calling for Llama 3.1 models. For other supported models and configuration options, see the vLLM documentation: vLLM: Tool Calling
After starting the container, configure the vLLM provider in your run.yaml, matching model_id with the model provided in the podman run command.
[...]
models:
[...]
- model_id: meta-llama/Llama-3.1-8B-Instruct # Same as the model name in the 'podman run' command
provider_id: vllm
model_type: llm
provider_model_id: null
providers:
[...]
inference:
- provider_id: vllm
provider_type: remote::vllm
config:
url: http://localhost:${env.EXPORTED_PORT:=8000}/v1/ # Replace localhost with the url of the vLLM instance
api_token: <your-key-here> # if any
OpenAI example
Add a provider for your language model in your run.yaml (e.g., OpenAI):
models:
[...]
- model_id: my-model
provider_id: openai
model_type: llm
provider_model_id: <model-name> # e.g. gpt-4o-mini
providers:
[...]
inference:
- provider_id: openai
provider_type: remote::openai
config:
api_key: ${env.OPENAI_API_KEY}
Make sure to export your API key:
export OPENAI_API_KEY=<your-key-here>
[!NOTE] When experimenting with different
models,providersandvector_dbs, you might need to manually unregister the old ones via the CLI.
Azure OpenAI
Not yet supported.
Ollama
The remote::ollama provider does not support tool calling, so RAG as a tool is not available. However, inline RAG is supported.
vLLM Mistral
The RAG tool calls where not working properly when experimenting with mistralai/Mistral-7B-Instruct-v0.3 on vLLM.
OKP/Solr Vector IO
The OKP (Offline Knowledge Portal) Solr Vector IO is a read-only vector search provider that integrates with Apache Solr for enhanced vector search capabilities. It enables retrieving contextual information from Solr-indexed Red Hat documents to enhance query responses with support for hybrid search and chunk window expansion.
How to Enable OKP/Solr Vector IO
1. Configure Lightspeed Stack (lightspeed-stack.yaml):
rag:
inline:
- okp # inject OKP context before the LLM request
tool:
- okp # expose OKP as the file_search tool
okp:
rhokp_url: ${env.RH_SERVER_OKP} # OKP base URL (env var or literal URL)
offline: true # true = use parent_id for source URLs (offline mode)
# false = use reference_url (online mode)
Set rhokp_url to the base URL of your OKP server. Use ${env.RH_SERVER_OKP} to read the URL from the environment; when omitted or empty, a default from the application constants is used.
[!NOTE] When
okpis listed inrag.inlineorrag.tool, Lightspeed Stack automatically enriches the underlying configuration at startup with the requiredvector_ioprovider andregistered_resourcesentries for the OKP vector store. No manual registration is needed.
Query Request Example:
curl -sX POST http://localhost:8080/v1/query \
-H "Content-Type: application/json" \
-d '{"query" : "how do I secure a nodejs application with keycloak?"}' | jq .
Query Processing:
- When OKP is enabled, queries use the
portal-ragvector store - Vector search is performed with configurable parameters:
k: Number of results (default: 5)score_threshold: Minimum similarity score (default: 0.0)mode: Search mode (default: “hybrid”). Per-request configurable.
- Results include document metadata and source URLs
- Document URLs are built based on the
offlinesetting:- Offline mode: Uses
parent_idwith Mimir base URL - Online mode: Uses
reference_urlfrom document metadata
- Offline mode: Uses
Query Filtering:
To further filter the OKP context, set the chunk_filter_query field in the okp section of
lightspeed-stack.yaml. Filters follow the OKP key:value format and are applied as a static
fq parameter on every OKP search request.
okp:
rhokp_url: ${env.RH_SERVER_OKP}
chunk_filter_query: "product:*openshift*"
Per-request filtering is also available on all inference endpoints via request field solr: mode (semantic, hybrid, or lexical) and filters (key:value format). Legacy payloads that omit mode/filters and send filter key:value pairs at the top level still work with mode set to hybrid.
Example:
{
"query": "How do I configure routes?",
"solr": {
"mode": "hybrid",
"filters": { "fq": ["product:*openshift*"] }
}
}
Prerequisites:
- The OKP server must be running and accessible at the URL given in
okp.rhokp_url(or${env.RH_SERVER_OKP}). For instructions on how to pull and run the OKP image, visit: https://github.com/lightspeed-core/lightspeed-providers/lightspeed_stack_providers/providers/remote/solr_vector_io/solr_vector_io/README.md
Chunk volume:
[!WARNING] Deprecated in 0.7.0: The chunk limit constants below are replaced by configurable fields in
lightspeed-stack.yaml(rag.byok.max_chunks,rag.okp.max_chunks,rag.retrieval.inline.max_chunks,rag.retrieval.tool.max_chunks). See the v0.7.0 Migration Guide for details.
OKP and BYOK scores are not directly comparable (different scoring systems), so
score_multiplier (a BYOK-only concept) does not apply to OKP results. To control
the number of retrieved chunks, set the constants in src/constants.py:
| Constant | Value | Description |
|---|---|---|
INLINE_RAG_MAX_CHUNKS |
10 | Hard upper bound on the final merged inline RAG chunks (BYOK + OKP) delivered to the LLM |
OKP_RAG_MAX_CHUNKS |
5 | Fetch hint for OKP (Inline RAG); controls how many chunks enter the reranking pool |
BYOK_RAG_MAX_CHUNKS |
10 | Fetch hint for BYOK stores (Inline RAG); controls how many chunks enter the reranking pool |
TOOL_RAG_MAX_CHUNKS |
10 | Max chunks retrieved via Tool RAG (file_search); independent from INLINE_RAG_MAX_CHUNKS |
Limitations:
- This is a read-only provider - no insert/delete operations
Complete Configuration Reference
To enable RAG functionality, configure the byok_rag and rag sections in your lightspeed-stack.yaml.
Below is an example of a working lightspeed-stack.yaml configuration with:
- A local
all-mpnet-base-v2embedding model - A
FAISS-based vector store - Inline and Tool RAG enabled
[!TIP] We recommend starting with a minimal working configuration and extending it as needed.
name: Lightspeed Core Service (LCS)
service:
host: localhost
port: 8080
auth_enabled: false
byok_rag:
- rag_id: ocp-docs
rag_type: inline::faiss
embedding_model: sentence-transformers/all-mpnet-base-v2
embedding_dimension: 768
vector_db_id: vs_3a7f9b2e-45dc-4e1a-b8f2-1c9d0e3f5a6b
db_path: /home/USER/lightspeed-stack/vector_dbs/ocp_docs/faiss_store.db
rag:
inline:
- ocp-docs
tool:
- ocp-docs
The BYOK vector store providers and registered resources are automatically generated at startup from the byok_rag entries above. Models and inference providers must be configured separately in your run.yaml.
System Prompt Guidance for RAG (as a tool)
When using RAG, the knowledge_search tool must be explicitly referenced in your system prompt. Without clear instructions, models may inconsistently use the tool.
Tool-Aware sample instruction:
You are a helpful assistant with access to a 'knowledge_search' tool. When users ask questions, ALWAYS use the knowledge_search tool first to find accurate information from the documentation before answering.
RAG annotations
The top-level vector_stores block in run.yaml may include annotation_prompt_params to control whether extra RAG annotation instructions are injected into the model prompt (for example, citation-style markers). The default configuration sets enable_annotations: false under that block to avoid unwanted annotations.