View on GitHub

Lightspeed core service

Lightspeed core service

Lightspeed Core Stack - Architecture Overview

Version: 1.0
Last Updated: January 2026
Status: Living Document


Table of Contents


1. Introduction

1.1 What is Lightspeed Core Stack?

Lightspeed Core Stack (LCORE) is an enterprise-grade middleware service that provides a robust layer between client applications and AI Large Language Model (LLM) backends. It adds essential enterprise features such as authentication, authorization, quota management, caching, and observability to LLM interactions.

LCore is built on Llama Stack - Meta’s open-source framework that provides standardized APIs for building LLM applications. Llama Stack offers a unified interface for models, RAG (vector stores), tools, and safety (shields) across different providers. LCore communicates with Llama Stack to orchestrate all LLM operations.

To enhance LLM responses, LCore leverages RAG (Retrieval-Augmented Generation), which retrieves relevant context from vector databases before generating answers. Llama Stack manages the vector stores, and LCore queries them to inject relevant documentation, knowledge bases, or previous conversations into the LLM prompt.

1.2 Key Features

1.3 System Overview

┌─────────────────────────────────────────────────────────┐
│                   Client Applications                   │
│  (Web UI, CLI, VS Code Extension, Mobile Apps, etc.)   │
└────────────────────┬────────────────────────────────────┘
                     │ REST/A2A/JSON-RPC
                     ▼
┌─────────────────────────────────────────────────────────┐
│                  LCore (This Service)                   │
│  ┌───────────────────────────────────────────────────┐  │
│  │          Enterprise Layer                         │  │
│  │  • Authentication & Authorization (RBAC)          │  │
│  │  • Quota & Rate Limiting                          │  │
│  │  • Caching & Conversation Management              │  │
│  │  • Metrics & Observability                        │  │
│  └───────────────────────────────────────────────────┘  │
│                         ▼                               │
│  ┌───────────────────────────────────────────────────┐  │
│  │          Request Processing                       │  │
│  │  • LLM Orchestration (via Llama Stack)            │  │
│  │  • Tool Integration (MCP servers)                 │  │
│  │  • RAG & Context Management                       │  │
│  └───────────────────────────────────────────────────┘  │
│                         ▼                               │
│  ┌───────────────────────────────────────────────────┐  │
│  │          Storage Layer                            │  │
│  │  • 4 Separate Databases                           │  │
│  │    (User, Cache, Quota, A2A State)                │  │
│  └───────────────────────────────────────────────────┘  │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
          ┌──────────────────┐
          │   Llama Stack    │
          │  (LLM Backend)   │
          │                  │
          │  • Models & LLMs │
          │  • RAG Stores    │
          │  • Shields       │
          └────────┬─────────┘
                   │ (manages & invokes)
                   ▼
          ┌──────────────────┐
          │  MCP Servers     │
          │  (Remote HTTP)   │
          └──────────────────┘

2. Core Components

This section describes the major functional components that make up LCore. Each component handles a specific aspect of the system’s operation.

2.1 Entry Points & Lifecycle Management

Primary Files: lightspeed_stack.py, app/main.py, app/routers.py

Purpose: Orchestrates application startup, shutdown, and request routing.

Key Responsibilities:

Note: All configured MCP servers must be running and accessible at startup time for LCore to initialize successfully.


2.2 Configuration System (configuration.py, models/config.py)

Purpose: Load, validate, and provide access to service configuration from YAML files

Configuration Files:

LCore requires two main configuration files:

  1. LCore Configuration (lightspeed-stack.yaml):
    • Service settings (host, port, logging, CORS)
    • Authentication and authorization methods
    • Database connections (user DB, cache, quota, A2A)
    • MCP server endpoints and credentials
    • Quota limits and schedules
    • User data collection preferences
    • Default models and system prompts
  2. Llama Stack Configuration (run.yaml):
    • Required for both library and server modes
    • Defines LLM providers, models, RAG stores, shields
    • See Llama Stack documentation for details

Configuration Validation:


2.3 Authentication System (authentication/)

Purpose: Verify the identity of incoming requests

Authentication Providers:

Provider Use Case Token Handling
No Auth Development, testing No token (empty string)
No Auth + Token Testing with token passthrough Bearer token passed through
Kubernetes K8s service accounts K8s service account token validated and forwarded
Red Hat SSO Red Hat environments X-RH-Identity header (no separate token)
API Key API key authentication API key from Authorization header
JWK/JWT JWT tokens JWT validated and forwarded

Authentication Result (AuthTuple):

All authentication modules return a standardized 4-tuple: (user_id, username, roles, token)

Note: LCore does not generate tokens - it extracts the client’s original token from the request (typically Authorization header) and forwards it to backend services.


2.4 Authorization System (authorization/)

Purpose: Enforce role-based access control (RBAC) on actions

Key Components:

middleware.py:

resolvers.py:

Authorization Actions:

The system defines 30+ actions that can be authorized. Examples (see docs/auth.md for complete list):

Query Actions:

Conversation Management:

Administrative Actions:

Agent-to-Agent Protocol:

Metadata Operations:

How Authorization Works:

  1. Each endpoint is decorated with required action (e.g., @authorize(Action.QUERY))
  2. User’s roles are extracted from AuthTuple
  3. Authorization module checks if any user role has permission for the action
  4. Request proceeds if authorized, returns 403 Forbidden if not

2.5 Llama Stack Client (client.py)

Purpose: Communicate with the Llama Stack backend service for LLM operations

Llama Stack APIs Used:


2.6 Quota Management (quota/)

Purpose: Enforce token usage limits and track consumption

Components:

Quota Limiters:

Token Usage Tracking:

Background Jobs:

Quota Enforcement Flow:

  1. Before LLM Call:
    • Check if user has available quota
    • Raises QuotaExceedError if quota exhausted
    • Request is blocked with 429 status code
  2. After LLM Call:
    • Count input and output tokens from LLM response
    • Record token usage in Token Usage History table
    • Consume tokens from user’s quota
    • Update quota counters
  3. On Error:
    • If LLM call fails, no tokens are consumed
    • Quota remains unchanged
    • User can retry the request

2.7 Caching System (cache/)

Purpose: Store full conversation transcripts for retrieval, debugging, and compliance

When an LLM response is received, the system creates a CacheEntry containing the query, response, referenced documents (RAG URLs), provider/model metadata, and timestamps. This entry is stored in the cache database for conversation retrieval (GET /conversations/{id}), debugging, analytics, and compliance auditing.

Implementations:


2.8 Metrics System (metrics/)

Purpose: Export Prometheus-compatible metrics for observability and monitoring

Metric Categories:

API Metrics:

LLM Metrics:

Quota Metrics:

Shield Metrics:

Metrics Endpoint:


2.9 MCP Integration (utils/mcp_*)

Purpose: Enable LLMs to call external tools via Model Context Protocol (MCP) servers

MCP servers are remote HTTP services that expose tools/capabilities to LLMs (e.g., Kubernetes management, web search, databases, custom business logic).

How It Works:

  1. Configuration: MCP servers are defined in the config file with name, URL, and authorization headers
  2. Registration at Startup: LCore tells Llama Stack about each MCP server by calling toolgroups.register() - this makes the MCP server’s tools available in Llama Stack’s tool registry
  3. Query Processing: When processing a query, LCore determines which tools to make available to the LLM and finalizes authorization headers (e.g., merging client-provided tokens with configured headers)
  4. Tool Execution: When the LLM calls a tool, Llama Stack routes the request to the appropriate MCP server URL with the finalized authorization headers

Authorization:

Client-Provided Authentication (MCP-HEADERS):

Clients can provide their own authentication tokens for specific MCP servers using the MCP-HEADERS request header. This is used when MCP servers are configured with "client" as the authorization value.

Use GET /v1/mcp-auth/client-options to discover which servers accept client authentication and what header names they expect.

Limitations:


2.10 A2A Protocol Support (app/endpoints/a2a.py, a2a_storage/)

Purpose: Enable external AI agents to call LCore as an A2A-compatible agent

External agents interact with LCore through a multi-step process:

  1. Discovery: The agent calls GET /.well-known/agent.json to retrieve LCore’s capabilities, skills, and supported modes
  2. Message Exchange: The agent sends messages via POST /a2a using JSON-RPC 2.0 format (e.g., message/send method) with a context_id to identify the conversation
  3. Context Mapping: The A2A context store maps the external agent’s context_id to LCore’s internal conversation_id, enabling multi-turn conversations (storage: PostgreSQL, SQLite, or in-memory)
  4. Query Processing: LCore processes the message through its standard query pipeline (including LLM calls via Llama Stack) and returns the response to the external agent

External A2A requests go through LCore’s standard authentication system (K8s, RH Identity, API Key, etc.).


3. Request Processing Pipeline

This section illustrates how requests flow through LCore from initial receipt to final response.

3.1 Complete Pipeline Overview

Every API request flows through this standardized pipeline:

  1. FastAPI Routing - Match URL path, parse parameters
  2. Middleware Layer - CORS validation, metrics timer, exception handling
  3. Authentication - Extract and validate auth token, return AuthTuple(user_id, username, roles, token)
  4. Authorization - Check user roles against required action permissions
  5. Endpoint Handler - Execute business logic (see concrete example below)
  6. Middleware Response - Update metrics, log response

Concrete Example:

Here’s how a real query flows through the system:

User Query: “How do I scale pods in Kubernetes?”

Step-by-Step Processing:

  1. Request Arrives - POST /v2/query with query text and optional conversation_id
  2. Authentication - Validate JWT token, extract user_id=”user123”, roles=[“developer”]
  3. Authorization - Check if “developer” role has QUERY action permission ✅
  4. Quota Check - User has 50,000 tokens available ✅
  5. Model Selection - Use configured default model (e.g., meta-llama/Llama-3.1-8B-Instruct)
  6. Context Building - Retrieve conversation history, query RAG vector stores for relevant docs, determine available MCP tools
  7. Llama Stack Call - Send complete request with system prompt, RAG context, MCP tools, and shields
  8. LLM Processing - Llama Stack generates response, may invoke MCP tools, returns token counts
  9. Post-Processing - Apply shields, generate conversation summary if new
  10. Store Results - Save to Cache DB, User DB, consume quota, update metrics
  11. Return Response - Complete LLM response with referenced documents, token usage, and remaining quota

Key Takeaways:


3.2 Error Handling

Exception Types and HTTP Status Codes:


4. Database Architecture

LCore uses a multi-database strategy to optimize for different data access patterns and lifecycles.

4.1 Multi-Database Strategy

The system uses 4 separate databases for different purposes:

Database Purpose Technology Size
User DB Conversation metadata SQLAlchemy ORM Small
Cache DB Full conversation transcripts psycopg2/sqlite3 Large
Quota DB Token usage and limits psycopg2/sqlite3 Medium
A2A DB Agent-to-agent protocol state SQLAlchemy async Small

4.2 Why Separate Databases?

Each database has different lifecycles, access patterns, and scaling needs:


5. API Endpoints

This section documents the REST API endpoints exposed by LCore for client interactions.

5.1 Core Query Endpoints

Non-Streaming Query:

Streaming Query:


5.2 Conversation Management

List Conversations: GET /conversations

Get Conversation: GET /conversations/{conversation_id}

Delete Conversation: DELETE /conversations/{conversation_id}


5.3 Information Endpoints

List Models: GET /models

List Providers: GET /providers

List Tools: GET /tools

Discover MCP Client Auth Options: GET /v1/mcp-auth/client-options

List Shields: GET /shields

List RAG Databases: GET /rags


5.4 Administrative Endpoints

Health Check: GET /health

Readiness Check: GET /readiness

Metrics: GET /metrics

Feedback: POST /feedback


5.5 A2A Protocol Endpoints

Agent Card Discovery: GET /.well-known/agent.json

A2A JSON-RPC: POST /a2a


6. Deployment & Operations

LCore supports two deployment modes, each suited for different operational requirements.

6.1 Deployment Modes

Library Mode:

Server Mode:


Appendix

A. Configuration Examples

See the examples/ directory in the repository root for complete configuration examples.



End of Architecture Overview