View on GitHub

Lightspeed core service

Lightspeed core service

Overview

This document is the deliverable for LCORE-1314. It presents the design options for conversation history compaction in lightspeed-stack, with a recommendation and a proof-of-concept validation.

The problem: When a conversation’s token count exceeds the model’s context window, Llama Stack’s inference provider rejects the request. lightspeed-stack catches this and returns HTTP 413. The conversation is stuck — the user must start over.

The recommendation: Use LLM-based summarization. When estimated tokens approach the context window limit, summarize older turns and keep recent turns verbatim. This is provider-agnostic, proven (Anthropic and LangChain use the same pattern), and can use a domain-specific prompt for Red Hat product support.

PoC validation: A working proof-of-concept was built and tested with 50 queries across 4 compaction cycles. Results in PoC results.

Decisions for @ptisnovs and @sbunciak

These are the high-level decisions that determine scope, approach, and cost. Each has a recommendation — please confirm or override.

Decision 1: Which approach to conversation history management?

When a conversation gets too long for the context window, what should lightspeed-stack do?

Option Description Complexity Context quality
A LLM summarization Medium Good
B Tiered memory (MemGPT-style) High Excellent
C Delegate to provider-native APIs Low-Med Varies

See Design alternatives for full pros/cons of each.

Recommendation: A (LLM summarization). Proven pattern, provider-agnostic, good context quality. Option C is simpler but creates vendor lock-in and can’t use a domain-specific prompt.

Decision 2: Recursive or additive summarization?

When compaction triggers a second time, how should the new summary relate to the previous one?

Recursive: The LLM re-summarizes the previous summary together with new turns. Produces a single rolling summary. Simple, but early context gets progressively diluted — our PoC showed that after 4 cycles, the final summary lost major topics from the first cycle (Kubernetes fundamentals, Helm, Istio details were all missing).

Additive: Each chunk’s summary is generated once and kept. The context becomes [summary of turns 1-N] + [summary of turns N+1-M] + ... + [recent turns]. Preserves fidelity of each chunk. Total summary size grows linearly, and eventually the summaries themselves may need compaction (at which point you fall back to recursive).

See PoC results for the experimental evidence.

Recommendation: Additive, with recursive as a fallback when total summary size approaches the context limit.

Decision 3: Which model for summarization?

Option Description Cost Quality
A Same model as the user’s query Higher Best
B Configurable (default=same, allow cheaper) Flexible Flexible
C Always a small/cheap model Lowest Varies

Recommendation: A (same model as the user’s query). Keeps it simple — one model, no additional configuration, no risk of quality mismatch.

Decision 4: Compaction threshold strategy

How do we decide when to trigger compaction?

The threshold is a percentage of the model’s context window. “70%” means: trigger when estimated input tokens exceed 70% of the window, leaving 30% for the new query and response. The percentage adapts automatically to different models — if you switch from a 128K model to a 32K model, the threshold changes from ~90K to ~22K with no config change.

Combo Description Flexibility
B Percentage of context window only Low
B+A Percentage + fixed token floor Low-Med
B+D Percentage + admin-configurable via YAML Medium
B+A+D Percentage + floor + admin-configurable Medium-High

(A = fixed token count, B = percentage, D = admin-configurable defaults.)

Recommendation: B+A+D. Percentage as the primary mechanism, a fixed token floor for safety (prevents triggering on very small windows), and admin-configurable via YAML so deployments can tune it.

Example for a 128K context window at 70% threshold:

Decision 5: Where does summarization happen?

Option Description
A In lightspeed-stack (recommended)
B In Llama Stack (upstream contribution)
C Split: trigger in lightspeed, summarize in Llama

Recommendation: A. lightspeed-stack controls the conversation flow, has the domain knowledge (Red Hat support), and doesn’t require upstream coordination. Llama Stack upstream has no active work here — see Appendix A.

Technical decisions for @ptisnovs

These are implementation-level decisions. They don’t affect scope or cost significantly but determine how the code is structured.

Decision 6: How to handle conversation_id after compaction?

After compaction, the LLM should see the summary + recent turns, not the full original history. How do we achieve this?

Option Description
A Stop using conversation param; build full input explicitly
B Inject summary as a message into the existing Llama Stack conversation
C Create a new Llama Stack conversation with summary as first message

Recommendation: B. Inject summary as a marked item into the existing conversation, then select from the marker onward when building context. This preserves a single continuous conversation identity — the user sees one conversation, the Conversations API returns complete history, and the audit trail is unbroken. lightspeed-stack still controls what the LLM sees by filtering items at the marker boundary. The PoC used C (new conversation), which validated the summarization mechanism but breaks conversation identity.

Decision 7: What to do with the truncated field?

The truncated field in QueryResponse is currently deprecated and hardcoded to False.

Option Description
A Un-deprecate it (True when summary is active)
B Keep deprecated; add compacted: bool
C Add context_status: "full" / "summarized"

Recommendation: C. Distinguishes between “full context” (no compaction) and “summarized” (compaction happened). Can be extended with additional values later if needed.

Decision 8: Summary storage location

Option Description
A Extend lightspeed conversation cache (CacheEntry)
B New dedicated table
C Store in Llama Stack (as conversation item metadata)

Recommendation: A. Co-locates summary with existing conversation metadata. All cache backends (SQLite, Postgres, memory) would need the schema extension.

Schema:

class ConversationSummary(BaseModel):
    summary_text: str
    summarized_through_turn: int  # last turn included in summary
    token_count: int              # tokens in the summary itself
    created_at: str               # ISO 8601
    model_used: str               # model used for summarization

Decision 9: Buffer zone calculation

The “buffer zone” is the most recent turns kept verbatim (not summarized).

Approach Description Pros Cons
Turns Keep last N turns Simple, intuitive Turns vary wildly in size
Tokens Keep last T tokens of recent messages Precise, predictable May split a turn in the middle
Hybrid Keep last N turns, capped at T tokens Intuitive + safe Slightly more logic

Anthropic’s compaction uses token-based thresholds throughout — the buffer is implicit (whatever fits after the compaction block).

Recommendation: Hybrid with degrading guard. Start with the last 4 turns. If their token count exceeds the available budget, degrade to 3, then 2, then 1, then 0. This handles pathological cases where a few large turns (e.g., with tool results) would overflow the context even after summarizing everything else.

Decision 10: Concurrency during compaction

What happens if a second request arrives for the same conversation while compaction is running?

Option Description
A No protection (accept race condition risk)
B Blocking: per-conversation lock, concurrent requests wait
C Optimistic: check if summary already exists, skip if so

Recommendation: B (blocking). Compaction modifies conversation state — concurrent requests could append messages mid-compaction or trigger duplicate compactions. A per-conversation lock ensures consistency. This matches industry practice (Cursor, Claude Code both use synchronous compaction).

Decision 11: Compaction progress notification

Should the client be notified that compaction is in progress (before the summarization LLM call)?

Option Description
A No notification (client sees an unexplained delay)
B Streaming event before compaction (e.g., compaction_started)
C Response header or field after the fact only

Recommendation: B for the streaming endpoint. Emit a compaction event before the summarization call so the client can display “Compacting conversation…” or similar. Non-streaming requests have no mid-request notification mechanism, so they just see a slower response.

Proposed JIRAs

Each JIRA includes an agentic tool instruction that an assignee can optionally feed to Claude Code or similar.

LCORE-????: Add token estimation to lightspeed-stack

Description: Add the ability to estimate token counts before sending requests to the LLM. This is the prerequisite for the compaction trigger — we need to know when conversation history is approaching the context window limit.

Scope:

Acceptance criteria:

Agentic tool instruction:

Read the "Token estimation" and "Configuration" sections in
docs/design/conversation-compaction/conversation-compaction.md.
Key files: pyproject.toml, src/models/config.py, src/utils/ (new module).
Add config fields following the pattern in models/config.py around line 1418
(ConversationHistoryConfiguration).

LCORE-????: Implement conversation summarization module

Description: Create the core summarization logic — given a list of conversation turns and a prompt, call the LLM and return a summary string. Includes the domain-specific summarization prompt for Red Hat product support.

Scope:

Acceptance criteria:

Agentic tool instruction:

Read the "Architecture" section (especially "Additive summarization",
"Conversation partitioning", and "Summarization prompt") in
docs/design/conversation-compaction/conversation-compaction.md.
Key files: src/utils/ (new module), src/models/config.py.

LCORE-????: Extend conversation cache for summaries

Description: Add summary storage to the conversation cache so summaries persist across requests and survive restarts.

Scope:

Acceptance criteria:

Agentic tool instruction:

Read the "Summary storage" section in
docs/design/conversation-compaction/conversation-compaction.md.
Key files: src/cache/, src/models/.
Follow existing cache backend patterns (test_sqlite_cache.py, test_postgres_cache.py).

LCORE-????: Integrate compaction into the query flow

Description: Wire the token estimator, summarization module, and summary cache into the actual request path so compaction triggers automatically.

Scope:

Acceptance criteria:

Agentic tool instruction:

Read the "Changed request flow after compaction" and "Implementation Suggestions"
sections in docs/design/conversation-compaction/conversation-compaction.md.
Key files: src/utils/responses.py (around line 292), src/app/endpoints/query.py,
src/app/endpoints/streaming_query.py.
The insertion point is in prepare_responses_params(), after conversation_id is
resolved but before ResponsesApiParams is built.

LCORE-????: Update response model and API

Description: Add a context_status field (or equivalent, per Decision 7) to the response so clients know whether compaction occurred.

Scope:

Acceptance criteria:

Agentic tool instruction:

Read the "API response changes" section in
docs/design/conversation-compaction/conversation-compaction.md.
Key files: src/models/responses.py (around line 410, the existing truncated field).

LCORE-????: Verify test coverage for compaction

Description: Review all compaction-related code and verify that unit tests, integration tests, and E2E tests cover the critical paths. Add any missing tests.

Scope:

Acceptance criteria:

Agentic tool instruction:

Read the "Appendix A: PoC Evidence" section in
docs/design/conversation-compaction/conversation-compaction.md
and the full experiment data in docs/design/conversation-compaction/poc-results/.
Key test files: tests/unit/utils/, tests/integration/endpoints/,
tests/e2e/features/.

LCORE-????: Coordinate with UI team on compaction indicator

Description: Define the API contract for communicating compaction status to the UI. Two signals: (1) context_status field in the response, and (2) a compaction_started streaming event emitted before the summarization call.

Scope:

Acceptance criteria:

PoC results

A proof-of-concept was built in lightspeed-stack and tested against a real Llama Stack + OpenAI (gpt-4o-mini) setup.

What the PoC does

The PoC hooks into prepare_responses_params() in src/utils/responses.py. When message_count (from the lightspeed DB) exceeds a threshold, it:

  1. Fetches full conversation history from Llama Stack.
  2. Splits into “old” (to summarize) and “recent” (to keep verbatim).
  3. Calls the LLM with a summarization prompt to produce a summary.
  4. Creates a new Llama Stack conversation seeded with [summary + recent turns].
  5. Uses the new conversation for the current query.

Important: The PoC diverges from the production design in several ways:

The PoC is not production code — it validates the core summarization mechanism.

Experiment 1: 5 queries, threshold=3

Experiment 2: 50 queries, threshold=10

50 queries across 10 topic blocks (Kubernetes, Docker, Podman, OpenShift, Helm, Istio, Tekton/ArgoCD, observability, security, wrap-up). 6 “probe” queries placed at turns 11, 21, 31, 41, 46, 50 — these ask the LLM to recall specific earlier topics to test whether compaction preserved them.

Results

Context fidelity

Post-compaction baseline growth

Post-compaction input tokens: 1565 → 2362 → 3280 → 4076.

Each recursive summary is larger than the last because it carries the weight of all prior summaries. This means that after enough cycles, the summary itself approaches the context limit. This is the main argument for additive summarization over recursive.

Summary quality

Summary Turns summarized Quality Notes
1 1-8 Good Focused, accurate
2 Summary 1 + 9-18 Good Broader, well-structured
3 Summary 2 + 19-26 Good Comprehensive, covers all prior topics
4 Summary 3 + 27-37 Problem Dominated by ArgoCD, lost broader context

Summary 4’s quality drop is likely because the LLM prioritized the detailed recent content over the already-compressed summary text.

Full evidence in poc-results/.

PoC code

All linters pass (black, pylint, pyright, ruff, pydocstyle, mypy).

How conversations work today

Request flow

User Query → lightspeed-stack
  1. Resolve model, system prompt, tools
  2. Build input (query + inline RAG + attachments)
  3. Pass =conversation_id= to Llama Stack
  ↓
Llama Stack Responses API
  4. Retrieve full conversation history from storage
  5. Build prompt: [system] + [full history] + [user query]
  6. Call LLM inference provider
  7. If context exceeded → error bubbles up → HTTP 413
  8. Store response in conversation
  ↓
lightspeed-stack
  9. Extract LLM text, tool calls, documents, token usage
  10. Cache in conversation cache
  11. Return QueryResponse (truncated=False, always)

Key components

Component Role Code
lightspeed-stack FastAPI wrapper; delegates to Llama src/utils/responses.py:322-331
Llama Stack Conversation storage + LLM calls openai_responses.py:206-278, streaming.py:399-413
conversation_items Rich items (tool calls, MCP) for UI conversations.py:81-98
conversation_messages Chat messages for LLM context responses_store.py:71-77

What happens when context is exceeded

  1. Llama Stack sends the full prompt to the inference provider.
  2. Provider rejects (HTTP 400/413 with “context_length” in error message).
  3. lightspeed-stack catches RuntimeError (library mode) or APIStatusError.
  4. Returns PromptTooLongResponse (HTTP 413) to the user.
  5. No recovery. No truncation. No summarization. Conversation is stuck.

Evidence: query.py:321-325, streaming_query.py:312-317.

The truncated field

The truncated field exists in QueryResponse but is:

It was added anticipating future truncation support, then deprecated when that work didn’t happen.

Llama Stack’s truncation support

The truncation parameter exists in the Responses API:

The TODO at streaming.py:400 says: “Implement actual truncation logic when ‘auto’ mode is supported.” Dormant since Feb 2026 (Issue #4890: zero comments, no assignee, no milestone).

Token estimation

Capability lightspeed-stack Llama Stack
Pre-inference estimation None None
Post-inference (usage) Yes Yes
Tokenizer dependency None tiktoken (RAG)
Context window registry None Partial

There is no way to estimate token count before sending to the LLM. Adding tiktoken to lightspeed-stack is a prerequisite for any token-based trigger.

tiktoken runs on CPU only — no API calls, no GPU. Cost is ~1-5ms for a 10K token conversation, negligible compared to the LLM call.

How other APIs handle compaction

OpenAI Responses API

Approach: Server-side stateful with opaque compaction.

Pros Cons
Zero developer intervention needed Opaque: can’t inspect what’s preserved
Server manages all state Vendor lock-in (encrypted blobs)
Manual compact for control All input tokens re-billed each turn

Anthropic Messages API

Approach: Stateless with transparent compaction.

Default summarization prompt:

“You have written a partial transcript for the initial task above. Please write a summary of the transcript. The purpose of this summary is to provide continuity so you can continue to make progress towards solving the task in a future context, where the raw history above may not be accessible and will be replaced with this summary.”

Pros Cons
Transparent, readable summaries Custom instructions replace default
Custom summarization prompts Client must handle compaction blocks
pause_after_compaction for control Stateless: client manages all history
Context editing strategies composable  

AWS Bedrock Converse API

Approach: Stateless with zero built-in context management.

Pros Cons
Model-agnostic (Claude, Llama…) Zero built-in context management
No data retention (privacy) Full burden on developer
Simple, predictable Cost grows linearly (full history re-sent)

Comparison

Feature OpenAI Anthropic Bedrock
State management Server-side Client-side Client-side
Auto compaction Yes Yes No
Manual compaction Yes Via trigger No
Summary transparency Opaque Transparent N/A
Custom prompts No Yes N/A
Pause after compact No Yes N/A
Context editing No Yes N/A

How other tools handle compaction

ChatGPT (Consumer)

Claude (Consumer and Claude Code)

MemGPT / Letta

LangChain

Strategy Trigger Preserves LLM cost
BufferMemory None Everything 0 extra
WindowMemory Message count Last k messages 0 extra
SummaryMemory Every turn Rolling summary 1 call/turn
SummaryBufferMemory Token threshold Recent + summary 1 call/trigger
TokenBufferMemory Token threshold Recent by tokens 0 extra

SummaryBufferMemory is the proven hybrid: keep recent messages verbatim, summarize older ones. Trigger is token-threshold-based.

Existing approaches

There are four approaches to handling long conversation history (excluding simple FIFO truncation, which loses all older context and is not considered here):

# Approach Examples Complexity Context quality
1 No management Bedrock, raw Anthropic Trivial Full until fail
2 LLM summarization Anthropic compact, OpenAI compact Medium Good
3 Hybrid buffer+summary LangChain SummaryBuffer, Claude Medium-High Very good
4 Tiered hierarchical MemGPT/Letta High Excellent

Design alternatives for lightspeed-stack

Given our architecture (lightspeed-stack wraps Llama Stack) and the constraint that we implement in lightspeed-stack (see Appendix A for why not upstream):

When approaching the context limit, use the LLM to summarize older turns. Recent turns kept verbatim.

Implementation:

  1. Add token estimation to lightspeed-stack.
  2. When estimated tokens exceed threshold (e.g., 70% of context window):
    1. Split conversation into “old” (summarize) and “recent” (keep).
    2. Send old turns to LLM with a summarization prompt.
    3. Store the summary.
    4. Build context as: [system] + [summary] + [recent turns] + [user query].
  3. Additive: when threshold hit again, generate a new summary for the new chunk and append it to the existing summaries.
Pros Cons
Preserves semantic context from older turns Extra LLM call for summarization
Well-proven pattern (Anthropic, LangChain) Summarization quality depends on model
Additive — each chunk summarized once Latency: adds 1 LLM call at trigger
Can use domain-specific summarization prompt Must handle summary storage

Trigger mechanism

Token-based, not turn-based:

trigger_when(estimated_tokens > context_window * threshold_ratio)

threshold_ratio configurable, defaulting to 0.7 (trigger at 70% of context window, leaving 30% for the new query + response).

Conversation partitioning

Split conversation into:

Buffer zone: degrading guard — start with N turns (default 4). If their token count exceeds the available budget, degrade to N-1, then N-2, down to 0.

Summarization prompt

Domain-specific for Red Hat product support:

Summarize this conversation history for an AI assistant that helps with
Red Hat product support. Preserve:
1. The user's original question and environment details.
2. All error messages, commands run, and their outcomes.
3. Key decisions and their rationale.
4. What was resolved and what remains open.
5. Clear attribution (what the user reported vs what the assistant suggested).

Be concise but complete. The assistant will use this summary as its only
memory of older conversation turns.

Summary storage

Extend lightspeed’s conversation cache. See Decision 8.

Changed request flow

User Query → lightspeed-stack
  1. Resolve model, system prompt, tools
  2. Build input (query + RAG + attachments)
  3. Acquire per-conversation lock
  4. Estimate total tokens: system + history + new query
  5. If over threshold:
     a. Emit compaction event (streaming)
     b. Summarize old turns
     c. Inject summary as marked item into Llama Stack conversation
     d. Store summary chunk in cache
  6. Build context: select items from last summary marker onward
  7. Call Llama Stack with conversation parameter (marker-based selection)
  ↓
Llama Stack
  8. Processes conversation (marker + recent turns + new query)
  ↓
lightspeed-stack
  9. Response stored in same conversation (continuous history)
  10. Release per-conversation lock
  11. Return QueryResponse (context_status="summarized" if summary was used)

After compaction, the summary is a marked item in the existing conversation. lightspeed-stack controls what the LLM sees by selecting from the marker onward. The conversation identity is preserved.

Alternative B: Hybrid with compaction-proof instructions

Alternative A + a “compaction-proof” instruction layer (inspired by Claude Code’s CLAUDE.md pattern).

Additional features over A:

Pros Cons
All benefits of A All costs of A
Critical instructions never lost Pinning adds UX complexity
Users can protect important context More state to manage

Verdict: Good enhancement for later. Not essential for v1.

Alternative C: Tiered memory (MemGPT-inspired)

Three-tier memory: working context, recall storage (searchable conversation history), archival storage (extracted facts).

Pros Cons
Nothing truly lost High complexity
LLM can retrieve old context on demand Requires vector DB for recall
Best long-term context quality Multiple LLM calls per turn
Cross-conversation memory Significant architecture changes

Verdict: Too complex for v1.

Alternative D: Delegate to provider-native compaction

Use OpenAI’s or Anthropic’s native compaction APIs. Implement client-side only for providers without native support.

Pros Cons
Leverages best-in-class compaction Divergent behavior across providers
Less code to maintain Opaque compaction for OpenAI
Provider handles edge cases Can’t customize for Red Hat domain
Free quality improvements over time Vendor lock-in for compaction format

Verdict: Breaks the provider-agnostic principle. Not recommended as primary approach, but could be offered as an opt-in optimization.

Cost and latency

Summarization cost

Each summarization call consumes tokens:

Example for a 128K context window at 70% threshold:

Latency impact

Scenario Current With compaction
Normal turn 1 LLM call 1 LLM call (no change)
Trigger turn 1 LLM call (or 413) 2 LLM calls (summarize + respond)
Post-trigger turn 1 LLM call 1 LLM call (no change)

Summarization adds latency only on the trigger turn. In our PoC, compaction turns took 14-40 seconds (vs 9-20 for normal turns).

What’s required

Requirement Status Effort
Token estimation (tiktoken) Not present Small
Context window registry (per model) Not present Small
Summary storage in conversation cache Schema change Medium
Summarization prompt design New Medium
Context building logic change Core change Large
Configuration (threshold, buffer size) New config Small

Dependencies

Dependency Type Blocker?
tiktoken library New dependency No
Model context window sizes Configuration No
Llama Stack conversation API Already exists No
Conversation cache schema Schema change No
Upstream Llama Stack changes None needed No

No external dependencies or cross-team coordination needed. The feature is fully self-contained within lightspeed-stack (except the UI indicator).

Appendix A: Llama Stack upstream status

As of 2026-03-16:

The truncation work upstream is about OpenAI API conformance, not about building context management algorithms. Implementing in lightspeed-stack is the right approach.

Appendix B: Anxhela Coba’s SVD suggestion

Anxhela suggested using SVD (Singular Value Decomposition) on conversation embeddings as an alternative to LLM-based summarization.

Assessment: Not practical for this use case. SVD on embeddings produces compressed vector representations that an LLM cannot consume as text context. The LLM needs natural language in its context window, not a compressed matrix. Information loss is uncontrollable — you can’t guarantee it preserves specific facts or decisions.

LLM-generated summaries produce natural language the model can directly read, and the summarization prompt can control what gets preserved.

Acknowledged as considered; not pursued.

Appendix C: Ondrej Metelka’s provider lock-in concern

Ondrej noted: “if it is implemented on the provider/openai API level — then this feature is locked for providers conforming to these particular endpoints.”

Valid concern. This is why Alternative A (LLM-based summarization in lightspeed-stack) is recommended over Alternative D (delegate to provider-native compaction).

By implementing in lightspeed-stack:

Appendix D: Reference sources