View on GitHub

Lightspeed core service

Lightspeed core service

Conversation Compaction Design

   
Date 2026-03-16
Component lightspeed-stack
Authors Maxim Svistunov
Feature / Initiative LCORE-1311
Spike LCORE-1314
Links Spike doc: docs/design/conversation-compaction/conversation-compaction-spike.md

What

Conversation history compaction for lightspeed-stack. When a conversation’s token count approaches the model’s context window limit, lightspeed-stack summarizes older turns using the LLM and keeps recent turns verbatim. The conversation continues without hitting HTTP 413.

Full conversation history is preserved in Llama Stack for UI display and audit. Only the LLM’s input context is compacted.

Why

Today, when a conversation exceeds the model’s context window, Llama Stack’s inference provider rejects the request. lightspeed-stack catches this and returns HTTP 413 (PromptTooLongResponse). The conversation is stuck — the user must start over.

Current failure path (verified in code):

Llama Stack sends full prompt → provider rejects (400/413, "context_length")
→ lightspeed-stack catches RuntimeError or APIStatusError
→ returns PromptTooLongResponse (HTTP 413)
→ no recovery, no truncation, no summarization

Evidence: query.py:321-325, streaming_query.py:312-317.

Requirements

R1
When estimated input tokens exceed a configurable threshold (default 70% of the model’s context window), lightspeed-stack must summarize older conversation turns before sending the request to the LLM.

R2 Recent turns must be preserved verbatim — not summarized. The buffer zone uses a degrading guard: start at N turns (default 4), but if those N turns exceed the token budget, degrade to N-1, then N-2, down to 0. This prevents compaction from producing output that still doesn’t fit.

R3
Summarization must be additive: each chunk’s summary is generated once and kept. Summaries are only re-summarized when the total summary size itself approaches the context limit.

R4
The summarization prompt must be domain-specific for Red Hat product support, preserving error messages, commands, outcomes, and user environment details.

R5
The same model used for the user’s query must be used for summarization.

R6
Full conversation history must remain accessible via the Llama Stack Conversations API (for UI display and audit). Only the LLM’s input context uses summaries.

R7
The response must include a context_status field indicating "full" (no compaction) or "summarized" (compaction occurred).

R8
Token estimation must run on every request using tiktoken. Cost is ~1-5ms, negligible compared to the LLM call.

R9
Compaction configuration must be admin-configurable via YAML: threshold ratio, fixed token floor, and buffer zone size.

R10 After compaction, lightspeed-stack injects the summary as a marked item into the existing Llama Stack conversation. When building context for the LLM, lightspeed-stack selects only items from the last summary marker onward. This preserves a single continuous conversation identity in Llama Stack while giving lightspeed-stack control over what the LLM sees.

R11 Compaction must be blocking per conversation. If a request triggers compaction, concurrent requests on the same conversation must wait until compaction completes. This prevents race conditions (e.g., two requests both triggering compaction, or a new message being appended mid-compaction).

R12 The streaming endpoint must emit a compaction event (e.g., {"event": "compaction_started"}) before the summarization LLM call begins, so the client can display a progress indicator. Non-streaming requests have no mid-request notification mechanism.

Use Cases

U1
As a user, I want long conversations to continue working instead of failing with HTTP 413, so that I don’t lose my troubleshooting context.

U2
As a user, I want the assistant to remember what we discussed earlier in the conversation (key decisions, error messages, steps tried), even if the raw messages were summarized.

U3
As a user, I want to see my full conversation history in the UI, even if the LLM is working from a summarized version.

U4
As an administrator, I want to configure when compaction triggers and how much recent context to preserve, so that I can tune the tradeoff between context quality and token usage.

U5
As a developer integrating with the API, I want to know whether the response was generated from full context or summarized context, so that I can display an appropriate indicator.

Architecture

Overview

User Query → lightspeed-stack
  1. Resolve model, system prompt, tools
  2. Build input (query + RAG + attachments)
  3. Acquire per-conversation lock (blocks concurrent requests)
  4. Estimate total tokens (tiktoken): system + history + new query
  5. If compaction needed (tokens > threshold):
     a. Emit compaction event on streaming endpoint
     b. Retrieve conversation history from Llama Stack
     c. Split into "old" (summarize) and "recent" (keep)
        — degrading guard: reduce recent turns if they exceed token budget
     d. Summarize old turns → inject summary as marked item into conversation
     e. Store summary chunk in conversation cache
  6. Build context: select items from last summary marker onward + new query
  7. Call Llama Stack Responses API with conversation parameter
     (Llama Stack loads items from marker onward)
  ↓
Llama Stack
  8. Processes conversation (summary marker + recent turns + new query)
  ↓
lightspeed-stack
  9. Response stored in same conversation (continuous history)
  10. Update conversation cache
  11. Release per-conversation lock
  12. Return QueryResponse with context_status="summarized" (or "full")

Token estimation

Add tiktoken as a dependency. Create src/utils/token_estimator.py:

def estimate_tokens(text: str, encoding_name: str = "cl100k_base") -> int
def estimate_conversation_tokens(messages: list, system_prompt: str) -> int

Context window sizes are configured per model in YAML:

inference:
  default_model: openai/gpt-4o-mini
  context_windows:
    openai/gpt-4o-mini: 128000
    openai/gpt-4o: 128000

Token estimation runs on every request. Cost: ~1-5ms for a 10K token conversation, negligible compared to the LLM call (seconds).

Trigger mechanism

Token-based, not turn-based. Turn sizes vary wildly (a turn with tool results can be 10x a simple Q&A).

trigger_when(estimated_tokens > context_window * threshold_ratio)

threshold_ratio defaults to 0.7 (70% of context window). A fixed token floor (e.g., 4096 tokens) prevents triggering on very small context windows.

Both values are admin-configurable via YAML:

compaction:
  enabled: true
  threshold_ratio: 0.7
  token_floor: 4096
  buffer_turns: 4
  buffer_max_ratio: 0.3

Example for a 128K context window at 70% threshold:

Conversation partitioning

When triggered, split conversation into:

Buffer zone uses a degrading guard: start with N turns (default 4), estimate their token count. If they exceed the available budget (context window minus summary minus new query), reduce to N-1 turns and re-estimate. Continue degrading (4→3→2→1→0) until the buffer fits. This handles pathological cases where a few large turns (e.g., with tool results) consume most of the context.

Additive summarization

Each chunk’s summary is generated independently and kept:

After 1st compaction:  [summary of turns 1-N] + [recent turns] + [query]
After 2nd compaction:  [summary of turns 1-N] + [summary of turns N+1-M] + [recent turns] + [query]
After 3rd compaction:  [summary 1] + [summary 2] + [summary 3] + [recent turns] + [query]

When total summary token count itself approaches the context limit, fall back to recursive re-summarization of the oldest summary chunks.

Why additive over recursive: a PoC experiment with 50 queries and 4 compaction cycles showed that recursive summarization progressively loses early-conversation context. By the 4th cycle, the summary had lost Kubernetes fundamentals, Helm, and Istio details that were discussed in the first 15 turns. See poc-results/01-analysis.txt for full evidence.

Summarization prompt

Domain-specific for Red Hat product support:

Summarize this conversation history for an AI assistant that helps with
Red Hat product support. Preserve:
1. The user's original question and environment details.
2. All error messages, commands run, and their outcomes.
3. Key decisions and their rationale.
4. What was resolved and what remains open.
5. Clear attribution (what the user reported vs what the assistant suggested).

Be concise but complete. The assistant will use this summary as its only
memory of older conversation turns.

Summary storage

Extend lightspeed’s conversation cache with summary fields. Schema:

class ConversationSummary(BaseModel):
    summary_text: str
    summarized_through_turn: int  # last turn index included in this chunk
    token_count: int              # tokens in this summary chunk
    created_at: str               # ISO 8601
    model_used: str               # model used for summarization

A conversation may have multiple summary chunks (one per compaction event). All cache backends (SQLite, Postgres, memory) need this schema extension.

Changed request flow after compaction

After compaction, lightspeed-stack injects the summary as a marked conversation item into the existing Llama Stack conversation. The summary item has a recognizable marker (e.g., metadata tag or content prefix) so that lightspeed-stack can identify it when loading history.

When building context for subsequent requests, lightspeed-stack fetches conversation items and selects only those from the last summary marker onward. The conversation parameter continues to be used — Llama Stack still manages the conversation. lightspeed-stack just controls which items form the LLM context.

This preserves a single continuous conversation identity. The user sees one conversation in the UI, and the Conversations API returns the full history including summary items.

API response changes

Add context_status field to QueryResponse and StreamingQueryResponse:

context_status: str = Field(
    "full",
    description="Context status: 'full' (no compaction), "
    "'summarized' (older turns summarized).",
)

The existing truncated field remains deprecated.

Configuration

Add to models/config.py, following the existing ConfigurationBase pattern:

class CompactionConfiguration(ConfigurationBase):
    enabled: bool = Field(
        False,
        title="Enable compaction",
        description="When true, older conversation turns are summarized "
        "when estimated tokens approach the context window limit.",
    )
    threshold_ratio: float = Field(
        0.7,
        title="Threshold ratio",
        description="Trigger compaction when estimated tokens exceed this "
        "fraction of the model's context window (0.0-1.0).",
    )
    token_floor: int = Field(
        4096,
        title="Token floor",
        description="Minimum token count before compaction can trigger. "
        "Prevents triggering on very small context windows.",
    )
    buffer_turns: int = Field(
        4,
        title="Buffer turns",
        description="Number of recent turns to keep verbatim.",
    )
    buffer_max_ratio: float = Field(
        0.3,
        title="Buffer max ratio",
        description="Maximum fraction of context window the buffer zone "
        "can occupy, regardless of buffer_turns.",
    )

Add compaction field to the root Configuration class.

Implementation Suggestions

Key files and insertion points

File What to do
pyproject.toml Add tiktoken dependency
src/utils/token_estimator.py New module: estimate_tokens(), estimate_conversation_tokens()
src/utils/compaction.py New module: summarization logic, partitioning, additive summary management
src/models/config.py Add CompactionConfiguration (near ConversationHistoryConfiguration)
src/configuration.py Add compaction_configuration property to AppConfig singleton
src/utils/responses.py Modify prepare_responses_params() — insert compaction check (see below)
src/app/endpoints/query.py No changes needed — compaction happens inside prepare_responses_params()
src/app/endpoints/streaming_query.py No changes needed — same function is used
src/models/responses.py Add context_status field to QueryResponse and StreamingQueryResponse
src/cache/ (all backends) Extend schema for ConversationSummary storage

Insertion point in responses.py

The compaction hook goes in prepare_responses_params(). Its signature:

async def prepare_responses_params(
    client: AsyncLlamaStackClient,
    query_request: QueryRequest,
    user_conversation: Optional[UserConversation],
    ...
) -> ResponsesApiParams:

At the insertion point (after line 297), the following are available:

After compaction, the summary is injected as a conversation item in Llama Stack. When building the next request, lightspeed-stack fetches items from the conversation, filters to only those after the last summary marker, and passes them as input alongside the conversation parameter. The conversation parameter is still used — the conversation identity is preserved.

Fetching conversation history

Use the same pattern as conversations_v1.py:240-246:

items_response = await client.conversations.items.list(
    conversation_id=llama_stack_conv_id,
    after=None,
    include=None,
    limit=None,
    order="asc",
)

Config pattern

All config classes extend ConfigurationBase which sets extra“forbid”. Use =Field() with defaults, title, and description. Add @model_validator(mode“after”)= for cross-field validation if needed.

Example config files go in examples/.

Test patterns

Latency and Cost

Scenario Current With compaction
Normal turn 1 LLM call 1 LLM call + ~2ms tiktoken (no change)
Trigger turn 1 LLM call (or 413) 2 LLM calls (summarize + respond)
Post-trigger turn 1 LLM call 1 LLM call (no change)

Compaction adds latency only on the trigger turn. In PoC testing, compaction turns took 14-40 seconds vs 9-20 seconds for normal turns (gpt-4o-mini).

Open Questions for Future Work

Appendix A: PoC Evidence

A proof-of-concept was built and tested.

Experiment 1 (5 queries, threshold=3): Compaction triggered successfully. LLM response after compaction correctly referenced all 4 prior topics.

Experiment 2 (50 queries, threshold=10): 4 compaction cycles. Demonstrated that recursive summarization loses early-conversation context after multiple cycles — this is why the spec requires additive summarization (R3).

Evidence files:

PoC code (not production quality, for reference only):

Appendix B: How Other APIs Handle This

Feature OpenAI Anthropic Bedrock
State management Server-side Client-side Client-side
Auto compaction Yes (opaque) Yes (transparent) No
Custom summarization prompts No Yes N/A
Context editing No Yes (clear_tool_uses, clear_thinking) N/A

See the spike doc (conversation-compaction-spike.md) for full comparison including ChatGPT, Claude Code, MemGPT/Letta, and LangChain.

Appendix C: Reference Sources