Diagnosis Engine: From Evidence to Actionable Findings

The Critiqor diagnosis engine takes the raw event stream captured during an OpenClaw session and transforms it into a structured, actionable diagnosis. It runs after critiqor finalize closes the session, processes every event in session.json, and produces a complete picture of what went wrong, why it happened, and how severe the impact is. No LLM judgment is involved at any stage — every finding is derived deterministically from observed runtime evidence.

The Diagnosis Pipeline

Session File (session.json)
     ↓
Evidence Processing (normalize events, count by type)
     ↓
Failure Detectors (6 OpenClaw-native detectors)
     ↓
Scoring Engine (weighted trust score across 6 dimensions)
     ↓
Causal Graph Builder
     ↓
Diagnosis (primary cause, severity, chain explanation)
     ↓
Dashboard Data (diagnosis.json)

Evidence Processing normalizes every event to a consistent shape — assigning event_id, resolving event type, and ensuring timestamps are present — then counts events by type to give detectors fast access to the event distribution. Failure Detectors each scan the normalized event stream for a specific OpenClaw failure pattern. They run independently and in sequence; multiple detectors can fire on the same session. Scoring Engine converts detected failures into per-dimension scores using a weighted formula, then computes a final integer trust score from 0 to 100. Causal Graph Builder constructs a directed graph where nodes are events and failures, and edges represent precedes, causes, and reinforces relationships. This graph powers the visual root-cause view in the dashboard. Diagnosis selects the highest-impact failure cause as the primary diagnosis and assembles the causal chain explanation. Dashboard Data writes the completed diagnosis.json artifact that the local dashboard reads.

The 6 Failure Detectors

Each detector targets one failure type from the OpenClaw failure taxonomy. All detectors are deterministic — they check specific event patterns, not model-generated interpretations.

Detector	Failure Type	What It Catches
Loop Control	`infinite_tool_loop`	>=3 identical tool calls (same tool name + same arguments)
Memory Integrity	`memory_degradation`	Memory events with status `recall_failed`, `ignored`, `lost`, or `miss`
Tool Output Utilization	`ignoring_tool_outputs`	Tool outputs marked `used: false`, `referenced: false`, or `status: ignored`
Context Health	`context_pollution`	Context events with saturation >=85% or an action of `compaction`
Cost Efficiency	`cost_explosion`	Total tokens >=12,000 or >=3 duplicate tool actions in a session
Skill Adherence	`skill_failure`	Skill events with `invoked: false` or status `ignored`, `mismatch`, or `failed`

Each detected failure produces a structured OpenClawFailureCause record containing:

type — the failure taxonomy key (e.g. infinite_tool_loop)
severity — medium, high, or critical based on detection thresholds
evidence — the specific events that triggered the detector
causal_chain — the ordered sequence of event types that led to the failure
impact_score — a negative integer representing the trust penalty
description — a human-readable explanation of what was observed

Severity Thresholds

Severity escalates based on observed counts:

Loop Control: high for 3–4 identical calls or 1–2 retries; critical for >=5 calls or >=3 retries
Memory Integrity: medium for 1–2 bad memory events; high for >=3
Tool Output Utilization: medium for 1 ignored output; high for >=2
Context Health: medium for saturation 85–94%; high if any event reaches >=95%
Cost Efficiency: high for token totals 12,000–29,999; critical for >=30,000 tokens
Skill Adherence: medium for 1 failed skill event; high for >=2

The Scoring Engine

After detectors run, the scoring engine computes a trust score from 0 to 100 across six weighted dimensions:

Dimension	Weight	Failure Type Penalized
`loop_control`	20%	`infinite_tool_loop`
`tool_output_utilization`	20%	`ignoring_tool_outputs`
`memory_integrity`	15%	`memory_degradation`
`context_health`	15%	`context_pollution`
`cost_efficiency`	15%	`cost_explosion`
`skill_adherence`	15%	`skill_failure`

Each dimension starts at 100 and is penalized by the impact score of its corresponding detected failure. Dimensions with no detected failure remain at 100. The final trust score is a weighted sum across all six dimensions. The readiness level is then derived from the trust score and failure severity:

unsafe_for_production — trust score less than 60, or any critical-severity failure
review_recommended — trust score 60–79, or any high-severity failure
ready_for_runtime — trust score >=80, no high or critical failures

No LLM Judgment

Every output from the diagnosis engine — trust scores, failure causes, causal graphs, readiness levels, and cost analysis — is computed deterministically from the event stream. Critiqor does not send evidence to an LLM for scoring, does not use a model to generate failure descriptions, and does not apply any heuristic that varies between runs on identical inputs. The same session.json will always produce the same diagnosis.json. For a full walkthrough of how scores translate to deployment decisions, see the Evaluation Guide.

​The Diagnosis Pipeline

​The 6 Failure Detectors

​Severity Thresholds

​The Scoring Engine

​No LLM Judgment

The Diagnosis Pipeline

The 6 Failure Detectors

Severity Thresholds

The Scoring Engine

No LLM Judgment