Skip to main content
The Critiqor diagnosis engine takes the raw event stream captured during an OpenClaw session and transforms it into a structured, actionable diagnosis. It runs after critiqor finalize closes the session, processes every event in session.json, and produces a complete picture of what went wrong, why it happened, and how severe the impact is. No LLM judgment is involved at any stage — every finding is derived deterministically from observed runtime evidence.

The Diagnosis Pipeline

Session File (session.json)

Evidence Processing (normalize events, count by type)

Failure Detectors (6 OpenClaw-native detectors)

Scoring Engine (weighted trust score across 6 dimensions)

Causal Graph Builder

Diagnosis (primary cause, severity, chain explanation)

Dashboard Data (diagnosis.json)
Evidence Processing normalizes every event to a consistent shape — assigning event_id, resolving event type, and ensuring timestamps are present — then counts events by type to give detectors fast access to the event distribution. Failure Detectors each scan the normalized event stream for a specific OpenClaw failure pattern. They run independently and in sequence; multiple detectors can fire on the same session. Scoring Engine converts detected failures into per-dimension scores using a weighted formula, then computes a final integer trust score from 0 to 100. Causal Graph Builder constructs a directed graph where nodes are events and failures, and edges represent precedes, causes, and reinforces relationships. This graph powers the visual root-cause view in the dashboard. Diagnosis selects the highest-impact failure cause as the primary diagnosis and assembles the causal chain explanation. Dashboard Data writes the completed diagnosis.json artifact that the local dashboard reads.

The 6 Failure Detectors

Each detector targets one failure type from the OpenClaw failure taxonomy. All detectors are deterministic — they check specific event patterns, not model-generated interpretations.
DetectorFailure TypeWhat It Catches
Loop Controlinfinite_tool_loop>=3 identical tool calls (same tool name + same arguments)
Memory Integritymemory_degradationMemory events with status recall_failed, ignored, lost, or miss
Tool Output Utilizationignoring_tool_outputsTool outputs marked used: false, referenced: false, or status: ignored
Context Healthcontext_pollutionContext events with saturation >=85% or an action of compaction
Cost Efficiencycost_explosionTotal tokens >=12,000 or >=3 duplicate tool actions in a session
Skill Adherenceskill_failureSkill events with invoked: false or status ignored, mismatch, or failed
Each detected failure produces a structured OpenClawFailureCause record containing:
  • type — the failure taxonomy key (e.g. infinite_tool_loop)
  • severitymedium, high, or critical based on detection thresholds
  • evidence — the specific events that triggered the detector
  • causal_chain — the ordered sequence of event types that led to the failure
  • impact_score — a negative integer representing the trust penalty
  • description — a human-readable explanation of what was observed

Severity Thresholds

Severity escalates based on observed counts:
  • Loop Control: high for 3–4 identical calls or 1–2 retries; critical for >=5 calls or >=3 retries
  • Memory Integrity: medium for 1–2 bad memory events; high for >=3
  • Tool Output Utilization: medium for 1 ignored output; high for >=2
  • Context Health: medium for saturation 85–94%; high if any event reaches >=95%
  • Cost Efficiency: high for token totals 12,000–29,999; critical for >=30,000 tokens
  • Skill Adherence: medium for 1 failed skill event; high for >=2

The Scoring Engine

After detectors run, the scoring engine computes a trust score from 0 to 100 across six weighted dimensions:
DimensionWeightFailure Type Penalized
loop_control20%infinite_tool_loop
tool_output_utilization20%ignoring_tool_outputs
memory_integrity15%memory_degradation
context_health15%context_pollution
cost_efficiency15%cost_explosion
skill_adherence15%skill_failure
Each dimension starts at 100 and is penalized by the impact score of its corresponding detected failure. Dimensions with no detected failure remain at 100. The final trust score is a weighted sum across all six dimensions. The readiness level is then derived from the trust score and failure severity:
  • unsafe_for_production — trust score less than 60, or any critical-severity failure
  • review_recommended — trust score 60–79, or any high-severity failure
  • ready_for_runtime — trust score >=80, no high or critical failures

No LLM Judgment

Every output from the diagnosis engine — trust scores, failure causes, causal graphs, readiness levels, and cost analysis — is computed deterministically from the event stream. Critiqor does not send evidence to an LLM for scoring, does not use a model to generate failure descriptions, and does not apply any heuristic that varies between runs on identical inputs. The same session.json will always produce the same diagnosis.json. For a full walkthrough of how scores translate to deployment decisions, see the Evaluation Guide.