critiqor finalize closes the session, processes every event in session.json, and produces a complete picture of what went wrong, why it happened, and how severe the impact is. No LLM judgment is involved at any stage — every finding is derived deterministically from observed runtime evidence.
The Diagnosis Pipeline
event_id, resolving event type, and ensuring timestamps are present — then counts events by type to give detectors fast access to the event distribution.
Failure Detectors each scan the normalized event stream for a specific OpenClaw failure pattern. They run independently and in sequence; multiple detectors can fire on the same session.
Scoring Engine converts detected failures into per-dimension scores using a weighted formula, then computes a final integer trust score from 0 to 100.
Causal Graph Builder constructs a directed graph where nodes are events and failures, and edges represent precedes, causes, and reinforces relationships. This graph powers the visual root-cause view in the dashboard.
Diagnosis selects the highest-impact failure cause as the primary diagnosis and assembles the causal chain explanation.
Dashboard Data writes the completed diagnosis.json artifact that the local dashboard reads.
The 6 Failure Detectors
Each detector targets one failure type from the OpenClaw failure taxonomy. All detectors are deterministic — they check specific event patterns, not model-generated interpretations.| Detector | Failure Type | What It Catches |
|---|---|---|
| Loop Control | infinite_tool_loop | >=3 identical tool calls (same tool name + same arguments) |
| Memory Integrity | memory_degradation | Memory events with status recall_failed, ignored, lost, or miss |
| Tool Output Utilization | ignoring_tool_outputs | Tool outputs marked used: false, referenced: false, or status: ignored |
| Context Health | context_pollution | Context events with saturation >=85% or an action of compaction |
| Cost Efficiency | cost_explosion | Total tokens >=12,000 or >=3 duplicate tool actions in a session |
| Skill Adherence | skill_failure | Skill events with invoked: false or status ignored, mismatch, or failed |
OpenClawFailureCause record containing:
type— the failure taxonomy key (e.g.infinite_tool_loop)severity—medium,high, orcriticalbased on detection thresholdsevidence— the specific events that triggered the detectorcausal_chain— the ordered sequence of event types that led to the failureimpact_score— a negative integer representing the trust penaltydescription— a human-readable explanation of what was observed
Severity Thresholds
Severity escalates based on observed counts:- Loop Control:
highfor 3–4 identical calls or 1–2 retries;criticalfor >=5 calls or >=3 retries - Memory Integrity:
mediumfor 1–2 bad memory events;highfor >=3 - Tool Output Utilization:
mediumfor 1 ignored output;highfor >=2 - Context Health:
mediumfor saturation 85–94%;highif any event reaches >=95% - Cost Efficiency:
highfor token totals 12,000–29,999;criticalfor >=30,000 tokens - Skill Adherence:
mediumfor 1 failed skill event;highfor >=2
The Scoring Engine
After detectors run, the scoring engine computes a trust score from 0 to 100 across six weighted dimensions:| Dimension | Weight | Failure Type Penalized |
|---|---|---|
loop_control | 20% | infinite_tool_loop |
tool_output_utilization | 20% | ignoring_tool_outputs |
memory_integrity | 15% | memory_degradation |
context_health | 15% | context_pollution |
cost_efficiency | 15% | cost_explosion |
skill_adherence | 15% | skill_failure |
unsafe_for_production— trust score less than 60, or anycritical-severity failurereview_recommended— trust score 60–79, or anyhigh-severity failureready_for_runtime— trust score >=80, nohighorcriticalfailures
No LLM Judgment
Every output from the diagnosis engine — trust scores, failure causes, causal graphs, readiness levels, and cost analysis — is computed deterministically from the event stream. Critiqor does not send evidence to an LLM for scoring, does not use a model to generate failure descriptions, and does not apply any heuristic that varies between runs on identical inputs. The samesession.json will always produce the same diagnosis.json.
For a full walkthrough of how scores translate to deployment decisions, see the Evaluation Guide.