> ## Documentation Index
> Fetch the complete documentation index at: https://critiqor.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Diagnosis Engine: From Evidence to Actionable Findings

> Critiqor's diagnosis engine scans session evidence with deterministic detectors to produce trust scores, failure causes, and deployment readiness verdicts.

The Critiqor diagnosis engine takes the raw event stream captured during an OpenClaw session and transforms it into a structured, actionable diagnosis. It runs after `critiqor finalize` closes the session, processes every event in `session.json`, and produces a complete picture of what went wrong, why it happened, and how severe the impact is. No LLM judgment is involved at any stage — every finding is derived deterministically from observed runtime evidence.

## The Diagnosis Pipeline

```
Session File (session.json)
     ↓
Evidence Processing (normalize events, count by type)
     ↓
Failure Detectors (6 OpenClaw-native detectors)
     ↓
Scoring Engine (weighted trust score across 6 dimensions)
     ↓
Causal Graph Builder
     ↓
Diagnosis (primary cause, severity, chain explanation)
     ↓
Dashboard Data (diagnosis.json)
```

**Evidence Processing** normalizes every event to a consistent shape — assigning `event_id`, resolving `event` type, and ensuring timestamps are present — then counts events by type to give detectors fast access to the event distribution.

**Failure Detectors** each scan the normalized event stream for a specific OpenClaw failure pattern. They run independently and in sequence; multiple detectors can fire on the same session.

**Scoring Engine** converts detected failures into per-dimension scores using a weighted formula, then computes a final integer trust score from 0 to 100.

**Causal Graph Builder** constructs a directed graph where nodes are events and failures, and edges represent `precedes`, `causes`, and `reinforces` relationships. This graph powers the visual root-cause view in the dashboard.

**Diagnosis** selects the highest-impact failure cause as the primary diagnosis and assembles the causal chain explanation.

**Dashboard Data** writes the completed `diagnosis.json` artifact that the local dashboard reads.

## The 6 Failure Detectors

Each detector targets one failure type from the OpenClaw failure taxonomy. All detectors are deterministic — they check specific event patterns, not model-generated interpretations.

| Detector                | Failure Type            | What It Catches                                                                 |
| ----------------------- | ----------------------- | ------------------------------------------------------------------------------- |
| Loop Control            | `infinite_tool_loop`    | >=3 identical tool calls (same tool name + same arguments)                      |
| Memory Integrity        | `memory_degradation`    | Memory events with status `recall_failed`, `ignored`, `lost`, or `miss`         |
| Tool Output Utilization | `ignoring_tool_outputs` | Tool outputs marked `used: false`, `referenced: false`, or `status: ignored`    |
| Context Health          | `context_pollution`     | Context events with saturation >=85% or an action of `compaction`               |
| Cost Efficiency         | `cost_explosion`        | Total tokens >=12,000 or >=3 duplicate tool actions in a session                |
| Skill Adherence         | `skill_failure`         | Skill events with `invoked: false` or status `ignored`, `mismatch`, or `failed` |

Each detected failure produces a structured `OpenClawFailureCause` record containing:

* **`type`** — the failure taxonomy key (e.g. `infinite_tool_loop`)
* **`severity`** — `medium`, `high`, or `critical` based on detection thresholds
* **`evidence`** — the specific events that triggered the detector
* **`causal_chain`** — the ordered sequence of event types that led to the failure
* **`impact_score`** — a negative integer representing the trust penalty
* **`description`** — a human-readable explanation of what was observed

### Severity Thresholds

Severity escalates based on observed counts:

* **Loop Control**: `high` for 3–4 identical calls or 1–2 retries; `critical` for >=5 calls or >=3 retries
* **Memory Integrity**: `medium` for 1–2 bad memory events; `high` for >=3
* **Tool Output Utilization**: `medium` for 1 ignored output; `high` for >=2
* **Context Health**: `medium` for saturation 85–94%; `high` if any event reaches >=95%
* **Cost Efficiency**: `high` for token totals 12,000–29,999; `critical` for >=30,000 tokens
* **Skill Adherence**: `medium` for 1 failed skill event; `high` for >=2

## The Scoring Engine

After detectors run, the scoring engine computes a trust score from 0 to 100 across six weighted dimensions:

| Dimension                 | Weight | Failure Type Penalized  |
| ------------------------- | ------ | ----------------------- |
| `loop_control`            | 20%    | `infinite_tool_loop`    |
| `tool_output_utilization` | 20%    | `ignoring_tool_outputs` |
| `memory_integrity`        | 15%    | `memory_degradation`    |
| `context_health`          | 15%    | `context_pollution`     |
| `cost_efficiency`         | 15%    | `cost_explosion`        |
| `skill_adherence`         | 15%    | `skill_failure`         |

Each dimension starts at 100 and is penalized by the impact score of its corresponding detected failure. Dimensions with no detected failure remain at 100. The final trust score is a weighted sum across all six dimensions.

The readiness level is then derived from the trust score and failure severity:

* **`unsafe_for_production`** — trust score less than 60, or any `critical`-severity failure
* **`review_recommended`** — trust score 60–79, or any `high`-severity failure
* **`ready_for_runtime`** — trust score >=80, no `high` or `critical` failures

## No LLM Judgment

Every output from the diagnosis engine — trust scores, failure causes, causal graphs, readiness levels, and cost analysis — is computed deterministically from the event stream. Critiqor does not send evidence to an LLM for scoring, does not use a model to generate failure descriptions, and does not apply any heuristic that varies between runs on identical inputs.

The same `session.json` will always produce the same `diagnosis.json`.

For a full walkthrough of how scores translate to deployment decisions, see the [Evaluation Guide](/guides/evaluation).
