> ## Documentation Index
> Fetch the complete documentation index at: https://critiqor.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation: How Critiqor Scores and Diagnoses Agent Runs

> Critiqor computes trust scores, failure diagnoses, causal graphs, and recommendations from runtime evidence using six deterministic dimension detectors.

Critiqor's evaluation pipeline is entirely deterministic. No LLM judgment is used at any stage. Every score, every failure cause, every causal chain, and every deployment recommendation is computed by inspecting the raw events collected in `session.json` and applying rule-based detectors defined in `openclaw.py`. The same evidence always produces the same diagnosis — making results auditable, reproducible, and safe to use in CI/CD gates.

The pipeline runs inside `diagnose_openclaw_events()`, which accepts the session's event list and returns an `OpenClawDiagnosis` dataclass containing all of the fields documented below.

***

## Trust Score

The trust score is a single integer from 0 to 100. It is the weighted sum of six per-dimension scores, each of which starts at 100 and is reduced by a penalty proportional to the severity of any detected failure in that dimension.

### Dimension weights

| Dimension                 | Weight | Failure type penalised  |
| ------------------------- | ------ | ----------------------- |
| `loop_control`            | 20%    | `infinite_tool_loop`    |
| `tool_output_utilization` | 20%    | `ignoring_tool_outputs` |
| `memory_integrity`        | 15%    | `memory_degradation`    |
| `context_health`          | 15%    | `context_pollution`     |
| `cost_efficiency`         | 15%    | `cost_explosion`        |
| `skill_adherence`         | 15%    | `skill_failure`         |

### Score formula

Each dimension score is computed in `_openclaw_scores()`:

```
dimension_score = max(0, 100 − |impact_score for that failure type|)
```

The weighted trust score is then computed in `_weighted_score()`:

```
trust_score = round(Σ dimension_score × weight)
```

clamped to the range `[0, 100]`.

### Penalty caps

Each failure detector imposes an `impact_score` that grows with evidence severity but is capped to prevent any single failure from dominating the score:

| Failure type            | Maximum penalty |
| ----------------------- | --------------- |
| `infinite_tool_loop`    | 30 points       |
| `ignoring_tool_outputs` | 30 points       |
| `cost_explosion`        | 30 points       |
| `memory_degradation`    | 25 points       |
| `skill_failure`         | 24 points       |
| `context_pollution`     | 22 points       |

A run with no detected failures in any dimension receives a trust score of 100.

***

## Executive Summary

The executive summary combines the trust score, readiness level, and primary failure type into a concise run-level verdict.

### Readiness level

`_readiness_level()` maps the trust score and failure severities to one of three levels:

| Level                   | Condition                                                 |
| ----------------------- | --------------------------------------------------------- |
| `ready_for_runtime`     | Trust ≥ 80 **and** no high-severity or critical failures  |
| `review_recommended`    | Trust 60–79 **or** any high-severity failure is present   |
| `unsafe_for_production` | Trust less than 60 **or** any critical failure is present |

### Critical severity thresholds

A failure escalates to `critical` severity — triggering `unsafe_for_production` regardless of the trust score — under two conditions:

* **`infinite_tool_loop`:** the same tool call is repeated **5 or more** times, **or** 3 or more `retry_event` entries are present in the same session.
* **`cost_explosion`:** cumulative token usage across all `token_usage` events reaches **30,000 tokens or more**.

The executive summary is the first thing shown in the run detail view and is also included in the output of `critiqor runs`.

***

## Primary Diagnosis

When multiple failure causes are detected in a single run, `_primary_diagnosis()` selects the one with the highest absolute `impact_score` and elevates it as the root cause explanation.

### Primary diagnosis fields

```json theme={null}
{
  "root_cause_failure_type": "infinite_tool_loop",
  "causal_chain_explanation": "tool_call -> tool_failure_or_no_progress -> retry_same_action -> loop_flagged",
  "severity": "critical",
  "description": "Tool call repeated 6 times with matching arguments."
}
```

| Field                      | Description                                                  |
| -------------------------- | ------------------------------------------------------------ |
| `root_cause_failure_type`  | The failure type with the largest penalty                    |
| `causal_chain_explanation` | The `" -> "`-separated causal chain string from the detector |
| `severity`                 | `"medium"`, `"high"`, or `"critical"`                        |
| `description`              | Human-readable description of what was observed              |

If no failures are detected, `root_cause_failure_type` is `null` and `causal_chain_explanation` reads `"No OpenClaw failure mode was detected from runtime evidence."`.

### Causal chain explanations by failure type

Each detector in `openclaw.py` hard-codes a causal chain that reflects the sequence of observations that led to the failure:

| Failure type            | Causal chain                                                                    |
| ----------------------- | ------------------------------------------------------------------------------- |
| `infinite_tool_loop`    | `tool_call -> tool_failure_or_no_progress -> retry_same_action -> loop_flagged` |
| `memory_degradation`    | `memory_stored -> recall_failed_or_ignored -> state_reconstruction_failed`      |
| `ignoring_tool_outputs` | `tool_call -> tool_output -> decision_skipped_output -> unsupported_agent_step` |
| `context_pollution`     | `context_growth -> saturation_or_compaction -> key_state_risk`                  |
| `cost_explosion`        | `repeated_reasoning_or_calls -> token_waste -> cost_spike`                      |
| `skill_failure`         | `skill_available -> skill_not_selected_or_failed -> generic_execution`          |

***

## Evidence Section

The `evidence_summary` block in the diagnosis is a compact snapshot of every event type observed in the session. It is computed by `_evidence_summary()` from the normalized event list.

### Evidence summary fields

```json theme={null}
{
  "event_count": 47,
  "event_counts": {
    "tool_call": 14,
    "tool_output": 14,
    "retry_event": 3,
    "error_event": 1,
    "token_usage": 6,
    "state_transition": 5,
    "memory_event": 4
  },
  "tool_calls": 14,
  "tool_outputs": 14,
  "memory_events": 4,
  "retries": 3,
  "errors": 1,
  "state_transitions": 5
}
```

| Field               | Source                                          |
| ------------------- | ----------------------------------------------- |
| `event_count`       | Total events in the session                     |
| `event_counts`      | Per-type breakdown of every observed event type |
| `tool_calls`        | Count of `tool_call` events specifically        |
| `tool_outputs`      | Count of `tool_output` events specifically      |
| `memory_events`     | Count of `memory_event` events                  |
| `retries`           | Count of `retry_event` events                   |
| `errors`            | Count of `error_event` events                   |
| `state_transitions` | Count of `state_transition` events              |

The raw events behind these counts are linked from the dashboard Evidence panel directly to the `session.json` file stored at `runs/<run_id>/session.json`.

***

## Recommendations

Each failure cause has an associated description that explains what the detector observed. The dashboard surfaces the primary failure's `causal_chain_explanation` as a step-by-step explanation of why the run scored the way it did, and pairs it with a remediation direction derived from the failure type.

| Failure type            | Remediation direction                                                                                                                                                          |
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `infinite_tool_loop`    | Review loop termination conditions; add retry caps to prevent the agent from re-issuing identical tool calls without progress                                                  |
| `memory_degradation`    | Check memory storage and recall logic; ensure that stored context is retrievable and actively used in downstream decisions                                                     |
| `ignoring_tool_outputs` | Ensure tool results are incorporated into agent decisions; outputs that are received but never referenced indicate a disconnect between the tool layer and the reasoning layer |
| `context_pollution`     | Reduce context window usage; avoid unnecessary compaction events that discard key state silently                                                                               |
| `cost_explosion`        | Review tool call efficiency; add token budget guards to prevent runaway token accumulation across multi-turn sessions                                                          |
| `skill_failure`         | Verify skill selection logic and skill invocation flow; a mismatch or ignored skill means the agent is falling back to generic execution when a specialized path is available  |

***

## Agent Health / Run History

Every completed run produces a trust score that is persisted alongside the diagnosis in the `runs/` directory. The dashboard Agent Health view plots these scores over time, making it possible to see whether an agent is improving, stable, or degrading across successive runs.

Run files follow the naming convention `runs/run_001.json`, `runs/run_002.json`, and so on, with IDs assigned sequentially by `next_run_id()` in `session.py`. The CLI command `critiqor runs` reads this directory and prints a summary table showing run ID, trust score, readiness level, primary failure type, and tool call count for each completed run.

***

## Root Cause Analysis

The causal graph is built by `build_openclaw_causal_graph()` and stored in the `causal_graph` field of the diagnosis. It is a node-edge graph with three edge types:

| Edge type    | Meaning                                                                                                                                                                                                           |
| ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `precedes`   | Temporal ordering — every consecutive pair of events in the timeline is connected with a `precedes` edge                                                                                                          |
| `causes`     | Evidence-to-failure — each piece of evidence that contributed to a detected failure has a `causes` edge pointing to the failure node                                                                              |
| `reinforces` | Repeated evidence — when multiple evidence items within a single failure cause share the same failure node, consecutive items are connected with `reinforces` edges to show that the signal accumulated over time |

### Node types

Every event in the timeline becomes an event node. Each detected failure becomes a `failure` node with an ID of the form `failure_<type>` (e.g., `failure_infinite_tool_loop`). Failure nodes are the terminal points in the causal graph — all `causes` edges terminate at them.

### Reading the causal chain

The `primary_diagnosis.causal_chain_explanation` field is the most human-readable entry point into the causal graph. It is the `" -> "`-joined sequence of the detector's `causal_chain` list — for example:

```
tool_call -> tool_failure_or_no_progress -> retry_same_action -> loop_flagged
```

Each step in this chain corresponds to a class of evidence observed in the runtime timeline. The full graph (accessible from the Evidence panel in the dashboard) shows the individual event nodes that instantiate each step, connected by the `causes` and `reinforces` edges that link them to the failure node.
