to_dict() method that returns a JSON-serializable dict. Import any type directly from the top-level critiqor package:
Type Aliases
ToolCall
Represents a single observed tool invocation. Produced by EvidenceRecorder.record_tool_call() and collected in EvaluationEvidence.tool_calls.
| Field | Type | Description |
|---|---|---|
tool | str | Tool name. |
args | dict | Arguments passed to the tool. |
id | str | None | Optional call ID used to correlate with a ToolOutput. |
timestamp | float | None | Unix timestamp of the call. |
ToolOutput
Represents the result of a single tool invocation. Produced by EvidenceRecorder.record_tool_output() and collected in EvaluationEvidence.tool_outputs.
| Field | Type | Description |
|---|---|---|
tool | str | Tool name. |
output | Any | The tool’s result. |
call_id | str | None | Correlates with the id of a prior ToolCall. |
error | str | None | Error message if the tool call failed; None on success. |
timestamp | float | None | Unix timestamp of the output. |
RuntimeMetrics
Aggregated runtime statistics for a single agent execution. Populated automatically by EvidenceRecorder.finish() or supplied directly to Critiqor.evaluate().
| Field | Type | Description |
|---|---|---|
latency | float | None | Wall-clock duration of the execution in seconds. |
token_usage | dict | Token usage breakdown, e.g. {"prompt_tokens": 120, "completion_tokens": 80, "total_tokens": 200}. |
retries | int | Number of retry events observed during the run. |
errors | list[str] | List of error message strings captured during the run. |
FailureCause
A structured explanation for a trust-score penalty. Failure causes are detected deterministically by detect_failure_causes() and returned in CritiqorResult.failure_causes.
| Field | Type | Description |
|---|---|---|
type | str | Failure type identifier, e.g. "infinite_tool_loop", "ignored_tool_output", "unsupported_claims", "redundant_tool_calls", "runtime_failures", "confidence_mismatch". |
severity | FailureSeverity | "low", "medium", or "high". |
impact | int | Trust-score penalty applied by this cause (negative integer). |
description | str | Human-readable description of what was observed. |
root_cause | RootCause | None | Optional deeper root cause enrichment. |
recommendation | str | Suggested remediation. Empty string if none is available. |
RootCause
Optional enrichment nested inside a FailureCause. Provides a deeper causal explanation and a concrete fix recommendation.
| Field | Type | Description |
|---|---|---|
description | str | Explanation of the underlying cause. |
impact | str | Human-readable description of the downstream impact. |
trust_penalty | int | Trust-score deduction contributed by this root cause. |
recommended_fix | str | Concrete remediation suggestion. |
EvaluationRecord
A persisted representation of one Critiqor evaluation. Returned by save_evaluation() and loaded back by load_evaluations(). Also produced by CritiqorResult.to_record().
| Field | Type | Description |
|---|---|---|
run_id | str | Unique run identifier (UUID). |
agent_id | str | Agent identifier. |
timestamp | str | ISO 8601 UTC timestamp of the evaluation. |
scores | dict | Per-dimension reliability scores keyed by dimension name. |
failure_causes | list[FailureCause] | All failure causes detected for this run. |
trust_score | int | Overall trust score (0–100). |
evidence_level | EvidenceLevel | Evidence quality used for this evaluation. |
evaluation_confidence | int | Critiqor’s self-confidence in the evaluation (0–100). |
deployment_recommendation | DeploymentRecommendation | The deployment gate result for this run. |
EvaluationRecord.to_dict()
Returns a JSON-serializable dict. Failure causes are serialized via their own to_dict() methods.
EvaluationRecord.from_dict()
EvaluationRecord from a previously serialized dict. Unknown or invalid field values are coerced to safe defaults.
PolicyCheckResult
Returned by check_policy(). Represents a CI/CD deployment gate decision for a given agent run.
| Field | Type | Description |
|---|---|---|
passed | bool | True if the run met all configured policy thresholds. |
deployment_recommendation | DeploymentRecommendation | The deployment decision: "safe_to_deploy", "review_recommended", or "unsafe_for_production". |
messages | list[str] | Human-readable messages explaining the gate result — which thresholds passed or failed. |
TrendAnalysis
Returned by analyze_trends(). Summarizes the direction and magnitude of reliability change across multiple historical runs for a single agent.
| Field | Type | Description |
|---|---|---|
trust_trend | TrendDirection | Overall trend direction: "improving", "stable", "declining", or "insufficient_data". |
trust_change | int | Average change in trust score per run (positive = improving). |
hallucination_change | int | Average change in the hallucination score per run. |
tool_reliability_change | int | Average change in the tool reliability score per run. |
reasoning_change | int | Average change in the reasoning score per run. |
summary | str | Human-readable narrative of the trend. |
ReliabilityCertification
Returned by certify_run(). Encodes a standardized certification level for a run or benchmark suite result.
| Field | Type | Description |
|---|---|---|
certification_level | CertificationLevel | "none", "bronze", "silver", "gold", or "platinum". |
trust_score | int | Trust score used to determine the certification level. |
percentile | int | Percentile rank among historical runs. |
markdown_badge | str | Ready-to-embed Markdown badge string for README files. |
criteria | dict | The threshold criteria that were evaluated to arrive at the certification level. |
AgentProfile
Registered identity for an agent, used for cross-agent ranking and leaderboard participation.
| Field | Type | Description |
|---|---|---|
agent_id | str | Unique agent identifier. |
name | str | Display name. Defaults to agent_id if not set. |
category | str | Agent category: "coding", "research", "customer_support", or "general". |
metadata | dict | Arbitrary additional metadata. |
BenchmarkResult
Returned by benchmark_run(). Aggregates scores across all prompts in a benchmark suite.
| Field | Type | Description |
|---|---|---|
name | str | Benchmark suite name. |
agent_type | AgentType | Agent category used for percentile ranking. |
trust_score | int | Average trust score across all benchmark runs. |
percentile | int | Percentile rank among agents in the same category. |
run_count | int | Number of prompts evaluated. |
scores | dict[str, int] | Average per-dimension scores across all runs. |
results | list[CritiqorResult] | Individual results for each benchmark prompt. |
CausalGraph
A structured causal graph for a single failure event. Returned by build_causal_graph().
| Field | Type | Description |
|---|---|---|
failure_event | str | The root failure type (e.g. "infinite_tool_loop"). |
causal_graph | list[CausalGraphEdge] | Ordered list of directed causal edges. |
run_id | str | None | Run identifier this graph was built from, if available. |
CausalGraph.explain()
Returns the causal chain as a human-readable string, e.g.:
"Prompt was ambiguous -> Agent selected incorrect tool -> Evidence was missing -> Final answer hallucinated"
ReliabilityInsight
An executive summary generated by generate_insights() from historical reliability data.
| Field | Type | Description |
|---|---|---|
summary | str | High-level narrative of agent reliability trends. |
primary_drivers | list[str] | The top contributing factors to recent reliability changes. |