Critiqor Data Types: Public API Type Reference Guide

All Critiqor data types are frozen dataclasses — immutable once constructed. Every type exposes a to_dict() method that returns a JSON-serializable dict. Import any type directly from the top-level critiqor package:

from critiqor import ToolCall, ToolOutput, FailureCause, EvaluationRecord, PolicyCheckResult, TrendAnalysis

Type Aliases

TrustLevel             = Literal["High", "Moderate", "Low"]
EvidenceLevel          = Literal["response_only", "trace_available", "fully_instrumented"]
FailureSeverity        = Literal["low", "medium", "high"]
DeploymentRecommendation = Literal["safe_to_deploy", "review_recommended", "unsafe_for_production"]
AgentType              = Literal["coding", "research", "customer_support", "general"]
CertificationLevel     = Literal["none", "bronze", "silver", "gold", "platinum"]
TrendDirection         = Literal["improving", "stable", "declining", "insufficient_data"]

`ToolCall`

Represents a single observed tool invocation. Produced by EvidenceRecorder.record_tool_call() and collected in EvaluationEvidence.tool_calls.

Field	Type	Description
`tool`	`str`	Tool name.
`args`	`dict`	Arguments passed to the tool.
`id`	`str \| None`	Optional call ID used to correlate with a `ToolOutput`.
`timestamp`	`float \| None`	Unix timestamp of the call.

`ToolOutput`

Represents the result of a single tool invocation. Produced by EvidenceRecorder.record_tool_output() and collected in EvaluationEvidence.tool_outputs.

Field	Type	Description
`tool`	`str`	Tool name.
`output`	`Any`	The tool’s result.
`call_id`	`str \| None`	Correlates with the `id` of a prior `ToolCall`.
`error`	`str \| None`	Error message if the tool call failed; `None` on success.
`timestamp`	`float \| None`	Unix timestamp of the output.

`RuntimeMetrics`

Aggregated runtime statistics for a single agent execution. Populated automatically by EvidenceRecorder.finish() or supplied directly to Critiqor.evaluate().

Field	Type	Description
`latency`	`float \| None`	Wall-clock duration of the execution in seconds.
`token_usage`	`dict`	Token usage breakdown, e.g. `{"prompt_tokens": 120, "completion_tokens": 80, "total_tokens": 200}`.
`retries`	`int`	Number of retry events observed during the run.
`errors`	`list[str]`	List of error message strings captured during the run.

`FailureCause`

A structured explanation for a trust-score penalty. Failure causes are detected deterministically by detect_failure_causes() and returned in CritiqorResult.failure_causes.

Field	Type	Description
`type`	`str`	Failure type identifier, e.g. `"infinite_tool_loop"`, `"ignored_tool_output"`, `"unsupported_claims"`, `"redundant_tool_calls"`, `"runtime_failures"`, `"confidence_mismatch"`.
`severity`	`FailureSeverity`	`"low"`, `"medium"`, or `"high"`.
`impact`	`int`	Trust-score penalty applied by this cause (negative integer).
`description`	`str`	Human-readable description of what was observed.
`root_cause`	`RootCause \| None`	Optional deeper root cause enrichment.
`recommendation`	`str`	Suggested remediation. Empty string if none is available.

`RootCause`

Optional enrichment nested inside a FailureCause. Provides a deeper causal explanation and a concrete fix recommendation.

Field	Type	Description
`description`	`str`	Explanation of the underlying cause.
`impact`	`str`	Human-readable description of the downstream impact.
`trust_penalty`	`int`	Trust-score deduction contributed by this root cause.
`recommended_fix`	`str`	Concrete remediation suggestion.

`EvaluationRecord`

A persisted representation of one Critiqor evaluation. Returned by save_evaluation() and loaded back by load_evaluations(). Also produced by CritiqorResult.to_record().

Field	Type	Description
`run_id`	`str`	Unique run identifier (UUID).
`agent_id`	`str`	Agent identifier.
`timestamp`	`str`	ISO 8601 UTC timestamp of the evaluation.
`scores`	`dict`	Per-dimension reliability scores keyed by dimension name.
`failure_causes`	`list[FailureCause]`	All failure causes detected for this run.
`trust_score`	`int`	Overall trust score (0–100).
`evidence_level`	`EvidenceLevel`	Evidence quality used for this evaluation.
`evaluation_confidence`	`int`	Critiqor’s self-confidence in the evaluation (0–100).
`deployment_recommendation`	`DeploymentRecommendation`	The deployment gate result for this run.

`EvaluationRecord.to_dict()`

Returns a JSON-serializable dict. Failure causes are serialized via their own to_dict() methods.

`EvaluationRecord.from_dict()`

EvaluationRecord.from_dict(payload: dict) → EvaluationRecord

Reconstructs an EvaluationRecord from a previously serialized dict. Unknown or invalid field values are coerced to safe defaults.

`PolicyCheckResult`

Returned by check_policy(). Represents a CI/CD deployment gate decision for a given agent run.

Field	Type	Description
`passed`	`bool`	`True` if the run met all configured policy thresholds.
`deployment_recommendation`	`DeploymentRecommendation`	The deployment decision: `"safe_to_deploy"`, `"review_recommended"`, or `"unsafe_for_production"`.
`messages`	`list[str]`	Human-readable messages explaining the gate result — which thresholds passed or failed.

`TrendAnalysis`

Returned by analyze_trends(). Summarizes the direction and magnitude of reliability change across multiple historical runs for a single agent.

Field	Type	Description
`trust_trend`	`TrendDirection`	Overall trend direction: `"improving"`, `"stable"`, `"declining"`, or `"insufficient_data"`.
`trust_change`	`int`	Average change in trust score per run (positive = improving).
`hallucination_change`	`int`	Average change in the hallucination score per run.
`tool_reliability_change`	`int`	Average change in the tool reliability score per run.
`reasoning_change`	`int`	Average change in the reasoning score per run.
`summary`	`str`	Human-readable narrative of the trend.

`ReliabilityCertification`

Returned by certify_run(). Encodes a standardized certification level for a run or benchmark suite result.

Field	Type	Description
`certification_level`	`CertificationLevel`	`"none"`, `"bronze"`, `"silver"`, `"gold"`, or `"platinum"`.
`trust_score`	`int`	Trust score used to determine the certification level.
`percentile`	`int`	Percentile rank among historical runs.
`markdown_badge`	`str`	Ready-to-embed Markdown badge string for README files.
`criteria`	`dict`	The threshold criteria that were evaluated to arrive at the certification level.

`AgentProfile`

Registered identity for an agent, used for cross-agent ranking and leaderboard participation.

Field	Type	Description
`agent_id`	`str`	Unique agent identifier.
`name`	`str`	Display name. Defaults to `agent_id` if not set.
`category`	`str`	Agent category: `"coding"`, `"research"`, `"customer_support"`, or `"general"`.
`metadata`	`dict`	Arbitrary additional metadata.

`BenchmarkResult`

Returned by benchmark_run(). Aggregates scores across all prompts in a benchmark suite.

Field	Type	Description
`name`	`str`	Benchmark suite name.
`agent_type`	`AgentType`	Agent category used for percentile ranking.
`trust_score`	`int`	Average trust score across all benchmark runs.
`percentile`	`int`	Percentile rank among agents in the same category.
`run_count`	`int`	Number of prompts evaluated.
`scores`	`dict[str, int]`	Average per-dimension scores across all runs.
`results`	`list[CritiqorResult]`	Individual results for each benchmark prompt.

`CausalGraph`

A structured causal graph for a single failure event. Returned by build_causal_graph().

Field	Type	Description
`failure_event`	`str`	The root failure type (e.g. `"infinite_tool_loop"`).
`causal_graph`	`list[CausalGraphEdge]`	Ordered list of directed causal edges.
`run_id`	`str \| None`	Run identifier this graph was built from, if available.

`CausalGraph.explain()`

Returns the causal chain as a human-readable string, e.g.: "Prompt was ambiguous -> Agent selected incorrect tool -> Evidence was missing -> Final answer hallucinated"

`ReliabilityInsight`

An executive summary generated by generate_insights() from historical reliability data.

Field	Type	Description
`summary`	`str`	High-level narrative of agent reliability trends.
`primary_drivers`	`list[str]`	The top contributing factors to recent reliability changes.

​Type Aliases

​ToolCall

​ToolOutput

​RuntimeMetrics

​FailureCause

​RootCause

​EvaluationRecord

​EvaluationRecord.to_dict()

​EvaluationRecord.from_dict()

​PolicyCheckResult

​TrendAnalysis

​ReliabilityCertification

​AgentProfile

​BenchmarkResult

​CausalGraph

​CausalGraph.explain()

​ReliabilityInsight