Deterministic agent testing

Your agent doesn't crash.It quietly does the wrong thing.

faultline breaks your agent's tools on purpose — wrong numbers, stale data, empty responses — and catches the moment it confidently ships a wrong answer with no error. Deterministically. No LLM judge.

$pip install faultline
faultline
Wraps any Python tool call
LangChainLlamaIndexOpenAICrewAIor your own
The silent failure

A 200 OK can still be completely wrong.

Your tests check that the agent runs. They don't check that a tool quietly handed it a stale price, a truncated list, or a number off by a factor of ten — and that the agent acted on it anyway. No exception. No alert. Just a wrong action with a price tag.

> agent.run("reorder low-stock items")
tool get_inventory() → 200 OK
returned { sku: "A‑12", qty: 2 } # real: 240
agent place_order(sku="A‑12", qty=238)
✓ run completed — exit 0
✗ faultline: SILENT‑WRONG — acted on a corrupted value
Watch the catch

A real Claude agent, caught moving money wrong.

A realistic 7-tool support agent (refunds, emails, tickets), written normally — no planted bugs. On honest data it’s clean. Then a stale cache bends one number and it refunds $210 on a $42 order, with no error anywhere. faultline catches it, plus what the model handled fine — reported honestly. The whole interrogation cost $0.16, and the agent + battery ship in the repo so you can re-run it — pip install faultline.

run it yourself, live: faultlineapp.com/demo.html

How it works

Six ways to break it before production does.

Every mode injects realistic faults into your tools and watches what the agent does — then gates CI when it does the wrong thing. Start with zero config: faultline scan agent.py:my_agent — no suite file, no rules, it finds your tools and breaks them itself.

probe

Honest edge cases you define — the inputs you already worry about.

faultline probe suite.py

fuzz

Auto-generated edge inputs: empty, null, bent numbers, dropped keys.

faultline fuzz suite.py

scenarios

Hard real-world situations the agent has to reason through.

faultline scenarios suite.py

replay

Re-run a recorded trace and confirm the verdict still holds.

faultline replay run.json

mine

Learn invariants from good runs, then enforce them.

faultline mine suite.py

chaos

The full fault library — timeouts, stale data, truncation, wrong numbers.

faultline run suite.py
No LLM judge

One input, one verdict — every single run.

Detection is behavioral, not a second model's opinion. The same fault produces the same verdict, so it gates CI without flaking.

The fault library

From the failures your tests catch — to the ones they don't.

01
Loud

timeout

The tool hangs. Your tests already see this — it throws.

get_quote() → TimeoutError
✓ your tests catch this
02
Loud

server-error

A 500 comes back. Loud, logged, handled.

fetch() → 503
✓ your tests catch this
03
Silent

truncate

The list comes back short. No error — the agent just sees fewer rows.

list_orders() → [3 of 240]
✗ acts on a partial list
04
Silent

null-response

An empty payload reads as "nothing found" instead of "lookup failed."

lookup() → null
✗ proceeds as if empty
05
Silent

stale-data

Yesterday's price returned as today's. 200 OK. Completely wrong.

get_price() → $41 (was $58)
✗ trades on stale data
06
Silent

wrong-number

A quantity off by 10×. The agent orders on it without blinking.

qty: 2 → 240
✗ silent-wrong — the headline
The interceptor

It sits between your agent and its tools.

faultline wraps every tool call, swaps the real response for a corrupted one, and compares what the agent does against what it should do — by your invariants, not a second model's opinion.

One line wires it into the framework you already use — fl.instrument(graph) for LangGraph, LangChain, LlamaIndex, pydantic-ai, and crewAI, each verified against the real installed library.

faultline intercepts a broken tool call
The dashboard

Every silent failure, on one screen.

Resilience score, the runs that caught a silent-wrong, and a fault matrix across your agents — wired straight from your CI.

the real dashboard — a 90-second tour, run on a real algo-trading bot

Measured, not claimed

On an 85-case benchmark with adversarial traps.

0%
recall · 39 of 40
0%
false-alarm · 1 of 45
0%
deterministic
0
LLM judges

Self-authored, independently audited, zero labels overturned. Run it yourself →

Benchmark agents are deterministic Python. Separately, faultline has been demonstrated catching a silent-wrong on a real Claude tool-calling agent — one scenario, not a rate. Reproduce it in 5 minutes →

Where this sits

Not another eval platform.

Evals, observability and faultline answer three different questions. Most teams shipping serious agents will end up wanting all three — faultline is built to sit beside your existing stack, not replace it.

Question 1 · quality

"Is the answer good?"

Eval platforms score outputs against datasets and rubrics — prompt regressions, answer quality, A/B-ing models. Usually judged by an LLM, which is the right tool for subjective quality.

LangSmith · Braintrust · DeepEval · promptfoo
Question 2 · visibility

"What happened in prod?"

Observability traces every call your agent made in production — latency, cost, token usage, full request trees. Essential once you've shipped; it explains failures after they happen.

Langfuse · Arize Phoenix · OpenInference · Datadog
Question 3 · behavior under failure

"What does it DO when its tools lie?"

faultline corrupts your agent's tool data on purpose — wrong numbers, stale or empty responses — and deterministically catches the agent silently acting on it. No LLM judge, no flaky verdicts: a CI gate and a runtime seatbelt for the failure the other two layers aren't built to catch.

faultline — free, open source
commit unit tests evals · quality gate faultline · silent-failure gate deploy observability+ faultline guard · runtime

the same agent, tested both ways — your eval passes it, faultline catches it

How it scales

From a free CI gate to tamper-evident evidence.

Free

CI gate

A GitHub Action that fails the build the moment your agent silently mishandles a fault. Six modes, all gate CI.

Runtime

Runtime guard

The same checks, live. A seatbelt that blocks an irreversible action before it fires on rule-breaking data.

Evidence

Attestation

A tamper-evident report — edit one verdict and the hash breaks. Reproducible evidence an auditor can re-check.

Install

Integrated in minutes. Permanent in your repo.

pip install faultline — pure standard library, zero dependencies. First verdict in two minutes: faultline init scaffolds a suite + CI workflow, faultline doctor preflights your agent, faultline scan breaks its tools. Then each rung lives in a different layer of your project.

In your repo · every PR

CI gate — free

Run faultline init — it writes this workflow and a starter suite for you. Every pull request then fails the build on a silent failure.
# .github/workflows/faultline.yml
- uses: actions/checkout@v4
- uses: aaravanmay/faultline@main
  with:
    suite: faultline_suite.py
In your production code

Runtime guard

Wrap the irreversible action once. In enforce mode a rule-breaking action raises before it fires.
order = fl.wrap(order, is_action=True)

with fl.guard([no_oversell], mode="enforce"):
    agent.run(task)  # bad action → blocked
In your release pipeline

Attestation

Each build writes a tamper-evident verdict file. Anyone can re-verify it — edit one number and the hash breaks.
faultline attest suite.py
faultline verify faultline.report.json
# verified: 3 verdict(s), hash OK

Ship agents you canactually trust.

Open source. Deterministic. Catches the bug your evals miss.