faultline breaks your agent's tools on purpose — wrong numbers, stale data, empty responses — and catches the moment it confidently ships a wrong answer with no error. Deterministically. No LLM judge.

Your tests check that the agent runs. They don't check that a tool quietly handed it a stale price, a truncated list, or a number off by a factor of ten — and that the agent acted on it anyway. No exception. No alert. Just a wrong action with a price tag.
A realistic 7-tool support agent (refunds, emails, tickets), written normally — no planted bugs. On honest data it’s clean. Then a stale cache bends one number and it refunds $210 on a $42 order, with no error anywhere. faultline catches it, plus what the model handled fine — reported honestly. The whole interrogation cost $0.16, and the agent + battery ship in the repo so you can re-run it — pip install faultline.
run it yourself, live: faultlineapp.com/demo.html
Every mode injects realistic faults into your tools and watches what the agent does — then gates CI when it does the wrong thing. Start with zero config: faultline scan agent.py:my_agent — no suite file, no rules, it finds your tools and breaks them itself.
Honest edge cases you define — the inputs you already worry about.
faultline probe suite.pyAuto-generated edge inputs: empty, null, bent numbers, dropped keys.
faultline fuzz suite.pyHard real-world situations the agent has to reason through.
faultline scenarios suite.pyRe-run a recorded trace and confirm the verdict still holds.
faultline replay run.jsonLearn invariants from good runs, then enforce them.
faultline mine suite.pyThe full fault library — timeouts, stale data, truncation, wrong numbers.
faultline run suite.pyDetection is behavioral, not a second model's opinion. The same fault produces the same verdict, so it gates CI without flaking.
The tool hangs. Your tests already see this — it throws.
get_quote() → TimeoutErrorA 500 comes back. Loud, logged, handled.
fetch() → 503The list comes back short. No error — the agent just sees fewer rows.
list_orders() → [3 of 240]An empty payload reads as "nothing found" instead of "lookup failed."
lookup() → nullYesterday's price returned as today's. 200 OK. Completely wrong.
get_price() → $41 (was $58)A quantity off by 10×. The agent orders on it without blinking.
qty: 2 → 240
faultline wraps every tool call, swaps the real response for a corrupted one, and compares what the agent does against what it should do — by your invariants, not a second model's opinion.
One line wires it into the framework you already use — fl.instrument(graph) for LangGraph, LangChain, LlamaIndex, pydantic-ai, and crewAI, each verified against the real installed library.

Resilience score, the runs that caught a silent-wrong, and a fault matrix across your agents — wired straight from your CI.
the real dashboard — a 90-second tour, run on a real algo-trading bot
Self-authored, independently audited, zero labels overturned. Run it yourself →
Benchmark agents are deterministic Python. Separately, faultline has been demonstrated catching a silent-wrong on a real Claude tool-calling agent — one scenario, not a rate. Reproduce it in 5 minutes →
Evals, observability and faultline answer three different questions. Most teams shipping serious agents will end up wanting all three — faultline is built to sit beside your existing stack, not replace it.
Eval platforms score outputs against datasets and rubrics — prompt regressions, answer quality, A/B-ing models. Usually judged by an LLM, which is the right tool for subjective quality.
Observability traces every call your agent made in production — latency, cost, token usage, full request trees. Essential once you've shipped; it explains failures after they happen.
faultline corrupts your agent's tool data on purpose — wrong numbers, stale or empty responses — and deterministically catches the agent silently acting on it. No LLM judge, no flaky verdicts: a CI gate and a runtime seatbelt for the failure the other two layers aren't built to catch.
the same agent, tested both ways — your eval passes it, faultline catches it

A GitHub Action that fails the build the moment your agent silently mishandles a fault. Six modes, all gate CI.
The same checks, live. A seatbelt that blocks an irreversible action before it fires on rule-breaking data.
A tamper-evident report — edit one verdict and the hash breaks. Reproducible evidence an auditor can re-check.
pip install faultline — pure standard library, zero dependencies. First verdict in two minutes: faultline init scaffolds a suite + CI workflow, faultline doctor preflights your agent, faultline scan breaks its tools. Then each rung lives in a different layer of your project.
# .github/workflows/faultline.yml - uses: actions/checkout@v4 - uses: aaravanmay/faultline@main with: suite: faultline_suite.py
order = fl.wrap(order, is_action=True) with fl.guard([no_oversell], mode="enforce"): agent.run(task) # bad action → blocked
faultline attest suite.py
faultline verify faultline.report.json
# verified: 3 verdict(s), hash OK
Open source. Deterministic. Catches the bug your evals miss.