Blog

Building BuchhalterPython: How We Set Up Agentic Infrastructure Before Writing Code

18 March 2026 · 4 min read · Stefan Pauleweit

agentic-ai infrastructure tdd microservices

Building BuchhalterPython: How We Set Up Agentic Infrastructure Before Writing Code

Before writing the first line of application code for BuchhalterPython, we invested a few hours designing and building agentic infrastructure. That’s one of the things AI fundamentally changes: what used to take weeks of scaffolding and deliberation now takes an afternoon. This wasn’t bureaucracy dressed up as planning. It was the difference between a project that ships and one that collapses under its own ambition.

The temptation is always the same: start coding immediately. You know what you want to build. You’ve sketched the data model on a whiteboard. Why not just start? The honest answer: because infrastructure becomes infinitely more expensive to change after code depends on it.

Six Agents, One Mission

We began by defining our team structure — not of humans, but of AI agents, each with a single, clear responsibility:

Backend Engineer: Write and optimise core business logic
Test Engineer: Design test strategies and catch regressions
PM Agent: Manage roadmap, priorities, and requirements flow
Wiki Writer: Document decisions and architecture
Infra Engineer: Handle deployments, CI/CD, monitoring
Learning Chronicler: Capture retrospectives and improvement insights

This sounds like premature structure, but it’s actually the opposite. By clarifying responsibilities upfront, we avoid the chaos of every agent trying to solve every problem. A test engineer knows exactly what it’s optimising for. A backend engineer isn’t distracted by infrastructure concerns. When someone needs to fix a bug, the right agent is already warmed up.

Golden Standards as Rails

We then codified non-negotiable practices — what I call “golden standards”:

Type hints on every function parameter and return value
Docstrings for every public method (one-liner minimum)
Cross-platform file paths (no hardcoded backslashes)
100-character line limits (readability over cleverness)
Test-driven development as the default mode, not an afterthought

These aren’t suggestions. They’re embedded in how agents work. A backend agent won’t commit code without type hints. A test agent won’t accept unmocked external dependencies. This creates a gravity well: doing the right thing becomes the path of least resistance.

Testing in Layers

We designed a three-layer testing pyramid:

Local Unit Tests via pytest. Agents run these before committing anything. If a test fails locally, it doesn’t reach the repository.

Drone CI/CD. Every commit triggers automated testing in an isolated environment. This catches environment-specific failures and coupling issues that local tests might miss.

Staging Deployment. The final layer. Code that passes CI runs in a staging environment before touching production. This catches integration issues that integration tests sometimes don’t.

This structure means failures are cheap when they happen locally, and catastrophically expensive (hence prevented) by the time they reach users.

Token Optimisation as Architecture

Here’s where infrastructure thinking really paid dividends. Agentic work is stateless by nature — each query is independent. That means context must be reloaded every time. A naive approach regenerates the entire project structure, dependencies, and relevant code on every query. Expensive.

We built index files: curated, human-edited summaries of key modules and their responsibilities. A backend agent needing context about database schema doesn’t read 50 files. It reads one 300-word index. The goal is simple: every token you don’t spend on context is a token spent on actual work.

For document processing, the cost story gets concrete. BuchhalterPython will use Mistral for OCR — currently the best OCR model available, and with its EU headquarters, it’s also the most straightforward choice for GDPR compliance. Mistral’s batch API cuts processing costs by exactly 50% compared to real-time requests. For a system that will process hundreds of invoices and receipts, that’s not a rounding error.

These aren’t micro-optimisations. They’re decisions that determine whether the project remains cost-effective at scale.

Why This Matters

The conventional workflow is: build fast, hit problems, restructure, rebuild. This works for weekend projects. It fails spectacularly for anything requiring hundreds of AI-generated commits — something we learned the hard way with the n8n predecessor.

Agentic workflows are like scaling a restaurant kitchen. The first two cooks can coordinate with verbal shouts. The tenth cook needs a documented brigade system. By the time you’re working with six agents across hundreds of commits, infrastructure isn’t overhead — it’s the only thing that works.

What surprised us most wasn’t the infrastructure complexity. It was how much clearer the domain became once we’d defined agent responsibilities and testing strategies. The act of deciding how to test something forces you to understand what it actually does.

Takeaway

If you’re building anything with agentic AI — whether it’s internal tooling or a commercial product — I’d recommend the same approach. Spend a few hours on infrastructure. Define your agents. Codify your standards. Then code faster than you ever thought possible.

The infrastructure isn’t slowing you down. It’s enabling you to move at scale.

Next in this series: Building BuchhalterPython: Architecture Before the First Commit (Part 2) — five architectural decisions made before writing a single line of business logic.