Blog

Lernreise 2/7: The Starting Point: A Thousand Untagged Documents

Lernreise 2/7: The Starting Point: A Thousand Untagged Documents

Six months ago I migrated my homelab. The old machine was a netbook that had served faithfully for years, and by the end of its life it was doing things that its manufacturer had never intended and would probably have considered unkind. I replaced it with a MiniPC: an N150 processor, 16 GB of RAM, a form factor that fits in a drawer. It runs Proxmox. It is fast. It is quiet. I am unreasonably pleased with it.

The migration was, like all migrations, an opportunity to do things properly this time. I moved from Mayan EDMS to Paperless-NGX. Fresh database. Clean structure. New approach to tagging, document types, and correspondents.

Around document 100, I stopped.


The problem with “doing things properly this time” is that it still requires doing them. Paperless-NGX is a genuinely good piece of software, and its built-in classifier will, over time, learn your patterns and make reasonable suggestions. But “over time” means: after you have classified enough documents for it to learn from. Which means you have to classify them first.

I had roughly a thousand documents. The classifier helped. It was not, as I noted at the time, the yellow from the egg. (Direct translation from German. It means: not exactly ideal.) Tesseract, the OCR engine underneath, is adequate for well-scanned PDFs and struggles with everything else.

My tagging structure started tidy and became progressively messier as I made decisions I later reversed, added categories I never used, and created correspondents with slightly different spellings of the same organisation’s name.

I can find any document in seconds. But assemble all receipts for the year, organised by category, for the tax accountant? That requires manual work I had been putting off for months.


About four weeks before this learning week, I had set up n8n. I had clicked together a workflow with AI assistance: n8n pulls documents from Paperless, sends them through Mistral OCR, pushes the results through a Mistral Small LLM with a structured prompt, and writes the metadata back.

The Mistral OCR was the revelation.

I tested it on receipts photographed in a hurry on a phone, poor lighting, slightly blurred. It extracted the amounts. Not the totals, the line items. I handed it a crumpled supermarket receipt photographed at an angle and it came back with the correct euro figure. That is genuinely impressive, and if you have spent any time with Tesseract you will understand why.

The LLM part was less impressive. It helped, but the metadata it returned was inconsistent. Sometimes it got the document type right. Sometimes it invented a correspondent. The structure of my tags was too messy to guide it reliably, and the prompts I had written were not precise enough.


This became the goal of the week.

Not just “use AI for document management”, which is vague. A specific, useful outcome: one click to prepare everything for the tax accountant. All documents for a given year, correctly tagged by category, correspondents clean, financial data extracted and structured. Self-hosted. Extensible. Self-improving over time.

The constraints: fifty euros. One week. A homelab that was already running a ten-year-old MariaDB instance that started its life as a MySQL database and has survived more migrations than I care to count.

The infrastructure: Proxmox with several VMs and LXC containers, a Docker host with assorted containers, n8n already running, Paperless already running, and roughly a thousand documents in various states of classification.

The ambition: considerably larger than the constraints.


← Lernreise 1/7: Just Do It (The Expensive Way)  ·  Lernreise 3/7: Teaching a Machine to Build Machines →

Lernreise 2/7 of 7. Follow the lernreise tag for the full series.