The tracebloc Playbook: How to Achieve Top Performance in Automated Claims Classification
Tracebloc is a tool for benchmarking AI models on private data. This Playbook breaks down how a team used tracebloc to benchmark AI models on their claims data and discovered which model truly delivered the best results. Find out more on our website or schedule a call with the founder directly.
Why Model Performance Matters
Every inaccurate classification costs money. Using tracebloc, an insurance company uncovered which NLP model truly performs under pressure, saving over €3 million a year compared to manual workflows.
Step 1: The Challenge
Julia Reinhardt, Head of Claims Automation at an insurance compony in Zurich, is tasked with streamlining the triage and processing of incoming insurance claims while reducing manual workload and review delays.
Today, incoming claims arrive as PDFs, emails, or scanned forms. A single case may include 10 to 50 heterogeneous documents: police reports, medical notes, invoices, repair estimates, damage photos, etc. Manually sorting these slows down processing, leads to human error, and causes SLA breaches. Julia’s goal is to introduce an AI-based classification engine that automatically labels documents by type and priority, helping claims handlers to find critical cases faster.
Key requirements:
· Model must run on-prem or in a secure private cloud environment
· Must achieve ≥98% document classification accuracy on real-world claims data
· Must integrate with the existing claims platform and comply with BaFin and GDPR regulations
The insurance company has access to a proprietary dataset: 500,000 labeled claims documents across 12 categories. While developing a custom model is an option, it would be time consuming. Julia instead decides to use tracebloc to set up a secure sandbox to launch a structured evaluation of highly specialized external vendors. This enables her to benchmark state of the art AI solutions on her data, while keeping it secure and not compromising any of it.
Step 2: What the Vendors Claimed
Each vendor submitted commercial and technical proposals:
|
Vendor
|
Claimed Accuracy
|
Cost per Document
|
Integration Complexity
|
| A |
96.5% |
€0.08 |
Low |
| B |
98.0% |
€0.20 |
Moderate |
| C |
98.5% |
€0.22 |
Moderate |
All vendors claimed ≥96% classification accuracy. Julia's team focused on recall for minority classes (e.g. medical invoices, police reports) and misclassification rate, especially in multi-page documents.
Step 3: Secure Evaluation and Fine-Tuning
Using tracebloc, Julia sets up a secure evaluation environment within the company`s infrastructure. Vendors receive no raw data, models are fine-tuned on-prem using a secured setup to ensure full compliance.
The company provides 400,000 labeled documents for fine-tuning and 100,000 held-out documents for benchmarking to each vendor. Standard metrics are: accuracy, per-class recall, misclassification rate, latency per document.
Following initial baselines, vendors fine-tune their models and submit updated versions. Results show a meaningful gap between claimed and actual performance.
Step 4: Observed Results After Testing
|
Vendor
|
Claimed Accuracy
|
Baseline Accuracy
|
Accuracy After Fine-Tuning
|
| A |
96.5% |
93.2% |
94.1% |
| B |
98.0% |
94.8% |
98.2% |
| C |
98.5% |
95.6% |
98.6% |
Surprise outcome: Vendor C surpassed its own claim after on-prem fine-tuning, outperforming all others.
Step 5: Business Case – Cost of Misclassification
Assumptions:
· Annual document volume: 5 million
· Manual classification cost: €0.10 per document
· Misclassification rate baseline: 7% → 350,000 errors/year
· Cost per error (e.g. wrong triage, SLA breach, fraud risk): €15
· AI usage: full-scale, automated classification with human override on edge cases
|
Strategy
|
Accuracy
|
Misclassified Docs
|
Error Cost
|
AI Cost
|
Total Cost
|
| Manual Only |
~93% |
350,000 |
€5,250,000 |
€0 |
€5,250,000 |
| Vendor A |
94.1% |
244,500 |
€3,667,500 |
€350,000 |
€4,017,500 |
| Vendor B |
98.2% |
85,000 |
€1,275,000 |
€400,000 |
€1,675,000 |
| Vendor C ✅ |
98.6% |
80,000 |
€1,200,000 |
€400,000 |
€1,600,000 |
Step 6: Decision – Human + AI Hybrid Strategy
After secure benchmarking and integration testing, Vendor C’s fine-tuned model reached 98.6% accuracy, significantly reducing misclassification. The selected hybrid setup includes automatic classification for all documents, with human review for critical document types and flagged uncertainties.
Benefits:
· 75% reduction in classification errors
· ~€3.65M annual savings compared to manual-only workflows
· Seamless integration into the companies existing claims system
· End-to-end audit trail and full GDPR compliance