
Enhancing Claims Processing with Secure AI Document Classification
Participants
9
End Date
17.12.26
Dataset
dc7u46yv
Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)
Compute
0 / 0 F
Submits
0/5
Overview

9
17.12.26
Tracebloc is a tool for benchmarking AI models on private data. This Playbook breaks down how a team used tracebloc to benchmark AI models on their claims data and discovered which model truly delivered the best results. Find out more on our website or schedule a call with the founder directly.
Every inaccurate classification costs money. Using tracebloc, an insurance company uncovered which NLP model truly performs under pressure, saving over €3 million a year compared to manual workflows.
Julia Reinhardt, Head of Claims Automation at an insurance compony in Zurich, is tasked with streamlining the triage and processing of incoming insurance claims while reducing manual workload and review delays.
Today, incoming claims arrive as PDFs, emails, or scanned forms. A single case may include 10 to 50 heterogeneous documents: police reports, medical notes, invoices, repair estimates, damage photos, etc. Manually sorting these slows down processing, leads to human error, and causes SLA breaches. Julia’s goal is to introduce an AI-based classification engine that automatically labels documents by type and priority, helping claims handlers to find critical cases faster.
The insurance company has access to a proprietary dataset: 500,000 labeled claims documents across 12 categories. While developing a custom model is an option, it would be time consuming. Julia instead decides to use tracebloc to set up a secure sandbox to launch a structured evaluation of highly specialized external vendors. This enables her to benchmark state of the art AI solutions on her data, while keeping it secure and not compromising any of it.
Each vendor submitted commercial and technical proposals:
| VENDOR | CLAIMED ACCURACY | COST PER DOCUMENT | INTEGRATION COMPLEXITY |
| A | 96.5% | €0.08 | Low |
| B | 98.0% | €0.20 | Moderate |
| C | 98.5% | €0.22 | Moderate |
All vendors claimed ≥96% classification accuracy. Julia's team focused on recall for minority classes (e.g. medical invoices, police reports) and misclassification rate, especially in multi-page documents.
Using tracebloc, Julia sets up a secure evaluation environment within the company`s infrastructure. Vendors receive no raw data, models are fine-tuned on-prem using a secured setup to ensure full compliance.
The company provides 400,000 labeled documents for fine-tuning and 100,000 held-out documents for benchmarking to each vendor. Standard metrics are: accuracy, per-class recall, misclassification rate, latency per document.
Following initial baselines, vendors fine-tune their models and submit updated versions. Results show a meaningful gap between claimed and actual performance.
| VENDOR | CLAIMED ACCURACY | BASELINE ACCURACY | ACCURACY AFTER FINE TUNING |
| A | 96.5% | 93.2% | 94.1% |
| B | 98.0% | 94.8% | 98.2% |
| C | 98.5% | 95.6% | 98.6% |
Surprise outcome: Vendor C surpassed its own claim after on-prem fine-tuning, outperforming all others.
| STRATEGY | ACCURACY | MISCLASSIFIED DOCS | ERROR COST | AI COST | TOTAL COST |
| Manual Only | ~93% | 350,000 | €5,250,000 | €0 | €5,250,000 |
| Vendor A | 94.1% | 244,500 | €3,667,500 | €350,000 | €4,017,500 |
| Vendor B | 98.2% | 85,000 | €1,275,000 | €400,000 | €1,675,000 |
| Vendor C ✅ | 98.6% | 85,000 | €1,200,000 | €400,000 | €1,600,000 |
After secure benchmarking and integration testing, Vendor C’s fine-tuned model reached 98.6% accuracy, significantly reducing misclassification. The selected hybrid setup includes automatic classification for all documents, with human review for critical document types and flagged uncertainties.