Enhancing Claims Processing with Secure AI Document Classification

Participants

End Date

31.12.26

Dataset

dc7u46yv

Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)

Compute

0 / 100.00 PF

Submits

0/5

On this page

The tracebloc Playbook: How to Achieve Top Performance in Automated Claims Classification

Tracebloc is a tool for benchmarking AI models on private data. This Playbook breaks down how a team used tracebloc to benchmark AI models on their claims data and discovered which model truly delivered the best results. Find out more on our website or schedule a call with the founder directly.

Why Model Performance Matters

Every inaccurate classification costs money. Using tracebloc, an insurance company uncovered which NLP model truly performs under pressure, saving over €3 million a year compared to manual workflows.

Step 1: The Challenge

Julia Reinhardt, Head of Claims Automation at an insurance compony in Zurich, is tasked with streamlining the triage and processing of incoming insurance claims while reducing manual workload and review delays.

Today, incoming claims arrive as PDFs, emails, or scanned forms. A single case may include 10 to 50 heterogeneous documents: police reports, medical notes, invoices, repair estimates, damage photos, etc. Manually sorting these slows down processing, leads to human error, and causes SLA breaches. Julia’s goal is to introduce an AI-based classification engine that automatically labels documents by type and priority, helping claims handlers to find critical cases faster.

Key requirements:

Model must run on-prem or in a secure private cloud environment
Must achieve ≥98% document classification accuracy on real-world claims data
Must integrate with the existing claims platform and comply with BaFin and GDPR regulations

The insurance company has access to a proprietary dataset: 500,000 labeled claims documents across 12 categories. While developing a custom model is an option, it would be time consuming. Julia instead decides to use tracebloc to set up a secure sandbox to launch a structured evaluation of highly specialized external vendors. This enables her to benchmark state of the art AI solutions on her data, while keeping it secure and not compromising any of it.

Step 2: What the Vendors Claimed

Each vendor submitted commercial and technical proposals:

VENDOR	CLAIMED ACCURACY	COST PER DOCUMENT	INTEGRATION COMPLEXITY
A	96.5%	€0.08	Low
B	98.0%	€0.20	Moderate
C	98.5%	€0.22	Moderate

All vendors claimed ≥96% classification accuracy. Julia's team focused on recall for minority classes (e.g. medical invoices, police reports) and misclassification rate, especially in multi-page documents.

Step 3: Secure Evaluation and Fine-Tuning

Using tracebloc, Julia sets up a secure evaluation environment within the company`s infrastructure. Vendors receive no raw data, models are fine-tuned on-prem using a secured setup to ensure full compliance.

The company provides 400,000 labeled documents for fine-tuning and 100,000 held-out documents for benchmarking to each vendor. Standard metrics are: accuracy, per-class recall, misclassification rate, latency per document.

Following initial baselines, vendors fine-tune their models and submit updated versions. Results show a meaningful gap between claimed and actual performance.

Step 4: Observed Results After Testing

VENDOR	CLAIMED ACCURACY	BASELINE ACCURACY	ACCURACY AFTER FINE TUNING
A	96.5%	93.2%	94.1%
B	98.0%	94.8%	98.2%
C	98.5%	95.6%	98.6%

Surprise outcome: Vendor C surpassed its own claim after on-prem fine-tuning, outperforming all others.

Step 5: Business Case – Cost of Misclassification

Assumptions:

Annual document volume: 5 million
Manual classification cost: €0.10 per document
Misclassification rate baseline: 7% → 350,000 errors/year
Cost per error (e.g. wrong triage, SLA breach, fraud risk): €15
AI usage: full-scale, automated classification with human override on edge cases

STRATEGY	ACCURACY	MISCLASSIFIED DOCS	ERROR COST	AI COST	TOTAL COST
Manual Only	~93%	350,000	€5,250,000	€0	€5,250,000
Vendor A	94.1%	244,500	€3,667,500	€350,000	€4,017,500
Vendor B	98.2%	85,000	€1,275,000	€400,000	€1,675,000
Vendor C ✅	98.6%	85,000	€1,200,000	€400,000	€1,600,000

Step 6: Decision – Human + AI Hybrid Strategy

After secure benchmarking and integration testing, Vendor C’s fine-tuned model reached 98.6% accuracy, significantly reducing misclassification. The selected hybrid setup includes automatic classification for all documents, with human review for critical document types and flagged uncertainties.

Benefits:

75% reduction in classification errors
~€3.65M annual savings compared to manual-only workflows
Seamless integration into the companies existing claims system
End-to-end audit trail and full GDPR compliance