Drone Aerial Object Detection: 10 Vehicle Classes, Smart City AI

Participants

End Date

31.12.26

Dataset

docbeb26

Resources2 CPU (8.59 GB) | 1 GPU (22.49 GB)

Compute

0 / 100.00 PF

Submits

0/5

On this page

Overview

About this use case: Five municipal transport authorities each operate drone fleets for real-time traffic monitoring — and each has a detection model that plateaus against its own city's footage, while GDPR and national surveillance law make cross-border footage sharing impossible. tracebloc federates training across all five city networks so every authority's model improves from the combined signal, without a single frame of aerial surveillance leaving the jurisdiction that captured it. Explore the data, submit your own model, and see how your approach compares.

Problem

Municipal transport authorities are deploying drone fleets for real-time traffic monitoring — counting vehicles, detecting congestion, and supporting enforcement across urban networks. The challenge: aerial object detection models trained on one city's footage generalise poorly to another. Different road layouts, traffic densities, and vehicle mixes mean a model that performs well above Rotterdam's ring road struggles above Lyon's boulevard peripherique. The term for this problem is domain shift, and in smart city surveillance it is constant. Markus Dreyer, Head of Smart Mobility at a German metropolitan transport authority, has a model achieving 89% recall at 90% precision across ten vehicle classes. He knows neighbouring authorities are facing the same ceiling — and that the combined signal from five city networks would push all five models past it.

Solution

Five municipal transport authorities collectively contribute models to a shared tracebloc workspace, seeded with 6,471 anonymised aerial frames and 343,204 vehicle annotations from the host authority's network. Each authority submits their detection model to the workspace. Inside tracebloc's containerised training environment, each model trains on the shared footage library — fine-tuning its weights to the vehicle class distribution, object density, and small-object challenge of this urban environment — without any raw footage leaving the host authority's infrastructure. tracebloc handles orchestration, scores each adapted model against the holdout set, and publishes results to a live leaderboard ranked by recall at 90% precision. This is a federated learning application of cross-city model improvement: the drone footage stays local, the detection performance improves across all five networks.

Outcome

In this example, two contributing authorities improved their recall by more than four percentage points after their models trained on the shared footage library — gains that translated directly into fewer missed congestion events and lower false alarm rates for enforcement teams. The tracebloc workspace stays active as participating authorities rotate new footage batches into the training pool, and as the vehicle class mix evolves with new transport modes entering city networks. The leaderboard records which detection approaches hold up across urban environments.

The Operational Challenge

Markus's team operates a fleet of eleven drones across a metropolitan network covering 840 km² of urban road infrastructure. The drones run continuous monitoring shifts during peak hours and on-demand during events and incidents. Every frame the fleet captures is processed by an on-board detection pipeline that must classify ten object categories — from the common (car, pedestrian) to the operationally critical (bus, truck) and the statistically rare (tricycle, awning-tricycle) — within the latency constraint of real-time monitoring.

The detection pipeline feeds three downstream systems: a congestion dashboard reviewed by traffic operations staff every fifteen minutes, an enforcement flagging system that routes suspected violations to human reviewers, and an incident response integration that alerts the city's emergency coordination centre when the scene density or composition crosses a defined threshold. Errors in the detection pipeline do not stay in the pipeline. A missed truck blocks a downstream count. A false alarm on the enforcement system triggers a review workflow that takes 40 minutes of staff time to close. A missed pedestrian cluster in the incident feed delays an emergency response decision.

The precision threshold for the enforcement feed is fixed at 90%. Everything below that threshold generates false alarms at a rate that operations management will not accept. Within that constraint, recall is the metric that matters — every percentage point of recall below the current baseline represents real missed detections on a live city network.

The cross-city training problem is structural. Smart city surveillance footage is subject to GDPR and national data protection law in every participating country. Surveillance imagery cannot leave the jurisdiction that captured it. Each authority's legal team has been explicit: footage does not travel. But the detection models those authorities have built are plateau-ing, and the five authorities — in different countries, with different traffic profiles — collectively hold the diverse training signal that none of them has access to individually. The challenge is not data volume. Each authority has footage. The challenge is diversity: a model trained only on one city's intersection layouts and vehicle mixes will always underperform on another city's.

UAV object detection at altitude introduces a layer of technical difficulty that ground-level camera systems do not face. At 80 to 120 metres, a car occupies a bounding box of roughly 30×20 pixels on a 512×512 frame. Pedestrians can be 8×20 pixels. Tricycles and awning-tricycles — categories that matter for enforcement and urban planning but appear infrequently in any single city's footage — are frequently in the sub-100 pixel² range. The scene density compounds this: the shared evaluation dataset averages 53 annotated objects per frame, with a median of 42. Detecting a pedestrian occluded behind a van in a frame that also contains 40 other objects requires models trained on exactly this kind of density — not on single-object benchmark imagery.

Inference latency is a hard constraint. The real-time monitoring use case requires results within 20 ms per frame. A model that achieves 95% recall at 500 ms is not a real-time monitoring model. It is a batch analysis tool with the wrong name.

Stakeholders

Markus Dreyer, Head of Smart Mobility: Owns detection performance across the authority's drone network. KPIs: recall at 90% precision across all ten vehicle classes, inference latency under 20 ms, false alarm rate on the enforcement feed
Traffic Operations Manager: Reviews the congestion dashboard and relies on vehicle counts being accurate; missed detections translate directly into wrong green-phase timing recommendations
Director of Urban Infrastructure: Responsible for long-term planning decisions that use drone-derived traffic data; systematic misclassification of vehicle types distorts modal split analysis
Legal / Data Protection Officer: GDPR and national surveillance data law prohibit footage from leaving the authority's jurisdiction; any cross-city collaboration must operate without data transfer
Head of Enforcement Operations: False alarm rate on the enforcement feed directly determines team workload; a 1% increase in false positives generates hundreds of additional manual reviews per week

The Underlying Dataset

The evaluation footage library contains 6,471 aerial frames at 512×512 pixels, with a total of 343,204 annotated vehicle and pedestrian objects across ten detection classes. Full dataset statistics, class distributions, bounding box size analysis, and spatial density patterns are available in the Exploratory Data Analysis tab.

This dataset is augmented. It was constructed to match the statistical structure of real-world drone traffic monitoring footage — the object density per frame, the class distribution, the proportion of very small objects, and the annotation coverage — without containing any identifiable location data, vehicle registrations, or surveillance imagery from real city networks.

Property	Value
Total images	6,471
Image dimensions	512×512 px
Annotation format	Pascal VOC (XML bounding boxes)
Total annotated objects	343,204
Average objects per frame	53 (median: 42)
Very small objects (<100px²)	153,960 (44.9%)
Annotation coverage	100% — all frames annotated
Evaluation metric	Recall at 90% precision, IoU ≥ 0.50

Class distribution:

Class	Annotations	Share
Car	144,866	42.2%
Pedestrian	79,337	23.1%
Motor	29,647	8.6%
People	27,059	7.9%
Van	24,956	7.3%
Truck	12,875	3.8%
Bicycle	10,480	3.1%
Bus	5,926	1.7%
Tricycle	4,812	1.4%
Awning-tricycle	3,246	0.9%

The class imbalance reflects real-world urban traffic composition. Car and pedestrian dominate the annotation set; tricycle and awning-tricycle together account for fewer than 2.3% of objects. A model that ignores these rare classes entirely can still achieve high overall recall — which is why per-class recall on the rarest four classes is tracked separately on the leaderboard alongside overall performance.

How Evaluation Works

Each authority submitted their aerial object detection model to the tracebloc workspace. The evaluation ran in two phases.

Phase 1 — Baseline performance. Each authority's model was benchmarked as-submitted on the evaluation holdout set, with no exposure to the shared footage library. This establishes the true cross-city generalisation baseline: what each city's model actually delivers when deployed above a different urban environment.

Phase 2 — Fine-tuning. Authorities were given access to the training environment inside the tracebloc workspace. Each authority transferred their detection model into tracebloc and ran training on the shared footage library — fine-tuning the model weights to the object density, class distribution, and small-object characteristics of this urban environment. After training, the adapted model was evaluated automatically against the holdout set. No raw footage was exported. Each authority received only their own results; no authority had visibility into another's training runs or scores before the leaderboard published.

Each contributor received:

Training access: The full shared footage library (5,177 frames at an 80/20 split) for model fine-tuning inside the workspace
Evaluation environment: Sandboxed execution — adapted models run against the holdout set, no footage export path available
Metrics tracked: Overall recall at 90% precision (IoU ≥ 0.50), per-class recall across all ten vehicle categories, rare class average recall (tricycle, awning-tricycle, bus, bicycle), inference latency per frame (ms)
Key constraint: Models exceeding 20 ms inference latency are flagged on the leaderboard; the real-time monitoring use case cannot accept higher latency regardless of recall performance

Results

→ View the full model leaderboard — complete authority rankings, per-class recall breakdown, and latency measurements across all submissions.

Authority	Baseline Recall	After Fine-tuning	Rare Class Recall	Latency
Authority A (YOLOv9)	87.4%	91.3%	73.2%	16 ms
Authority B (RT-DETR) ✅	89.1%	94.5%	80.6%	18 ms
Authority C (YOLOv8)	85.6%	89.0%	69.8%	12 ms
Authority D (DINOv2)	88.3%	93.8%	58.4%	19 ms
Authority E (YOLOv10)	86.9%	90.7%	71.5%	14 ms

What the numbers reveal:

Authority B's RT-DETR model delivered the strongest post-fine-tuning recall at 94.5% alongside the highest rare class recall at 80.6%. Starting from the strongest baseline in the group at 89.1%, it gained 5.4 percentage points after training on the shared footage library — a gain that held across all ten vehicle classes, including the operationally critical low-frequency categories.

Authority D achieved 93.8% overall recall after fine-tuning — the second-highest figure. Its rare class recall of 58.4% tells a different story. The model improved its aggregate performance by concentrating gains on dominant classes (car at 42.2% of annotations, pedestrian at 23.1%) while its tricycle and awning-tricycle detection degraded. For a traffic enforcement use case where tricycle classification matters, Authority D's overall number masks an operational problem.

Authority C achieved the fastest inference at 12 ms but ended the fine-tuning phase at 89.0% overall recall — below the group average — and 69.8% rare class recall. Speed without recall is not a solution for a monitoring system where missed detections generate downstream errors in every connected operational system.

Business Impact

Illustrative assumptions: 1,000 operational monitoring shifts per year / 100 traffic decisions per shift influenced by detection output / €800 average cost per missed detection event (misallocated traffic management resource, missed enforcement action, delayed incident response)

Authority	Recall After Fine-tuning	Missed Detection Rate	Incidents/Year	Detection Cost	Workspace Cost (p.a.)	Total Annual Cost
Internal baseline	89.0%	11.0%	11,000	€8,800,000	—	€8,800,000
Authority A	91.3%	8.7%	8,700	€6,960,000	€120,000	€7,080,000
Authority B ✅	94.5%	5.5%	5,500	€4,400,000	€120,000	€4,520,000
Authority C	89.0%	11.0%	11,000	€8,800,000	€120,000	€8,920,000
Authority D	93.8%	6.2%	6,200	€4,960,000	€120,000	€5,080,000

Authority B reduces total annual operational cost from €8,800,000 (internal baseline) to €4,520,000 — a saving of €4,280,000 per year — while staying within the 20 ms latency constraint that real-time monitoring requires. Authority D's headline recall of 93.8% produces €560,000 in additional cost compared to Authority B once the rare class detection gap is factored into enforcement and incident response decisions.

Decision

Markus selects Authority B's RT-DETR model configuration as the reference architecture for the joint network's next fine-tuning cycle. The model is deployed in shadow mode alongside the existing detection pipeline across three high-density monitoring zones for eight weeks, validating that 94.5% recall holds at the full frame rate before the pipeline switch.

The tracebloc workspace stays active. As each of the five participating authorities completes new footage collection cycles, they contribute updated models for evaluation — and the shared footage library grows with anonymised batches from additional city environments. The leaderboard evolves from a one-time evaluation record into ongoing governance infrastructure for the multi-authority smart city network: which detection approaches are improving, which are degrading as traffic patterns change, and which are ready to be recommended across city boundaries.

Explore this use case further:

View the model leaderboard — full authority rankings, per-class recall, latency measurements
Explore the dataset — class distribution, bounding box size analysis, object density
Start training — submit your own aerial object detection model to this evaluation

Related use cases: See how the same evaluation approach applies to AI weld inspection in automotive manufacturing and satellite crop classification and yield forecasting. For a broader view of what federated learning applications look like across industries, see our federated learning applications guide.

Deploy your workspace or schedule a call.

Disclaimer

Disclaimer: The dataset used in this use case is augmented — designed to closely reflect the statistical structure of real-world drone traffic monitoring footage, including object density per frame, class distribution across ten vehicle and pedestrian categories, and the proportion of very small bounding boxes, without containing any identifiable location data, vehicle registration information, or surveillance imagery from real city networks. The persona, authority configurations, claimed performance figures, business impact assumptions, and collaboration scenario are illustrative and based on patterns observed across smart city and urban transport deployments. They do not represent any specific municipality, authority, or contractual outcome.