
55
29.07.27
On this page
Training larger models has become the default answer to every accuracy problem. It works — until the bill arrives. GPU costs compound at scale, edge hardware cannot run a 100M-parameter model in real time, and sustainability teams are starting to ask how many grams of CO2 each inference generates. This workspace is built around the opposite instinct: bring your model, make it as accurate and as efficient as possible, and see exactly where it stands. The leaderboard scores both — accuracy and efficiency, side by side, on real data that never leaves the infrastructure.
The ML community has spent a decade optimising for accuracy. Benchmark scores improve, parameter counts grow, and compute requirements expand accordingly. What that trajectory misses is everything that happens after training: the cost of running inference at scale, the hardware constraints of edge deployment, the carbon budget of a production system processing millions of requests per day.
A model that achieves 95% accuracy at 50 GFLOPS per inference is not equivalent to one that achieves 93% at 3 GFLOPS — not when it runs on a factory-floor camera unit with no GPU, a medical device with a fixed power envelope, or inside a company that has committed to reducing its AI carbon footprint. The accuracy gap is small. The deployment gap is the difference between viable and impossible.
Frugal AI is the practice of building models that are accurate enough and efficient enough for the environment where they actually run. It is not a compromise on quality. It is precision about what quality means in context.
tracebloc brings builders together around a single premise: accuracy and efficiency should be measured together. You submit model code to the workspace. The code trains and evaluates against the benchmark dataset inside the infrastructure — the data never moves. What comes back is a composite score that weights task accuracy alongside FLOPS and gCO2e per inference.
The AI model leaderboard does not just show who built the most accurate model. It shows who built the best model relative to the compute budget it consumes. A lean model scoring 91% accuracy at 2 GFLOPS may rank above a heavier model at 94% — and both are visible on the leaderboard, so every builder can see exactly where the tradeoff sits across the full field of submissions.
This is a federated learning application of a broader principle: you do not need to see the data to build a good model, and you do not need unlimited compute to build an accurate one.
The benchmark dataset is a structured classification task representative of the types of problems deployed in real enterprise environments: labelled data collected under real-world conditions with natural variation across samples and classes. It is approachable enough that builders can reach a strong baseline quickly, while remaining rich enough that efficiency optimisations — quantisation, pruning, architecture selection — have measurable impact on the leaderboard score.
The dataset lives inside the tracebloc workspace. Builders interact with it through the training environment and explore its structure through the EDA tab, which provides class distributions, sample counts, and condition breakdowns computed directly on the data. No download is required or available.
This mirrors the constraint that applies in production: the model must perform where the data lives, not on a local copy. Spending time in the EDA tab before writing a single line of training code is the fastest path to understanding what the data will reward.
The scoring rubric is explicit about the tradeoff it measures.
Task accuracy is the primary dimension — the proportion of held-out samples correctly classified. This is the floor: a model that sacrifices accuracy below the viable threshold is not a useful submission regardless of how few FLOPS it uses.
FLOPS at inference time measures the computational cost of processing a single sample through the submitted model. Lower is better, all else equal. This is the direct proxy for hardware compatibility, latency, and operating cost at scale.
gCO2e per inference is the estimated carbon equivalent of a single forward pass, computed from the FLOPS measurement using a standard energy conversion factor. This metric makes the environmental cost of model design visible and directly comparable across submissions — a figure that rarely appears on any other AI benchmark leaderboard.
The composite score is computed as:
Score = 0.70 × Accuracy + 0.20 × FLOPS_Efficiency + 0.10 × gCO2e_Efficiencywhere FLOPS_Efficiency and gCO2e_Efficiency are normalised inverse scores relative to the field of submitted models. A model at the median FLOPS for the workspace scores 0.5 on that dimension; a model at the 10th percentile scores close to 1.0.
The weighting is deliberate. Accuracy matters most — but compute cost and carbon cost are treated as real constraints, not afterthoughts. Builders who optimise for accuracy alone will find themselves mid-table behind leaner models that concede little on the accuracy dimension.
The frugal AI benchmark rewards model engineering decisions that rarely surface in accuracy-only benchmarks:
Architecture selection. Choosing a backbone designed for efficiency — lightweight CNNs, compact transformers, MobileNet-class architectures — rather than defaulting to the largest available pretrained model. The FLOPS gap between architectures is often larger than the accuracy gap.
Quantisation. Post-training quantisation (INT8, FP16) reduces inference cost with limited accuracy degradation when applied carefully. The leaderboard makes this tradeoff visible across all submissions simultaneously.
Pruning. Structured pruning removes redundant capacity from overparameterised models. On a narrow vertical task, a pruned specialist consistently outperforms a dense generalist at a fraction of the compute.
Knowledge distillation. Training a smaller student model against a larger teacher recovers significant accuracy while maintaining the efficiency profile of the student architecture. Distilled models consistently appear near the top of efficiency-weighted leaderboards.
Task-specific fine-tuning. General-purpose pretrained models carry representational capacity that a narrow vertical task does not need. Fine-tuning a compact model on domain-specific data frequently outperforms a large general model — at a fraction of the inference cost.
Builders submit code to the tracebloc workspace via the training and fine-tuning tab. The submitted code executes inside the workspace on the benchmark dataset. Training, inference, FLOPS measurement, gCO2e calculation, and scoring all happen within the infrastructure. No data is transferred outside it.
After evaluation completes, results are published automatically to the leaderboard. You can resubmit updated code to improve your position at any time. Submission format requirements, environment specifications, and framework compatibility are documented in the training tab.
The leaderboard is live and updates with each completed submission. Current rankings, accuracy scores, FLOPS measurements, gCO2e values, and composite efficiency scores are available on the leaderboard tab. The leaderboard displays the full accuracy-efficiency frontier across all submissions — so every builder can see not just their position but where their model sits on the tradeoff curve relative to the field.
The gCO2e estimates displayed on the leaderboard are computed from FLOPS measurements using a standard energy-to-carbon conversion methodology. These figures are indicative of relative carbon cost across model architectures rather than precise measurements of actual energy consumption, which varies by hardware, data centre location, and energy mix. They are intended to make efficiency tradeoffs visible and comparable within this benchmarking context and should not be used as certified carbon figures for regulatory or ESG reporting purposes.