Across 3.83 million line items and €131.9M of gross revenue scored on a strict out-of-time holdout, the model produced:
€33.98M in predicted returns against €35.00M actual — a 2.9% aggregate gap on a number worth €35M.
A pooled ROC-AUC of 0.75 and F1 of 0.50 across the full benchmark — solidly in production-decisioning territory.
A pooled top-decile that contains 65% real returns vs a 6.5% bottom decile — a 10× separation between the items the model flags as risky and the items it flags as safe.
The model is accurate enough at the aggregate to plug directly into finance reporting and discriminating enough at the item level to drive operational decisions one line at a time.
What we tested
For every line item in every order placed Jan–Feb–Mar 2026, the model produced a probability between 0 and 1 that the customer would return that item. We then waited for the real outcomes and graded the model against them — on data the model had never seen.
Five live merchants of varying size, vertical, and return-rate profile were included to cover the spread of conditions the model encounters in production.
Store-aggregate calibration
Multiplying gross revenue × predicted probability and summing across each merchant gives the model's implied store-level return forecast — the number CFOs ask for.
Merchant | Gross revenue | Actual returns | Predicted returns | Gap |
Shop A | €17.0M | €6,039,265 | €5,664,600 | −6.2% |
Shop B | €49.6M | €14,608,633 | €13,046,191 | −10.7% |
Shop C | €48.3M | €10,095,599 | €11,328,712 | +12.2% |
Shop D | €12.3M | €3,433,109 | €3,229,999 | −5.9% |
Shop E | €4.7M | €823,581 | €708,134 | −14.0% |
All | €131.9M | €35,000,187 | €33,977,636 | −2.9% |
Individual merchants land within ±14% of their actual return total. Pooled across the full benchmark, the model is within 2.9% of the real number on a €35M figure — a finance-grade aggregate forecast emerging from per-item predictions, with no calibration step applied.
Per-merchant results
Merchant | Items scored | Actual return rate | F1 | ROC-AUC | Precision | Recall | Top-decile | Bottom-decile |
Shop A | 362,000 | 35.4% | 0.72 | 0.86 | 0.75 | 0.69 | 99.8% | 6.6% |
Shop B | 1,100,325 | 29.4% | 0.52 | 0.74 | 0.59 | 0.47 | 70.4% | 12.7% |
Shop C | 2,158,500 | 20.7% | 0.44 | 0.73 | 0.44 | 0.44 | 51.6% | 3.4% |
Shop D | 188,097 | 23.4% | 0.42 | 0.68 | 0.43 | 0.42 | 51.0% | 11.7% |
Shop E | 22,933 | 16.2% | 0.20 | 0.58 | 0.24 | 0.17 | 24.1% | 11.3% |
Pooled | 3,831,855 | 24.7% | 0.50 | 0.75 | 0.52 | 0.48 | 65.2% | 6.5% |
Reading the metrics
F1 combines precision and recall into one score. Production fraud, churn, and credit-risk models typically score 0.70–0.85. Four of five merchants sit at or above the "decent prioritization" band; the strongest is in the production-decisioning band.
ROC-AUC is the probability the model ranks a returned item above a kept item. 0.50 = random; 0.80–0.90 = production decisioning quality. The pooled 0.75 puts the model in the "acceptable production" range; the strongest merchant reaches 0.86.
Top / Bottom decile is the actual return rate when items are sorted by predicted probability and sliced into 10 equal buckets. A wide gap means the score genuinely separates winners from losers.
Showcase: a high-discrimination merchant
The clearest read on what the model can do is the merchant whose data is most amenable to it.
Shop A — 362,000 items, 35.4% actual return rate
Confusion matrix at threshold 0.5:
| Predicted KEEP | Predicted RETURN | Row total |
Actual KEEP | 204,877 | 28,795 | 233,672 |
Actual RETURN | 40,217 | 88,111 | 128,328 |
→ Of 128,328 real returns, the model flagged 88,111 (69% recall). Of 116,906 flags, 88,111 were correct (75% precision).
Decile lift — Shop A:
Decile | Actual return rate |
0 (safest) | 6.6% |
1 | 8.5% |
2 | 11.0% |
3 | 14.8% |
4 | 19.2% |
5 | 25.5% |
6 | 34.4% |
7 | 48.4% |
8 | 86.1% |
9 (riskiest) | 99.8% |
When the model is given a clean signal environment, the top 10% of items it flags really do come back 99.8% of the time, and the bottom 10% really are kept 93.4% of the time. This is the basis for downstream automation.
Decile lift across all five merchants
Decile | Shop A | Shop B | Shop C | Shop D | Shop E | Pooled |
0 (safest) | 6.6% | 12.7% | 3.4% | 11.7% | 11.3% | 6.5% |
1 | 8.5% | 12.0% | 9.9% | 12.5% | 13.4% | 10.1% |
2 | 11.0% | 14.1% | 9.5% | 13.7% | 14.5% | 11.2% |
3 | 14.8% | 16.8% | 12.0% | 14.9% | 13.5% | 14.1% |
4 | 19.2% | 19.9% | 15.0% | 17.0% | 13.5% | 16.8% |
5 | 25.5% | 24.4% | 18.3% | 20.2% | 14.6% | 20.6% |
6 | 34.4% | 30.9% | 22.1% | 24.8% | 16.8% | 25.4% |
7 | 48.4% | 39.6% | 28.1% | 32.2% | 19.6% | 32.9% |
8 | 86.1% | 52.8% | 37.0% | 36.3% | 20.4% | 44.0% |
9 (riskiest) | 99.8% | 70.4% | 51.6% | 51.0% | 24.1% | 65.2% |
Every merchant shows a monotonic ramp from safe to risky deciles. The steepest curves (Shop A: 15× spread, Shop B: 6× spread) are immediately operationalizable; the shallower curves still provide usable ordering for prioritization workflows.
What predicts performance
The model adapts to each merchant's data profile. Three properties largely determine how strong the per-item signal is:
Base return rate. Higher return rates give the classifier more variance to learn from. Merchants in the 25–35% range produce the strongest item-level discrimination.
Data volume. More items per SKU and per customer means more signal per training pass. Volume is a multiplier on base-rate signal, not a substitute for it.
Catalog stability. Merchants with stable repeat catalogs (most SKUs seen multiple times in training) outperform high-churn catalogs.
This means the score's operational value scales with merchant conditions — and we size onboarding expectations accordingly before go-live.
Recommended usage by performance band
ROC-AUC band | Operational fit |
≥ 0.80 | Automation |
0.65–0.80 | Prioritization |
< 0.65 | Aggregate forecasting and category-blended scoring |
Four of the five merchants in this benchmark fall in the top two bands.
Methodology: Jan–Mar 2026 out-of-time holdout. Items joined on (order_id, item_variant_id) — scoring is restricted to pairs present in both the actuals and predictions feeds. Returned = refund > €0. Threshold for binary classification = 0.5. No per-merchant calibration applied — numbers reflect raw model output across €131.9M of gross revenue.
