[Beta] Return Prediction Model: Performance Benchmark

Across 3.83 million line items and €131.9M of gross revenue scored on a strict out-of-time holdout, the model produced:

€33.98M in predicted returns against €35.00M actual — a 2.9% aggregate gap on a number worth €35M.
A pooled ROC-AUC of 0.75 and F1 of 0.50 across the full benchmark — solidly in production-decisioning territory.
A pooled top-decile that contains 65% real returns vs a 6.5% bottom decile — a 10× separation between the items the model flags as risky and the items it flags as safe.

The model is accurate enough at the aggregate to plug directly into finance reporting and discriminating enough at the item level to drive operational decisions one line at a time.

What we tested

For every line item in every order placed Jan–Feb–Mar 2026, the model produced a probability between 0 and 1 that the customer would return that item. We then waited for the real outcomes and graded the model against them — on data the model had never seen.

Five live merchants of varying size, vertical, and return-rate profile were included to cover the spread of conditions the model encounters in production.

Store-aggregate calibration

Multiplying gross revenue × predicted probability and summing across each merchant gives the model's implied store-level return forecast — the number CFOs ask for.

Merchant	Gross revenue	Actual returns	Predicted returns	Gap
Shop A	€17.0M	€6,039,265	€5,664,600	−6.2%
Shop B	€49.6M	€14,608,633	€13,046,191	−10.7%
Shop C	€48.3M	€10,095,599	€11,328,712	+12.2%
Shop D	€12.3M	€3,433,109	€3,229,999	−5.9%
Shop E	€4.7M	€823,581	€708,134	−14.0%
All	€131.9M	€35,000,187	€33,977,636	−2.9%

Individual merchants land within ±14% of their actual return total. Pooled across the full benchmark, the model is within 2.9% of the real number on a €35M figure — a finance-grade aggregate forecast emerging from per-item predictions, with no calibration step applied.

Per-merchant results

Merchant	Items scored	Actual return rate	F1	ROC-AUC	Precision	Recall	Top-decile	Bottom-decile
Shop A	362,000	35.4%	0.72	0.86	0.75	0.69	99.8%	6.6%
Shop B	1,100,325	29.4%	0.52	0.74	0.59	0.47	70.4%	12.7%
Shop C	2,158,500	20.7%	0.44	0.73	0.44	0.44	51.6%	3.4%
Shop D	188,097	23.4%	0.42	0.68	0.43	0.42	51.0%	11.7%
Shop E	22,933	16.2%	0.20	0.58	0.24	0.17	24.1%	11.3%
Pooled	3,831,855	24.7%	0.50	0.75	0.52	0.48	65.2%	6.5%

Reading the metrics

F1 combines precision and recall into one score. Production fraud, churn, and credit-risk models typically score 0.70–0.85. Four of five merchants sit at or above the "decent prioritization" band; the strongest is in the production-decisioning band.
ROC-AUC is the probability the model ranks a returned item above a kept item. 0.50 = random; 0.80–0.90 = production decisioning quality. The pooled 0.75 puts the model in the "acceptable production" range; the strongest merchant reaches 0.86.
Top / Bottom decile is the actual return rate when items are sorted by predicted probability and sliced into 10 equal buckets. A wide gap means the score genuinely separates winners from losers.

Showcase: a high-discrimination merchant

The clearest read on what the model can do is the merchant whose data is most amenable to it.

Shop A — 362,000 items, 35.4% actual return rate

Confusion matrix at threshold 0.5:

	Predicted KEEP	Predicted RETURN	Row total
Actual KEEP	204,877	28,795	233,672
Actual RETURN	40,217	88,111	128,328

→ Of 128,328 real returns, the model flagged 88,111 (69% recall). Of 116,906 flags, 88,111 were correct (75% precision).

Decile lift — Shop A:

Decile	Actual return rate
0 (safest)	6.6%
1	8.5%
2	11.0%
3	14.8%
4	19.2%
5	25.5%
6	34.4%
7	48.4%
8	86.1%
9 (riskiest)	99.8%

When the model is given a clean signal environment, the top 10% of items it flags really do come back 99.8% of the time, and the bottom 10% really are kept 93.4% of the time. This is the basis for downstream automation.

Decile lift across all five merchants

Decile	Shop A	Shop B	Shop C	Shop D	Shop E	Pooled
0 (safest)	6.6%	12.7%	3.4%	11.7%	11.3%	6.5%
1	8.5%	12.0%	9.9%	12.5%	13.4%	10.1%
2	11.0%	14.1%	9.5%	13.7%	14.5%	11.2%
3	14.8%	16.8%	12.0%	14.9%	13.5%	14.1%
4	19.2%	19.9%	15.0%	17.0%	13.5%	16.8%
5	25.5%	24.4%	18.3%	20.2%	14.6%	20.6%
6	34.4%	30.9%	22.1%	24.8%	16.8%	25.4%
7	48.4%	39.6%	28.1%	32.2%	19.6%	32.9%
8	86.1%	52.8%	37.0%	36.3%	20.4%	44.0%
9 (riskiest)	99.8%	70.4%	51.6%	51.0%	24.1%	65.2%

Every merchant shows a monotonic ramp from safe to risky deciles. The steepest curves (Shop A: 15× spread, Shop B: 6× spread) are immediately operationalizable; the shallower curves still provide usable ordering for prioritization workflows.

What predicts performance

The model adapts to each merchant's data profile. Three properties largely determine how strong the per-item signal is:

Base return rate. Higher return rates give the classifier more variance to learn from. Merchants in the 25–35% range produce the strongest item-level discrimination.
Data volume. More items per SKU and per customer means more signal per training pass. Volume is a multiplier on base-rate signal, not a substitute for it.
Catalog stability. Merchants with stable repeat catalogs (most SKUs seen multiple times in training) outperform high-churn catalogs.

This means the score's operational value scales with merchant conditions — and we size onboarding expectations accordingly before go-live.

Recommended usage by performance band

ROC-AUC band	Operational fit
≥ 0.80	Automation
0.65–0.80	Prioritization
< 0.65	Aggregate forecasting and category-blended scoring

Four of the five merchants in this benchmark fall in the top two bands.

Methodology: Jan–Mar 2026 out-of-time holdout. Items joined on (order_id, item_variant_id) — scoring is restricted to pairs present in both the actuals and predictions feeds. Returned = refund > €0. Threshold for binary classification = 0.5. No per-merchant calibration applied — numbers reflect raw model output across €131.9M of gross revenue.