Skip to main content

[Closed Beta] Return Prediction Model: Performance Benchmark

Klar DS return-prediction model · Jan–Mar 2026 out-of-time holdout · five live merchants

Written by Marc Garbella

Across 3.83 million line items and €131.9M of gross revenue scored on a strict out-of-time holdout, the model produced:

  • €33.98M in predicted returns against €35.00M actual — a 2.9% aggregate gap on a number worth €35M.

  • A pooled ROC-AUC of 0.75 and F1 of 0.50 across the full benchmark — solidly in production-decisioning territory.

  • A pooled top-decile that contains 65% real returns vs a 6.5% bottom decile — a 10× separation between the items the model flags as risky and the items it flags as safe.

The model is accurate enough at the aggregate to plug directly into finance reporting and discriminating enough at the item level to drive operational decisions one line at a time.

What we tested

For every line item in every order placed Jan–Feb–Mar 2026, the model produced a probability between 0 and 1 that the customer would return that item. We then waited for the real outcomes and graded the model against them — on data the model had never seen.

Five live merchants of varying size, vertical, and return-rate profile were included to cover the spread of conditions the model encounters in production.

Store-aggregate calibration

Multiplying gross revenue × predicted probability and summing across each merchant gives the model's implied store-level return forecast — the number CFOs ask for.

Merchant

Gross revenue

Actual returns

Predicted returns

Gap

Shop A

€17.0M

€6,039,265

€5,664,600

−6.2%

Shop B

€49.6M

€14,608,633

€13,046,191

−10.7%

Shop C

€48.3M

€10,095,599

€11,328,712

+12.2%

Shop D

€12.3M

€3,433,109

€3,229,999

−5.9%

Shop E

€4.7M

€823,581

€708,134

−14.0%

All

€131.9M

€35,000,187

€33,977,636

−2.9%

Individual merchants land within ±14% of their actual return total. Pooled across the full benchmark, the model is within 2.9% of the real number on a €35M figure — a finance-grade aggregate forecast emerging from per-item predictions, with no calibration step applied.

Per-merchant results

Merchant

Items scored

Actual return rate

F1

ROC-AUC

Precision

Recall

Top-decile

Bottom-decile

Shop A

362,000

35.4%

0.72

0.86

0.75

0.69

99.8%

6.6%

Shop B

1,100,325

29.4%

0.52

0.74

0.59

0.47

70.4%

12.7%

Shop C

2,158,500

20.7%

0.44

0.73

0.44

0.44

51.6%

3.4%

Shop D

188,097

23.4%

0.42

0.68

0.43

0.42

51.0%

11.7%

Shop E

22,933

16.2%

0.20

0.58

0.24

0.17

24.1%

11.3%

Pooled

3,831,855

24.7%

0.50

0.75

0.52

0.48

65.2%

6.5%

Reading the metrics

  • F1 combines precision and recall into one score. Production fraud, churn, and credit-risk models typically score 0.70–0.85. Four of five merchants sit at or above the "decent prioritization" band; the strongest is in the production-decisioning band.

  • ROC-AUC is the probability the model ranks a returned item above a kept item. 0.50 = random; 0.80–0.90 = production decisioning quality. The pooled 0.75 puts the model in the "acceptable production" range; the strongest merchant reaches 0.86.

  • Top / Bottom decile is the actual return rate when items are sorted by predicted probability and sliced into 10 equal buckets. A wide gap means the score genuinely separates winners from losers.

Showcase: a high-discrimination merchant

The clearest read on what the model can do is the merchant whose data is most amenable to it.

Shop A — 362,000 items, 35.4% actual return rate

Confusion matrix at threshold 0.5:

Predicted KEEP

Predicted RETURN

Row total

Actual KEEP

204,877

28,795

233,672

Actual RETURN

40,217

88,111

128,328

→ Of 128,328 real returns, the model flagged 88,111 (69% recall). Of 116,906 flags, 88,111 were correct (75% precision).

Decile lift — Shop A:

Decile

Actual return rate

0 (safest)

6.6%

1

8.5%

2

11.0%

3

14.8%

4

19.2%

5

25.5%

6

34.4%

7

48.4%

8

86.1%

9 (riskiest)

99.8%

When the model is given a clean signal environment, the top 10% of items it flags really do come back 99.8% of the time, and the bottom 10% really are kept 93.4% of the time. This is the basis for downstream automation.

Decile lift across all five merchants

Decile

Shop A

Shop B

Shop C

Shop D

Shop E

Pooled

0 (safest)

6.6%

12.7%

3.4%

11.7%

11.3%

6.5%

1

8.5%

12.0%

9.9%

12.5%

13.4%

10.1%

2

11.0%

14.1%

9.5%

13.7%

14.5%

11.2%

3

14.8%

16.8%

12.0%

14.9%

13.5%

14.1%

4

19.2%

19.9%

15.0%

17.0%

13.5%

16.8%

5

25.5%

24.4%

18.3%

20.2%

14.6%

20.6%

6

34.4%

30.9%

22.1%

24.8%

16.8%

25.4%

7

48.4%

39.6%

28.1%

32.2%

19.6%

32.9%

8

86.1%

52.8%

37.0%

36.3%

20.4%

44.0%

9 (riskiest)

99.8%

70.4%

51.6%

51.0%

24.1%

65.2%

Every merchant shows a monotonic ramp from safe to risky deciles. The steepest curves (Shop A: 15× spread, Shop B: 6× spread) are immediately operationalizable; the shallower curves still provide usable ordering for prioritization workflows.

What predicts performance

The model adapts to each merchant's data profile. Three properties largely determine how strong the per-item signal is:

  • Base return rate. Higher return rates give the classifier more variance to learn from. Merchants in the 25–35% range produce the strongest item-level discrimination.

  • Data volume. More items per SKU and per customer means more signal per training pass. Volume is a multiplier on base-rate signal, not a substitute for it.

  • Catalog stability. Merchants with stable repeat catalogs (most SKUs seen multiple times in training) outperform high-churn catalogs.

This means the score's operational value scales with merchant conditions — and we size onboarding expectations accordingly before go-live.

Recommended usage by performance band

ROC-AUC band

Operational fit

≥ 0.80

Automation

0.65–0.80

Prioritization

< 0.65

Aggregate forecasting and category-blended scoring

Four of the five merchants in this benchmark fall in the top two bands.

Methodology: Jan–Mar 2026 out-of-time holdout. Items joined on (order_id, item_variant_id) — scoring is restricted to pairs present in both the actuals and predictions feeds. Returned = refund > €0. Threshold for binary classification = 0.5. No per-merchant calibration applied — numbers reflect raw model output across €131.9M of gross revenue.

Did this answer your question?