Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle

📄 Abstract - Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle

Infinite-shoe casino blackjack provides a rigorous, exactly verifiable benchmark for discrete stochastic control under dynamically masked actions. Under a fixed Vegas-style ruleset (S17, 3:2 payout, dealer peek, double on any two, double after split, resplit to four), an exact dynamic programming (DP) oracle was derived over 4,600 canonical decision cells. This oracle yielded ground-truth action values, optimal policy labels, and a theoretical expected value (EV) of -0.00161 per hand. To evaluate sample-efficient policy recovery, three model-free optimizers were trained via simulated interaction: masked REINFORCE with a per-cell exponential moving average baseline, simultaneous perturbation stochastic approximation (SPSA), and the cross-entropy method (CEM). REINFORCE was the most sample-efficient, achieving a 46.37% action-match rate and an EV of -0.04688 after 10^6 hands, outperforming CEM (39.46%, 7.5x10^6 evaluations) and SPSA (38.63%, 4.8x10^6 evaluations). However, all methods exhibited substantial cell-conditional regret, indicating persistent policy-level errors despite smooth reward convergence. This gap shows that tabular environments with severe state-visitation sparsity and dynamic action masking remain challenging, while aggregate reward curves can obscure critical local failures. As a negative control, it was proven and empirically confirmed that under i.i.d. draws without counting, optimal bet sizing collapses to the table minimum. In addition, larger wagers strictly increased volatility and ruin without improving expectation. These results highlight the need for exact oracles and negative controls to avoid mistaking stochastic variability for genuine algorithmic performance.

通过精确的二十一点预言机评估掩蔽动作环境中的无模型策略优化 / Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle

1️⃣ 一句话总结

这篇论文利用无限牌堆的二十一点游戏作为精确可验证的基准，通过一个精确的动态规划预言机评估了三种无模型优化算法的性能，发现尽管奖励曲线平滑，但算法在具体决策上仍存在显著错误，并强调了使用精确基准和负面对照组来避免误判算法性能的重要性。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要