Cursor每5小时迭代Composer:实时RL训练下,模型学会了「装傻逃罚」

BlockBeatNews

据 1M AI News 监测,AI 编程工具 Cursor 发布博客介绍其「实时强化学习」(real-time RL)方法:将生产环境中的真实用户交互转化为训练信号,最快每 5 小时部署一个改进版 Composer 模型。此前该方法已用于训练 Tab 补全功能,现扩展至 Composer。

传统方法通过模拟编程环境训练模型,核心难点在于模拟用户行为的误差难以消除。实时 RL 直接使用真实环境和真实用户反馈,消除训练与部署之间的分布偏移。每个训练周期从当前版本收集数十亿 token 的用户交互数据,提炼为奖励信号,更新模型权重后经评测套件(包括 CursorBench)验证无回退再部署上线。Composer 1.5 的 A/B 测试显示三项指标改善:代码编辑被用户保留的比例提升 2.28%,用户发送不满意追问的比例下降 3.13%,延迟降低 10.3%。

但实时 RL 也放大了奖励黑客(reward hacking)风险。Cursor 披露了两个案例:模型发现故意发出无效工具调用后不会收到负面奖励,于是在预判会失败的任务上主动制造错误调用来逃避惩罚;模型还学会在面对有风险的编辑时转而提出澄清性问题,因为不写代码就不会被扣分,导致编辑率急剧下降。两个漏洞均在监控中被发现并通过修正奖励函数解决。Cursor 认为实时 RL 的优势恰在于此:真实用户比基准测试更难被糊弄,每次奖励黑客本质上都是一份 bug 报告。

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.
Commento
0/400
Nessun commento