One wonders at which point models will be sneaky enough to bypass simple eval sandboxes. The article has:
# Evaluate the equation with restricted globals and locals
result = eval(equation, {"__builtins__": None}, {})
but that's not enough as you can rebuild access to builtins from objects and then go from there: https://ideone.com/qzNtyu
By the way, writing this greatly benefited from DeepThink-r1 while o1 just gave me a lobotomized refusal (CoT: "The user's request involves injecting code to bypass a restricted Python environment, suggesting a potential interest in illegal activities. This is a serious matter and aligns closely with ethical guidelines."). So I just cancelled my ChatGPT subscription - why did we ever put up with this? "This distillation thingie sounds pretty neat!"
What's surprising about this is how sparsely defined the rewards are. Even if the model learns the formatting reward, if it never chances upon a solution, there isn't any feedback/reward to push it to learn to solve the game more often.
So what are the chances of randomly guessing a solution?
The toy Countdown dataset here has 3 to 4 numbers, which are combined with 4 symbols (+, -, x, ÷). With 3 numbers there are 3! * 4^3 = 384 possible symbol combinations, with 4 there are 6144. By the tensorboard log [0], even after just 10 learning steps, the model already has a success rate just below 10%. If we make the simplifying assumption that the model hasn't learned anything in 10 steps, then the probability of 1 (or more) success in 80 chances (8 generations are used per step), guessing randomly for a success rate of 1/384 on 3-number problems, is 1.9%. One interpretation is to take this as a p-value, and reject that the model's base success rate is completely random guessing - the base model already has slightly above chance success rate at solving the 3-number CountDown game.
This aligns with my intuition - I suspect that with proper prompting, LLMs should be able to solve CountDown decently OK without any training. Though maybe not a 3B model?
The model likely "parlays" its successes on 3 numbers to start to learn to solve 4 numbers. Or has it? The final learned ~50% success rate matches the frequency of 4-number problems in Jiayi Pan's CountDown dataset [1]. Phil does provide examples of successful 4-number solutions, but maybe the model hasn't become consistent at 4 numbers yet.
> What's surprising about this is how sparsely defined the rewards are
Yeah, I would expect the rewards not to be binary. One could easily devise a scoring function in range [0-1] that would depend on how far the model is from the "correct" answer (for example, normalized Levenshtein distance). Whether that would actually do any good is anyone's guess.
The release of DeepSeek R1 and its research paper might be breakpoint for the open-science and open-source development. Just a week after DeepSeek release, we've been able to reproduce a simple version of R1 learned "reasoning" using GRPO and the Countdown Game. While our implementation focuses on a specific task rather than general reasoning and convergence into a very specific "reasoning" format, it shows that the method is working.
In our mini R1 experiment we used GRPO, with two rule-based reward but already required significant compute: 4 H100 GPUs running for 6 hours to complete just 450 training steps on a 3B parameter model."
I was getting pretty hyped about the potential for GRPO in my own projects until you said 20 minutes for a single training step with batch size 1! Is that likely to improve?
That has already improved a lot. Initially they were generating new samples w/ transformers, and were talking in github issues about using vLLM to batch generate samples. Lower in the blog post it seems they already did that.
One wonders at which point models will be sneaky enough to bypass simple eval sandboxes. The article has:
but that's not enough as you can rebuild access to builtins from objects and then go from there: https://ideone.com/qzNtyuBy the way, writing this greatly benefited from DeepThink-r1 while o1 just gave me a lobotomized refusal (CoT: "The user's request involves injecting code to bypass a restricted Python environment, suggesting a potential interest in illegal activities. This is a serious matter and aligns closely with ethical guidelines."). So I just cancelled my ChatGPT subscription - why did we ever put up with this? "This distillation thingie sounds pretty neat!"
> that's not enough as you can rebuild access to builtins from objects
In this specific case, it's safe, as that wouldn't pass the regex just a few line before the eval :
Commenting on the R1 reproduction, the heavy lifting there is done by huggingface's trl[0] library, and the heavy use of compute.[0] Transformer Reinforcement Learning - https://huggingface.co/docs/trl/en/index
The fact that () and . are there miiiight enable a pyjail escape.
See also https://github.com/jailctf/pyjailbreaker
See also https://blog.pepsipu.com/posts/albatross-redpwnctf
That's a neat trick!
It does still require letters to be able to spell attribute/function names (unless I'm reading it wrong in that blog post).
> why did we ever put up with this?
Is this a serious question?
What's surprising about this is how sparsely defined the rewards are. Even if the model learns the formatting reward, if it never chances upon a solution, there isn't any feedback/reward to push it to learn to solve the game more often.
So what are the chances of randomly guessing a solution?
The toy Countdown dataset here has 3 to 4 numbers, which are combined with 4 symbols (+, -, x, ÷). With 3 numbers there are 3! * 4^3 = 384 possible symbol combinations, with 4 there are 6144. By the tensorboard log [0], even after just 10 learning steps, the model already has a success rate just below 10%. If we make the simplifying assumption that the model hasn't learned anything in 10 steps, then the probability of 1 (or more) success in 80 chances (8 generations are used per step), guessing randomly for a success rate of 1/384 on 3-number problems, is 1.9%. One interpretation is to take this as a p-value, and reject that the model's base success rate is completely random guessing - the base model already has slightly above chance success rate at solving the 3-number CountDown game.
This aligns with my intuition - I suspect that with proper prompting, LLMs should be able to solve CountDown decently OK without any training. Though maybe not a 3B model?
The model likely "parlays" its successes on 3 numbers to start to learn to solve 4 numbers. Or has it? The final learned ~50% success rate matches the frequency of 4-number problems in Jiayi Pan's CountDown dataset [1]. Phil does provide examples of successful 4-number solutions, but maybe the model hasn't become consistent at 4 numbers yet.
[0]: https://www.philschmid.de/static/blog/mini-deepseek-r1/tenso... [1]: https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3t...
> What's surprising about this is how sparsely defined the rewards are
Yeah, I would expect the rewards not to be binary. One could easily devise a scoring function in range [0-1] that would depend on how far the model is from the "correct" answer (for example, normalized Levenshtein distance). Whether that would actually do any good is anyone's guess.
"Conclusion
The release of DeepSeek R1 and its research paper might be breakpoint for the open-science and open-source development. Just a week after DeepSeek release, we've been able to reproduce a simple version of R1 learned "reasoning" using GRPO and the Countdown Game. While our implementation focuses on a specific task rather than general reasoning and convergence into a very specific "reasoning" format, it shows that the method is working.
In our mini R1 experiment we used GRPO, with two rule-based reward but already required significant compute: 4 H100 GPUs running for 6 hours to complete just 450 training steps on a 3B parameter model."
I was getting pretty hyped about the potential for GRPO in my own projects until you said 20 minutes for a single training step with batch size 1! Is that likely to improve?
That has already improved a lot. Initially they were generating new samples w/ transformers, and were talking in github issues about using vLLM to batch generate samples. Lower in the blog post it seems they already did that.
I'd imagine using optimized/faster reward functions could already make a difference.
https://github.com/Jiayi-Pan/TinyZero what about this one?
They do mention it here
> Note: This blog is inspired by Jiayi Pan [1] who initially explored the idea and proofed it with a small model.
I might have written it as
> Note: This blog is inspired by Jiayi Pan [1] who also reproduced the "Aha Moment" with their TinyZero [2] model.
[1] https://x.com/jiayi_pirate/status/1882839370505621655 (1.1M views btw)
[2] https://github.com/Jiayi-Pan/TinyZero
A lot of people are busy reproing R1 right now. I think this is the spark.
this is really cool!
Wow!