In an impressive demonstration of cost-effective AI research, a group of researchers has successfully replicated DeepSeek’s R1-Zero model for just $30.
Dubbed TinyZero, this project focuses on countdown and multiplication tasks, leveraging reinforcement learning (RL) to enable a 3-billion-parameter (3B) base language model (LM) to develop self-verification and search abilities autonomously.
Built on the veRL framework, TinyZero showcases how reinforcement learning can help large language models (LLMs) evolve reasoning capabilities independently.
The researchers behind this project highlight an “Aha!” moment that users can experience firsthand with minimal computational costs.
For those interested in exploring the methodology, a detailed experiment log is available on Weights & Biases, with further insights shared in a Twitter thread. The team has also confirmed that a formal research paper is forthcoming.
The research team selected the “countdown game” as their test environment, a mathematical challenge where the AI generates equations from a set of numbers to reach a specific target.
This game is ideal for testing problem-solving capabilities, as it requires logical reasoning and strategic trial-and-error to improve over time. Initially, the model produced random outputs with no clear strategy.
However, through reinforcement learning, it gradually refined its approach, developing logical reasoning skills independently.
Running TinyZero: Installation and Setup
To replicate TinyZero, users can follow a straightforward setup process:
Installation Steps
- Create Environment:
conda create -n zero python=3.9
- Install Torch (optional):
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
- Install vLLM:
pip3 install vllm==0.6.3
- Install veRL and Dependencies:
pip install -e . pip3 install flash-attn --no-build-isolation pip install wandb IPython matplotlib
Countdown Task: Training TinyZero
Data Preparation
Activate the environment and preprocess the dataset:
conda activate zero
python ./examples/data_preprocess/countdown.py --local_dir {path_to_your_dataset}
Training on a Single GPU
For models up to 1.5B parameters, a single GPU setup works effectively:
export N_GPUS=1
export BASE_MODEL={path_to_your_model}
export DATA_DIR={path_to_your_dataset}
export ROLLOUT_TP_SIZE=1
export EXPERIMENT_NAME=countdown-qwen2.5-0.5b
export VLLM_ATTENTION_BACKEND=XFORMERS
bash ./scripts/train_tiny_zero.sh
Scaling Up: Training a 3B+ Model
For larger models that exhibit more advanced reasoning skills, a two-GPU configuration is recommended:
export N_GPUS=2
export BASE_MODEL={path_to_your_model}
export DATA_DIR={path_to_your_dataset}
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS
bash ./scripts/train_tiny_zero.sh
Instruct Ablation: Experimenting with Qwen-2.5-3B
The team also experimented with an instruction-tuned version of Qwen-2.5-3B. This requires additional data preprocessing:
conda activate zero
python examples/data_preprocess/countdown.py --template_type=qwen-instruct --local_dir={path_to_your_dataset}
Training follows a similar two-GPU setup:
export N_GPUS=2
export BASE_MODEL={path_to_your_model}
export DATA_DIR={path_to_your_dataset}
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b-instruct
export VLLM_ATTENTION_BACKEND=XFORMERS
bash ./scripts/train_tiny_zero.sh
TinyZero was developed based on the veRL framework and employs the Qwen2.5 series base models. The research team, comprising Jiayi Pan, Junjie Zhang, Xingyao Wang, Lifan Yuan, Hao Peng, and Alane Suhr, has made the project open-source, accessible on GitHub here.
With the success of TinyZero, this experiment demonstrates that state-of-the-art AI capabilities can be developed and studied on a remarkably small budget, potentially paving the way for more affordable AI research.
Find this News Interesting! Follow us on Google News, LinkedIn, and X to Get Instant Updates