This project is an exploration into training a Large Language Model (Gemma-3 4B-it) to play the game of Wordle using Reinforcement Learning (RL) with LoRA. The entire training and inference pipeline is optimized to run locally on Apple Silicon using the MLX framework.
The primary goals were to gain hands-on experience with RL, understand the challenges and hardware constraints of local training, and compare RL techniques to traditional Supervised Fine-Tuning (SFT).
If you are not familiar with wordle, the best way is to play a round: wordle-nyt.
While Wordle can be solved deterministically using algorithms based on information theory (as beautifully explained by 3Blue1Brown), it presents a fascinating and constrained environment for Reinforcement Learning.
An algorithmic approach typically works by:
This project takes a different approach: Can we teach a language model to develop a strategic policy for playing Wordle simply by rewarding good moves and penalizing bad ones? This makes it a perfect, self-contained problem for learning and applying RL concepts.
Lets look at this example where the secret word is “STARS”:
Played 'TARES', got feedback '---x✓'
Possible words remaining: 7 -> ['STARS', 'DRATS', 'BRATS', 'FRATS', 'PRATS', 'ARTIS', 'ROTAS']
Played 'FIORD', got feedback 'xxx✓x'
Possible words remaining: 1 -> ['STARS']
🎉 🎉 🎉 Congratulations! 'STARS' is the correct answer, after 3 plays!
The algorithmic version starts with the following assumptions: 1) we have a finite list of words (words_list) 2) we have a finite list of allowed guesses where (allowed_guesses <= words_list)
After each guess, we get feedback on the letter positions which allows us to keep only the possible guesses First, the word ‘TARES’ provides us with the maximum amount of information ~6.3 bits from that feedback, there are now 7 possible words remaining.
In order to guess which one of the 7, the idea behind the algorithm, is to propose a word in the allowed guesses that would provides a maximum information gain. This word ‘FIORD’ since we are left with only one remaining word.
The following colab scripts/wordle_no_rl.ipynb implements the 3Blue1Brown wordle approach. Make a secret word and let the algorithm guess it.
Checkout this scripts/calculate_word_entropy_mlx.py to calculate the entropy of each word. The result are available in data/word_entropy.json
Those will be used later in our reward function.
This project was developed entirely within the Apple Silicon ecosystem (initially M1, later M4 Pro). While PyTorch is a common choice, I switched to Apple’s MLX framework for several key reasons:
bitsandbytes
for quantization, which lacks stable support for Apple Silicon (#252). MLX is built from the ground up for unified memory and Apple’s Metal Performance Shaders (MPS).This project was trained on a Mac M4 Pro with 48 GB of RAM using the mlx-community/gemma-3-4b-it-bf16
model.
Clone the repository and set up a Python virtual environment.
git clone https://github.com/charbull/wordle-lora-rl.git
cd wordle-rl-gemma
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
You will need to download the Gemma-3 model weights from the Hugging Face Hub. This project uses the 4B-parameter version.
# For full training
hf download mlx-community/gemma-3-4b-it-bf16
# For faster iteration/testing
hf download mlx-community/gemma-3-270m-it-bf16
Note: Update the model path in your config file to point to the downloaded directory.
The training and inference scripts are controlled by a central config.json
file. This file specifies the model, data paths, LoRA configuration, RL parameters, and more. See src/utils/config.py
for detailed field descriptions.
The reward function uses word entropy to encourage smart opening guesses. You can pre-calculate this for the entire word list.
python -m scripts.calculate_word_entropy_mlx
Results are saved to data/word_entropy.json
.
To train the model effectively, we generate synthetic game data. This provides the model with partially completed games (0-4 turns of history), allowing it to learn from various states instead of getting stuck on opening moves.
python -m scripts.data_synth --mode rl
Before starting a long training run, it’s recommended to clear your system’s memory cache to prevent slowdowns from memory swapping.
sudo purge
Start the RL training process using the desired configuration file.
python -m scripts.train_gemma_rl --config ./config/grpo_lora_config.json
A LoRA adapter trained for 500 steps is available on the Hugging Face Hub. You can download it and run side-by-side comparisons against the base model.
# Run a single game (6 turns)
python -m scripts.play_sxs.py
# Run a full evaluation across 150 games
python -m scripts.evaluation_sxs.py
Visualize the cumulative wins and loss curves from a training log file.
python -m scripts.plot_cumulative_wins --file ./logs/your_metrics_file.jsonl
python -m unittest
The core of this project is the reward function, which guides the agent to become a proficient Wordle player. It’s a combination of strong penalties (the “stick”) for breaking rules and positive bonuses (the “carrot”) for strategic play.
The system calculates two primary values for each guess:
game_score
adjusted by penalties for efficiency (like turn count and response length), which is used directly to update the model during training.The total reward is a composite of several components, categorized into penalties for mistakes and bonuses for good strategy.
These are strong negative rewards designed to quickly teach the agent the fundamental rules and constraints of the game.
format_fail_penalty
): A large penalty is applied if the model’s response does not contain a valid 5-letter word. This is the most basic rule.repetition_penalty
): The agent is penalized for guessing a word it has already used in the current game.green_position_penalty
): Not placing a known green letter in its correct spot.yellow_letter_penalty
): Failing to include a known yellow letter anywhere in the guess.gray_letter_penalty
): Using a letter that has previously been identified as gray.not_in_dictionary_penalty
): A penalty is applied if the guess is a 5-letter word but is not in the official Wordle dictionary.These are positive rewards designed to encourage intelligent, information-seeking behavior.
solution_correct_guess
): A large, positive reward is given for correctly guessing the secret word, as this is the ultimate objective.valid_guess_base
): Any valid, non-losing guess receives a small base reward to encourage participation.information_gain_bonus_coeff
): For the first guess, the agent is rewarded based on a pre-calculated entropy score for its chosen word. This encourages the use of optimal starting words (like “SOARE” or “CRANE”) that are statistically most likely to reveal the most information about the secret word.new_letter_bonus
): After the first turn, the strategy shifts to rewarding exploration. The agent receives a bonus for each new, previously unused letter it includes in its guess. This encourages the agent to use its turns to test new characters and narrow down the possibilities.possibility_reduction_bonus
): This is a direct reward for making an informative guess. The system calculates the number of possible remaining answers before and after the current guess. The reward is proportional to the percentage of possible solutions that were eliminated by the guess, directly incentivizing moves that prune the search space effectively.These are “soft” penalties designed to refine the agent’s behavior, encouraging it to be not just correct, but also efficient.
green_reuse_penalty
): A penalty for placing a known green letter in its correct spot again. That letter slot is already “solved,” so it should be used to test a new letter if possible.yellow_reuse_penalty
): A penalty for using a known yellow letter in a guess. This encourages the agent to use “eliminator” words with all-new letters to discover more greens and yellows, rather than just rearranging known yellows.time_penalty_per_guess
): A small, constant penalty is applied for every guess made. This incentivizes the agent to solve the puzzle in as few turns as possible.length_penalty_per_token
): A minor penalty is applied based on the total number of tokens in the model’s generated response (including its reasoning). This encourages concise output.Training was run for 500 steps using the configuration in config/grpo_lora_config.json
.
The model showed a clear learning trend, with the cumulative win rate increasing steadily during both training and evaluation phases.
We evaluated the trained LoRA adapter against the base Gemma-3 model on 150 unseen games. We tested two key variables: game history (starting from scratch vs. a partially completed game) and sampling temperature (deterministic vs. creative guesses).
Providing the model with previous turns gives it crucial context, leading to a dramatic improvement in performance.
Temperature = 0.1 (More Deterministic): The trained model’s choices are more focused, leading to a consistent and significant performance gain over the base model.
Temperature = 0.9 (More Creative): With higher temperature, the model’s guesses are more random. While it still outperforms the base model, its win rate is lower and less consistent compared to the low-temperature setting.
When starting from scratch, the model’s performance drops significantly, highlighting its weakness in developing an optimal opening strategy.
Temperature = 0.1: The LoRA model still shows a slight edge, but the performance for both models is much lower.
Temperature = 0.9: At high temperature and with no history, the strategic advantage is nearly lost, and performance is poor for both models.
Next Steps to Improve Performance:
CRANE
or SOARE
) and let the fine-tuned model take over from turn two. This guarantees a strong, information-rich start every time.Over the course of training a language model to play Wordle using Reinforcement Learning, we encountered and solved a series of progressively more complex challenges. This document summarizes the key technical and strategic lessons from that process
The majority of our initial debugging was not about AI strategy, but about fundamental software engineering and data integrity. An RL agent cannot learn if its environment is flawed.
game_logic.py
file was critical. It eliminated inconsistencies between data generation, training, and evaluation.An RL agent is a relentless optimizer. It will not learn what you want it to learn; it will learn what you incentivize it to learn. Any loophole in the reward function will be found and exploited.
Final Final Final...
). It discovered that the penalty for this “lazy” inaction was sometimes less severe than the penalty for making a thoughtful but incorrect guess.format_fail_penalty
) unequivocally the worst possible outcome. This closed the loophole and forced the model to engage with the actual task.The model’s performance is not just a function of its weights, but of the quality and clarity of the input it receives.
✓✓xxx
), which were less effective. The best results came from providing a complete, plain-English “state summary” (Current Knowledge:
, Green Letters:
, Words Already Guessed:
, etc.). Clear, structured, natural language is key.The structure of the training data had a direct and measurable impact on the model’s ability to learn.
A key finding was the challenge of training a model with RL without a preceding Supervised Fine-Tuning (SFT) step. While our Rank 16 run proved this is possible, it is a difficult and unstable path.
<think>
, <guess>
, 5-letter words), indicating they lacked the capacity to learn the structure from the RL signal alone. For smaller models, SFT is likely not just helpful, but necessary.20 GB Swap Used
) was crippling the training process. A simple system restart to clear the memory was the fix. The health of the hardware is a critical, non-obvious hyperparameter.num_generations
is extremely memory-intensive. This is due to the KV Cache, the model’s “working memory.” Each parallel generation requires its own multi-gigabyte KV Cache. Increasing from 2 to 4 generations had a massive memory impact, whereas increasing the LoRA rank was comparatively cheap in terms of RAM. Understanding this trade-off is essential for configuring runs that don’t overload your hardware.