wordle-lora-rl

Wordle-RL: Training a Language Model to Play Wordle with Reinforcement Learning on Apple Silicon

License: MIT

This project is an exploration into training a Large Language Model (Gemma-3 4B-it) to play the game of Wordle using Reinforcement Learning (RL) with LoRA. The entire training and inference pipeline is optimized to run locally on Apple Silicon using the MLX framework.

The primary goals were to gain hands-on experience with RL, understand the challenges and hardware constraints of local training, and compare RL techniques to traditional Supervised Fine-Tuning (SFT).

Table of Contents

Why Wordle? The RL Challenge

What is Wordle?

If you are not familiar with wordle, the best way is to play a round: wordle-nyt.

Do we need RL?

While Wordle can be solved deterministically using algorithms based on information theory (as beautifully explained by 3Blue1Brown), it presents a fascinating and constrained environment for Reinforcement Learning.

An algorithmic approach typically works by:

  1. Maintaining a list of all possible secret words.
  2. Choosing a guess that maximizes the expected information gain (entropy), effectively splitting the remaining possibilities as evenly as possible.
  3. Filtering the list of possibilities based on the feedback and repeating the process.

This project takes a different approach: Can we teach a language model to develop a strategic policy for playing Wordle simply by rewarding good moves and penalizing bad ones? This makes it a perfect, self-contained problem for learning and applying RL concepts.

Lets look at this example where the secret word is “STARS”:

Played 'TARES', got feedback '---x✓'

Possible words remaining: 7 -> ['STARS', 'DRATS', 'BRATS', 'FRATS', 'PRATS', 'ARTIS', 'ROTAS']
Played 'FIORD', got feedback 'xxx✓x'

Possible words remaining: 1 -> ['STARS']

🎉 🎉 🎉 Congratulations! 'STARS' is the correct answer, after 3 plays!

The algorithmic version starts with the following assumptions: 1) we have a finite list of words (words_list) 2) we have a finite list of allowed guesses where (allowed_guesses <= words_list)

After each guess, we get feedback on the letter positions which allows us to keep only the possible guesses First, the word ‘TARES’ provides us with the maximum amount of information ~6.3 bits from that feedback, there are now 7 possible words remaining.

In order to guess which one of the 7, the idea behind the algorithm, is to propose a word in the allowed guesses that would provides a maximum information gain. This word ‘FIORD’ since we are left with only one remaining word.

Play Wordle

The following colab scripts/wordle_no_rl.ipynb implements the 3Blue1Brown wordle approach. Make a secret word and let the algorithm guess it.

Calculate wordle word entropy

Checkout this scripts/calculate_word_entropy_mlx.py to calculate the entropy of each word. The result are available in data/word_entropy.json

Those will be used later in our reward function.

Understanding Policy Optimization

The Technology Stack: Why MLX?

This project was developed entirely within the Apple Silicon ecosystem (initially M1, later M4 Pro). While PyTorch is a common choice, I switched to Apple’s MLX framework for several key reasons:

  1. Hardware Compatibility: Training with libraries like Hugging Face TRL often requires bitsandbytes for quantization, which lacks stable support for Apple Silicon (#252). MLX is built from the ground up for unified memory and Apple’s Metal Performance Shaders (MPS).
  2. Enforcing Local Constraints: MLX’s primary focus on Apple Silicon forced me to solve performance and memory issues locally, providing deeper insights into hardware limitations without the easy “escape hatch” of a cloud GPU.
  3. Performance: Early benchmarks suggest MLX can be significantly faster than PyTorch on MPS for certain training workloads (comparison).
  4. Modern API: MLX’s API is inspired by both PyTorch and JAX, making it intuitive and powerful.

This project was trained on a Mac M4 Pro with 48 GB of RAM using the mlx-community/gemma-3-4b-it-bf16 model.

Getting Started

1. Setup Environment

Clone the repository and set up a Python virtual environment.

git clone https://github.com/charbull/wordle-lora-rl.git
cd wordle-rl-gemma

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

2. Download the Model

You will need to download the Gemma-3 model weights from the Hugging Face Hub. This project uses the 4B-parameter version.

# For full training
hf download mlx-community/gemma-3-4b-it-bf16

# For faster iteration/testing
hf download mlx-community/gemma-3-270m-it-bf16

Note: Update the model path in your config file to point to the downloaded directory.

Usage

The training and inference scripts are controlled by a central config.json file. This file specifies the model, data paths, LoRA configuration, RL parameters, and more. See src/utils/config.py for detailed field descriptions.

1. Pre-computation (Optional)

The reward function uses word entropy to encourage smart opening guesses. You can pre-calculate this for the entire word list.

python -m scripts.calculate_word_entropy_mlx

Results are saved to data/word_entropy.json.

2. Generate Synthetic Data

To train the model effectively, we generate synthetic game data. This provides the model with partially completed games (0-4 turns of history), allowing it to learn from various states instead of getting stuck on opening moves.

python -m scripts.data_synth --mode rl

3. Clear System Cache

Before starting a long training run, it’s recommended to clear your system’s memory cache to prevent slowdowns from memory swapping.

sudo purge

4. Run Training

Start the RL training process using the desired configuration file.

python -m scripts.train_gemma_rl --config ./config/grpo_lora_config.json

5. Evaluate a Pre-Trained Model

A LoRA adapter trained for 500 steps is available on the Hugging Face Hub. You can download it and run side-by-side comparisons against the base model.

# Run a single game (6 turns)
python -m scripts.play_sxs.py

# Run a full evaluation across 150 games
python -m scripts.evaluation_sxs.py

6. Plot Training Metrics

Visualize the cumulative wins and loss curves from a training log file.

python -m scripts.plot_cumulative_wins --file ./logs/your_metrics_file.jsonl

7. Run Unit Tests

python -m unittest

The Reinforcement Learning Strategy

The core of this project is the reward function, which guides the agent to become a proficient Wordle player. It’s a combination of strong penalties (the “stick”) for breaking rules and positive bonuses (the “carrot”) for strategic play.

The system calculates two primary values for each guess:

  1. Game Score: A score that reflects the quality of the Wordle guess itself.
  2. Training Reward: The game_score adjusted by penalties for efficiency (like turn count and response length), which is used directly to update the model during training.

The total reward is a composite of several components, categorized into penalties for mistakes and bonuses for good strategy.

1. Penalties for Rule Violations and Mistakes (The “Stick”)

These are strong negative rewards designed to quickly teach the agent the fundamental rules and constraints of the game.

2. Bonuses for Strategic Play (The “Carrot”)

These are positive rewards designed to encourage intelligent, information-seeking behavior.

3. Penalties for Inefficiency

These are “soft” penalties designed to refine the agent’s behavior, encouraging it to be not just correct, but also efficient.

Results and Analysis

Training was run for 500 steps using the configuration in config/grpo_lora_config.json.

Training Performance

The model showed a clear learning trend, with the cumulative win rate increasing steadily during both training and evaluation phases.

Training vs Eval Cumulative Wins Training Loss Curve

Evaluation: LoRA vs. Base Model

We evaluated the trained LoRA adapter against the base Gemma-3 model on 150 unseen games. We tested two key variables: game history (starting from scratch vs. a partially completed game) and sampling temperature (deterministic vs. creative guesses).

With Game History (Starting from Turns 1-4)

Providing the model with previous turns gives it crucial context, leading to a dramatic improvement in performance.

Without Game History (Starting from Turn 1)

When starting from scratch, the model’s performance drops significantly, highlighting its weakness in developing an optimal opening strategy.

Analysis and Key Findings

  1. Game History is Crucial: The model’s primary strength is using constraints from previous turns. Its performance is dramatically better when it has context to work with.
  2. Low Temperature Wins: For a logical puzzle like Wordle, a lower sampling temperature (e.g., 0.1) consistently yields better results. The deterministic, high-probability choices are more effective than the creative, random guesses introduced by a high temperature.
  3. Weak Opening Strategy: The model is effective at deduction but has not learned an optimal opening strategy. Its performance is highly dependent on its default first guess, which explains the poor results when starting without history.

Next Steps to Improve Performance:


Lessons Learned: Training a Wordle-Solving RL Agent

Over the course of training a language model to play Wordle using Reinforcement Learning, we encountered and solved a series of progressively more complex challenges. This document summarizes the key technical and strategic lessons from that process

Lesson 1: The System is the Foundation. Get it Right First.

The majority of our initial debugging was not about AI strategy, but about fundamental software engineering and data integrity. An RL agent cannot learn if its environment is flawed.

Lesson 2: RL is a Battle Against “Reward Hacking”

An RL agent is a relentless optimizer. It will not learn what you want it to learn; it will learn what you incentivize it to learn. Any loophole in the reward function will be found and exploited.

Lesson 3: Prompt Engineering is a High-Impact Lever

The model’s performance is not just a function of its weights, but of the quality and clarity of the input it receives.

Lesson 4: Data and Curriculum Drive the Learning Curve

The structure of the training data had a direct and measurable impact on the model’s ability to learn.

Lesson 5: “Straight to RL” is a High-Wire Act

A key finding was the challenge of training a model with RL without a preceding Supervised Fine-Tuning (SFT) step. While our Rank 16 run proved this is possible, it is a difficult and unstable path.

Lesson 6: Know Your Hardware and Its “Hidden” Bottlenecks