Meta-RL Induces Exploration in Language Agents
LaMer is a Meta-RL framework of training LLM agents to explore and adapt to the environment at test time. LaMer induces efficient exploration through cross-episode training while enabling in-context policy adaptation via self-reflection, resulting in effective test-time scaling and strong generalizaiton to harder and unseen tasks.
Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents.
Publication
Meta-RL Induces Exploration in Language Agents
Yulun Jiang*, Liangze Jiang*, Damien Teney, Michael Moor**, Maria Brbić**.
@article{jiang2025metarl,
title={Meta-RL Induces Exploration in Language Agents},
author={Yulun Jiang and Liangze Jiang and Damien Teney and Michael Moor and Maria Brbic},
journal={arXiv preprint arXiv:2512.16848}
year={2025}
}
Key idea: Training LLM Agents with Meta-RL
Out goal is to to build agents that can actively explore their environment, gather feedback, and leverage this experience for more effective exploitation. Therefore, we consider an multi-episode regime, where an episode is the unit of exploration and exploitation. In early episodes, the agent is encouraged to gather diverse experiences and informative feedback from the environment, which are then used the to adapt its policy in later episodes. This process aligns with the concept of meta reinforcement learning (Meta-RL), which focus on “learning to reinforcement learn” to rapidly adapt to new tasks and environments. We observe that Meta-RL produces more diverse samples while simultaneously achieves higher performance, reaching a better balance between exploration and exploitation than standard RL baselines.
LaMer: Cross-episode training and in-context self-adaptation
We present LaMer, a general Meta-RL framework for LLM agent training. LAMER contains two key design principles. First, introduces a cross-episode training scheme that treats each trial as a sequence of episodes, enabling the agent to explore in early episodes and exploit this information in later ones. The agent is trained to maximize the cross-episode culmulative rewards. Second, LaMer uses self-reflection as an in-context adaptation mechanism, allowing the agent to summarize past experiences and adjust its strategy accordingly without updating the model parameters.
Evaluation Settings
We evaluate LaMer on four challenging and diverse environments: Sokoban, MineSweeper, Webshop and ALFWorld. We compare LaMer with prompting methods and RL baselines. We also evaluate the generalization ability of LaMer to harder tasks and tasks under distribution shifts.
Overall results
Meta-RL obtains better performance. Across all three environments, LaMer trained with Meta-RL consistently outperforms both prompting-based baselines and RL-training methods on the final pass@3 success rate, achieving 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Together, these results demonstrate that LaMer delivers consistent benefit on the trained agents to solve the long-horizon task in the complex environments.
Meta-RL exhibits stronger test-time scaling. LaMer trained agents demonstrate remarkable performance gains across attempts. From pass@1 to pass@3, LaMer yields improvements of 13.5% (Sokoban), 30.3% (MineSweeper), and 20.3% (Webshop), significantly larger than RL and prompting-based baselines. The results indicate that LaMer has successfully learned to actively explore in the earlier episodes and adapt effectively from the mistakes, leading to effective test-time scaling.
Meta-RL induces exploration. We quantify exploration by estimating the entropy of the empirical distribution over unique sampled trajectories. We observe that the base model exhibits the highest entropy and standard RL converges to deterministic behaviors. In contrast, LaMer preserves significantly higher diversity than RL baselines, allowing more exploration at test time.
Generalization to harder tasks
Next we evaluate the trained agents with increased difficulty on the tasks of Sokoban and MineSweeper. As expected, the model trained with both RL and Meta-RL underperforms on harder tasks with an increasing number of boxes or mines in the grid. However, Meta-RL consistently outperforms RL on all the difficulty levels. The consistent gap indicates that LaMer trained with Meta-RL not only performs better on the training distribution, but also generalizes better to the harder tasks.
Generalization to unseen tasks
We further evaluate the out-of-distribution (OOD) generalization performance using the ALFWorld environment. We train the agents with four task types (Pick, Look, Clean, Heat) as in-distribution (ID) and hold out two (Cool, Pick2) as OOD. While standard RL performs well on ID tasks (>20% improvement over prompting), it struggles on OOD tasks. In contrast, LaMer consistently outperforms RL on ID and OOD tasks, indicating better out-of-distribution generalization.
Influence of Trajectory Discount Factor
The trajectory discount factor controls the rewards propagation across the episodes, thereby mediating the balance between exploration and exploitation during training. We analyzed the impact of this factor across the environments and the results indicate that the optimal setting varies across different environments. For Sokoban and Webshop, intermediate values yield peak results, suggesting a need to balance immediate and long-term rewards. In contrast, MineSweeper benefits from higher values, indicating that extended credit assignment better supports strategic exploration in this domain. Overall, the trajectory discount factor serves as an effective mechanism for tuning the exploration-exploitation trade-off.
Code
A PyTorch implementation of LaMer is available on GitHub.
Contributors
The following people contributed to this work:






