Sitemap

MAPPO: When Your Agents Need to Learn Together

7 min readApr 24, 2024

Training one RL agent is hard. Training multiple agents that need to coordinate? That’s a different problem entirely.

The environment keeps shifting because every agent is learning and changing its behavior. What worked five minutes ago doesn’t work now because your teammates (or opponents) have updated their policies. This non-stationarity makes standard RL algorithms unstable or ineffective.

Multi-Agent Proximal Policy Optimization (MAPPO) addresses this by combining centralized training with decentralized execution. During training, a shared critic sees everything and provides stable learning signals. During execution, each agent acts independently using only its local observations.

Why Multi-Agent RL Is Different

In single-agent RL, the environment is fixed (or changes in predictable ways). You can assume the transition dynamics are stationary. Your agent explores, collects data, and gradually improves its policy.

Add multiple learning agents and this assumption breaks. Agent A learns a new strategy. From Agent B’s perspective, the environment just changed — Agent A now behaves differently. Agent B adapts. Now Agent A sees a different environment. Policies chase each other in circles, never converging.

--

--

No responses yet