10 min readMulti-Agent Reinforcement Learning Soft Introduction: Cooperation

Multi-Agent Reinforcement Learning Soft Introduction: Cooperation

Whether you are a student in Georgia Tech's OMSCS CS 7642 Reinforcement Learning course or a curious learner, this article aims to discuss the high level abstract concepts Multi-Agent Reinforcement Learning.

Here are some of the topics that will be discussed:

  • Quick overview of Cooperative, Competitive, and Mixed Environments
  • Quick overview of Game Theory
  • Quick overview of the Difficulties of Multi-Agent Reinforcement Learning problems
  • Quick overview of the following: Exploration, Reward Shaping, Global Reward Sharing, Reward Gaming, Centralized Learning and Decentralized Execution

This article is part of a series of articles for the OMSCS CS 7642 Reinforcement Learning class.

OMSCS Reinforcement Learning Series:

  1. Georgia Tech Reinforcement Learning: Preparing for Success
  2. Single Agent Reinforcement Learning: Basic Concepts and Terminologies
  3. Turbocharging Advantage Actor-Critic with Proximal Policy Optimization
  4. Advantage Actor-Critic with Proximal Policy Optimization: A Journey Through Code
  5. Multi-Agent Reinforcement Learning Soft Introduction: Cooperation

Friendly Caution

Hello dear reader! I'm thrilled that I get to share my knowledge. However, I am still a student, journeying through the complex field of computer science. While I strive to present information that is accurate and insightful, there may be instances where I misinterpret or oversimplify concepts. Please approach this article with a curious and critical mind and consult the learning material sections for the most accurate and comprehensive information available.

Multi-Agent Reinforcement Learning

In Multi-Agent Reinforcement Learning (MARL) Environments, agents may have different objectives, and their actions can influence each other's outcomes. This introduces complex dynamics, such as cooperation, competition, coordination, and communication.

MARL is complicated because even in a completely deterministic environment where transitions are 100% predictable, multiple agents will turn what was once a deterministic environment into a stochastic environment.

This is because, from each agent's point of view, the other agents are assumed to be a never-changing or static part of the environment. However, that is not true.

Consider two agents, each with their own individual policy. As they learn and update their policies based on their observations, they don't account for each other's policies. This leads to a situation where each agent's policy quickly becomes outdated as the other agents update theirs. Ultimately, this creates a feedback loop where no agent can effectively learn.

In the end, the goal of multi-agent reinforcement learning problems is to find a way for all agents in the environment to learn effectively in what will always be a stochastic setting.

Learning Materials:

Additional Terminologies

MARL introduces some additional terminologies that expand from the basic terminologies used in single agent environments.

Agent Space?

In the context of multiple agents, a term is needed to represent a single agent out of the group. While the specific term may vary depending on the paper or context, for the purpose of this article, I will denote an agent in the agent space as \(i \in I\).

Joint Transition Probability

In the context of cooperative play, each timestep requires multiple agents selecting an action and executing them simultaneously in the environment.

This concept is quite important as it tells you the size of the policy space, and the complex nature of finding the optimal policy.

A joint probability transition is denoted as

$$ \large J(s^{i}{t}, a^{i}) \rightarrow (s^{i}{t+1}, r^{i}{t+1})$$ where \(i\) denotes the identity of an individual agent, \(s^{i}{t}\) is the state of the individual agent at timestep \(t\), \(a^{i}{t}\) is the action of an individual agent at timestep \(t\), \(r^{i}{t+1}\) is the reward an individual agent receives for taking an action in a particular state.

Game Theory

Many concepts discussed and derived in Multi-Agent Reinforcement Learning (MARL), a subfield of Artificial Intelligence, have roots in Game Theory, a field born from economics that studies strategic decision-making.

Game Theory is the idea that all players in a game act and behave rationally; therefore, it's possible to determine what actions or strategies a player will take when playing a game.

The concepts in MARL and Game Theory intersect, meaning that both use different terminologies to discuss similar concepts.

For example, for what RL calls policy and timesteps, Game Theory calls it strategy and stages. Deterministic policy and stochastic policies in RL would be called pure strategy and mixed strategy in Game theory. For what RL calls an optimal policy, Game Theory may say Nash Equilibrium. When RL says fully observable states, Game Theory will say Perfect Information. Where RL says maximum cumulative returns, Game Theory may say max payoff; value function of RL becomes payoff function for Game Theory.

Understanding the intersection of MARL and Game Theory, despite their different terminologies, can significantly enhance your ability to comprehend how strategies are formed in multi-player games. This knowledge can be directly applied to real-world scenarios, making your understanding of MARL concepts more practical and impactful.

Learning Materials:

Types of Games

Game Theory has ways of classifying games. There are two forms of games: static games and dynamic games. Static games are where player actions are done once and where the game ends immediately. Dynamic games on the other hand are games in which players take turns sequentially or where player actions are done repeatedly, and moves are usually categorized at occurring during some specific timestep. Dynamic games are the ones that tend to be actively researched: Go, Chess, League of Legends, etc.

Also one concept that underpins this is the idea of common knowledge, which is simply that there is some information that all players know about each other and the game such as that it's common knowledge that player 1 will utilize strategy 1.

There are many types of games, but I go over a few one may see in the RL literature.

Extensive Form Games

Extensive-form games are simply games where players take sequential turns making moves. In these games, there is a static order for when players can take action and a set of rules known to everyone about the game. Every action changes the state of the game, which in turn changes the way players behave in future turns.

Consider games like Go or Chess, where players must anticipate their opponents' moves based on their own actions. This strategic depth is a hallmark of extensive form games, making them a stimulating challenge for players of all levels.

Games in extensive form can be built using a game tree. In RL, these games could most likely be represented as a Monte-Carlo Tree Search or MinMax Trees.

Lastly, extensive form games are ones where the sequence of actions players make matters in the decision process.

Strategic Form Games

Strategic form, also called normal form, is a game in which we care more about the "strategy" than the set of specific sequential actions that lead to particular outcomes. The goal is to find the probability distribution of one or more strategies in the strategy space that can lead a player to victory. Simply put, strategic games are only concerned with some abstract strategy that leads to winning or losing rather than the sequence of specific actions that lead to those outcomes. Strategic games are ones where the payoffs for all players depend on all players' actions and where payoffs are interdependent.

In two-player games, these types of games are typically represented as a matrix or a table. The row and column of the tables represent the entire action space for players one and two, respectively. The intersection of the actions shows the payoff for both players when those actions occur. From this action space, we can determine a "strategy," which is a probability distribution of actions that will be selected to lead to the best overall outcome.

The most straightforward strategic game is the game of rock, paper, scissors or jankenpon. Players must develop a "strategy" out of multiple strategies in this game. In this case, the best plan for the game is to pick rock, paper, and scissors with probability \( \frac{1}{3} \) for all actions. In turn, the opponent should then perceive the player's strategy and wish to adopt the same strategy, which leads both players to the best possible outcome for both in the long term: a draw.

Coalitional Form Games

Coalitional form games, also called characteristic function form games, are simply games in which the goal is for all players to work in one or multiple groups to achieve some goal. In a sense, they're games in which cooperation is the primary focus, and the focus is mainly on the group's collective payoffs rather than the strategy that leads to those payoffs.

Think of cooperative puzzle games such as an escape room where the primary goal is for all members in the game to succeed by solving the puzzle.

Bayesian Games

In Bayesian games, players have incomplete information about the game's state and other players' strategies. In this sense, common knowledge about player strategy and the state of the game is uncertain, but where the rules of the games, the types of actions and payoffs, the use of belief state on the world, and opponents' actions are known.

In this type of game, players will have to create beliefs about the uncertainties of the game world and use that to create a probability distribution over the strategies they may take. Beliefs will most likely be created about the state of the world and the strategies of opponents.

Types of Environments

We can classify environments based on how the agents in the environment interact with one another.

Cooperative

In the cooperative setting, all agents work together. Think of this as an escape room game. No one wants to betray or hurt other agents, and the success of one agent contributes to the success of all agents.

Cooperative environments emphasize teamwork to achieve a goal. While it's fine for a single agent to achieve a goal alone, it will typically be more efficient if all agents learn to carry out certain subtasks or subgoals to achieve the true goal.

Ultimately, all agents should learn policies that maximize a collective reward.

Lastly, although a cooperative game may have indirect competition, such as a game of a two-legged race, their are no other agents working adversarially against any agent in the environment.

This article, in particular, emphasizes understanding MARL in the context of cooperative play.

Competitive

The complete opposite of a cooperative setting, the competitive setting is where all agents seek to maximize their rewards while minimizing the rewards for all other agents in the environment.

This means that the success of one agent requires that other agents fail. Think of the games Chess, Go, or a a Free-For-All competitive video game, where all agents individually seek to win, but by doing so, all other agent must lose.

In a two-player game, this would be considered a zero-sum game, which is an all-or-nothing game.

Mixed

Lastly, a mixed environment shares the elements of both cooperative and competitive environments. Think of team-based games such as League of Legends, Economics, or Politics (National, Corporate, University, etc).

Agents in these environments must work with some agents and compete against others by the rules of the environment. Perhaps the environment is a free-for-all environment, where it is in the agent's best interest to form alliances and strategically compete against other allied agents. Perhaps other times it is a team based competitive game, such as Football and League of Legends where teammates are in cooperation with one another while at the same time are competing against an adversarial team.

The complexity of these environments reflects the real world the most; so it is a new, challenging, and currently less explored area of research.

Types of Player Information

There are two main types of common player information in the MARL space: perfect information and imperfect information.

Perfect Information

When all states are available to a player in the game, it is said that the player has perfect information.

In this scenario, the player is capable of forming accurate models of their allies and opponents' policies, as all relevant information is available to the player. Think of the games of Go and Chess.

Imperfect Information

A player is said to have imperfect information when some states are hidden from them.

In multiplayer video games, these could be things such as the location of opponents. In a game like poker, this would be the cards opponents are holding. In this case, agents must rely on past observations and probabilities to not only predict the actions of opponents, but also the actions of the player.

Types of Strategies

Based on the type of game, and the information a player has, we can then categorize what type of strategy a player can infer. Think of a strategy as no different than a policy in reinforcement learning.

Pure Strategies

A pure strategy is one where a players is capable of creating a concrete plan, or strategy of playing a game from beginning to end. Basically, a player knows what moves to make based on all information available. In the context of RL, this would be similar to the concept where an agent always selects a max action based on the state. Another way of thinking about this is that based on all available information, the player will select a deterministic action.

Imagine if a player is at a casino and knows the payout of a machine at all times. In this scenario, the player knows the machine must give a payoff of $100 fifty-percent of the time, and where the player must lose 50% of the time, but the amount of money lost must be in a range between $0 - $98. In this scenario, the pure strategy is to keep playing against the machine because the worst case scenario the player faces is winning an average of $1 per game. This is as opposed to the other pure strategy in this game, which is to not play at all, which nets the player $0.

A pure strategy is said to strictly dominate the game if the payoff for that pure strategy has the highest payoff or average reward of all possible pure strategies the player is capable of making.

Although this scenario is simplistic, this hopefully illustrates an example of a pure strategy; which is that a player is capable of picking a strategy that never changes as that strategy will always positively benefits the player no matter the strategy of the opponent.

Mixed Strategies

A mixed strategy is one where a player may select one out of several pure strategies based on a probability distribution. It should also be noted that these probability distributions are independent of the strategies of other players. In the context of RL, this would be similar to the concept of where an agent selects an action from a probability distribution based on the state.

Think again of the game rock, paper, scissors. The opponent may select a strategy such as only playing rock, or perhaps only playing scissors and rock. In this scenario, the best option for any strategy the opponent may use at any time is to select rock, paper, or scissor randomly with equal distribution, or 1/3 of the time respectively.

The point of mixed strategies is that there exists games where a single pure strategy may not strictly dominate the game and so sometimes a player must make use of a mixed strategy approach.

Nash Equilibrium

When studying MARL (Multi-Agent Reinforcement Learning) problems in the literature, you'll likely come across the concept of Nash Equilibrium.

Nash Equilibrium is a fundamental concept in game theory. Simply put, a Nash Equilibrium is a state in which no player in the game can benefit by unilaterally changing their strategy, given that all other players' strategies remain the same. In other words, in a Nash Equilibrium, each player's strategy is the best response to the strategies of all other players.

In MARL, agents take actions while "considering" the actions of other agents in the environment. This decision process mirrors game theory, where all agents are rational and seek to make the best decisions for themselves. In MARL, all agents should aim to find the best possible strategic policy, ideally leading to a Nash Equilibrium where no agent, or team, can improve their outcome by changing their strategy alone.

Learning Materials:

Difficulties of MARL Problems

Before diving into the foundational concepts of doing well in MARL settings, we need to first understand what issues they aim to solve.

As you may already know, MARL makes everything about single agent reinforcement learning even harder. MARL introduces extra problems on top of the problems that were inherit in single agent reinforcement learning.

Stochasticity

The main issue in Multi-Agent Reinforcement Learning (MARL) is that multiple agents turn any deterministic problem into a stochastic one.

In single-agent problems, an agent updates its policies based on the Markov Decision Process (MDP) of the world. Although the MDP is unknown to both us and the agent, it remains constant. This constancy allows the single agent to update its view of the world and continue learning until it converges to an optimal policy.

However, once multiple agents are introduced, the MDP of the world becomes dynamic. This is because all other agents are continuously updating their policies, behaviors, and interactions with the world. As a result, the MDP is always changing because agents are constantly altering their behavior.

An always-changing MDP is akin to a moving target. When an agent learns a new policy, that policy may eventually become obsolete because it was tailored to an older version of the MDP.

An effective algorithm for agent learning should address the problem of ever-changing individual agent viewpoints. One such solution is the foundational concept called Centralized Learning, Decentralized Execution (CLDE).

Increased Action Space

The problem that MARL faces is that the joint action policy space explodes as the number of actions, agents, or timesteps increases: ((A^{I})^{T}).

As you can see, this is a problem for all agents, complicating their learning. It also demonstrates that a single state-action pair can lead to different results based on the actions of other agents in the environment.

Moreover, as the action space increases, so does the complexity of learning for all agents in the environment.

One should consider how to address this issue in the choice of algorithm.

Scalability Issues

The increased action space also leads to a more generic issue of scalability.

For example, consider a solution that takes in the states of multiple agents and outputs the joint actions for multiple agents. As the state space increases, so does the complexity of the algorithm that produces the joint action. In the case of a simple neural network, this could lead to an explosion in the network's width or depth.

The going understanding is that no algorithm can scale better than a purely decentralized learning approach, where all agents take care of their own learning.

However, a learning technique with comparable scalability to decentralized learning is the concept of centralized learning, decentralized execution. Not only is it comparably scalable to a decentralized learning approach, but it also reduces the amount of learning required per agent when compared to a decentralized learning approach.

Sparse Rewards

Although sparse rewards is not unique to MARL, it's still considered a significant problem.

When designing rewards for the environment, one should also consider how difficult it is for agents to receive rewards.

The concept of learning requires that an agent be capable of reaching the true goal and, therefore, receiving a reward signal.

However, in many complex problems, it is often the case that it is impossible to receive rewards when the true goal requires multiple subgoals to be achieved first.

Take, for example, the task of making coffee. Although the true goal may be a completed cup of coffee, it could be difficult for an agent to learn this when placed in a room full of useless objects that are not necessary for making coffee.

In this case, one would need to consider some form of reward shaping, where subgoals lead to the true goal.

We can reward an agent for grabbing certain items in the environment or, perhaps, reward an agent for approaching certain objects in the environment.

How one designs the reward structure of the environment is an open-ended question, with the answer depending on the problem domain.

This concept is not new, as its roots stem from animal behavior theory and operant conditioning, however this simple concept has formed the basis for better learning task frameworks such as Curriculum Learning.

Lastly as Andeychowicz et al. wrote in their seminal paper "Hindsight Experience Replay", sometimes the issue with sparse and/or large state spaces is not the lack of diversity in exploration, but rather the the impracticality of exploration in hard conditions (Section 3.1). Basically, one should consider alternatives to the simple exploration heuristics used in simpler problems.

Learning Materials:

Non-Unique Learning Goals and Rewards

In a cooperative setting, there are many situations where, despite multiple agents in the environment, their goals may be non-unique.

Take, for example, the multiplayer game League of Legends, which is in a sense a mixed environment. Each team consists of five players, each with their own unique role in the game. Despite the collective goal of beating the other team, their individual goals and subgoals may not align perfectly.

Consider a player with the healer role. A healer's goal may be to heal teammates more than attacking opponents. This contrasts with a player in the mage role, whose goal is to deal massive amounts of damage to the opponents strongest character. Which is in contrast to the player with the tank role, whose goal is to protect the strongest character on the team.

Although all roles share the collective goal of winning the game, their subgoals and, therefore, their learning tasks are completely different.

From that point of view, it makes intuitive sense that each agent should be rewarded differently based on how they explore and interact with the world.

A healer may need to be penalized for attacking opponents, whereas a mage may be rewarded for the same action in the exact same situation.

This leads to a very open-ended question: How should a programmer design the reward shaping of the world?

Should the design of reward shaping be done manually, utilizing domain knowledge? Perhaps this reward shaping should be done programmatically, through concepts and techniques already used in the research field of choice.

Again, this is something to consider, and as always, the answer is highly dependent on the problem domain.

Partially Observable Markov Decision Process

Although a more advanced concept in RL, a Partially Observable Markov Decision Process (POMDP) is an MDP where some information about the state is hidden from the player. Many challenging and intriguing problems exist in this space.

Take, for example, the game of Poker. When watching a game of poker, the audience is privy to the entire state space, meaning the audience knows exactly what cards each player is holding.

However, each player only sees the cards they are holding and the cards on the table. Players cannot see the opponents' cards, and therefore some parts of the state are hidden from them.

How does one form a strategy when they don't know the full state space? The answer comes from observations, the history of seen observations from specific states, and most importantly, probabilities.

Literature in the field do converge to a general answer: generating beliefs over the state space and the action policies of allies and opponents. From there the agent can create a mixed strategy for actions, which changes as the generated belief over state space along with ally and opponent policies change.

This is an important consideration, but it is beyond the scope of this article's soft introduction.

RL Is Still Trying to Understand

Also consider that academic researchers are still learning more about single-agent reinforcement learning (RL), meaning that RL concepts are not yet fully understood.

Take, for example, the idea of the Deadly Triad, introduced by Sutton, which is a general theory on why neural networks are not capable of working properly in conjunction with bootstrapping and off-policy learning (Q-Learning). However there are algorithms that utilize concepts of the Deadly Triad and seem to be performing beyond superhuman levels, such as Mnih et al. and the introduction of the seminal paper "Human-level control through deep reinforcement learnind (2015)" which uses an off-policy bootstrapping algorithm via Q-Learning in conjunction with neural networks.

Foundational Concepts

When programming algorithms for multiple agents in a cooperative environment, it's crucial to grasp fundamental concepts. This understanding enables one to effectively orchestrate agents to collaborate with teammates and navigate potential pitfalls inherent in such interactions.

Exploration

Exploration is a common theme and problem in reinforcement learning, better known as the exploration-exploitation tradeoff. However, this problem is exacerbated when it comes to MARL.

In a single-agent environment, the goal of the agent is to seek maximum rewards for itself, which in isolation is fine.

But what about in a cooperative setting? Is it still in the best interest of all parties that each agent seeks to maximize its own rewards?

The answer is simply no. Agents should, in some form, consider the rewards of other agents, and therefore, they should consider working together.

In that sense, they should consider exploring together and exploiting together as well.

When it comes to solving a true goal that requires teamwork, it is important that all agents explore the joint transitions together to see which joint transitions lead to the true goal.

To incentivize exploration, we can use techniques such as reward shaping, global reward sharing, and Centralized Learning, Decentralized Execution (CLDE).

Learning Materials:

Reward Shaping

Reward shaping is a common way of getting an agent to learn faster, and especially useful in environments that are naturally sparse in rewards. Sparse simply means that its difficult for the agent to reach the goal state, given that an environment at its default should only give a reward to the agent for achieving the goal.

The concept of reward shaping is simple. Simply introduce supplementary rewards for completing subtasks along the experience trajectory that reaches the true goal of the agent.

It sounds simple in intuition, but difficult in practice especially as the problem domain increases in difficulty. Sutton mentions that this is a trial and error process with no real format or framework.

However there are benefits to reward shaping. They can both speed up learning and improve exploration. One simple exploration reward shaping strategy is to give the agent a tiny negative reward for any state-action pair that does not lead to the goal or terminal state.

As with most things, there are no free lunches. There are tradeoffs when attempting reward shaping, especially when it comes to the design of the rewards. Reward shaping does lean towards having good domain knowledge as you'll need to understand a lot about the environment in order to produce good subtask reward design.

Learning Materials:

  • Chapter 17.4 of Reinforcement Learning (Sutton and Barto)

Global Reward Sharing

Cooperative Multi-Agent Reinforcement Learning problems occur when two or more agents in an environment work together to achieve a goal.

In the context of rewards, the agent who achieves the goal is the only one to get the reward, but this may not be desirable in the context of multiple agents. This is because all agents are trying to maximize their rewards during an episode; therefore, they compete with one another.

In a cooperative setting, some subtasks and goals are agent-interdependent, where achieving the goal may require multiple agents working together. Sometimes, one should favor collective success over individual success; in this case, consider sharing rewards amongst multiple agents.

Conversely, you may also have an issue with "reward gaming."

Learning Materials:

Reward Gaming

When you deviate from the standard reward paradigm of 'only the agent who achieves the goal gets the reward', you open the door to a concept known as 'reward gaming.' This is the notion that an agent may start behaving in ways that are not in line with your intentions, potentially leading to unintended outcomes.

For instance, consider a scenario where an agent is tasked with cleaning a room. If the agent is rewarded for every object it moves, it might start moving objects back and forth to maximize its rewards, instead of cleaning the room.

Sometimes, just through lousy luck in reward design, the agent will find a loophole to maximize the total rewards seen in an episode.

In global reward sharing, you end up with a concept called "free-riding," which is the idea that an agent can choose to do nothing and still see rewards.

Grasping the concept reward gaming is not just important; it's crucial. Unfortunately, solving this complex issue often requires trial and error and domain knowledge.

Learning Materials:

Centralized Learning, Decentralized Execution

Centralized Learning and Decentralized Execution (CLDE) is one of the most recommended concepts in the MARL problem space as it eliminates many of the problems and difficulties found in MARL problem domains.

The concept is straightforward. In a decentralized approach to Multi-Agent Reinforcement Learning, such as in the Actor-Critic framework, all agents in the environment have their own actor and critic neural networks.

Decentralization is an excellent concept for multiple agents who may never interact with each other. However, it may not be helpful for agents who should be working together because there is no way to share information.

To remedy this, we can introduce the concept of sharing just the critic neural network or both the actor and critic neural networks. This essentially allows the agents to share information.

A major strength of sharing information in Centralized Learning, Decentralized Execution (CLDE) is that instead of each agent learning individually, all agents are learning together. In essence, each agent explores different parts of the policy space. When other agents enter those spaces, they build upon the knowledge already discovered.

This collaborative learning approach allows for stationarity in policies in certain regions of the policy space. When actions in these regions become stationary or predictable, it helps other policies in different regions converge. These other policies can make the assumption that certain parts of the policy space are stationary, which simplifies their own learning process.

Eventually, as more policies become stationary, the entire policy space stabilizes, leading to a final policy that is effective for all states in the environment.

Quick note, research in A2C with PPO shows better performance when both actor and critic neural networks are shared among all agents.

Learning Materials:

Communication

The concept of Centralized Learning, Decentralized Execution (CLDE) introduces the essential idea of communication in MARL settings.

Although this term can be vague, its practicality in MARL cannot be understated.

The core concept of communication is about sharing information among all cooperative agents and, importantly, determining what information should be shared.

In CLDE, the information shared or communicated among all agents in the context of an Actor-Critic framework includes the "Value" of being in a particular state or the "Action" one should take in a given state. If the actions become stationary or predictable in certain regions of the policy space, this allows other agents to learn a policy in reaction to those stationary actions.

We can extend this further by introducing other forms of communication among agents.

This could involve direct information from the environment, such as agent location, distance between agents, or the subgoal the agent is currently trying to solve.

Other forms of communication can involve adding more information about the state space as observed by individual agents, which can be used as feature engineering techniques.

For example, in a partially observable environment like football, agents may have limited vision but can communicate with each other across the entire environment. If an agent sees the football, it can communicate that to other agents. This could involve adding to the state space the location of the ball or identifying which agent has the ball. Then, all other agents can act on that information, even if they are not observing it themselves.

One important point of communication is that it helps agents collaborate and coordinate their actions.

Lastly, based on the literature, as long as agents are capable of communicating, then agents are capable of learning in a decentralized fashion. The notable example comes from the literature in the RL of Unmanned Aerial Vehicles, or UAVs. Literature on UAVs state that learning is done in a decentralized fashion; where UAVs can communicate with each other via channels, share information available amongst each other, and based on successful sharing of information are then able to come up with joint actions to achieve whatever task is at hand.

Universal Function Value Approximator

Introduced by Schaul et al., Universal Value Function Approximators (UVFA) is the idea that we can train an agent to learn multiple goals, including all subgoals and true goals, within a single neural network.

The theory is sound, but the practicality is challenging in the MARL setting as it becomes problem-dependent.

However, I leave this here as it may provide you with some ideas, especially if you consider using this in conjunction with game AI concepts, such as Finite State Machines or Blackboard algorithms. These are ideas that I have come up with, and not necessarily those recommended in the field.

Learning Materials:

Role Labelling

A simpler concept than Blackboard algorithms is the idea of assigning an agent a "role" throughout the entirety of the cooperative game. Role labelling is a concept I created to tackle a MARL problem and is not something I believe you'll find in the literature.

The concept is straightforward: each agent is assigned a "role" in the cooperative game, and the individual agent's reward structure is statically determined based on that role.

As the game progresses, you can either swap the roles around based on a simple heuristic defined by the programmer, termed dynamic role labeling, or keep the roles intact throughout the entirety of the cooperative game, termed static role labeling.

As mentioned by an amazing person, who happens to be a former/current TA at Georgia Tech's Reinforcement Learning class, you can also consider masking actions based on the roles; essentially, you can hide and/or show actions based on the roles assigned to the agent to reduce the exploration of the action space for the individual agent.

In my experiments, static role labeling has been shown to improve learning performance, although I'm not sure to what extent the improved learning can be attributed to role labeling, nor have I tested it against dynamic role labeling.

Conclusion

This article is simply a quick introduction to multi-agent reinforcement learning. There is a lot of information out there, so many different academic papers and concepts that tackle the concepts of MARL.

My recommendation to go further in your journey is to read the Multi-Agent Reinforcement Learning Book by Albriecht.

Thank you for joining me. Warmest wishes on your success.

OMSCS Reinforcement Learning Series:

  1. Georgia Tech Reinforcement Learning: Preparing for Success
  2. Single Agent Reinforcement Learning: Basic Concepts and Terminologies
  3. Turbocharging Advantage Actor-Critic with Proximal Policy Optimization
  4. Advantage Actor-Critic with Proximal Policy Optimization: A Journey Through Code
  5. Multi-Agent Reinforcement Learning Soft Introduction: Cooperation

Hello! I'm just a person who wants to help others on their programming journey.

Godot Tutorials

Student