Evolutionary Reinforcement Learning in an Ecosystem Based Environment
This is a post about an interesting project we did, in which we compared different learning algorithms in an open-ended ecosystem simulation by making agents compete against each other to survive.
We wanted to see how agents learn in a simple, simulated ecosystem, and how different aspects like evolution (evolutionary algorithms) and learning (reinforcement learning) affect the performance and survival chances of a species as a whole.
We ran experiments in a simplified single agent case to tune some hyperparameters, and to analyse the methods in isolation.
Then we investigated how these agents interact with one another, both in terms of competition between the same species as well as the predator/prey relationship.
The simulation consisted of multiple types of agents, prey, predators and food. As you might expect, predators ate the prey and prey ate the food. There were some extra features, like energy: you gain energy by eating and lose it over time. When you have enough energy, you can reproduce, producing a mutated copy of yourself. Agents also died after exhausting a fixed lifespan.
Our second simulation was a simpler version of the above, where there was only one prey, no predators and no energy/reproduction mechanics. This was modelled as a simple reinforcement learning environment where agents were rewarded when they ate food, and they got a negative reward when hitting obstacles.
This allowed us to run simple experiments to compare different methods in a controlled setting.
Agents could observe around them in 8 directions, and for each of those directions the observation was the distance to the closest piece of food, the closest predator, prey or wall, respectively.
This is shown in the image below. Agents took actions based on this information by passing it through a neural network as in the image on the left. The inputs were fed directly into this network and the outputs corresponded to scores for a specific action. The action the agent took was simply the one that had the highest value in the output node. These actions then allowed the agent to move in one of the 8 directions.
The methods we explored to actually update these networks were the following:
- Q-learning with eligibility traces.
- A genetic algorithm.
- Evolutionary Reinforcement learning that combines the above two methods.
- A random agent.
- A hardcoded agent that performed an ‘optimal ‘ action each time, which we hand crafted.
Q-Learning attempts to find the best action to take by maximising its expected future reward. It then updates this expectation based on the rewards it actually achieves.
A genetic algorithm tries to mimic the concept of natural selection and evolution, in that we have a population that is composed of individuals, and each individual has a “fitness” score, which is basically a measure of how well it solves the specific problem. We then select the top performing agents to create the new population through a process called crossover where two parents are combined to form children. These children are mutated slightly to introduce some genetic diversity and to explore more thoroughly. Crossover is simply splicing two networks together, by taking half of the weights from one parent and the other half from the other.
The idea is that, with enough generations, good performing traits will be prioritised and passed down, while bad performing traits will be discarded.
Evolutionary Reinforcement Learning combines these two paradigms into one where agents can learn, and update their weights using reinforcement learning, but the best performing agents of this generation are used to act as a starting point for the next generation.
We ran many experiments in the single agent case, and the main information we could take out of this is that ERL performs well in general, it is competitive with Q-learning, and it is better than a normal genetic algorithm.
We performed evaluation using a few specific levels that the agents were run on, with a frozen network. We found the evaluation results were quite variable and often ended in tragedy when the agent got stuck in a corner without a way to escape.
Doing a small gridsearch over the most important hyperparameters proved to help a bit, and it resulted in a better ERL agent.
Here we found some interesting behaviour, from species competing, to predator and prey oscillating in a natural way, to the environment having a fixed carrying capacity.
The multi-agent case was quite inconsistent, with different random initialisations leading to drastically different results.
The random agents also sometimes prevailed and survived by basically randomly searching the environment.
These issues could be fixed by changing the parameters of the simulation, like how big it is, how long agents survive, as well as how many agents we spawn in initially.
Overall, the project proved to be successful, as the ERL agents improved over time, and were quite general, in contrast to the hard coded agents, which must be changed with every new rule and simulation.
We could observe the agents in both simulations gradually get better, avoid obstacles and collect more food.
The full code, as well as a more detailed report can be found on Github.
References / Acknowledgements
Great blog post about a similar topic: https://andrearamazzina.com/2018/02/05/reinforcement-learning-in-a-predator-prey-model/
- [Ackley and Littman 1992] Ackley and Littman. Evolutionary Reinforcement Learning (ERL). 1992.
- [Fraczkowski and Olsen 2014] Rachel Fraczkowski and Megan Olsen. An agent-based predator-prey model with reinforcement learning. In Conference: Swarm- fest 2014: 18th Annual Meeting on Agent-Based Modeling & Simulation, 06 2014.
See our report for a more complete list of references.