AlphaEvolve [1] is a promising new coding agent by Google’s DeepMind. Let’s look at what it is and why it is generating hype. Much of the Google paper is on the claim that AlphaEvolve is facilitating novel research through its ability to improve code until it solves a problem in a really good way. Remarkably, the authors report that AlphaEvolve has already achieved such research breakthroughs.
In this article, we will go through some basic background knowledge, then dive into the Google DeepMind paper and finally look at how to get OpenEvolve [2] running, an open-source demo implementation of the gist of the AlphaEvolve paper. In the end, you will be ready to make your own experiments! We will also briefly discuss the possible implications.
What you will not get, however, is an absolute statement on “how good it is” . Applying this tool is still labor intensive and costly, especially for difficult problems.
Indeed, it is difficult to determine the extent of this breakthrough, which builds upon previous research. The most significant citation is another Google DeepMind paper from 2023 [4]. Google is definitely suggesting a lot here in regards to the possible research applications. And they seem to be trying to scale up the research applications: AlphaEvolve has already produced numerous novel research results in their lab, they claim.
Now other researchers have to reproduce the results and put them into context, and additional proof of its value needs to be created. This is not straightforward, and again, will take time.
The first open-source attempts at applying the AlphaEvolve algorithms were available within days. One of these attempts is OpenEvolve, which implemented the solution in a clean and understandable way. This helps others to evaluate similar approaches and determine their benefits.
But let’s start from the beginning. What is all of this about?
If you are reading this, then you have probably heard of coding Agents. They typically apply large language model’s (LLMs) to automatically generate computer programs at breathtaking speeds. Rather than producing text, the chatbot generates Python code or something else. By confirming the output of the generated program after each attempt, a coding agent can automatically produce and improve actionable computer programs. Some consider this a powerful evolution of LLM capabilities. The story goes like this: Initially, LLMs were just confabulating and dreaming up text and output in other modalities, such as images. Then came agents that could work off to-do lists, run continuously and even manage their own memory. With structured JSON output and tool calls, this was further extended to give agent access to additional services. Finally, coding agents were developed that can create and execute algorithms in a reproducible fashion. In a sense, this enables the LLM to cheat by extending its capabilities to include those that computers have had for a long time.
There is much more to creating a reliable LLM system, more on this in future articles. For AlphaEvolve, however, reliability is not a primary concern. Its tasks have limited scope, and the outcome must be clearly measurable (more on this below).
Anyway, coding agents. There are many. To implement your own, you could start with frameworks such as smolagents, swarms or Letta. If you just want to start coding with the support of a coding agent, popular tools are GitHub CoPilot, integrated in VS Code, as well as Aider and Cursor. These tools internally orchestrate LLM chatbot interactions by providing the right context from your code base to the LLM in real time. Since these tools generate semi-autonomous functions based on the stateless LLM interface, they are called “agentic.”
How extremely stupid not to have thought of that!
Google is now claiming a sort of breakthrough based on coding agents. Is it something big and new? Well, not really. They applied something very old.
Rewind to 1809: Charles Darwin was born. His book On the Origin of Species, which outlined evidence that natural selection leads to biological evolution, led biologist Thomas Henry Huxley to the above exclamation.
Photo by Logan Gutierrez on Unsplash
Of course, there are other forms of evolution besides biological evolution. In a figure of speech, you can essentially claim it whenever survival of the fittest leads to a particular outcome. Love, the stars — you name it. In computer science, Evolutionary Algorithms (with genetic algorithms as the most common subclass) follow a simple approach. First, randomly generate n configurations. Then, check if any of the configurations meets your needs (evaluate their fitness). If so, stop. If not, pick one or multiple parent configurations — ideally, very fit ones — , create a new configuration by mixing the parents (this is optional and is referred to as crossover ; a single parent works too), optionally add random mutations, remove a few of the previous configurations — preferably, weak ones — and start over.
There are three things to note here:
The necessity of a fitness function means that there is measurable success. AlphaEvolve doesn’t do science on its own, finding just anything for you. It works on a perfectly defined goal, for which you already may have a solution, just not the best.
Why not make the goal “get mega rich”? A short warning: Evolutionary algorithms are slow. They require a large population size and many generations to reach their local optimum by chance. And they don’t always identify the global optimum solution. This is why you and I ended up where we are, right? If the goal is too broad and the initial population is too primitive, be prepared to let it run a few million years with unclear outcome.
Why introduce mutations? In evolutionary algorithms, they help overcome the flaw of getting stuck in a local optimum too easily. Without randomness, the algorithm may quickly find a poor solution and get stuck on a path where additional evolution can not lead to further improvements, simply because the population of possible parent configurations may be insufficient to allow for the creation of a better individual. This inspires a central design objective in AlphaEvolve: Mix strong and weak LLMs and mix elite parent configurations with more mundane ones. This variety enables faster iterations (idea exploration), while still leaving room for innovation.
Background knowledge: Example on how to implement a basic evolutionary algorithm
For finger practice or to get a basic feel of what evolutionary algorithms generally can look like, this is an example:
import random
POP, GEN, MUT = 20, 100, 0.5
f = lambda x: -x**2 + 5
# Create an equally distributed start population
pop = [random.uniform(-5, 5) for _ in range(POP)]
for g in range(GEN):
# Sort by fitness
pop.sort(key=f, reverse=True)
best = pop[0]
print(f"gen #{g}: best x={best}, fitness={f(best)}")
# Eliminate the worst 50 %
pop = pop[:POP//2]
# Double the number of individuals and introduce mutations
pop = [p + random.gauss(0, MUT) for p in pop for _ in (0, 1)]
best = max(pop, key=f)
print(f"best x={best}, fitness=", f(best))
The goal is to maximize the fitness function -x²+5 by getting x as close to 0 as possible. The random “population” with which the system is initialized gets modified up in each generation. The weaker half is eliminated, and the other half produces “offspring” by having a Gaussian value (a random mutation) added upon itself. Note: In the given example, the elimination of half the population and the introduction of “children” could have been skipped. The result would have been the same if every individual were mutated. However, in other implementations, such as genetic algorithms where two parents are mixed to produce offspring, the elimination step is necessary.
Since the program is stochastic, each time you execute it, the output will differ, but will be similar to
gen #0 best x=0.014297341502906846 fitness=4.999795586025949 gen #1 best x=-0.1304768836196552 fitness=4.982975782840903 gen #2 best x=-0.06166058197494284 fitness=4.996197972630512 gen #3 best x=0.051225496901524836 fitness=4.997375948467192 gen #4 best x=-0.020009912942005076 fitness=4.999599603384054 gen #5 best x=-0.002485426169108483 fitness=4.999993822656758 [..] best x=0.013335836440791615, fitness=4.999822155466425
Pretty close to zero, I guess. Simple, eh? You may also have noticed two attributes of the evolutionary process:
The results are random, yet the fittest candidates converge.
Evolution doesn’t necessarily identify the optimum, not even an obvious one.
With LLMs in the picture, things get more exciting. The LLM can intelligently guide the direction the evolution takes. Like you and me, it would figure out that x must be zero.
How it works: Meet AlphaEvolve
AlphaEvolve is a coding agent that uses smart prompt generation, evolutionary algorithms to refine provided context as well as two strong base LLMs. The primary model generates many ideas quickly, whereas the stronger secondary LLM increases the quality level. The algorithm works irrespective of which LLM models are used, but more powerful models produce better result.
In AlphaEvolve, evolution for the LLM means its context adapts with each inference. Essentially, the LLM is provided with information on successful and unsuccessful past code attempts, and this list of programs is refined through an evolutionary algorithm with each iteration. The context also provides feedback on the programs’ fitness results, indicating their strength and weaknesses. Human instructions for a specific problem can also be added (the LLM researcher and the human researchers form a team, in a way, helping each other). Finally, the context includes meta prompts, self-managed instructions from the LLM. These meta-prompts evolve in the same way that the fittest code results evolve.
The evolutionary algorithm that was implemented may be relevant. It combines a strategy called MAP-Elites [5] with island-based population models, such as traditional genetic algorithms. Island-based population models allow for subpopulations to evolve separately. MAP-Elites, on the other hand, is a smart search strategy that selects the fittest candidates who perform well in multiple dimensions. By combining the approaches, exploration and exploitation are mixed. At a certain rate, the elite is selected and adds diversity to the gene pool.
Fitness is determined as a multidimensional vector of values, each of which shall be maximized. No weighting seems to be used, i.e., all values are equally important. The authors dismiss concerns that this could be an issue when a single metric is more important, suggesting that good code often improves the results for multiple metrics.
Fitness is evaluated in two stages (the “evaluation cascade”): First, a quick test is performed to filter out obviously poor candidate solutions. Only in the second stage, which may take more execution time, is the full evaluation performed. The goal of this is to maximize throughput by considering many ideas quickly and not wasting more resources than necessary on bad ideas.
This whole approach is easily parallelized, which also helps throughput. The authors are thinking big: They mention that even problem evaluations that take hundreds of computing hours for a single test are possible in this setup. Bad candidates are discarded early, and the many long-running tests take place simultaneously in a datacenter.
The LLM’s output is a list of code sequences that the LLM wants replaced. This means the LLM does not have to reproduce the entire program but can instead trigger modifications to specific lines. This presumably allows AlphaEvolve to handle larger code bases more efficiently. To accomplish this, the LLM is instructed in its system prompt to use the following diff output format:
<<<<<<< SEARCH
search text
=======
replace text
>>>>>>> REPLACE
Key findings from the paper
Much of the paper discusses relevant research advancements that AlphaEvolve already produced. The research problems were expressed in code with a clear evaluator function. This is usually possible for problems in mathematics, computer science and related fields.
Specifically, the authors describe the following research results produced by AlphaEvolve:
They report that AlphaEvolve found (slightly) faster algorithms for matrix multiplication. They mention that this required non-trivial changes with 15 separate, noteworthy advancements.
They used it for finding search algorithms in different mathematical problems.
They were able to improve data center scheduling with the help of AlphaEvolve.
They had AlphaEvolve optimize a Verilog hardware circuit design.
Attempts to optimize compiler-generated code produced some results with 15–32% speed improvement. The authors suggest that this could be systematically used to optimize code performance.
Note that the magnitude of these result is under discussion.
In addition to the immediate research results produced by AlphaEvolve, the authors’ ablations are also insightful. In an ablation study, researchers attempt to determine which parts of a system contribute most to the results by systematically removing parts of it (see page 18, fig. 8). We learn that:
Self-guided meta prompting of the LLM didn’t contribute much.
The primary versus secondary model mixture improves results slightly.
Human-written context in the prompt contributes quite a bit to the results.
Finally, the evolutionary algorithm, that produces the evolving context passed to the LLM makes all the difference. The results demonstrate that AlphaEvolve’s evolutionary aspect is crucial for successfully solving problems. This suggests that evolutionary prompt refinements can vastly increase LLM capability.
OpenEvolve: Setup
It is time to start doing your own experiments with OpenEvolve. Setting it up is simple. First, decide whether you want to use Docker. Docker may add an extra security layer, because coding agents may pose security risks (see further below).
To install natively, just clone the Git repository, create a virtual environment, and install the requirements:
The agent will optimize the initial program and produce the best program as its output. Depending on how many iterations you invest, the result may improve more and more, but there is no definite logic to determine the ideal stopping point. Typically, you have a “compute budget” that you exhaust, or you wait until the results seem to plateau.
The agent takes an initial program and the evaluation program as input and, with a given configuration, produces new evolutions of the initial program. For each evolution, the evaluator executes the current program evolution and returns metrics to the agent, which aims to maximize them. Once the configured number of iterations is reached, the best program found is written to a file. (Image by author)
Let’s start with a very basic example.
In your initial_program.py, define your function, then mark the sections you want the agent to be able to modify with # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END comments. The code does not necessarily need to do anything; it can simply return a valid, constant value. However, if the code already represents a basic solution that you wish to optimize, you will see results much sooner during the evolution process. initial_program.py will be executed by evaluator.py, so you can define any function names and logic. The two just must fit together. Let’s assume this is your initial program:
Next, implement the evaluation functions. Remember the cascade evaluation from earlier? There are two evaluation functions: evaluate_stage1(program_path) does basic trials to see whether the program runs properly and basically seems okay: Execute, measure time, check for exceptions and valid return types, etc.
In the second stage, the evaluate(program_path) function is supposed to perform a full assessment of the provided program. For example, if the program is stochastic and therefore does not always produce the same output, in stage 2 you may execute it multiple times (taking more time for the evaluation), as done in the example code in the examples/function_minimization/ folder. Each evaluation function must return metrics of your choice, only make sure that “bigger is better”, because this is what the evolutionary algorithm will optimize for. This allows you to have the program optimized for different goals, such as execution time, accuracy, memory usage, etc. — whatever you can measure and return.
from smolagents.local_python_executor import LocalPythonExecutor
def load_program(program_path, additional_authorized_imports=["numpy"]):
try:
with open(program_path, "r") as f:
code = f.read()
# Execute the code in a sandboxed environment
executor = LocalPythonExecutor(
additional_authorized_imports=additional_authorized_imports
)
executor.send_tools({}) # Allow safe builtins
return_value, stdout, is_final_answer_bool = executor(code)
# Confirm that return_value is a callable function
if not callable(return_value):
raise Exception("Program does not contain a callable function")
return return_value
except Exception as e:
raise Exception(f"Error loading program: {str(e)}")
def evaluate_stage1(program_path):
try:
program = load_program(program_path)
return {"distance_score": program(1)}
except Exception as e:
return {"distance_score": 0.0, "error": str(e)}
def evaluate(program_path):
try:
program = load_program(program_path)
# If my_function(x)==x for all values from 1..100, give the highest score 1.
score = 1 - sum(program(x) != x for x in range(1, 101)) / 100
return {
"distance_score": score, # Score is a value between 0 and 1
}
except Exception as e:
return {"distance_score": 0.0, "error": str(e)}
This evaluator program requires the installation of smolagents, which is used for sandboxed code execution:
pip3 install smolagents
With this evaluator, my_function(x) has to return x for each tested value. If it does, it receives a score of 1. Will the agent optimize the initial program to do just that?
Before trying it out, set your configuration options in config.yaml. The full list of available options is documented in configs/default_config.yml. Here are a few important options for configuring the LLM:
log_level: "INFO" # Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
llm:
# Primary model (used most frequently)
primary_model: "o4-mini"
primary_model_weight: 0.8 # Sampling weight for primary model
# Secondary model (used for occasional high-quality generations)
secondary_model: "gpt-4o"
secondary_model_weight: 0.2 # Sampling weight for secondary model
# API configuration
api_base: "https://api.openai.com/v1/"
api_key: "sk-.."
prompt:
system_message: "You are an expert programmer specializing in tricky code
problems. Your task is to find a function that returns an
integer that matches an unknown, but trivial requirement."
You can configure LLMs from another OpenAI-compatible endpoint, such as a local Ollama installation, using settings like:
It will then whiz away.. And, magically, it will work!
Did you notice the system prompt I used?
You are an expert programmer specializing in tricky code problems. Your task is to find a function that returns an integer that matches an unknown, but trivial requirement.
The first time I ran the agent, it tried “return 42”, which is a reasonable attempt. The next attempt was “return x”, which, of course, was the answer.
The harder problem in the examples/function_minimization/ folder of the OpenEvolve repository makes things more interesting:
Top left: Initial program; Center: OpenEvolve iterating over different attempts with the OpenAI models; Top right: Initial metrics; Bottom right: Current version metrics (50x speed, video by author)
Here, I ran two experiments with 100 iterations each. The first try, with cogito:14b as the primary and secondary model took over an hour on my system. Note that it is not recommended to not have a stronger secondary model, but this increased speed in my local setup due to no model switching.
[..] 2025-05-18 18:09:53,844 – INFO – New best program 18de6300-9677-4a33-b2fb-9667147fdfbe replaces ad6079d5-59a6-4b5a-9c61-84c32fb30052 [..] 2025-05-18 18:09:53,844 – INFO – 🌟 New best solution found at iteration 5: 18de6300-9677-4a33-b2fb-9667147fdfbe [..] Evolution complete! Best program metrics: runs_successfully: 1.0000 value: -1.0666 distance: 2.7764 value_score: 0.5943 distance_score: 0.3135 overall_score: 0.5101 speed_score: 1.0000 reliability_score: 1.0000 combined_score: 0.5506 success_rate: 1.0000
In contrast, using OpenAI’s gpt-4o as the primary model and gpt-4.1 as an even stronger secondary model, I had a result in 25 minutes:
Surprisingly, the final metrics seem similar despite GPT-4o being far more capable than the 14 billion parameter cogito LLM. Note: Bigger numbers are better! The algorithm aims to maximize all metrics. However, while watching OpenAI run through iterations, it seemed to try more innovative combinations. Perhaps the problem was too simple for it to gain an advantage in the end, though.
A note on security
Please note that OpenEvolve itself does not implement any sort of security controls, despite coding agents posing considerable security risks. The team from HuggingFace has documented the security considerations with coding agents. To reduce the security risk to a reasonable degree, the evaluator function above used a sandboxed execution environment that only allows the import of whitelisted libraries and the execution of whitelisted functions. If the LLM produced a program that attempted forbidden imports, an exception such as the following would be triggered:
Error loading program: Code execution failed at line ‘import os’ due to: InterpreterError
Without this extra effort, the executed code would have full access to your system and could delete files, etc.
Discussion and outlook
What does it all mean, and how will it be used?
Running well-prepared experiments takes considerable computing power, and only few people can specify them. The results come in slowly, so comparing them to alternative solutions is not trivial. However, in theory, you can describe any problem, either directly or indirectly, in code.
What about non-code use cases or situations where we lack proper metrics? Perhaps fitness functions which return a metric based on another LLM evaluation, for example, of text quality. An ensemble of LLM reviewers could evaluate and score. As it turns out, the authors of AlphaEvolve are also hinting at this option. They write:
While AlphaEvolve does allow for LLM-provided evaluation of ideas, this is not a setting we have optimized for. However, concurrent work shows this is possible [3]
Another outlook discussed in the paper is using AlphaEvolve to improve the base LLMs themselves. That does not imply superspeed evolution, though. The paper mentions that “feedback loops for improving the next version of AlphaEvolve are on the order of months”.
Regarding coding agents, I wonder which benchmarks would be helpful and how AlphaEvolve would perform in them. SWE-Bench is one such benchmark. Could we test it that way?
Finally, what about the outlook for OpenEvolve? Hopefully it will continue. Its author has stated that reproducing some of the AlphaEvolve results is a goal.
More importantly: How much potential do evolutionary coding agents have and how can we maximize the impact of these tools and achieve a broader accessibility? And can we scale the number of problems we feed to them somehow?
Let me know your thoughts. What’s your opinion on all of this? Leave a comment below! If you have facts to share, all the better. Thanks for reading!
References
Novikov et al., AlphaEvolve: A Gemini-Powered Coding Agent for Designing Advanced Algorithms (2025), Google DeepMind
Asankhaya Sharma, OpenEvolve: Open-source implementation of AlphaEvolve (2025), Github
Gottweis et al., Towards an AI co-scientist (2025), arXiv:2502.18864
Romera-Paredes et al., Mathematical discoveries from program search with large language models (2023), Nature
Mouret and Clune, Illuminating search spaces by mapping elites (2015), arXiv:1504.04909