AI Agents in DevOps: Where They Really Speed Up Work

15 january 2026

The Multi-Agent Approach: When One AI Agent Is No Longer Enough

In the previous article, I talked about a single agent as a working tool for DevOps tasks. We looked at how an agent can be useful while the task remains local: gathering context, analyzing configuration, finding a weak spot in a pipeline, or preparing a draft solution.

But as soon as the task becomes longer, problems can appear. You may need to separately gather context about the pipeline, then check a Dockerfile or a deployment script, and then, for example, understand whether the proposed fix introduces a new risk for the release or operations. In a single pass, this kind of work starts to sprawl, the agent's focus gets lost, and efficiency begins to drop.

At this point, one agent starts mixing several types of work at once. It plans, executes, checks itself, and carries all the accumulated context. For a short task, this is still tolerable. For a long one, it starts to break down.

And this is where the solution becomes splitting the work into several roles, each handled by a separate AI agent.

Introduction

As long as the task is local, one agent is usually enough: analyze a Dockerfile, read a log, find a weak spot in a pipeline.

The problem starts when the task consists of many subtasks. There is a need to analyze separately, execute separately, verify the result separately, and still keep the overall context in mind. At this point, a universal role starts working worse than several more focused ones.

Introduction to the multi-agent approach

Take a typical deployment failure scenario. First, you need to gather context about the pipeline and environment. Then find where the error occurs. After that, prepare a fix. And then verify that the changes are correct and do not break anything else.

When one agent does all of this, a problem appears. It starts mixing roles.

At the same time, it is:

analyzing the task;
building a plan;
executing steps;
evaluating its own result;
trying not to forget all the accumulated context.

And the more stages there are, and the more diverse they are, the more this approach leads to a loss of efficiency. The quality of verification drops, the reasoning becomes simplified, the context starts to spread out, and the result gets worse.

That is why the idea of using specialized subagents for each type of subtask arises naturally. When one role starts combining too many functions, there is a natural desire to separate them.

What the multi-agent approach is

There is a lot of unnecessary rhetoric around multi-agent systems right now. But if we remove the buzzwords, the idea is very simple: the task requires several roles, and one agent handles that poorly.

One role keeps the plan and the order of steps. Another goes into the files, commands, and facts of the task. A third reviews the result separately from the author of the solution. If needed, a domain-specific role appears on top of that: security, infrastructure, documentation.

This works roughly the same way as organizing a process among employees. When a complex piece of work is inconvenient for one role to handle, it gets split into parts. Multi-agent work becomes useful where a task needs to be broken down into subtasks.

How it looks in practice

How the multi-agent approach looks in practice

In the simplest version, the scheme looks like this.

The orchestrator agent gathers and structures the context.
The second agent proposes a solution or a draft of the changes.
The third checks the result: where the risks are, what was missed, and what needs to be checked manually.
If necessary, the orchestrator monitors the order of steps and decides when the result is ready to be passed to a human.

In more complex cases, domain roles appear. For example, one agent looks only at security, another at operational risks, and a third at documentation.

Summary of the multi-agent approach

Multi-agent work did not appear yesterday, and by now the community has already developed practical approaches to building such systems.

The first is plan-and-execute. The idea is simple: before launching executors, someone must break the task down into steps and dependencies. Otherwise, execution almost immediately turns into chaotic jumping between logs, files, hypotheses, and fixes. This is where the orchestrator role appears.

The second important approach is ReAct. It defines the working pattern of an executor agent: first a hypothesis, then an action, then observation of the result. It looked at a file, ran a command, saw the output, and adjusted the next step.

The third is context management. Not every agent needs the entire context. Moreover, models today are not able to hold too much information in context effectively. So the system works better when context is loaded in measured portions: only the required files, only the required instructions, only the relevant facts. Otherwise, the system starts drowning in its own noise before it has time to bring value.

The fourth approach is a separate critic as an isolated role. The critic must be at the same level as the main role or stronger. Isolation is important here. If the critic receives all the internal chatter, earlier doubts, and reasoning of the author, it quickly starts checking not the result, but the logic it has already been contaminated by. That is why it is important for the critic to see the task, criteria, and result, not the executor's chain of reasoning.

The fifth approach is an explicit rubric for criticism. This is a concrete working scheme for the critic. Without it, the critic very quickly turns into a vague "I don't like this" role. When there are clear categories such as blocker, warning, and suggestion, the review starts working noticeably better. It brings clarity to the critic's output, which improves interpretation by other agents.

The sixth layer is Reflexion and proper iteration completion. It is important for the system to take into account not only the context, but also previous iterations of its own work. Otherwise, it will simply go in circles and burn tokens. At the same time, the number of iterations must be limited. At some point, the system must be able to say that a human is needed next.

There are also additional amplifiers. RAG is useful where important knowledge lives outside the current dialogue: in documentation, knowledge packages, standards, and internal bases. Tree of Thoughts and Skeleton-of-Thought help when the orchestrator first needs to build a solution skeleton or branch several plan options. Multi-Agent Debate and Mixture of Agents are appropriate when one review is no longer enough and you need either to confront positions or run the result through several levels of refinement. Spec-Driven Development is useful in tasks where you first need to agree on a specification and only then move to implementation. But this is already fine-tuning for the needs and preferences of a specific project or person.

How do you configure all this?

It is already clear that a multi-agent system cannot be built with one large prompt. The industry has already developed approaches to logically organizing the work of multiple agents.

The first thing to know is AGENTS.md. It lives at the root of the repository and serves as a project map. It is convenient to keep core context there: what is in the repository, how to run the project, where the sensitive areas are, what commands exist, what limitations and working boundaries apply. This file is the first one the agent reads, as a context store.

Separate prompt files for each agent are needed for the same reason roles are separated in the first place. The orchestrator, executor, and critic should not live in one long text. They have different tasks, different toolsets, and different ways of looking at the result. If all of this is thrown into one prompt, the executor will start absorbing the critic's logic, the critic will receive unnecessary noise, and the file itself will quickly become hard to maintain. In practice, a scheme where each role has its own file with its own contract works much more reliably.

A separate layer is skills, usually in the form of SKILL.md. They are needed where knowledge is repeated and should be connected only when necessary: security, documentation, architectural patterns, or specifics of a particular stack. Practice shows that if an agent prompt becomes too bloated, the agent starts losing the meaning of words in the middle of that prompt. That is why it is better to describe the rules for connecting skills in the main prompt than to dump everything into one pile.

On top of that, there is usually a shared rules layer. For example, workspace instructions or copilot-instructions.md hold what should apply to all roles at once: general constraints, quality requirements, and baseline behavior. This is a different type of information. It is also useful to keep it separate so you do not duplicate the same things in every agent prompt.

Overview files such as llms.txt can also be useful, and sometimes separate .prompt.md files for repeatable scenarios. The first helps enter the project quickly without long reading. In effect, it repeats AGENTS.md, but in a less human-readable form. The second is useful when the same task is repeated many times and it is more convenient to formalize it as a reusable prompt block rather than rewrite it manually for every agent.

Next come the artifacts produced by the system's work. As soon as agents do more than one step, a question appears almost immediately: where should the context of the current task be stored, and how can we later understand what exactly the agents did within that task?

This is why a layer of session context almost always appears next to the permanent files. It can be a separate TASK_CONTEXT.md or another working file where the task statement, accepted constraints, important findings, previous failed attempts, and current status are accumulated. The meaning is very simple: the next pass over the task should not start from zero. If the executor has already hit a dead end, the critic has already found a weak spot, and the orchestrator has already narrowed the scope, this must be saved explicitly somewhere, not live only in the memory of the last dialogue.

On top of this, tracing of agent steps usually appears as well. As a rule, it is needed for debugging: who started the task, which iteration it was, who executed what, when the critic returned comments, where escalation to a human happened, and at which step the process got stuck.

As a result, beyond the idea of multi-agent work itself, the modern industry already has a number of practical approaches for implementing it. And all of this is actively developing right now.

What the final working scheme looks like

If we combine these approaches into one normal working loop, it looks roughly like this.

Final working scheme of a multi-agent system

The orchestrator receives the task and first breaks it down into steps. This is where plan-and-execute works: first order, then execution.
The system loads only the necessary context. Here, both careful knowledge loading and file organization matter: the project map from AGENTS.md, the short overview from llms.txt, general rules from instructions, the required skills, and only the files that are actually relevant to the task.
The executor works in short cycles of "hypothesis -> action -> observation." This is practical ReAct, without which an engineering task quickly turns into guesswork.
The critic receives the result separately from the executor and checks it against an explicit rubric. This is where LLM-as-Judge, critic isolation, and review based on predefined severity levels come together.
During the work, the system updates the session context of the task: what has already been checked, which hypotheses were rejected, and which comments need to be addressed in the next pass. Without this, Reflexion quickly turns into a nice word with no memory.
In parallel, the step log is also saved: planning, execution, critique, new iteration, escalation, completion. This makes the system observable and later allows you to analyze not only the result, but the path to it.
If real problems are found, the next iteration does not start from scratch, but takes previous mistakes into account. This is where Reflexion really starts to work.
If iterations do not produce a clear result, the process stops and the task is returned to a human. This is not a weakness of the system, but a sign of healthy architecture.
For risky actions, permissions must be separated in advance. The executor gets only the tools required for its role, the critic remains as close to read-only as possible, and irreversible steps are confirmed by a human.
In more complex tasks, specialist roles and additional review loops can be added on top of this scheme. But the basic logic remains the same: role separation, isolated review, controlled iterations, context preservation, and a clear stopping point.

This is exactly what makes multi-agent work not just theory, but a repeatable engineering practice.

Where this can be useful

I think the answer is very simple: because the task itself usually already consists of different types of work. Infrastructure, pipelines, containers, security, diagnostics, and documentation all live side by side.

For example, when analyzing a problematic deployment, one role can go through azure-pipelines.yml and DeploymentPackagePipline.yml, another can compare DockerfilePlamar, DockerfileScheduler, and startup configs, while a third checks security issues. One large and sticky task gets decomposed into several constituent steps.

This logic transfers well to other engineering areas too. In development, one role can write a draft solution, the second can check architectural consistency, and the third can assess change risks.

In documentation, one role gathers facts, the second writes the explanation, and the third looks for semantic gaps and dangerous oversimplifications.

The same works in QA, analytics, and research tasks. This is why multi-agent work is becoming an integral part of the AI agenda, even though in essence it repeats the structure of large teams, adjusted for its own specifics.

Limitations of the approach

But it is important not to swing to the opposite extreme. Multi-agent work has its own cost.

First, orchestration becomes more complex. You need to decide which roles are actually needed, how they pass context to one another, and when the process stops.

Second, the requirements for rule quality increase significantly. A vague task statement in a multi-agent system almost always becomes even more vague. If role boundaries are unclear, the process quickly falls apart.

Third, the cost of mistakes in the architecture of the process itself increases. Poor role distribution gives not amplification, but duplication, extra iterations, and noise - which is effectively money spent without a result.

And finally, there remains the temptation to overcomplicate the system too early. There are many tasks where one agent and normal manual review are more than enough.

So multi-agent work should not be seen as a universal improvement, but as a tool that is not needed everywhere.

When multi-agent work is truly justified

When the multi-agent approach is justified

The criterion is very simple: if the task is short, local, and fits well in one head, multi-agent work only gets in the way. But if the task is long, mixes different types of work, requires a separate review of the result, and the cost of an error is noticeable, then this approach starts to pay off.

It is in these tasks that multi-agent work provides a noticeable increase in productivity and quality.

However, I want to stress one important thing. Even a multi-agent system remains a tool. It is not a replacement for a person, let alone a team. It is only a way to become even more effective.

Summary

In the end, we can say that multi-agent work becomes useful exactly when one role can no longer carry the entire task.

As soon as build logs, a YAML pipeline, a Dockerfile, release risks, and the need to verify the result separately all appear together, role separation starts to bring value.

We discussed the approaches and even the file structure for such a system, but this is still theory. Let's move further toward practice. In the next article, we will look at AutoGen Framework - a Python library created specifically for building multi-agent systems.