AutoGen: Multi-Agent Conversation Frameworks for Finance AI
After Gorilla showed that a single LLM can learn to call thousands of APIs accurately, the natural question is: what happens when you give multiple LLMs distinct roles and let them talk to each other? AutoGen (Wu et al., 2023) answers that question by building a framework for multi-agent conversation, and reading it now feels timely — most production finance AI systems I see being designed involve at least three agents by default.
The paper
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (Wu, Bansal, Zhang et al., Microsoft Research, 2023) proposes a framework where "conversable agents" — each backed by some combination of an LLM, tools, and human input — send messages to one another until a task is complete. The framework introduces two built-in agent types: AssistantAgent (driven by an LLM) and UserProxyAgent (which can execute code and relay human input), plus a GroupChatManager that routes turns in larger ensembles.
The core idea is what the authors call "conversation programming": instead of hand-writing orchestration logic in code, you specify what each agent should do via natural-language system prompts and let message passing handle control flow. The paper demonstrates this across math problem-solving, retrieval-augmented QA, ALFWorld decision-making, and an operations-research application called OptiGuide.
Key ideas
- Accuracy lift on MATH benchmark: a two-agent AutoGen setup (an LLM assistant plus a code-executing proxy) reaches 69.48% on the MATH test set, compared to 55.18% for GPT-4 used alone — a 14-point gain from adding code execution feedback.
- Human-in-the-loop is first class: the
UserProxyAgenthas a configurablehuman_input_mode—ALWAYS,NEVER, orTERMINATE— meaning you can dial oversight up or down without changing the agent's logic. - Dynamic group chat: the
GroupChatManagerselects the next speaker based on conversation state rather than a fixed round-robin order, which lets workflows branch in response to emerging results. - OptiGuide safety gain: attaching a SafeGuard agent to a supply-chain optimization workflow improved unsafe-code detection F1 by 8 percentage points on GPT-4 and 35 points on GPT-3.5, while shrinking the user's codebase from 430 lines to 100.
- Interactive retrieval: in QA tasks, the assistant agent could request additional context by emitting an
UPDATE CONTEXTsignal; this triggered in roughly 19.4% of questions on Natural Questions, and the overall F1 was 23.40%. - Composability by design: any AutoGen agent is itself a valid "tool" that another agent can call, so hierarchical pipelines compose without special glue code.
What holds up — and what doesn't
The MATH and ALFWorld results are solid — controlled, reproducible comparisons against named baselines with real benchmarks. The 69.48% figure is meaningful because it isolates the benefit of code execution feedback within a structured conversation loop.
What's weaker is the cost and latency analysis, or rather its absence. Every GroupChat turn triggers a full LLM call with the accumulated conversation history. A four-agent workflow with ten rounds means forty LLM calls minimum, each with a growing context window. The paper never reports token cost or latency for any of its applications. In a live accounting pipeline processing thousands of transactions, that omission is not academic — it determines whether the approach is viable at all.
The conversation programming metaphor is also more fragile than it looks in demos. The GroupChatManager selects the next speaker by prompting the LLM to choose from a list of agents. That selection is itself a probabilistic text generation step, meaning control flow can go wrong in subtle ways that don't raise exceptions. For a ledger write-back agent — where the order of operations matters and a misplaced tool call could corrupt a journal entry — non-deterministic speaker selection is a real liability.
Finally, the evaluation tasks are all single-session, short-horizon. There is no experiment where agents accumulate state across days, handle contradictory instructions, or need to resolve conflicts between an older agent memory and a newer ledger entry. These are exactly the scenarios that arise in real accounting workflows.
Why this matters for finance AI
The finance AI case for multi-agent systems is straightforward: reconciliation, posting, and reporting are naturally separate concerns. A Beancount pipeline could have a LedgerReaderAgent that queries the ledger as read-only, a ReconcilerAgent that compares transactions against bank statements, a WriterAgent that proposes new entries, and a ReviewerAgent that checks them against chart-of-accounts rules before any write is committed. AutoGen's UserProxyAgent pattern is the right abstraction for the WriterAgent — it can execute the actual ledger write and return the result as a message that the ReviewerAgent inspects.
The OptiGuide SafeGuard result is the most directly transferable finding: adding a dedicated verification agent to catch unsafe actions improved detection substantially, and the detection happened inside the conversation loop rather than as a post-hoc audit. That is exactly the architecture I would want for Beancount write-back safety — a verifier that blocks the commit, not one that alerts after the fact.
The non-deterministic speaker selection problem is solvable: you can override the GroupChatManager with a deterministic Python function that routes based on message content. But you have to know to do that, and the paper doesn't foreground it as a concern.
What to read next
- AgentBench: Evaluating LLMs as Agents (Liu et al., arXiv:2308.03688, ICLR 2024) — benchmarks LLMs across eight distinct agent environments including web browsing, coding, and database manipulation; the gap between commercial and open-source models is the key finding and directly informs which base models to use for finance agent pipelines.
- TradingAgents: Multi-Agents LLM Financial Trading Framework (arXiv:2412.20138) — directly instantiates the AutoGen pattern for financial markets with specialized analyst, researcher, trader, and risk-manager agents; the Sharpe ratio and max drawdown results give the first real performance numbers for multi-agent finance systems.
- AGENTLESS: Demystifying LLM-based Software Engineering Agents (Xia et al., arXiv:2407.01514) — argues that a simple, agentless two-phase approach (localize, then repair) outperforms complex multi-agent frameworks on SWE-bench; a useful counterweight to the assumption that more agents always helps.
