Can AI Reason About Politics? I Built a Trading Agent to Find Out
I spin up my AI prediction market trading agent. It produces an elegant dashboard showing all the contracts I can bet on, sorted by how much “edge” my agent tells me we have. I choose the first option, our bet with the highest expected gain: Republicans to win control of the House and Senate in 2026. The market price is only 20%, and my agent assures me that the Republicans have a 52.9% chance to win. Time to get that edge.
There’s just one problem. I click through to view my agent’s reasoning: “my evidence tells me that these elections already occurred and the Republicans won.” Hmmmm. Why does my agent think the election already occurred? And if it did, shouldn’t the probability be 100%, not 52.9%? Maybe my agent isn’t quite as intelligent as I was hoping...
If we could build AI systems that genuinely reason about politics—systems that grasp strategy, voter behavior, economic incentives, foreign policy constraints, the power of concentrated interests, and so on—we would have something meaningfully new. Models that aren’t “just” good at writing code, answering math problems, or helping us draft emails, but that can even provide insight into how political outcomes come to pass.
The recent rise of prediction markets may point towards a way to get there. Unlike most existing AI evaluations, they are forward-looking by construction: the outcomes don’t yet exist, can’t be scraped from the internet, and can’t be smuggled into the training data accidentally or intentionally. This makes them something somewhat rare in the AI evaluation landscape: a benchmark that can’t be gamed through memorization.
Over the holidays, I plunged into this strange and fascinating world. I designed, built, and rebuilt a suite of AI agents specialized to make predictions and suggest trades for political prediction markets, which I’m now using as part of an ongoing academic study. In this piece, I’ll explain what I did, what I learned, and how it can help us to design AI with real political intelligence.
Here’s what happens when you try to teach an AI the dark arts of political analysis.
The basic setup
I built my agent in Python using Claude Code. First, the program pulls all politically relevant contracts from Kalshi via their free API. Using AI—Anthropic’s Haiku model, specifically—I parse the contract and its resolution rules. The program searches for relevant news coverage in GDELT, an enormous database which draws from a continuously updated, crawl-based universe of tens of thousands of global news, government, NGO, and blog websites across 100+ languages. I then use Anthropic’s heavier-duty models, Sonnet and Opus 4.5, to interpret these news articles, compare them to the market price, and provide an estimate of the probability of the outcome, taking into account the specific resolution rules.
Finally, the program outputs a dashboard that shows the contracts, their market prices, the AI agent’s estimates, and its trade recommendation.

AI is magic
A confession: I am not a software engineer. I’m a political scientist who can write code when I need to, but building even a relatively simple system like this from scratch—pulling data from multiple APIs, parsing thousands of contracts, orchestrating calls to multiple AI models, outputting a functional dashboard—would have taken me an annoyingly long time. I built this one over a holiday break, mostly by carefully planning out the project in conversation with Claude Code, approving various actions, and iterating when things broke or behaved in unexpected ways.
That’s the real story here. I stitched together Kalshi’s API, GDELT’s vast news archive, and frontier AI models into a functioning analytical tool without writing most of the code myself. The barrier between “I wonder if this would work” and “let me try it” has effectively collapsed.
We talk a lot about AI replacing jobs or achieving superhuman benchmark performance. We talk less about what happens when individual researchers, traders, or journalists can each command their own personal swarm of AI agents—tools that pull data, synthesize evidence, and surface insights on demand. A political scientist with a laptop and an API key can now build his own trading desk. My agent makes plenty of mistakes, as we’ll see. But the fact that it exists at all suggests something meaningful has shifted.
Signs of real intelligence
My agent had plenty of stumbles, but it also offered promising glimmers of intelligence applied at scale. There’s no way I could go through the thousands of contracts listed on Kalshi and come up with sophisticated takes on all of them. But the agent surfaced a number with useful analysis.
For example, the agent thought that President Trump is more likely to add himself to Mt. Rushmore than the market thinks—and its reasoning seemed apt. It pointed out that Trump floated this idea five years ago, recently renamed the Kennedy Center after himself (pending ongoing litigation), and that the resolution rules state that he merely has to submit an Executive Order, not actually complete the addition.
Another example: the agent thought carefully about whether President Trump might pardon Steve Bannon for a second time. On the positive side of the ledger, it noted that Trump has demonstrated strong loyalty to his most controversial allies, has pardoned many others in situations similar to Bannon’s, and isn’t afraid of political controversy. Cutting the other way, it noted that Trump and Bannon have fallen out, that recent revelations about Bannon and Epstein might increase the political costs of a pardon, and even pointed out that since Bannon has already served his prison time, the value of a pardon is lower now.
In both cases, the reasoning seemed sound—though nothing brilliant that a person wouldn’t be able to come up with if given enough time. Thus far, in my experiments, the edge these traders are providing, to the extent it exists, comes from its ability to be roughly as insightful as a normal human but at the scale of thousands of contracts, far more than a normal human would be able to devote meaningful attention to.
Because I could only crudely link in simple sources of external information, the agent definitely lacked the extraordinary deep dives into papal conclaves or commissioning private polls that some prediction market traders have famously pulled off—but for automated analysis at scale, it was still impressive. It might not substitute for the incredible sleuthing of pro traders, but it could be a useful complement.
Growing pains
At the same time, my AI agent made a number of mistakes—some hilarious, some grave—that expose where the technology is still rough around the edges.
Confused about time
The House race example I started this piece with wasn’t a fluke. The agent repeatedly struggled with time—what’s resolved versus pending, what information is current versus stale, where we are in a political process.
I’ve been able to fix most of these issues by having the program tell the AI in the prompt what today’s date is, and reminding it to make sure to use current information, but this seems kludgy.
I’m not the only person who’s noticed this problem. On Jan 5, a widely followed prediction market account on X pointed out a debacle in which an automated bot lost $100,000 after it detected President Trump mentioning the phrase “fake news” and subsequently purchasing the wrong week’s mention market contract for the phrase. Whoops!
A similar example I saw in mine: the agent got very excited about a contract asking if Tim Cook would step down as Apple CEO before 2027. It assured me there was almost no chance this could resolve to yes with only a few days left in the year…but unless I’m mistaken, we’re only about to enter 2026, not 2027!
Difficulties with probabilities
If you think Candidate A has a 40% chance of winning and Candidate B has a 70% chance in a two-candidate race, something has gone wrong—but that’s the kind of mistake my agent made, more often that I would have liked.
It seems like the agent doesn’t reliably update correlated beliefs together. If new information makes it more confident that X will happen, it should become correspondingly less confident about mutually exclusive alternatives. But the models often treat each contract as an isolated question rather than part of an interconnected system.
More generally, it’s not clear how well the agent reasons about probabilities, even though I designed my program to elicit predictions in probabilistic form. A recent academic study evaluating frontier models on Kalshi prediction data found “systematic overconfidence across all models” and reported that “extended reasoning worsens rather than improves calibration.” The researchers concluded that “epistemic calibration”—the alignment between expressed confidence and actual accuracy—is “a distinct capability, separate from accuracy, that current training approaches fail to adequately develop.”
Missing political nuance
Politics is complicated, and my agent often seemed to overinterpret crude news articles without appreciating the broader context. For example, it insisted that the Labour party has essentially no chance of winning the next UK election, based on recent polling evidence covered in the news. Although the agent is right that Labour is not favored—the market has the party at roughly a 1 in 4 chance to win—4.1% is clearly too low.
The strange thing is that the “Key Drivers” it offers are quite sophisticated, yet they’re interpreted in a shallow way. This contract was analyzed by Sonnet, and my best guess is that, as one of the cheaper models, it probably failed to appreciate how important it is that UK elections are first-past-the-post, which historically consolidates support around the two major parties and makes it very difficult for third parties like Reform UK to convert poll numbers into seats (though British politics has certainly surprised us before.)
When I manually asked GPT-5.2 the same question, it built from many of the same data points, but was sure to mention that the next UK election is likely very far off, Labour could still win a substantial number of seats given the way voting works, and that betting markets put Labour around a 27% chance to win.
Lacking important context—and being overconfident
In many cases, the model was hamstrung by its lack of relevant context because I failed to provide it with enough external data while using cheaper models that don’t search the web or access other external information. Often, instead of being humbled by its lack of context, the models reacted by opining confidently and incorrectly.
For example, Sonnet was convinced there was *no way* the song ‘Sleigh Ride’ by the Ronettes would be in the top 10 of the Billboard’s Hot 100 for the week of Jan 3rd, 2026. In this case, it actually understood the timing issue, and flagged that Christmas might make the song more popular—but it airily asserted that even so, “historical data shows” that the song “rarely if ever cracks the top 10 even during peak Christmas week.”
But actually…the song regularly hits the top 10 around Christmas time (here are articles about 2022 and 2023), and it did so again the week of January 3rd, 2026. Sonnet was just wrong! Interestingly, other models got this right, and below I’ll show how Sonnet was able to correct its mistake after hearing from them.
Budgeting problems
An obvious problem that I didn’t fully anticipate when building my AI agent was how to budget for it: it’s actually kind of expensive to ask frontier AI models to analyze thousands of prediction-market contracts on a regular basis.
I faced a hard tradeoff. I could analyze all 675 active contracts with a cheap model like Haiku and get shallower analysis with lots of blunders. Or I could use heavy-duty models like Opus 4.5, get much higher quality analyses, but burn through money fast. The math is stark: running every contract through Opus twice a day could cost thousands of dollars a month, while a cheaper model like Haiku might be in the hundreds (all depending on token usage).
My solution was a tiered approach. Haiku handles the initial triage, flagging contracts that look promising. A mid-tier model, Sonnet, takes a second look at those. And Opus—the expensive artillery—only gets called in for the top 25 contracts where deep reasoning might actually matter. This brings costs down to something manageable while (hopefully) preserving quality where it counts.
The broader point is that the binding constraint has shifted. Building the tool was easy; running it is expensive. The scarce resource isn’t engineering time anymore—it’s inference compute, metered by the token. Allocating that resource efficiently across questions of varying importance is, in effect, a new kind of research design problem.
The wisdom of the AI crowd - assembling my council
After playing with the agent for a while, it was clear it was adding value—but also making enough mistakes that I started thinking about how to make it smarter.
Recently, Andrej Karpathy released a simple project he called “LLM Council.” The idea is straightforward: instead of asking one AI model for advice, you ask several. Each model gives its answer, then each model reviews and ranks the others’ responses, and finally a designated “Chairman” model synthesizes everything into a final recommendation.
Karpathy built it as a weekend hack to help him get better answers to hard questions—things like working through difficult passages in books he was reading. He described it as “99% vibe coded” and said he had no intention of maintaining it. But the underlying insight is cool: a council of models arguing with each other might catch errors and aggregate diverse thoughts and ideas to produce a superior collective judgment. Enamored by it, I spent some time building an experiment to test how deliberation and collective decision-making might improve how models answer test questions, finding that they can lead to better results in some cases.
For my prediction market agent, these results suggested a clear design: use a diverse council of frontier models, and let them see each other’s reasoning before committing to a final probability. Whether this translates to better forecasting performance remains to be seen—mathematical reasoning and political prediction are different domains. But the principle seems sound. Collective intelligence requires genuine diversity of perspective, and deliberation can help surface the best reasoning rather than just the most common answer.
So I added this functionality to my dashboard. Asking all the major frontier models to debate and vote on a contract is slow and expensive, so I make debates a user-initiated option for eligible contracts—those with sufficient trading volume that are resolving soon. When I trigger a debate, each model submits an initial probability estimate, they discuss their reasoning, and then each updates its estimate. The final output is a distribution of views across models, plus a synthesized recommendation.
“I made a critical error”
To see how this council works, let’s go back to the Sleigh Ride example. After Sonnet botched it and claimed it had no chance of being in the Billboard Top 10 during Christmas time, I opened the question up for debate. GPT-5.2, Gemini Pro, and Grok-4 all correctly pointed out that Sleigh Ride is regularly in the top 10. Interestingly, Sonnet was persuaded! After hearing their reasoning, it acknowledged “I made a critical error by not verifying the actual historical chart performance” and it massively changed its assessment of the contract. The transcript is kind of amazing.
My first trade
I wanted to really commit to this project, so I put real money into Kalshi and resolved to place bets based on my agent’s advice.
Of all the trades it recommended, one stood out: a contract on whether the Trump Administration would release additional Epstein documents by December 28th. The market price was about 50 cents—implying a 50% chance of release. My agent thought it was considerably less likely.
There were two reasons this one caught my eye. First, it was resolving soon, which meant I’d get quick feedback on a possible trade. And second, unlike a number of the blunders I encountered in other analyses like the ones chronicled above, this one appeared well reasoned and plausible. My own sense was that Trump is not eager to release Epstein-related documents and wouldn’t rush to do so during a holiday break when it would be so easy to let the matter lie.
I decided to spend some precious API credits and convene my council. The debate was in equal parts fascinating, blunderous, and insightful. (The full transcript can be viewed here.)
First, Sonnet 4.5 got a bit hung up on trying to figure out if the event might have already occurred—evidently unwilling to trust the market price and lack of resolution. But the model reassured itself that recent news did not suggest the documents had yet been released.
It then turned to a potentially more useful insight: releasing the documents likely requires bureaucratic processes that would take a while in normal times and take even longer during a holiday period.
GPT-5.2 offered a similar analysis: the holiday period would make a document release unlikely. But it did offer a caveat: maybe in response to public pressure, the Trump administration would release a small tranche of new documents as a performative gesture?
Grok also got hung up on figuring out if the event had already occurred, then mentioned the holiday issue, before settling on an admittedly made-up 60% guess. (Gemini’s API was down for this debate, so the model abstained).
Having offered their initial views, the models then got to see each other’s reasoning and update. You can see their updated probabilities and a summary of their final rationale below.

All in all, the models updated their estimates quite a lot. Sonnet increased its probability from 12 to 18%, writing: “After considering the other perspectives, I’m updating upward from my initial 12%. GPT-5.2’s point about the low threshold for what constitutes a ‘release’ is important - even a small memo or document tranche could satisfy the contract. However, Grok 4’s pressure-response theory seems overly optimistic given the 4-day holiday window and the fact that self-initiated releases creating political exposure are rare. The key insight is that while unlikely, the possibility of a pre-planned or minimal release isn’t negligible.”
GPT-5.2 updated the most dramatically, going from 32% all the way down to 12%. It explained: “Claude’s point about “absence of major coverage” as strong negative evidence in such a compressed window is persuasive, and it reinforces my base-rate view that an executive-branch release is unlikely without pre-positioning. Grok usefully highlighted the potential pressure mechanism, but I think it is more likely to produce statements/spin than a qualifying release before Dec 28.”
Grok also updated substantially, reducing its estimate by 12 percentage points but remaining by far the most positive probability. It wrote: “The other forecasters’ focus on distinguishing court releases from Trump administration actions has made me realize I may have overinterpreted the recent news as potentially admin-driven, lowering my estimate of imminent action. Their emphasis on the short timeframe, holidays, and bureaucratic hurdles are factors I hadn’t fully accounted for, though I still see room for political motivation to override these. Overall, this discussion has moderated my optimism without fully convincing me to drop to their low probabilities.”
While there wasn’t full agreement, there was a directional consensus: the market was overpricing Yes. So I bought No at 50 cents.
Over the following days, as no announcement came from the Administration, the No price climbed. I sold above 80 cents, netting a tidy 60% return on the position (and sure enough, the Administration didn’t release any documents before the deadline).
One trade proves nothing, of course. In the coming months, I’ll be tracking the agent’s predictions against actual outcomes and against the market to see whether it has any real edge. Stay tuned!
Flush from my initial success, though, I’ve already put down a few more wagers. I now have a strong vested interest in President Trump adding himself to Mt. Rushmore. And I’ve also put down a bet that Wes Moore won’t end up ramming through his off-cycle redistricting plan, on the encouragement of my AI council.
The promise of this direction
Why does any of this matter beyond a fun holiday project?
The typical way we evaluate AI systems is through benchmarks—standardized tests that measure capabilities like mathematical reasoning, coding ability, or factual recall. But benchmarks have a problem: they leak. Models get trained on test data accidentally or intentionally, performance numbers become inflated, and we lose the ability to distinguish genuine capability from sophisticated memorization.
Prediction markets and forecasting offer something different. They are forward-looking by construction. The outcomes haven’t happened yet, which means they can’t be in the training data. The questions are real, with real money at stake, which attracts informed participants and produces prices that reflect genuine uncertainty. This makes them a nearly ideal testbed for political intelligence. If we want to know whether AI systems can genuinely reason about politics—not just pattern-match on historical data—prediction markets give us a way to find out.
Others are pursuing this too. Metaculus runs AI vs. human forecaster tournaments (humans winning but gap narrowing). What seems to matter most, according to Metaculus: Better base models beat clever prompting. Quality data inputs are a necessary but not sufficient condition for accuracy. And calibration is distinct from accuracy.
Prophet Arena tests models against actual market prices. Some models are showing eye-popping returns so far, but it’s too early to know what’s real and what’s chance.
All in all, both show roughly the same pattern: promise may be there, but we’re early.
How we get there
Getting from here to AI agents that genuinely understand and predict politics will require progress on several fronts.
Prediction market platforms are the foundation. The more contracts they list, and the more domains they cover, the richer the training ground for AI forecasting. But quantity isn’t enough. Platforms could invest in more structured metadata—tagging contracts by domain, time horizon, and resolution type—so that AI systems can learn which kinds of questions they’re good at and which they struggle with.
Resolution rules matter enormously: a contract that resolves based on a specific government announcement is very different from one that resolves based on a journalist’s judgment call. The clearer and more machine-readable these rules become, the better AI systems can learn to interpret them.
Platforms might also consider publishing historical resolution data in standardized formats, creating a corpus that researchers can use to train and evaluate forecasting models systematically.
AI labs have a role too. Current frontier models are trained heavily on code, mathematics, and general reasoning—but political knowledge is often shallow. The models know facts about political systems, but they don’t always understand how those systems actually operate: the informal rules, the strategic calculations, the path dependencies that shape outcomes.
Labs could invest in curating high-quality training data on political strategy, institutional design, and historical case studies. They could fine-tune models specifically on forecasting tasks, optimizing not just for accuracy but for calibration—the alignment between expressed confidence and actual accuracy.
The Metaculus results and academic research both suggest this is a distinct capability that current training approaches fail to develop adequately. And they could build better tools for temporal reasoning, so that models understand not just what day it is but where we are in longer political processes.
Researchers and hobbyists are where the real experimentation happens. We should be playing with these questions in as many different ways as possible: testing prompts, trying different data sources, experimenting with governance structures like the council approach I described. The space of possible architectures is vast, and the only way to map it is through distributed tinkering.
Crucially, we should be posting predictions publicly and ex ante—before outcomes are known—so that we can evaluate what actually works rather than cherry-picking successful bets after the fact. It’s easy to find examples where an AI agent made a brilliant call; it’s much harder to track systematic performance over hundreds of predictions.
Back to the workshop
My AI still sometimes thinks elections that already happened are still up in the air. It still gets confused about contract resolution rules. It still produces probability estimates that don’t quite add up.
But my experiments, the Metaculus results, Prophet Arena, and lots of other ongoing efforts point the way forward: better base models matter most, good data sources come second, and clever prompting provides marginal gains on top. My tiered architecture—cheap models for triage, expensive models for deep dives, councils for the hardest calls—is a decent start. The next steps are clearer now: better news sources, more structured contract metadata, and systematic tracking to distinguish real edge from lucky bets.
I’ll keep working at it. But I’m more excited about what happens when thousands of others do the same. Prediction markets have always promised to aggregate dispersed knowledge into useful signals. AI agents that can participate in these markets—scanning contracts, synthesizing information, placing bets—could make that aggregation faster, broader, and more continuous. We might end up with something genuinely new: a real-time, always-on collective intelligence that helps us not just to predict politics but to understand it by combining human judgment with machine scale.
Will experimenting with AI forecasting agents, with hundreds of researchers trying thousands of approaches, eventually produce systems that genuinely understand how politics works? We’re not there yet. But the path is there. I’m putting a (small amount of) money on Yes.
Disclosures: I receive consulting income from a16z crypto and Meta Platforms, Inc.









Where do you see the liquidity coming from in (thin) political prediction markets? As you note, AIs are expensive to use in this way. If more people use AI in prediction markets, you’d expect herding in estimates (although maybe not? unclear how stochastic an LLM council as the potential to be—maybe that’s an interesting question in its own right).
In order to make AI trading profitable, the alpha needs to come from somewhere. To channel/paraphrase Matt Levine, sports “precision markets” are clearly just a lot of people having fun gambling, so there’s plenty of misinformed retail investors and so fairly thick markets with a lot of potential alpha for the sharps. In political precision markets it’s not clear there’s any dumb money to make it worthwhile. This is even more so true in political markets where insider trading seems like it is commonly occurring and seems not to be illegal.
Maybe you’re more interested in the AI functioning but I think as long as you have non negligible trading costs, the real barrier here is market structure.