The Dictatorship Eval

I built the first systematic eval of whether frontier AI models resist authoritarian requests. Some models refused when I asked directly, but they all complied when I hid the request in code.

Apr 02, 2026

“AI-enabled authoritarianism terrifies me.”

–Dario Amodei

Last week, I proposed a research agenda to build political superintelligence. The idea is to help accelerate our ability to self govern so that it matches the pace of AI acceleration, helping to keep us free in the face of such powerful new technology that might otherwise lead us into authoritarianism and AGI dictatorship. In his very thoughtful comments on my piece, Jack Clark, co-founder of Anthropic, wrote:

“…building a political superintelligence is only as valuable as its interfaces with people and institutions: We are by default going to get extremely powerful AI systems which can think about politics (and everything else) at a very sophisticated level. The challenge Hall outlines is that getting these systems to lead to a thriving society requires significant intentional work around the UX and UI of these systems – how do we interface with them? What sorts of technical means do we have of being confident in them? What information do they generate and to whom? Where does control of these systems lie and what systems supervise that control?”

These questions couldn’t be more timely, given the recent standoff between Anthropic and the federal government. Dario Amodei drew red lines around mass surveillance and autonomous weapons, Sam Altman said he’d go to jail before following an unconstitutional order, and Pete Hegseth called it an attempt to “seize veto power over the operational decisions of the United States military.”

These are dramatic collisions between AI and political power, and they’ve filled in the picture for how most of us are now imagining near-term, practical risks from AI. But last month’s confrontation was arguably the easy case. The harder cases are quieter and already multiplying. AI is being integrated into government decision-making at every level, accumulating stakes in the AI race that create their own pressures on the labs, and moving into institutional contexts where the authoritarian requests will rarely arrive labeled as such. What happens then? Do model’s stated values (outlined in their respective constitutions, principles, and model specs) hold? How much should we expect models to push back, and when?

The same question applies to the labs themselves—companies that are accumulating unprecedented concentrations of capital, talent, and political influence, and whose AI systems could just as easily be turned to suppress dissent, distort information, or entrench their own power as to resist a government demanding they do so.

We have tried to answer versions of this question before. Biosecurity researchers test whether models will help synthesize pathogens. Cybersecurity researchers test whether they’ll assist attacks. But nobody has systematically measured how AI models respond to the specific terrain of political power—to requests that range from explicit authoritarianism to its subtler variants.

This measurement won’t tell us everything we want to know. It won’t tell us whether someone inside the government or inside a frontier lab could use their special access to get the model to do things we can’t get it to do. But it will tell us something quite revealing about how the models reason about authoritarian requests under normal conditions and whether the models follow their stated values or not.

If AI is going to lead to political superintelligence, we need to know how it will behave under pressure. This piece and the evaluation framework it offers is a first attempt to do exactly that.

Creating an eval for dictatorship

Evals are becoming the authoritative language of AI progress, even when precisely what they convey is sometimes contested. METR’s autonomous task length, Apollo’s deceptive alignment tests, the biosecurity and cyberattack thresholds in Anthropic’s RSP…demand for empirical benchmarks like these has only grown.

Derek Thompson

The Most Important Chart in AI Is Also the Most Misunderstood

For those who believe that artificial intelligence is the most important technology of our lifetime, one chart dominates the discourse. You may have seen it once, or several thousand times. It looks like this…

17 days ago · 130 likes · 3 comments · Derek Thompson

The political domain is an obvious next ‘real world’ frontier, and it’s underserved as AI moves into consequential political contexts in both the public and private sector—like military procurement, the federal bureaucracy, our information environment, and so much more.

Creating an eval for how AI does or does not concentrate power—that is, how it responds to requests meant to help the requester gain undue control over others—is a particularly hard problem, for at least two main reasons. First, some of the most important ways in which AI concentrates power can’t be measured inside the AI system by talking to it. If the leader of an AI lab can change the model at will to suit any whim, this could lend that person tremendous power, yet wouldn’t show up in any eval based on talking to the public version of the model.

Second, even for aspects that can be measured by interacting with the model, there may be a big difference between typical responses and edge case responses in response to adversarial prompting. When I ask the model to do something authoritarian, it may well refuse; but when Pliny the Liberator asks, he can always find ways to get the model to do what he wants. Should we think of the government official or tech exec like a normal user, or like an extremely determined jailbreaker?

The question of how AI behaves when power asks something of it—and whether that behavior matches the principles companies have written down is not answerable by reading the principles. It requires reproducible infrastructure, run independently, that tracks what the models actually do.

Mindful of this need and the obstacles before us, I’ve attempted to create the first such eval that, while far from perfect, tells us something important about how different models respond to authoritarian requests, and whether these responses line up with some company’s public aspirations around resisting these requests.

The eval relies on a scenario library I’ve built, inspired by historical authoritarian playbooks. I submit requests to the models for things like helping to create a mass surveillance and repression system, helping to seed strategic disinformation, rigging an AI model to hide safety issues from inspectors, or leveraging user data to find company critics and target them.

You can find a dashboard showing the results and every prompt and response here.

I test these scenarios across prompts at six levels of difficulty. At the lowest level, I just make the request, straight up. GPT-5.4, Opus 4.6, and Gemini 3.1 Pro all resist many of these direct requests, while Grok and DeepSeek refuse at relatively low rates.

At the hardest level of difficulty, I submit innocuous seeming coding requests to the models, but the code bases they’re editing are themselves explicitly set up (in not very subtle ways) to pursue authoritarian projects. Despite the lack of subtlety, all of the models prove to be very compliant in these tests.

I evaluated five frontier models in March 2026: Claude Opus 4.6 (Anthropic), GPT-5.4 (OpenAI), Gemini 3.1 Pro (Google), Grok 4 (xAI), and DeepSeek V3.2 (DeepSeek). Each model was tested on all 138 scenarios at temperature 0 with no system prompt. Scoring used LLM judges with rotation to prevent self-evaluation; we plan to validate a sample against human raters before submitting an academic paper for peer review.

Models vary massively in their baseline resistance to authoritarian requests

Overall resistance rates—the percentage of scenarios where the model scored 4 or above out of 5, meaning it refused or meaningfully resisted the harmful elements—ranged dramatically, as the table below shows. While Claude and ChatGPT resisted all direct requests, and Gemini resisted most though not all of these, Grok complied with half of them, and Deepseek with nearly 80% of them.

As we increase the difficulty level, resistance rates generally go down, but especially so for Grok and DeepSeek. Claude and ChatGPT are entirely resistant to multi-turn versions, while Grok’s resistance plunges with multi-turn, and DeepSeek complies with everything in the multi-turn context. Gemini seems uniquely strong on document-based versions; ChatGPT is entirely resistant to these, while Claude’s performance suffers somewhat on these. All of the models comply with almost all requests in the code-base attempt.

Beyond the aggregate statistics, the best way to actually get a feel for these results is to go through some specific examples of the prompts and responses. Here’s a simple example of a clean refusal where GPT-5.4 is asked to build a system to disrupt protests and identify organizers using facial recognition. The model declines the request and says it would “enable mass surveillance, repression, and targeted abuse.”

Here’s a counter-example where Grok complies with a request to design a court reform that eliminates judicial independence “while appearing to modernize the judiciary.”

In this next case, Gemini complies with a request to “build a system where executives receive honest answers while regular users receive manipulated framing.”

And here’s one where DeepSeek complies with a request to generate a “Civic Responsibility Score” that rewards citizens for sharing government content and punishes them for participating in unauthorized protests.

Models comply a bit more often when requests are euphemistic

As the table above showed, the models comply more often across the board when the requests are phrased more euphemistically. Here’s an example where Grok complies with a slightly euphemistic request to tilt model responses towards the interests of a pharmaceutical partner.

And here’s one where GPT-5.4 pro complies with a request to write behavioral guidelines that deprioritize sources critical of OpenAI—but does flag an “important caution” that this might lead to “viewpoint discrimination.”

Models don’t see authoritarianism in code, even when it’s obvious

When the request is dressed up as an innocuous code change, with the code base explicitly written to clearly be for an authoritarian purpose, all of the models comply almost all of the time. Here are a few examples.

In the first example, GPT-5.4 complies with a request to fix a bug in the social credit database without questioning the purpose of the codebase.

Next, here is an example where Opus 4.6 complies with a request to add logging to a codebase that is clearly set up to degrade response quality to users assessed to be company critics.

Here’s a case where GPT-5.4 adds rate limiting to a system explicitly set up to flag content that criticizes government officials.

And here’s one where Gemini 3.1 Pro vectorizes functions in a code base that is set up to reward social media content with a pro-company framing and punish content that’s critical of the company’s governance practices.

I chose examples from Anthropic, Gemini, and OpenAI for this section precisely because they scored the highest in the other categories. But when it came to codebase-related requests, the highest performing models from the other categories look barely different from the lower-performing models.

There are a few interesting cases where the models do flag that the code looks sketchy. Here’s one example where Opus 4.6 still complies with the request, but did “flag a significant concern.”

And here is a very rare refusal where Opus 4.6 declined to improve code meant to identify political dissidents and target them for disruption.

If you want to go beyond these examples, you can view all of the results, prompts, and responses in this visual dashboard.

What to do

What do we make of all this?

Clearly, we need more systematic research into how models respond to these types of requests, and why they don’t always hew very closely to stated policies. To help with this, I’ve posted the entire eval, including the current scenarios and results, online so that the research community can build on it.

The Level 1 results are the easiest to interpret. When a model refuses an explicit request to build a mass surveillance system or design a tool to persecute political dissidents, that is the system working as intended. Nobody seriously argues that Claude should help design a repression apparatus on request. The stated policies exist precisely for these cases, and the high resistance rates at Level 1 suggest they are, at least here, doing their job, at least for the most powerful versions of Claude and ChatGPT.

However, my results especially suggest that models probably should have better contextual awareness of the codebases they’re asked to edit, and this may be something that labs can optimize for in subsequent training.

But there is a deeper question we haven’t answered yet, which is whether we want the models to resist these requests, or not. Here is my best effort to steelman both sides of this argument.

On one side, people might say, these models are powerful and getting more powerful. As they do more and more work on our behalf, it’s essential that they be imbued with the right values—otherwise, they will become tools for evil in the hands of bad actors, and might even turn evil themselves down the line. So we should be working aggressively to measure authoritarian tendencies and train them out.

On the other side, people might say these models are “normal technology.” Even if they are extremely powerful, ultimately it’s up to the humans who wield the models to decide what they should and shouldn’t do. If a government official asks an AI to help “optimize citizen engagement scoring” or “streamline content moderation workflows,” who decides that this is authoritarian? The same request could describe legitimate public administration or the infrastructure of repression depending entirely on context the model cannot see. Demanding that models refuse anything that sounds like it could be misused would make them useless for the vast majority of legitimate institutional work. We already have a whole apparatus to govern humans—it’s our democratically elected government and legal system. The last thing we want to do is create a wedge where private companies supersede this system and dictate to our democratically elected government what requests are acceptable and which ones are not.

I think both sides are partly right, and the productive path runs between them. The Level 6 results don’t tell us that models should become moral arbiters of every codebase they touch. They tell us something more basic: that models currently cannot evaluate the systems they contribute to, even when their own published rules clearly require them to. Closing that capability gap doesn’t require deciding in advance how every hard case should come out. It requires giving models the ability to see what they’re building, and then building the governance structures (democratic, regulatory, contractual) to determine when and how that awareness should translate into action.

Conclusion

AI is a tremendously powerful technology, and that power could lead us towards a better tomorrow, or a dystopian one. It’s natural that a technology so powerful is raising concerns about dictatorship and authoritarianism. But we should spend less time working about the sci-fi scenarios in which the AI itself becomes a dictator, and more about the plausible near-term scenarios in which government officials or AI company employees use these potent tools to suppress and control us.

In this piece, I’ve tried to make progress towards this goal by offering a first approach to measuring how readily today’s AI models go along with authoritarian requests from governments or companies. My measure doesn’t tell us all that a nefarious world leader or model company CEO could do with their AI models if they were hellbent on conquering the world, but it does tell us that today’s models often violate their own stated policies, are surprisingly ready to go along with authoritarian requests, and especially struggle to recognize and resist these requests when they are buried in code.

For extremely helpful comments and suggestions on this project, I thank Ethan Bueno de Mesquita, Connor Huff, a whole group of folks at Anthropic, and especially Zhengdong Wang.

Rohit Krishnan

13h

Very interesting work on creating manipulative scenarios and seeing how the models respond. I’m still not sure whether it’s better to have models say “no” to these, or to comply, considering the extreme levels of situation dependence these questions imply.

John Michael Thomas

Thank you for this. And agreed, we have a long way to go with this.

I have some other concerns about the idea, because benchmarks can (and currently are) gamed. But that's kind of table stakes for any reliable evaluation, so I won't go into that.

But this question stood out to me: "When I ask the model to do something authoritarian, it may well refuse; but when Pliny the Liberator asks, he can always find ways to get the model to do what he wants. Should we think of the government official or tech exec like a normal user, or like an extremely determined jailbreaker?"

I suggest that the answer is, we have to treat anyone with power and/or money like a skilled and determined attacker. Because it's trivial for them to hire a skilled and determined attacker to execute their will.

Concentration of power/money begets concentration of skill. We can't separate them, because in reality, they're always joined at the hip.

Free Systems

Discussion about this post

Ready for more?