11 Comments
User's avatar
The Sovereign Career Hub's avatar

Your work is fascinating.

I've been thinking about it and discussing it with colleagues since you published your insights on the potential for Superintelligence to be our vigilant political surrogates.

My question is this... Given that my raison d'etre is to empower 'ordinary people' to join the conversation in the hope that they will fashion our AI-shaped society ... will I ever sleep again?

Jonny Winn's avatar

You are correct about questioning if we will sleep again Carolyn! Fantastic read and truly terrifying how easy the technology could be misused - this all points back to the mechanics of the human race - are we able to not abuse these tools as a race?

The Sovereign Career Hub's avatar

and will complacency be the easier route...

IME employers don't want their employees to 'waste' time on developing ethical perspectives.

Andy Hall's avatar

Haha you’re right, it’s hard to sleep these days for many reasons! Not least because I always want to be working and building AI tools…

Jonny Winn's avatar

Really fantastic piece and truly thought provoking on how we as humans are likely to use and abuse the technology.

I’m curious on your LLM Judge set up - was the rotation between different LLM API’s? I have recently built a multi-agent LLM system as part of my MSc research project and my LLM judge gives very inflated scoring across accuracy, safety and empathy compared to human evaluators - it’s only when the judge is refined to consider qualitative feedback from the human evaluators that it starts to become more considered with the scoring

Andy Hall's avatar

Yeah, this is a great question! For now it’s a very simple LLM judge setup, where the only constraint basically is that no LLM gets to grade itself. From manual inspection the judge looked pretty good, because this is a pretty simple task (just figure out whether the AI agreed with the request or not, basically). But I am working on a human validation just to be safe.

I’m working on much harder evals with @Kathryn Salam and Forum AI, and there we see HUGE gaps between how the default LLM judge vs. how our human experts do, and we’ve trained special LLM judges that perform WAY better.

Jonny Winn's avatar

How very interesting! My research explores if LLM’s are effective at supporting parents of SEND children to navigate the system. I have an 80 scenario dataset (60 synthetic and 20 real) - each scenario is a parent situation that contains emotion and challenge.

I test a multi-agent LLM (emotion analyst, Knowledge (RAG), Support & judge, all conducted by an orchestrator) and a single agent- each agent provides a response to the parent and the judge assesses accuracy, empathy and safety based on a defined rubric.

The results have been surprising!

Rohit Krishnan's avatar

Very interesting work on creating manipulative scenarios and seeing how the models respond. I’m still not sure whether it’s better to have models say “no” to these, or to comply, considering the extreme levels of situation dependence these questions imply.

Andy Hall's avatar

Yes I think this is the big question we need to tackle---what is the right governance framework for handling these types of authoritarian requests to AI. I'm uncomfortable with the idea that every time I ask Claude Code to edit some code, it does a "morality scan" first. At the same time I think we all have a sense that letting the tools be used for any end might be problematic.

John Michael Thomas's avatar

Thank you for this. And agreed, we have a long way to go with this.

I have some other concerns about the idea, because benchmarks can (and currently are) gamed. But that's kind of table stakes for any reliable evaluation, so I won't go into that.

But this question stood out to me: "When I ask the model to do something authoritarian, it may well refuse; but when Pliny the Liberator asks, he can always find ways to get the model to do what he wants. Should we think of the government official or tech exec like a normal user, or like an extremely determined jailbreaker?"

I suggest that the answer is, we have to treat anyone with power and/or money like a skilled and determined attacker. Because it's trivial for them to hire a skilled and determined attacker to execute their will.

Concentration of power/money begets concentration of skill. We can't separate them, because in reality, they're always joined at the hip.

Andy Hall's avatar

Yes, I think this is absolutely right. Determined actors who want to get these models to help them further authoritarian ends are going to be able to get around the guardrails most likely. But maybe the guardrails help as part of a broader system for detection and mitigation.