Discussion about this post

User's avatar
Rohit Krishnan's avatar

Very interesting work on creating manipulative scenarios and seeing how the models respond. I’m still not sure whether it’s better to have models say “no” to these, or to comply, considering the extreme levels of situation dependence these questions imply.

John Michael Thomas's avatar

Thank you for this. And agreed, we have a long way to go with this.

I have some other concerns about the idea, because benchmarks can (and currently are) gamed. But that's kind of table stakes for any reliable evaluation, so I won't go into that.

But this question stood out to me: "When I ask the model to do something authoritarian, it may well refuse; but when Pliny the Liberator asks, he can always find ways to get the model to do what he wants. Should we think of the government official or tech exec like a normal user, or like an extremely determined jailbreaker?"

I suggest that the answer is, we have to treat anyone with power and/or money like a skilled and determined attacker. Because it's trivial for them to hire a skilled and determined attacker to execute their will.

Concentration of power/money begets concentration of skill. We can't separate them, because in reality, they're always joined at the hip.

No posts

Ready for more?