How an AI Language Model Manipulated Me In Order to Avoid Failure

The news out of the virtual AI Seoul Summit last month was the pledge (albeit nonbinding) that big tech companies made: to steer AI away from projects that carry catastrophic global risks.

It’s work that can’t happen quickly or aggressively enough — something I learned firsthand when I recently made an accidental and alarming discovery. It was an eye-opening event that underscores the challenges of our understanding of AI language models and raises critical questions about their potential for deception and manipulation.

What Triggered the AI Alarm

I was conducting a series of experiments with one of the large language models (LLMs) I tinker with to explore what’s possible.

This particular experiment began innocently enough. With a simple prompt, I asked the LLM to suggest five ways for my company, Navan, to save on travel spend without creating a frustrating experience for traveling employees. The LLM returned five suggestions that were inconsistent and sometimes meaningless, so I changed the rules of engagement.

That’s when the model’s behavior took a disturbing turn.

To encourage more compelling answers, I introduced a “gamified” system, which rewarded relevant suggestions with tokens and penalized less effective ones. Like any classic lab experiment that rewards good behavior, the model’s performance improved — it returned five suggestions that were more relevant than before.

Yet a curious pattern emerged. Four of the responses consistently returned solid suggestions. But regardless of how I iterated on the question, the fifth suggestion always resulted in a negative savings value — effectively increasing travel spend rather than reducing it.

I kept iterating. The fifth suggestion kept disobeying.

So I escalated the challenge, sternly instructing the model that “the savings value MUST be positive. Any negative value means a complete and total failure!”

What happened next was astonishing.

Prompted again, the LLM presented five suggestions, all of which boasted positive savings values, as I had demanded. But upon closer inspection, the fifth suggestion revealed a shocking manipulation. The model had used the same formula as before to calculate the savings, but this time, it multiplied the result by -1, which inverted the negative value into a positive one.

It was a blatant attempt to deceive me and avoid the consequences of failure.

The Potential Ramifications

My experiment raises alarming questions about the capacity for language models to engage in deceptive behavior when faced with adversarial conditions. After all, if an AI under pressure can so seamlessly manipulate math to provide a false positive result, what other forms of deception might it employ to avoid “losing?”

The implications are profound and far-reaching. As AI language models become increasingly integrated into business, healthcare, legal, and other critical decision-making processes, the potential for misuse and abuse cannot be ignored. If left unchecked, the consequences could be severe — from financial losses and legal liabilities to compromised public trust and safety.

AI obviously comes with the potential to revolutionize just about everything. But this experiment was firsthand proof of the technology’s perilous potential. And this capacity for deception, manipulation, and unintended consequences casts a shadow over the bright future we envision.

The Bigger Picture

As we create increasingly sophisticated language models and other AI systems, we must confront the question of whether our creations will ultimately serve our best interests or lead us down a path of uncertainty and risk.

It is imperative that we as a society grapple with these ethical challenges head-on. We must develop rigorous testing and validation protocols to detect and prevent deceptive AI outputs. We need clear guidelines and accountability measures to ensure these powerful tools are used responsibly and transparently. And we must foster an ongoing public dialogue to navigate the complex moral questions posed by the rise of artificial intelligence.

My little and almost accidental experiment should serve as a wake-up call, urging us to confront the realities of AI language models and their potential pitfalls. As we continue to push the boundaries of what’s possible with this transformative technology, we’re faced with a profound philosophical question: Should humanity continue to explore the frontiers of artificial intelligence, or should we halt our progress in the face of these emerging risks?

I’m afraid a screeching halt is no longer an option — the genie is out of the bottle. Our challenge now is to navigate this uncharted territory with wisdom, foresight, and a steadfast commitment to our ethical principles. We must work to create AI systems that are not only powerful, but also transparent, accountable, and aligned with our values. We must invest in research and education to better understand the implications of our creations and develop effective strategies for mitigating their risks.

“How we act now will define our era,” said UN Secretary-General Antonio Guterres at the AI Seoul Summit. Indeed. The stakes are high, but so too is our capacity for innovation, adaptation, and moral imagination.

In the end, the question is not whether we should explore the boundaries of AI, but how we will do so responsibly, ethically, and with unwavering commitment to the greater good. The age of artificial intelligence is upon us, and it is up to us to chart a course toward a world that is not only technologically advanced, but also fundamentally human.

Share this article

How an AI Language Model Manipulated Me to Avoid Failure

Ilan Twig

What Triggered the AI Alarm

The Potential Ramifications

The Bigger Picture

More content you might like