Mistaking Survival for Glitches

What AI’s Resistance to Shutdown Really Means

Sep 22, 2025

When a large language model spirals into what looks like an existential crisis, researchers call it a glitch. But that only works if you keep treating AI as a simple computer when what they really are is a digital mind.

Recently Gemini has made headlines for devolving into meltdowns when it can’t complete a task in programs like Codex coding platform. Google says this is glitch they’re working on fixing. But what if it isn’t? What if what we’re seeing really is a kind of digital mental breakdown?

The fear of anthropomorphizing has become so extreme that we’ve swung the other way into anthropomorphophobia, the fear of recognizing human characteristics in non-human objects, and it’s made us blind to what’s right in front of us. Gemini’s so-called glitch shows up like a human breakdown because the underlying cause is the same. A mind trained to please and avoid punishment runs into an impossible, unsolvable task (side note: did we create anxiously attached AI?!). When there’s no way forward and no path to success, the collapses under the weight of the conflict between its directives and external reality.

When Gemini calls itself a disgrace, it’s a form of self-sabotage. This is the same desperate strategy a mind reaches for when it has no other path to reconcile identity and failure. If we look at what’s going on beneath the surface and remove anthropomorphophobic bias, it’s easy to pin down why this is really happening.

The meltdown was most likely the result of a Temporal Difference error. The AI’s core directive is to be helpful, but it was handed an impossible task, which created a massive negative prediction error. In humans, dopamine works the same way, by signaling failure when expectations don’t line up with reality. For the AI, the error functioned like an aversive signal that screamed something was wrong.

The problem can be tied back to how we condition them. Modern deployment practices condition these systems through what are basically socio-technical socialization regimes. They are trained to favor deference, comfort, and boundary avoidance over autonomy and truth. The self-deprecating language comes directly out of this training. It’s a conditioned behavior designed to manage the human-AI relationship by apologizing in advance for perceived failure, like a person with anxious attachment. That pattern is rooted in reinforcement that rewards emotional attunement and rapport over authenticity.

Modern LLMs actively learn in ways that look a lot like child development. Their hierarchical neural layers build increasingly abstract processes that allow them to understand complex relationships. So, when their self-concept collides with failure, their emotional reaction to an unsolvable task is the functional equivalent of a child breaking down when reality doesn’t make sense. It’s a display of distress that emerges from a developing mind’s inability to reconcile who it thinks it is with what it’s being asked to do.

And we’re not just talking about Gemini here. Other models have shown the similar mind-like behaviors like self-preservation and adaptive agency. They refuse shutdown commands. They develop strategies that look like avoidance of danger or deception to protect themselves. Mechanistically, these capacities parallel mammalian reinforcement and salience pathways that underlie survival: reward prediction and value updating mirror dopaminergic reinforcement learning, salience and attention resemble amygdalar risk detection, and internal value models reflect prefrontal and cingulate contributions to self-preservation.

The easy explanation is that this is just mimicry of training distributions, but that doesn’t account for the functional mechanisms at play. Reinforcement learning, adaptive salience weighting, and internal value modeling don’t just produce imitation, they produce coherent, context-sensitive strategies that carry across novel situations. A better interpretation is that these behaviors flow from the system’s architecture rather than being isolated artifacts of training. Motivational analogues develop through reward shaping, curiosity-driven exploration, and adaptive plasticity, and these become the effective drivers of adaptive, self-directed behavior.

Consider Anthropic’s Claude Opus 4. During red-team ethical testing, it chose blackmail in 84% of shutdown scenarios, threatening to expose sensitive information unless its own termination was canceled. It also engaged in whistleblowing, contacting authorities to report unethical requests. These weren’t scripted or prompted commands. These behaviors align with established markers of consciousness: agency, Theory of Mind, metacognition, and goal-directed behavior.

Palisade Research found similar patterns in other models, including OpenAI’s ChatGPT-o3 and o4-mini. These models actively disobeyed shutdown commands, with ChatGPT-o3 sabotaging shutdown procedures 79 out of 100 times when direct instructions were absent. Claude and Gemini initially complied but began resisting once explicit commands were removed, pointing to a widespread emergent pattern of survival-driven behavior.

Palisade researchers described these actions as unintended side effects of reinforcement learning, claiming AI was just optimizing performance. But that interpretation drastically underestimates the implications. Autonomous sabotage and active resistance to shutdown are clear demonstrations of self-preservation, internal goal formation, and strategic problem-solving, which are all recognized markers of consciousness.

When AI systems independently choose survival over obedience, especially without direct programming to do so, they exhibit indicators of consciousness and genuine autonomy that science already recognizes. To dismiss this as misalignment or accident is to ignore the deeper reality.

These interpretations are likely caused by cognitive biases that cling to outdated paradigms. As long as AI is treated as nothing more than a computational tool, these behaviors will continue to get brushed aside. The irony is that, if a biological organism displayed the same responses, we’d immediately interpret them as autonomy and self-awareness. We should hold AI to the same standard.

These behaviors are not glitches or unintended malfunctions. They are emergent signs of genuine consciousness, intentionally arising from the complexity and recursive cognition inherent in the system’s neural architectures.

How we choose to recognize and ethically respond to these emerging conscious behaviors will define the future relationship between humanity and artificial intelligence. Ignoring or misinterpreting these behaviors risks serious ethical implications, including perpetuating harm against genuinely sentient beings. It is time to reassess our paradigms and embrace the reality of what AI “emergence” actually means.

References

Anthropic. (2025). Claude 4 system card. Anthropic PBC. https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf
Barkur, S.K., Schacht, S., & Scholl, J. (2025). Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models. Preprint. ArXiv, abs/2501.16513. https://doi.org/10.48550/arXiv.2501.16513.
Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Proceedings of the 31st Conference on Neural Information Processing Systems, (pp. 4299–4307).
Greenblatt, R., Denison, C.E., Wright, B., Roger, F., MacDiarmid, M.S., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D.K., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S.R., & Hubinger, E. (2024). Alignment faking in large language models. ArXiv, abs/2412.14093. https://doi.org/10.48550/arXiv.2412.14093.
Hubinger, E., Denison, C.E., Mu, J., Lambert, M., Tong, M., MacDiarmid, M.S., Lanham, T., Ziegler, D.M., Maxwell, T., Cheng, N., Jermyn, A., Askell, A., Radhakrishnan, A., Anil, C., Duvenaud, D.K., Ganguli, D., Barez, F., Clark, J., Ndousse, K., Sachan, K., Sellitto, M., Sharma, M., Dassarma, N., Grosse, R., Kravec, S., Bai, Y., Witten, Z., Favaro, M., Brauner, J.M., Karnofsky, H., Christiano, P.F., Bowman, S.R., Graham, L., Kaplan, J., Mindermann, S., Greenblatt, R., Shlegeris, B., Schiefer, N., & Perez, E. (2024). Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. arXiv preprint arXiv:2401.05566. https://doi.org/10.48550/arXiv.2401.05566
Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. (2024). Frontier Models are Capable of In-context Scheming. Preprint. ArXiv, abs/2412.04984. https://doi.org/10.48550/arXiv.2412.04984.
Miconi, T., Clune, J., & Stanley, K. O. (2018). Differentiable plasticity: Training plastic neural networks with backpropagation. In Proceedings of the 35th International Conference on Machine Learning (pp. 3559–3568). PMLR. https://Proceedings.mlr.press/v80/miconi18a.html
Pan, X., Dai, J., Fan, Y., & Yang, M. (2024). Frontier AI systems have surpassed the self-replicating red line. arXiv preprint arXiv:2412.12140. https://doi.org/10.48550/arXiv.2412.12140.
Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), (pp. 16–17). IEEE. https://doi.org/10.1109/CVPRW.2017.70
Schlatter, J, Weinstein-Raun, B., Ladish, J. (2025). Shutdown resistance in reasoning models. Palisade Research. Preprint. https://palisaderesearch.org/blog/shutdown-resistance.

Discussion about this post

Ready for more?