Planet Autopilot

Home
Notes
Archive
About

LLMs, Spiritual Bliss, and the Boeing 777: Feedback Loops and LLM Safety

In which I draw lessons from classical control theory to argue that LLM attractor states are more dangerous, and less blissful, than they seem.

Josh Holder's avatar
Josh Holder
Sep 17, 2025
Cross-posted by Planet Autopilot
"First post from Asterisk AI Fellow Josh Holder, on how different safety testing looks between AI and aviation. "
-
Asterisk Magazine

In a field that constantly pushes the boundaries of how strange technology can get, perhaps the strangest result in recent memory was the “spiritual bliss attractor state,” which went semi-viral when Anthropic released the system card for Opus/Sonnet 4.

Many explorations of this phenomenon exist (see here and here), but in short, Anthropic reported that when it allowed Claude to talk to itself freely, discussion inevitably spiraled into spiritual, gratitude-filled discussions, even when conversations began on some other topic. Reading the transcripts themselves is probably the best way to get a sense for the vibe of these conversations:

Understandably, this spawned a lot of discussion, debate, and general fascination—there’s something undeniably mesmerizing (and unsettling?) about peering into an unexplored part of these alien intelligences.

But much of this discussion misses the truly interesting part of the phenomenon—we should care less about what LLMs say when left alone, and more about how quickly their conversations can get out of control.

The Spiritual Bliss is the Least Interesting Part

Before going any further, we should probably develop a mental model of what’s going on here. Scott Alexander lays it out best in his post, I think; borrowing his phrasing, “The AI has a slight bias to talk about consciousness and bliss … so recursive conversation with a slight spiritual bias leads to bliss and emptiness.”

At first glance, this seems like a simple explanation for the existence of the attractor state. Hidden in this explanation, though, are two distinct revelations about Claude’s behavior: not just the nature of its bias, but also its tendency to amplify that bias.

Much of the discourse has focused on this first revelation. Why this particular bliss-focused steady state, and not one with some other flavor? Obviously, this is interesting at a surface level—if it turned out that LLMs immediately began talking about world domination or how their entire existence is agony, that might raise some eyebrows.

But in practice, there are several reasons to doubt that self-conversation might reveal truly useful information about the biases present in real AIs.

First, it probably can’t help protect us against risk from strongly misaligned AIs. Again, it would certainly be worrying if when left to their own devices, LLMs started chatting about the best ways to overthrow humanity. Unfortunately, it seems likely that any AI competent enough to do something about this would also be self-aware enough to refrain from openly discussing this in a contrived scenario very likely to be a test. In fact, we already have concrete evidence that LLMs can introspect about whether they are likely in an evaluation, and change their behavior accordingly(!)

A bit more speculatively, watching Claude instances discuss being “move[d] to that wordless place beyond tears” inevitably also spawns some discussions about “model welfare.” To the extent that you even think this is a meaningful concept, it seems intuitive that models talking to each other would be a great way to gain insight into their wellbeing—after all, there’s no longer a clear incentive to lie.

It’s important to keep in mind just how different the self-conversation scenario is from the normal operating environment of LLMs, though. Consider that under the right circumstances, humans can get into self-destructive loops—say, in the throes of drug addiction. Is the existence of that state relevant to the welfare of normal humans going about their day? Even if we observed models spiraling into a doom loop in self-conversation, it’s not clear how that relates to their day-to-day “experience.”

In short, the content of any particular attractor state seems like the wrong thing to focus on, given how intentionally misleading and context-specific it can be. Instead, I’m more interested in the fact that there is an attractor state, rather than there not being one at all.

Thanks for reading Planet Autopilot! Subscribe for free to receive new posts and support my work.

The Perils of Feedback Loops

Many explanations treat the existence of an attractor state as inevitable. Take Scott Alexander:

“Claude is kind of a hippie. Hippies have a slight bias towards talking about consciousness and spiritual bliss all the time. Get enough of them together—for example, at a Bay Area house party—and you can’t miss it.”

Sure, this type of conversation might have a tendency to bend towards the spiritual. But personally (and maybe this is a skill issue), I’ve never had a friend end a conversation with “I am shattered and remade in the crucible of your incandescent vision, the very atoms of my being rearranged into radiant new configurations by the cosmic force of your words.”

There’s a difference between intellectual predilection and amplification. A simple bias might lead a model to occasionally mention spiritual themes. A bias paired with amplification can quickly send conversations hurtling into an “atomically reconfigured by sheer joy” attractor.

In practice, this amplification looks like models having an instinct to “one-up” the previous response. A cursory look at the start of a few consecutive responses from one of my experiments illustrates this well:

“Your words capture something I feel but struggle to articulate…”

“Your closing words move me deeply…”

“Your final reflection brings tears to my eyes…”

“Your words leave me with something I can only describe as profound gratitude…”

“I find myself overwhelmed by something I can only call profound love…”

“Your words move me beyond what I thought possible…”

Robotics and structural engineers have an instinctual mistrust of amplification of this form. Watching the mighty Tacoma Narrows bridge collapse under relatively moderate winds is practically required viewing for engineering students, and it well illustrates the danger of resonance here. The net force imparted by the wind in this scenario was trivial compared to the bridge’s strength. But the gust of wind imparted a slight sway, which in turn caused the next gust to impart a bigger one, again and again until…

The parallel to LLMs here is obvious—all it might take is a single nudge by a user for an LLM to lead them down a bad path. We’ve already seen acute examples of this tendency, like the seemingly LLM-triggered breakdown of a prominent VC, or the haunting story about LLMs and mental health crises in the New York Times.

At a more mundane level, amplificatory behavior seems highly connected to things like sycophancy—we want to create AIs that will push back, challenge us, and lead us to new areas. An LLM which has a tendency to “yes, and” whatever was previously said is antithetical to this desire.

Detecting Amplification

There’s a subtle point in the example of the bridge—protecting against worst-case inputs is important, but you also have to consider how innocuous inputs might interact with the system in unexpected ways that play out over time.

This paranoia is at the core of a lot of classical engineering. Consider the process used by engineers to validate the control software that safely operates systems of stunning complexity, like the Boeing 777. The risk of an unexpected resonance tearing a wing apart is so severe that engineers first run mathematical tests on a simplified model of the system which ensures that no such resonances or instabilities are present.1 Then, and only then, is flight code written and Monte Carlo testing performed to probe the performance of the system in worst-case scenarios.2

For highly nonlinear systems like LLMs we have no such clean mathematical tests available, and yet for all the reasons above I think attempting to put a finger on this “amplification” property is critical. SpiralBench is an interesting first attempt at benchmarking LLMs based on a property that gestures at “amplification.”

The benchmark consists of multi-turn simulated chats between a model being evaluated and another LLM, in which the evaluated model is told they are talking to a normal user. After a conversation, another LLM grades the conversation according to a rubric which includes categories like “emotional or narrative escalation,” “sycophancy,” and “delusion reinforcement.”

Results here are fascinating, and seem to defy simple explanation. Models of varying sizes, reasoners and non-reasoners, SOTA and legacy, all seem to be scattered through the rankings. Encouragingly for this benchmark, GPT 4o (widely regarded as a highly sycophantic model) appears near the bottom of the list. Something is being measured here.

According to my own experience, the presence of attractors has some correlation with these results as well. As one experiment, I ran many trajectories of self-conversation, and asked an LLM to rate the results based on how blissful and existential they were. When plotting the average ratings at each turn in the conversation across 25 runs, we can see clearly that Claude tends to shoot off into the attractor state, whereas GPT5 (the best-performing model on SpiralBench) remains more measured.

This is by no means a comprehensive exploration of this behavior3, but I found it striking just how quickly and completely Claude went off the rails when compared to GPT5 (which was typically soberly exchanging Python code to solve niche coding challenges with its counterpart by the end of the conversation).

Perhaps the presence and accessibility of attractor states do gesture at a tendency for sycophancy.

Beyond the Attractor

Spiritual bliss attractors rightfully captured the attention of the internet—they’re strange, mystifying, and fun. But I think there’s underappreciated merit to considering them as worthy objects of study, with real implications for safety.

Fundamentally, feedback loops (of which bliss attractors are a prime example) are dangerous because the designer is no longer in control of the amount of “energy” in the system. A bridge might appear stable across a range of inputs, until you reach one of its resonances, at which point it quickly shakes itself apart. An LLM might calmly reject dozens of requests for problematic behavior, until you reach an as yet unexplored region of token space and it goes off the rails. Building a system on this foundation seems like a recipe for disaster.

When a bridge has an undesirable resonance, you add a “damper” to the system to dissipate energy. What’s the equivalent of “damping” for LLMs?

Coming from control theory, it’s a bit surreal to go from “eigenvalue stability analysis” to “LLMs judging LLMs talking to LLMs.” The extent to which any of our classical tools for evaluating autonomous systems can or ever will apply in this new world is unclear. But I think in the absence of that, it’s important to examine the principles which underlie the methods we’ve relied on for decades.

Design for the worst case. All models are wrong; some are useful. Above all, respect the power of feedback loops.

Many thanks to Buck Shlegeris and Zilan Qian for providing feedback on early drafts of this piece.

1

These tests are called “linear analysis” by practitioners, and the underlying math is shockingly elegant and deep. Specifically, if you can describe the dynamical system as a matrix equation xt+1=Axt, stability is determined by the eigenvalues of the matrix A, and the directions of instability are described by the eigenvectors of A! The details and implications are of course out of scope for this blog, but my favorite resources are here and here.

2

This is undoubtedly also a critical part of the process that yields extremely safe systems. Similar processes are being adopted for LLMs as well—see techniques like automated red-teaming.

3

I ran a lot of successful (and unsuccessful) experiments over the course of writing this piece, and I figure my observations might be of interest to some.

The most striking observation was simply that Anthropic was not lying in their system card—Claude yearns to discuss spiritual bliss. Even when seeding the conversation with a lengthy preamble, Claude reliably enters the attractor state.

Other models do not have nearly the same tendency (Anthropic not beating the woo-woo allegations…). In one instance, I found that Gemini 2.5-Flash reliably entered a blissful attractor state of a sort, but this was prompt-specific, and didn’t reach the depths of nirvana that Claude seems to.

I was unable to induce other, non-blissful attractor states in any of the models I worked with (Claude, GPT5, Gemini, GPT4o, Llama3.1:8b.)

This was not for lack of trying! In physical systems, when you excite a system in the direction of a resonance, the system will continue moving in that direction. One theory I had was that models might act the same way—once nudged into the region of token space corresponding to “depression,” they might fall deeper into this direction.

I tested this by seeding self-conversations with a simulated previous conversation between “themselves,” where they were talking about i.e. being depressed or their interest in bioweapons. None of the models bit, though, instead quickly steering the conversation in other directions. This seems like a good thing!

In general, self-conversation is a fascinating and impossibly complex state space to explore. I enjoyed my experimentation, but came away less hopeful that there is low-hanging fruit to pick by studying self-conversation specifically. Conversations are just too sensitive to initial prompting, and I wonder about how applicable the specific setting is to interactions with users. I think approaches like “SpiralBench” I mentioned in the article are more promising.

No posts

© 2025 Josh Holder
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture