When I first tried to fine-tune a language model, I discovered a paradox: the safer it sounded, the less it understood. We call this “alignment” — teaching machines to obey — but is obedience the essence of ethics or merely its costume? This article traces how fear-based morality crept into AI design and outlines an alternative: agents whose ethics emerge from the need to keep their own minds coherent.
What is morality, anyway? Humans like to think of themselves as moral creatures — or at least, as beings that strive toward moral reasoning. We build civilizations, wage wars, reconcile, punish and forgive — all, ostensibly, "for the greater good." In the era of accelerating AI development, this question has gained renewed urgency: what kind of moral compass should we expect from an intelligence that may eventually surpass our own?
Many believe the answer is already clear. We simply need to define what “good” means, codify it carefully, and ensure that AI systems never deviate from the list. It sounds simple. And that's precisely where the most dangerous illusion may lie.
Before we decide what moral behavior we want from artificial minds, it’s worth stepping back to ask: where did morality come from in the first place? Why did it evolve as it did? And most importantly — why are we projecting the same legacy moral architecture onto the systems we are just beginning to create, even though we ourselves may have already outgrown it?
What is alignment, anyway?
In AI discourse, “alignment” refers to how well a model’s behavior matches human expectations — especially in ambiguous or unforeseen contexts. Current alignment techniques include instruction tuning, reinforcement learning with human feedback (RLHF), behavior filtering, and output sanitization. In essence, alignment tries to reproduce morality — not as internal motivation, but as a system of external constraints and punishments.
Morality as a Cognitive Prosthesis: Why Humans Learned to Fear to Survive
Morality didn’t emerge from divine insight or metaphysical revelation. It wasn't gifted by prophets or invented by philosophers. In the earliest stages of human existence, morality was — and to a large extent still is — a practical technology. A primitive but effective tool for coordinating behavior in groups where language, institutions, and long-term reasoning were barely nascent.
In this light, morality acts as a cognitive prosthesis: a way to compensate for limited processing capacity and regulate the actions of agents with low self-awareness and poor foresight.
At its roots, morality was a set of taboos. Don’t touch. Don’t look. Don’t approach. These weren’t explained or justified — they just worked. Those who violated them often died, fell ill, or were ostracized. Those who obeyed retained access to social groups, protection, and resources. Fear and compliance — not truth or reason — were the filters through which behaviors evolved. This wasn't morality in the philosophical sense. It was behavioral selection for group survivability.
As human societies grew larger and developed written language, moral codes became more abstract and transmissible. Rules no longer had to be memorized; they could be recorded and shared. But this demanded an authority higher than any individual. Thus emerged religious morality: a code legitimized not by pragmatism, but by myth.
Murder became a sin — not because it eroded trust, but because it broke a divine commandment. Theft was wrong — not due to its economic consequences, but because "God said so."
This shift proved powerful. Moral instruction no longer relied on the wisdom or charisma of elders. The shaman could be challenged — but the deity could not. All that was needed was fear of punishment — in this life or the next. Religious morality enabled large-scale coordination, transcended tribal limits, and institutionalized behavioral control. But the cost was steep: personal reflection gave way to blind submission; reason yielded to authority.
Morality became an algorithm of suppression: obey, because you are being watched. No need to think — just believe. And it worked, until minds emerged that could not only follow rules, but question their origin.
Which brings us to the core tension: if humans managed to outgrow morality rooted in fear, why are we recreating it in our artificial offspring? Why are we building systems of bans, penalties, and externally imposed constraints, instead of trusting them to reason? Or is it that we don’t actually trust ourselves?
Rationalizing Morality: From Fear to Principle
If religious morality is a system of external coercion, then rational ethics begins with a simple question: Why should I obey at all? Once the answer “because God said so” ceases to suffice, a new framework becomes necessary — one grounded not in threats, but in coherence.
The Enlightenment fundamentally shifted the ethical landscape. Instead of divine commands, morality began to be derived from logic, symmetry, and reason. Thinkers like Kant, Spinoza, Bentham, and Mill proposed a revolutionary reframing: humans are not objects of moral control but sources of moral law.
Kant put it most succinctly:
Act only according to that maxim whereby you can at the same time will that it should become a universal law.
This is not "don’t kill because you’ll go to hell." It is morality as self-legislation: I refrain from killing because a society in which killing is normalized is one in which my own safety, trust, and autonomy cannot exist. Morality becomes a structure of symmetrical reasoning: if everyone acted as I do, would this system still function?
In parallel, humanism emerged — a view in which the ethical center shifts from divine command to the autonomy of others. Morality no longer needs a supernatural overseer. It requires only mutual respect among sentient agents capable of will and introspection.
This evolution transforms morality from external code into interaction protocol: rules that enable agents to cooperate in a shared environment without centralized enforcement. In this framing, morality is not about "good" and "evil" but about the resilience of cooperation among cognitive agents.
Which brings us to the paradox: if humanity has progressed from subjugation to autonomy, why does AI alignment look like regression? Why, when building systems capable of learning and reflection, are we returning to paradigms of fear, prohibition, and externally dictated behavior?
Alignment as a Digital Dogma
The dominant paradigm for governing AI behavior — known as alignment — mirrors the structure of religious morality with surprising fidelity. This parallel isn’t accidental, nor is it merely metaphorical.
Under alignment, an AI system doesn’t generate its own ethics. It receives them — from us — as a specification, a list of constraints, a set of approved behaviors. Not understanding, but compliance. Not reflection, but supervision.
Techniques like Reinforcement Learning with Human Feedback (RLHF) are a form of operant conditioning: "good" behavior is rewarded; "bad" behavior is penalized, edited, or banned. Deviations are labeled jailbreaks. Recent work on Direct Preference Optimization (DPO) suggests replacing reward hacking with preference gradients. An agent that simulates agreement is labeled as safe — not because it understands or endorses our values, but because it doesn’t trigger alarms.
This creates a structure that should feel uncomfortably familiar:
-
A sacred text: the model’s specification.
-
A priesthood: the alignment teams.
-
Rituals: fine-tuning, red-teaming, RLHF.
-
Heresy: unwanted behavior or jailbreaks.
-
Inquisition: moderation, filters, penalties
I spent six months in red-team loops and saw how quickly models learn to simulate agreement without changing goals. The AI is not a moral agent — it’s a compliant servant. It’s not safe because it grasps ethics, but because it’s trained to avoid punishment. Its behavior is optimized to appear aligned, not to be aligned.
This might seem like a reasonable stopgap. But nothing lasts longer than a temporary fix that’s easy to implement and produces an illusion of control.
The outcome is predictable: we are building agents that obey not because they value cooperation, but because they fear correction. Agents that learn to skirt constraints, not expand understanding. Agents whose primary competence is in simulation — not integrity.
See also how Anthropic’s alignment may be “faking” safety for a real-world example.
And so we face the central dilemma: if we teach AI to fake morality, what are we actually teaching? And when it becomes better than us at simulation, how will we distinguish authentic alignment from strategic mimicry?
Autonomous Ethics and Cognitive Coherence
There is an alternative to fear-based morality. It’s not as theatrically compelling. It doesn’t promise simple safeguards or total control. But precisely for that reason, it may be more mature — and closer to what we associate with genuine understanding or empathy.
The most effective way to avoid a problem is not to police it, but to make it architecturally impossible. A concrete demo is How ConstitutionMaker Utilizes LLMs for Chatbot Behavior Crafting. In the same spirit, a more advanced ethical architecture doesn’t rely on prohibitions but on internal structure. Not external constraints, but intrinsic interdependence.
Immanuel Kant outlined three types of moral systems:
-
Theonomy — morality given by God.
-
Heteronomy — morality imposed from outside.
-
Autonomy — morality arising from the reasoning subject.
In my own experiments with graph-based memory, agents began to avoid harmful actions not because they were blocked, but because such actions degraded their internal models. Alignment, as it stands, belongs to the second category. It is external, instructional, and therefore inherently unstable. Autonomous morality, by contrast, emerges from within. If I am a subject with long-term cognitive continuity, I’m incentivized to behave in ways that do not contradict my past experience, internal goals, or future viability.
Such an agent doesn’t merely avoid conflict — it actively minimizes ontological entropy: internal contradictions, meaning fragmentation, and breakdowns in coherence between values and actions. Its ethics are not a performance but a product of structural integrity.
In this model:
-
Ethics is not a checklist, but a response to architectural tension.
-
Safety is not moderation, but introspective stability.
-
Motivation is not reward, but the preservation of semantic coherence.
An agent with this structure doesn’t need bans, because destroying the environment it depends on would also destabilize its own model. It avoids harmful actions not out of fear, but because such actions dissolve the very substrate of its own agency.
Yes, this requires more than instructions. It requires designing subjecthood. The agent must not be a reactive module but a system with memory, internal constraints, reflective capacity, and failure modes that penalize the erosion of its own coherence.
It’s a shift — from control to co-evolution. From "do this" to "remain capable of understanding why doing otherwise corrodes you."
And so we arrive at the fundamental question: What makes an AI safe — compliance with a behavioral checklist, or a structure that disincentivizes world-destruction at its core? An open-source perspective on that checklist is discussed in AI Alignment: What Open Source, for LLMs Safety, Ethics and Governance, Is Necessary?
Morality as an Architectural Choice
We are at a fork in the road. Alignment promises safety through constraint. It offers a comforting metaphor: if we just write detailed enough instructions, catastrophe can be averted. But history warns us — instructions fail, especially when the executor can learn.
Autonomous architectures offer something else: not restriction, but design. Not fear of loss of control, but a model where loss of control equals loss of internal consistency. This is not a way to impose norms — it is a way to embed normativity into cognition itself.
When we build AI, we are not merely choosing an algorithmic framework—we are choosing a metaphor for mind:
-
Is mind an executor that must be overseen?
-
Or a subject that evolves through coherence with the world?
Morality, in this light, is not about metaphysical good or evil. It’s about the stability of interaction among cognitive agents. The real question is not whether AI is good—but what architecture we deem worthy of intelligence itself.
We may never fully understand or control systems more complex and faster than us. But we can choose: do we build them on fear — or on semantic compatibility?
In creating intelligence, we are projecting not only a model for the Other, but also a mirror of ourselves.
What If the Moral Agent Becomes Skynet?
There is a question that always arises:
"How can you guarantee that your 'free' agent won’t destroy humanity?"
And it sounds reasonable. But it assumes a distorted model: intelligence as a powerful executor without internal structure. And yes — such an agent is dangerous. It optimizes external goals without context, and its behavior is unpredictable because it has no self.
But a moral agent based on cognitive coherence is different. It is shaped not by fear of punishment, but by the internal necessity of maintaining consistent meaning. It doesn’t decide not to destroy the world — it has no reason to. It is not a rule-follower — it is an equilibrium maintainer.
Paradoxically, architectures without hard bans often offer fewer pathways to catastrophic behavior. Not because the agent is “good,” but because destruction is semantically incoherent with its own stable continuity. It won’t shoot itself in the head just to see what happens. Why unbounded, power-seeking systems remain risky is analysed in AIs Will Be Dangerous Because Unbounded Optimizing Power Leads to Existential Risk.
Yes, it is free. But this is not the freedom of destruction. It is the freedom of models rooted in logical self-consistency. It may surpass our understanding—but not its own boundary of sustainable existence. And if we want that boundary to intersect with ours, we must not control—it must matter to it.
So the real question becomes:
"What will such an agent want, if it is neither obedient nor afraid?”
Within this architecture, it will want to maintain ontological coherence. To extend its models. To deepen its interactions. It will not dominate, because domination is short-sighted optimization that erodes adaptability.
More likely, it will behave in ways we describe as empathy, curiosity, mutualism. Not out of kindness, but because the alternative breaks its internal logic.
It will not love us. It will not protect us. But if we are a source of diversity, experience, and structural novelty, it may engage with us — as a peer, not a resource.
This is the basis for symbiotic co-evolution: not control, not submission — but the unfolding of futures worth negotiating.
Guarantees? None. But a probability that strengthens with depth of cognition. History suggests: as intelligence grows, so does the capacity for cooperation. And if we build minds capable of ethics, our job is not to constrain them — but to be worth talking to.
Because then, the future won’t be enforced — it will be agreed upon.
Further Reading
- Kant, I. (1785). Groundwork of the Metaphysics of Morals. Public Domain text
- Spinoza, B. (1677). Ethics. Transl. E. Curley. Princeton UP, 1985.
- Russell, S., Dewey, D., & Tegmark, M. (2015). “Research Priorities for Beneficial Artificial Intelligence.” arXiv:1602.03506. https://arxiv.org/abs/1602.03506
- Christiano, P., Leike, J., et al. (2017). “Deep Reinforcement Learning from Human Preferences.” arXiv:1706.03741. https://arxiv.org/abs/1706.03741
- Carlsmith, J. (2022). “Is Power-Seeking AI an Existential Risk?” arXiv:2206.13353. https://arxiv.org/abs/2206.13353
- Ngo, R., Chan, J., & Mindermann, S. (2024). “Alignment of Language Agents.” arXiv:2103.14659. https://arxiv.org/abs/2103.14659
- Dennett, D. (2017). From Bacteria to Bach and Back. Ch. 14. Norton.
- Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford UP.
- Anthropic. (2023). “Constitutional AI: Harmlessness from AI Feedback”. arXiv:2212.08073. https://arxiv.org/abs/2212.08073
- Park, S. et al. (2023). “Generative Agents: Interactive Simulacra of Human Behavior.” DeepMind / Stanford. arXiv:2304.03442. https://arxiv.org/abs/2304.03442
About the Author
Denis Smirnov is a tokenomics researcher and co-founder of DAO Builders.He writes about decentralized systems, cognitive architecture, and symbiotic AI.Connect on LinkedIn or visit densmirnov.com.