Essay Writing Service

Anthropic Design in the Creation of ‘Aligned’ AI Systems

Anthropic Design in the Creation of ‘Aligned’ AI Systems

Brood of hell, you’re not a mortal!

Shall the entire house go under?

Over threshold over portal

Streams of water rush and thunder.

Broom accurst and mean,

Who will have his will,

Stick that you have been,

Once again stand still!



  • Goethe

Chapter 1


  1.          ‘Software is eating the world…’ 

Artificial Intelligence (AI) has become a loaded term, conjuring up images of human-like embodied machines with superhuman-like capabilities, often married with a peculiarly human way of interpreting and reacting to the world. Once we begin to think of an intelligent agent (IA) we cannot help but attribute human characteristics. And this is perfectly natural and expected; we know of only one truly intelligent agent in our evolutionary history; and that is us. As such, we end up attributing these characteristics to AI that are not necessarily connected to or associated with intelligence per se. For instance: consciousness, normative reasoning, and personality. Such attributions only contribute to a specific and narrow type of intelligence, that of humans, with all its strengths and shortcomings. But such a narrow intelligence isn’t required or necessarily desired.

We can think of an AI as an optimization process that is highly capable in carrying out a specified goal or set of goals. Understood like this, we notice that AI systems are ubiquitous. Fictional portrayals of artificial intelligent agents that aid humans to achieve goals, such as HAL 9000 in 2001: A Space Odyssey or Asimov’s MULTIVAC, have found their real-life analogues in Apple’s Siri or Amazon’s Alexa. Financial markets rely on large data sets and computer algorithms that trade at speeds which humans could never achieve, with little to no guidance by a human operator (Lin, 2013). And autonomous vehicles (AV) have progressed significantly with manufacturers such as Tesla offering cars with full self-driving capabilities (, 2018).

These developments have contributed to a palpable enthusiasm within the tech community that we will one day create an AI that will be able to perform goals more generally, that is, we will create an Artificial General Intelligence (AGI). This isn’t the first time we have experienced this kind of AI optimism. Past AI research has often been characterised by periods of optimism, followed by periods of disappointment. But whether AI researchers remain optimistic or not we should assume that these artificial agents we are considering are realizable; considering how quick we are to disseminate new technologies and the potentially huge impact they could have to everyday life for humans and other living species (Bostrom, 2014; Yudkowsky 2008).

The increase in computational power coupled with more ingenious algorithms and novel systems, such as Artificial Neural Networks (ANN) and Machine Learning (ML), are proving incredibly effective at attaining goals in many tasks across many industries (Narula, 2018). But with increased intelligence comes increased potential catastrophe if the motivational system guiding its actions goes awry. Given the increasing deployment of AI systems and their increasing computational power and ability to affect the world it would be prudent to examine the motivational systems that guide them. Up until very recently, worries about potential misguided motivations in AI had been scant. Traditionally it was sufficient to give an AI a simple goal – e.g. check for plagiarism, filter for spam email, tidy house – without considering unintended consequences; because AIs were limited in their scope of affect. Now AIs are demonstrating more general capabilities and autonomy with their behaviour harder to predict and we are beginning to see some of the unexpected failure modes these systems can demonstrate. Microsoft’s chat-bot Tay which after only a few hours after its launch, began to post increasingly racist content, tweeting:  ‘bush did 9/11 and Hitler would have done a better job than the monkey we have now. donald trump is the only hope we’ve got.’ (Hunt, 2018). Demonstrating how an AI can act in a way that is undesirable or unexpected, even to the developers that built it. Such an error was fairly innocuous, the bot was indeed incredibly offensive and may have caused some distress to its Twitter audience, but it was incapable of actually acting on the world in a physical and embodied way. This may not be the case in the future. As AI systems become more intelligent, their domain of application will expand. To fully reap the benefits, this will expand to include AI interacting with the physical world.


1.2  The Sorcerer’s Apprentice and Perverse Instantiations


There are many tales of highly powerful magical entities possessing the power to perform or grant any goal or wish to an individual. Common amongst these tales is that they will perform the task or wish given to them with literalness; resulting in them being performed or granted in unexpected and undesirable ways. A good example of such a tale can be found in Goethe’s poem ‘The Sorcerer’s Apprentice’ (1797). The sorcerer departs his workshop, leaving his young apprentice with a number of chores to perform, one of which is to fill up a large cauldron. After growing bored of continually filling up buckets of water to fill up the cauldron, the apprentice decides to enchant a broom – using magic in which he is not fully trained – to become animated and carry the water for him.  This initially works very well, with the cauldron being filled by no effort of his own. But it soon becomes clear that the broom will not stop filling up the cauldron with water. The apprentice, cocksure in his magical abilities, soon realizes he has started something which he cannot stop.  He attempts to stop the broom’s maniacal pursuit of endlessly filling the cauldron with water by striking it with an axe causing it to split into two pieces. These two pieces form their own separate brooms with the same objective to fill the cauldron with water, but now at twice the speed. As the room fills with water the sorcerer returns and breaks the spell.

Such a scenario bears a curious resemblance to the value-alignment or value-loading problem of AI. Consider an AI with the final goal to reduce the amount of CO2 in the atmosphere. This AI will have access to a large repository of resources and computational power if it is to affect the chemical composition of the atmosphere. After we have set the objective for the AI, we might expect that using its superior intelligence and vast resources it will create a mechanism that buffers the amount of CO2 in the atmosphere to the desired level. We do not know exactly how, but we expect that the AI will. However, it may be that the AI could achieve its final goal to a greater degree by creating a super-virus that kills off a large portion of all CO2 emitting organism on Earth, including humans. The AI would have no preference, in say, killing mosquitoes over humans, unless specifically programmed for. More extremely, the AI would not have any preference for life at all, unless programmed for. And in many instances, it is most efficient and viable for an AI to do satisfy its goal in ways that to us seem unimaginable, what Bostrom calls a ‘Perverse Instantiation’ (2014: 146).

It seems then that the AI must have some conception of the things we value, and in some sense, what we mean by our instructions. This is a surprisingly difficult problem to solve. Much of natural language relies on some implicit knowledge shared between those communicating. Commands aren’t usually literal but implied. For instance, if a manager at a busy diner calls a worker to come in early and says “I need you in ASAP”. This is not meant literally. A literal rendition of this would have the worker drive through all red lights, at dangerously high speeds, running over any unfortunate soul that happens to be in his way. What is meant then is arrive as soon as possible within some sphere of common sense that is naturally assumed in human-to-human interactions.

It has been proposed that by programming some simple goal into AIs we can avoid these perverse instantiations. Hibbard (2001) suggests that we train an AI through a learning system to recognise positive facial gestures, laughter, voice tones and body language. The AI could then be programmed to execute actions that lead to “positive human emotions”. How could an AI that’s programmed to execute actions leading “positive human emotion” be undesirable? Setting aside the existential question of whether maximal pleasure is even desirable for humans, we can invoke the notion of the sorcerer’s apprentice to see how such a goal could lead to perverse instantiations (Bostrom, 2014: 146-147).

Goal: Induce positive human emotions.

Perverse Instantiation: paralyze facial muscles into a constant smile.

This is one viable path to induce positive emotions that could conceivably attain its goal to a higher degree than actually making the person smile. The usual means that humans use to induce positive human emotions in others is likely to be far more complex for an AI than physically forcing a smile onto a human face. The intuitive response is to stipulate not to paralyze facial muscles into a constant smile. So we have:

Goal: Induce positive human emotions without paralyzing facial muscles.

Perverse Instantiation: Manipulate parts of the brain controlling facial muscles to produce a constant smile.

Assigning the final goal to some behavioural manifestation of happiness doesn’t look promising. So what if we define the final goal in terms of the phenomenal state of being happy, assuming this can be represented computationally.

Goal: Induce a state of subjective happiness in humans.

Perverse Instantiations: Construct a dopamine drip device that puts humans in a permanent state of euphoria.

A common response is to say that it is obvious that this is not what we meant by the specified goal. But this assumes that the AI is like a servant who can look over the commands and figure out what was meant. This is another case of anthropomorphisizing the AI; computer programs do not function in this way. The AI does not receive the code and then act on it; the AI is the code (Yudkowsky, 2011: 1).


1.3  The Orthogonality Thesis

It is not only AI capabilities that we anthropomorphosize, we also tend to do the same with an AI’s motivations. In the example of an AI regulating CO2 levels in the atmosphere, we saw that if left to its own devices it could carry out the goal in a way that we would find deeply unsettling. This is because its goal is not aligned in some sense with the things that we value. Unless an AI is specified to value the things that we value, it will carry out its goal in any way that increases the likelihood of it being attained (Russell and Norvig, 2010:611). Intelligence and motivation are in some sense independent of each other; they do not naturally converge. Bostrom visualizes intelligence and motivation as two orthogonal axes on a graph, where any level of intelligence can be paired with any goal (some caveats)[1] (2014: 130). Current AI research seems to suggest this. AI methods such as Reinforcement Learning (RL) and Artificial Neural Networks (ANN) rely on artificial agents being rewarded for executing actions leading to states desired by the goal and punishes the agent when states are achieved that do not help realize its goal. No matter what goal is associated with the reward, the AI will pursue actions leading to the rewarding states. The main point to come away with is that high intelligence and efficiency do not guarantee desirable goals; we cannot ‘let the AI figure it out’.



1.4  Instrumental Convergence

The orthogonality thesis holds that any goal can be paired up with any level of intelligence. Therefore, intelligence cannot guarantee desirable goal-directed behaviour. Nevertheless, there may be certain instrumental goals that any sufficiently intelligent AI would pursue. Instrumental goals are those whose satisfaction leads to a greater probability of the final goal being realized. An obvious example of an instrumental goal is survival. If my final goal in life is to retire early in the suburbs then survival would be an instrumental goal; if I die I can’t retire. Similarly, if I wanted to become the world’s best cricket player then my survival would be instrumentally valuable; I cannot play cricket if I’m dead. The same may be expected of highly capable AI, but whereas humans place some final value on life, the AI need not. Being powered on is just a necessity of attaining its goal.  This presents a serious problem in AI-control – If we create a powerful AI with a misspecified goal then it will be very difficult to shut down and if it is of higher than human-level intelligence it may be impossible. The AI places all its value in the final goal being realized, so it will avoid anything that obstructs this, including being shut down by its creators. We can’t just ‘turn it off’.

Other AI Drives include goal-content integrity, cognitive enhancement, technological perfection and resource acquisition (Ohomundro, 2008). These drives contribute to the intelligence and capability of the AI, exasperating the control problem.





1.5       Chapter Summary


The arguments in this chapter can be summarised as followed:

  1. Any level of intelligence can be matched with any goal.
  2. If we do not correctly specify the goal that we want, the AI might find a way to maximally increase its satisfaction in ways that could be deeply undesirable to humans.
  3. Instrumental goals (such as self-preservation, cognitive enhancement, resource acquisition), that increase the likelihood of the final goal being realized, converge in AI.
  4. If we specify the goal incorrectly, then given 1-3, the AI might pursue the unwanted goal with increasing ability and competence.
  5. If the AI reaches above-human-intelligence, via 3, it will be impossible to shut down.

Chapter 2


2.1.1 How Do Humans Acquire Values?

In the previous chapter, we noted that values played an important role in determining the manner in which goals are attained.  Values constrain behaviour and limit the ways in which a goal can be carried out, acting as a compass that guides an agent into acceptable patterns of behaviour. In addition to instrumental values, there are terminal values which are valued in themselves, such as the value of life, which may serve as abstractions of previous situations in which rewards were received (Sotola, 2016). These terminal values are absent in AI unless programmed for; an AI does not see any intrinsic value in terminal values. Any values that automatically align are likely to be those outlined by instrumental convergence. Our value for life aligns with an AI’s, but whereas our value of life is terminal, the AI sees only the instrumental value of being alive.

It seems that we must try and inculcate human values in AI if we want it to carry it out a goal in a way that humans would find desirable. But the process by which an agent acquires these values isn’t well understood. Human values are vast and nuanced; the expression of which are multifarious, presenting themselves in various costumes from culture to culture. Although human values vary across the globe, there does seem to be, in some sense, a tacit, universal human value framework under which we act (Schwartz and Shalom, 1992). There has been considerable research into human values within the social sciences with many authors looking to establish a unifying theory of human values (Schwartz and Shalom, 1992) or a foundational theory of morality (Haidt, 2001). By looking at previous research and literature on human values, we can begin to think about which values (and by what process) we would want to instil in AIs.

2.1.2 Anthropomorphic Bias vs Anthropomorphic Design

Though I have warned of anthropomorphic bias in discussing AI capabilities, this shouldn’t be confused with constructing an AI with an anthropomorphic design. Anthropomorphic biases mislead us about the nature of intelligence and motivation in AIs, but designing AIs anthropomorphically may be beneficial; given that we want them to be aligned with human values. It may be that in order for an AI to align with our values we will have to build AI systems with important commonalities to the human mind. At the same time, designing an AI anthropomorphically does not guarantee human-like behaviour, but it might be a promising place to start.

It isn’t entirely clear how humans acquire values, but there are probably environmental and evolutionary factors at play, which likely interact. The following chapters examine how we acquire these values with the ultimate purpose of applying this knowledge to AI design. The first chapter looks at the evidence for innate human values and the evolutionary role these might play. The second section looks at how human acquire values through socialization and enculturation.



2.2.1 The Evolutionary Origin of Values and Morality


So far we have been discussing human values as being those things we value fundamentally, which influence the goals we pursue and the ways in which we pursue them. We notice that the latter part of this definition echoes the notion of a morality;whichseeks to establish trust and cooperation amongst agents by acting in ways that are appropriate, according to some tacitly agreed upon criteria, promoting optimal outcomes for everyone (Sinclair, 2012: 14). Values can be thought of as normatively thinner entities than moral codes or instructions; in some sense, these moral codes derive from human values; we want to act in ways that guarantee or safeguard them. The reason for not discussing the value-alignment problem in terms of morality is because morality often presupposes faculties that may not be present in an AI. Consciousness is often believed to play a role in moral reasoning and accountability [REFERENCE], but such notions do not concern us in the present discussion: we just want the AI to act beneficially to humans. Nonetheless, it would be useful to examine any psychological/biological/neurological literature that seeks to explain how humans develop moral sensibilities and the role they play; with the hope of gaining some insight as to how to implement a framework of human values or morality in artificial agents.

2.2.2 The Moral Foundations of Morality: A Top-Down Approach


Top-down approaches to morality seek to establish or discover a set of duties, principles, rules, or goals which allow us to make the countless moral decisions we make daily (Wallach, Allen and Smit, 2007). These principles can come from a variety of sources, including religious, legal, and philosophical. Examples of top-down ethical theories include The Ten Commandments, Sharia Law, consequentialism, deontology, Aristotle’s virtue theory, and Asimov’s Laws. A top-down ethic for an AI has often been criticized as impossible to implement, as it is too hard to specify fuzzy concepts such as happiness, fairness, or justice, and if we can specify them the unconstrained pursuit of which will result in perverse instantiations (Soares, 2016).  In addition, some laws or rules that have been directly specified in an AI may conflict with each other. This conflicting-laws criticism is also levelled at deontological theories in humans, but whereas human minds have an architecture that seems to allow contradictory beliefs and conflicting goals (Kurzban, 2010), the same cannot be said of our current state-of-art AIs. Humans seem to be particularly well-adapted at resolving conflicting goals and cognitive dissonances, with some arguing that such an architecture must be built into AI if they are to function reliably (Muraven, 2017).

Although misspecifying the value framework or a morality is a serious problem, one which can lead to wholly undesirable outcomes, we should not do away with top-down ethics altogether. There is growing evidence that morality may be an innate faculty of the human mind (de Waal, 1996; Cosmides and Tooby, 1992; Cowie, 1999); that there are certain predetermined modules of morality which serve as the foundations of all moral behaviour. Jonathan Haidt’s Moral Foundations Theory of Morality (2001) has identified six modules, each with a positive and negative dimension, that give rise to moral behaviour.







These modules act as immediate, non-deliberative responses to situations demanding moral action, having evolved to provide a (positive/negative) response to adaptive challenges presented by the current situation (Kuiper, 2016). The function of the positive and negative affective dimensions on a module is to direct behaviour towards optimal actions that increase the resolution of an adaptive challenge. The Care/Harm module, for example, has the role of directing protective and nurturing behaviours towards offspring whilst avoiding behaviours that may prove injurious. Knowledge that one’s offspring is in harm, triggers a negative emotional response, which is then reconciled by ensuring offspring is cared for. The positive emotional response associated with care generalizes to include caring for other people’s children and even other species. Care becomes something we value in itself, once generalised after enough cases (the same applying to the other modules).

2.2.3 Social Intuitionism

It is important to note that these responses are not the result of deliberative moral reasoning. We do not weigh up the pros and cons before the decision, and then execute the action that encourages the outcome we want. Instead, a person elicits an intuitive response, which is preselected, and then provides a moral judgement and reasoning for the response[2]. Moral reasoning, like other cognitive abilities, takes place over varying time scales (Kahneman, 2015). Joshua Greene (2001) conducted a study to identify which areas of the brain were activated in participants responding to moral dilemmas, and found that areas of the brain associated with emotion-regulation were activated initially, followed by a slower activation of areas of the brain associated with reasoning and planning. This evidence supports Haidt’s Social Intuitionism model which asserts that moral judgements are (1) primarily intuitive, (2) rationalized post hoc, (3) taken to influence other people and (4) influenced by others judgements and actions.

Image result for social intuitionist model

Figure 1: Haidt’s Social Intuitionist Model of morality (Haidt, 2001)

It might be slightly disappointing for some, that in times of moral decision-making, we ‘go with our gut’. We like to think that the decisions we make are the result of careful reasoning and deliberation. That what separates us from other animals is our ability to envision various outcomes based on various decisions and pick the one that matches up with our preferences. But, there is good reason for this intuition-first mechanism of moral decision-making. If morality is to be practical and flexible it must be able to react to demanding scenarios in real-time. An architecture that analyses and compares outcomes, choosing the most preferable, is slow and demanding. Such an architecture would prove useless and too taxing evolutionarily to be able to deal with adaptive challenges in the immediate environment. In a similar vein to how true act-utilitarianism requires too much calculation – by avoiding generalizations -basing every decision on what promotes the greatest good in a given situation. Our evolved morality has the structure of rule-utilitarianism, which applies broader rules that govern behaviour in new situations; relying on generalized, pattern-directed rules that have proven advantageous in the past.

Pattern-directed responses to moral situations may not always prove satisfactory. Sometimes an intuitive response will invoke a response that demands further deliberation. Such deliberation rarely takes place solely in the individual’s mind, but instead takes place within a community over a longer period of time, via links 3 and 4 in Figure 1. The interaction between individual moral judgement/reasoning with other individual’s intuition, creating an interplay which leads to the moral evolution of a society. We both influence and are influenced by other participants in a society, with our moral intuitions converging to create a stable and conforming society.

2.2.4 The Role of Signalling and its Application in AI Systems

Links 3 and 4 in Haidt’s Social Intuitionist model of morality illustrate the importance of signalling. Our moral behaviour is used to signal to other potential cooperative partners that we are trustworthy, at the same time, reading signals from other potential trustworthy partners. This kind of signalling could prove very useful in autonomous vehicles (AV). An AV must drive in such a way as to signal intentions; that it is behaving in such a way to as to avoid risks, even when these risks are unavoidable.  In the event that an AV does cause an accident, it will be clear that it was not intentional, that the AV was doing everything in its power to avoid risk. AGI’s will have to take part in this signalling also; we want the intentions of the AGI to be transparent. As AI systems advance they will have to play a more active role in signalling, by both reading and adapting their behaviour to those potential trustworthy partners around.  It may be that AI’s will have to worry about something approximating reputation; consistent untrustworthy behaviour will result in less participation, which in turn will result in the AI’s reward-function being activated less.

2.2.5 Applying Haidt’s Moral Modules to AI Design

Haidt’s Moral Foundation Theory can also aid us in the design of beneficial or ‘Friendly AI’ (Yudkowsky, 2001). We will need to implement some kind of initial value-framework or morality into an AI, and Haidt’s work may offer us a foundational value-system which can be programmed into the AI. To illustrate how this could be implemented into an AI, let us use the Fairness/Cheating module as an example. Envision an embodied AGI with the goal of retrieving groceries from the local supermarket. If the only thing guiding the AGI’s behaviour is the final goal, then we can expect that anything will be justified in the acquisition of it. An AGI retrieving groceries from a supermarket may cut the line, steal, and even push people out of the way if the only goal is the retrieval of groceries. This is because these behaviours, actually enable the final goal to be satisfied more easily and to a higher degree than acting in ways that correspond to the fairness module. However, if we designate a reward-value to actions that respect fairness in the acquisition of the final goal, then we can steer the AGI into patterns of behaviour that respects fairness, alongside any other modules we assign a reward-value to.

2.2.6 Mammalian Values

These moral modules, and the values which they protect, may have their origins earlier in evolutionary history than humans. Sarma (2003) argues that what we call human values can be decomposed into three separate categories: (1) mammalian values, (2) human cognition, and (3) several millennia of human social and cultural evolution; the interaction of which contribute to the nuanced aspects of human values.  Mammalian values are those motivational systems that drive behaviour. Panksepp and Biven (2012) have identified seven such systems common to mammals: seeking, rage, fear, lust, care, panic/grief, and play. The neural correlates corresponding to these motivational systems have also been identified in mammals; they show that there is a common framework of mammalian values, which has been identified in the affective circuits deep in the subcortical regions of the brain. In evolutionary terms, the subcortical region of the brain is far older than the neocortical regions, with human neocortical regions being the most developed of all species. This suggests that mammals share a common foundational value-system which drive our affective and moral proclivities. The interaction between these foundational mammalian values in conjunction with human cognition and incidental historical processes form human values. Sarma (2013) argues that the combination of mammalian values with AI cognition will result in human-like values, by giving the mammalian-motivated AI access to all of human history, it will begin to develop the more nuanced aspects of human values. Incidental historical processes, the third component of Sarma’s decomposition, may or may not play a role the development of values in AI. Aspects of contemporary human values, driven by incidental historic processes, might not be as meaningful in a world where many problems have been solved by a deeper understanding arising from mammalian values and AI cognition.

It isn’t clear how many of these moral modules or foundational values there are, nor is it clear how they interact. The lists provided by no means exhaustive lists; they merely outline the more basic and obvious examples of values within humans and mammals, which have been inferred behaviourally and whose structure have been determined neurologically

2.2.7 Intelligence and Conflicting Values


As I alluded to earlier, some contemporary values which appear in conflict across cultures may be resolved with increased intelligence. Sarma (2003) uses an example from negotiation theory to illustrate how increased intelligence can resolve prima facie conflicting values. ‘Principled Negotiation’ is a method that distinguishes between values and positions, in order to reach a mutually satisfactory outcome for those involved (Fisher, 1987). For example, a couple is arguing about where to go on holiday. The woman, who took archaeology at college, wants to go to Egypt to see the pyramids. The man, on the other hand, has a deep interest in wildlife and wants to go to Costa Rica. The woman’s preference is visiting a country with a rich archaeological history, whereas the man’s preference is visiting a country with outstanding natural beauty. These preferences are the values, which correspond to their respective positions. By negotiating from values as opposed to positions, we can begin to resolve the conflict. So in the case of the couple, it may be that Mexico may satisfy both parties; it has a rich archaeological history and is also a country of significant natural beauty.  Conflicts pertaining across cultures may be resolved in a similar manner (this is just one possible method of resolving conflicting preferences amongst many). What we perceive as conflicting values might, at closer inspection, be a matter of conflicting positions, which can be resolved by reframing the problem and finding mutual preferences. This ability to reframe the problem and uncover mutual preferences will be enhanced with increases in intelligence. A highly powerful AI might determine that many of the conflicts of values between cultures are a result of divergent positions, which are motivated by mutually compatible values.



2.3 Chapter Summary

  1. Given that we want an AI to be aligned with our values, we should design an AI’s motivational or value system anthropomorphically; That is, with human characteristics in mind.
  2. To design an AI’s value-system anthropomorphically, we must understand the way in which human values and morality are acquired.
  3. There is evidence that morality, and the values they protect, are the result of evolutionary adaptions. Which is to say, they are to some extent innate
  4. Given that there is evidence that our morals and values are to some extent innate, we should not rule out top-down approaches in the implementation of ethics in AI systems.
  5. The top-down implementation of the Moral Foundations Model of morality could be used in AI systems to guide the AI’s initially uncertain value-system
  6. Similar to Haidt’s moral foundations in humans, there may be initial motivational systems common to mammals – mammalian values.
  7. These mammalian values, in combination with human cognition and incidental historic processes, form human values.
  8. An AI programmed with mammalian values, with access to all human history, will develop human-like values
  9. Values that seem to be in conflict cross-culturally may be resolved with increases in intelligence. An advanced AI may be able to reframe the problem and find common avenues which satisfy the values of both parties.


Chapter 3


3.1 Learning Values through Culture

Though I have been arguing that much of human moral behaviour and values have been the result of evolutionary pressures and adaptations, there is no doubt that much of our behaviour is learned. This is most glaringly obvious in the divergence of morals, values, and norms from culture to culture. If all moral behaviour and values were predetermined then there would be no such diversity. Indeed, we would expect to see very little conflict and probably little in the way of a distinct culture[3]. It isn’t hard to see how our interaction and observation of others, influence the way in which we behave. Indeed, we have already seen how the observation of others can shape our moral intuitions in Haidt’s Social Intuitionism (2001).

The process by which we learn behaviour, through observation from others (usually people held in high regard or revered), is Social or Observational Learning (Bandura, 1977). In addition to Social Learning Theory, there may be other ways that we learn values and develop morality. Literature, art, music, and opera offer us a wealth of information relating to the values inherent in a society. By reading a story, we put ourselves in a different world where moral concepts and ideas can be ‘tested out’, and we have the ability to ‘think with someone else’s head, other than our own (Schopenhauer, DATE). We can learn social behaviour by both the direct or real observation of others, or through the indirect or fictitious observation of a representative world within a story.  These two mechanisms of enculturation enable us to develop the nuanced values and morality of a culture. Social Learning has the benefit of allowing, almost constant, imitation and adaption of behaviour from others. Whereas the process by which we adopt values through literature, allows us to understand the motivations of those we are learning from.

I wish to investigate these mechanisms of learned moral behaviour and their application in AI ethics and value-loading.


3.2 Social Learning Theory




3.3 Storytelling and Moral Development

Moral development is rarely discussed in terms of story-telling or narratives. Instead, we adopt a model of moral cognitive development based on Kohlberg (1971), which has had a significant and lasting effect on our interpretation of moral reasoning. This model assumes that moral life is the result of the development of increasingly abstract moral principles, which are communicated verbally. These developments take course over six developmental stages, progressing from stage 1 to stage 6. Each stage reflects a qualitatively different kind of moral reasoning, which is independent of particular moral content. We move through the stages as a result of a cognitive disequilibrium between the current stage and higher stage of moral reasoning. The experiential conflict caused by this disequilibrium forces us towards the upper stages, with our moral reasoning evolving as we develop increasingly abstract and sophisticated principles of moral reasoning. Such an approach to moral reasoning is reliant on propositional thinking.

This characterization of moral development excludes another mode of cognitive functioning, namely, narrative thinking. Bruner (1986) distinguishes between these two modes of cognitive functioning.

There are two modes of cognitive functioning, two modes of thought, each providing distinctive ways of ordering experience, or constructing reality. The two (though complementary) are irreducible to one another. Efforts to reduce one mode to the other or to ignore one at the expense of the other inevitably fail to capture the rich diversity of thought. (Bruner, 1986: 11)

Whereas there has been considerable research in understanding propositional thinking, little interest has been paid to the role narrative thinking plays in our moral decision-making. Narratives focus on people and the causes of their actions: their intentions, goals, and subjective experiences (Vitz, 1990: 710). It is by paying attention to contextual details, that literature enables us to come to a more vivid understanding of moral reasoning. Its strength is context sensitivity, whereas propositional thinking is context-independent, and for this reason better suited to science and philosophy.

Support for the functional use of narratives in moral reasoning has been provided by Robinson and Hawpe (1981)

First, where practical choice and actions are concerned, stories are better guides than rules or maxims. Rules and maxims state significant generalizations about experience but stories illustrate and explain what those summaries mean. The oldest form of moral literature is the parable; the most common form of informal instruction is the anecdote. Both forms enable us to understand generalizations about social order because they exemplify order in a contextualized account. Second, stories can also be used as tests of the validity of maxims and rules of thumb. That is, stories can be used as arguments. Stories are natural mediators between the particular and the general in human experience. We should try to improve and refine this mode of thinking, not eschew it. (p 124)

It becomes clear that much of our moral life can be understood in terms of narrative thinking, and that our moral reasoning has a qualitatively distinct character from propositional thinking. The astute reader will notice that this bears a resemblance to social intuitionism discussed in an earlier chapter, which claims that moral responses are (1) primarily intuitive, that is, non-rational and (2) rationalized post-hoc. It may be that by exciting the non-rational part of our thinking through stories or experience, we come to develop our moral-intuitions.

3.4 Narrative Intelligence and its Application in AI

Some AI researchers have noticed the potential usefulness of story-telling and narrative intelligence in AI. Reidl and Harrison (2015) argue that given that there are infinitely many undesirable outcomes in an open world, we should have the AI-system learn our values. Their approach to aligning AI systems with our values relies on using stories to communicate the values and norms of a society, to which the story belongs.  They claim that ‘stories are necessary reflections of the culture and society that they were produced in’ which ‘encode many types of sociocultural knowledge: commonly shared knowledge, social protocols, examples of proper and improper behaviour, and strategies for coping with adversity’ (Reidl and Harrison, 2015). The idea being that by immersing an AI in the stories of a given culture, the AI will “reverse engineer” those tacitly held values within a culture.

Reidl et al highlight the Scheherazade system’s (Li et al, 2013) ability represent a domain as plot graph, which include a set of events (plot points), precedence constraints (particular events must happen before), mutual exclusion constraints (events that preclude the occurrence of another cause the plot graph to branch), optional events, and events conditioned on whether optimal events have occurred. The plot points can rearrange parts of example stories. The relevance of the Scheherazade system to value-alignment is its ability to learn how a story about a topic can unfold without having previously coded knowledge of the topic.

Figure 2: Example of Plot Graph modelling a trip to a pharmacy (Reidl and Harrison, 2015)


Reidl and Harrison’s approach uses a reinforcement learning agent – providing the agent with a reward signal – to encourage it into optimal patterns of behaviour in the acquisition of its goal. The agent will receive positive rewards for taking actions humans would take, in the acquisition of the goal, and receives punishment for actions humans would not take (or that don’t lead to the goal). In order to generate a reward-signal, the agent must learn a plot-graph via crowdsourcing linear narrative examples of the way the topic would typically occur. By using many crowdsourced examples it can avoid noise introduced by unlikely sequences or outlier events. Details and events are often left out in stories, but by crowdsourcing many examples these details can be captured and a reliable plot-graph generated.

The next stage translates the plot-graph into a trajectory tree, where plot graphs are nodes, and arrows denote a legitimate step from one plot point to the next. All possible trajectories are generated from the plot graph, with the desired path being used to assign a reward signal. The agent tracks its current state as well as its progress through the trajectory tree. Performing actions that are the successor of the current node, result in the agent receiving a reward. Actions performed that are not the successor of the current node result in a small punishment.

Figure 3: Example of trajectory tree using pharmacy plot graph.



There are a number of technical challenges using trajectory trees to generate a reward-signal.







Whos values?







































Bandura, A. (1977). Social learning theory. Englewood Cliffs, NJ: Prentice Hall.


Bostrom, N. (2014). Superintelligence. Oxford: Oxford University Press.

Bruner, J. (1986) Actual minds, possible worlds. Cambridge, MA: Harvard University Press.

de Waal, F. B. E. (1996). Good natured: The origins of right and wrong in humans and

other animals. Cambridge, MA: Harvard University Press.

Cosmides, L. and Tooby, J. (1992). Cognitive adaptations for social exchange. In J. H.

Barkow, L. Cosmides, and J. Tooby (Eds.), The adapted mind: Evolutionary

psychology and the generation of culture (pp. 163-228). New York, NY: Oxford

Cowie, F. (1999). What’s within? Nativism reconsidered. Oxford: Oxford University


Fisher, R. and Ury, W. (1981). Getting to yes. Boston: Houghton Mifflin.

Goethe, JWV “The Sorcerer’s Apprentice”, Poem of Quotes,  [3 May 2018]

Greene, J. (2001). An fMRI Investigation of Emotional Engagement in Moral Judgment. Science, 293(5537), pp.2105-2108.

Haidt, J. (2001). The emotional dog and its rational tail: A social intuitionist approach to moral judgment. Psychological Review, 108(4), pp.814-834.

Hibbard, B. (2001). Super-intelligent machines. ACM SIGGRAPH Computer Graphics, 35(1), pp.13-15.


Hunt, E. (2018). Tay, Microsoft’s AI chatbot, gets a crash course in racism from Twitter. [online] the Guardian. Available at: [Accessed 14 May 2018].

Jain (1

Kahneman, D. (2015). Thinking, fast and slow. New York: Farrar, Straus and Giroux.

Kohlberg, L. (1971) Stages of moral development as a basis of moral education. In C. Beck, B. Crittendon, and E. Sullivan (eds), moral education: interdisciplinary approaches. Toronto, Canada: University of Toronto Press.

Kuipers, B. (2016), “Toward morality and ethics for robots”, Ethical and Moral Considerations in Non-Human Agents, AAAI Spring Symposium Series, Palo Alto, CA

Kurzban, R. (2010). Why everyone (else) is a hypocrite. Princeton, N.J.: Princeton University Press.

Li, B.; Lee-Urban, S.; Johnston.G,; and Reidl, M. O. (2013) Story generation with crowdsourced plot graphs. In proceedings of the 27th AAAI Conference on Artificial Intelligence

Lin, Tom C. W. (2013). The New Investor. 60 UCLA Law Review 678 ; Temple University Legal Studies Research Paper No. 2013-45, pp.

Muraven, M. (2017). Goal conflict in designing an autonomous artificial system.

Retrieved from:

Narula, G. (2018). Everyday Examples of Artificial Intelligence and Machine Learning. [online] TechEmergence. Available at: [Accessed 14 May 2018].

Nyholm, S. and Smids, J. (2016). The Ethics of Accident-Algorithms for Self-Driving Cars: an Applied Trolley Problem?. Ethical Theory and Moral Practice, 19(5), pp.1275-1289.

Omohundro, S. (2008). The basic AI drives. In P. Wang, B. Goertzel, and S. Franklin (eds.). Proceedings of the First AGI Conference, Vol. 171. Frontiers in Artificial Intelligence and Applications. Amsterdam: IOS Press.

J. Panksepp and L. Biven, The Archaeology of Mind: Neuroevolutionary Origins of Human Emotions. WW Norton & Company, 2012. (2018). Autopilot. [online] Available at: [Accessed 12 May 2018].

Riedl, M. O., & Harrison, B. (2015). Using stories to teach human values to artificial agents. Paper presented at the 2nd international workshop on AI, ethics, and society. <>.

Robinson, J. A., and Hawpe, L. (1986). Narrative Thinking as a Heuristic Process. In T.R. Sarbin (Ed), narrative psychology: The storied nature of human conduct. pp. 111-125

Russell, S. and Norvig, P. (n.d.). Artificial intelligence: A Modern Approach. 3rd ed. New Jersey: Pearson Education.

Schwartz and Shalom H. (1992). “Universals in the Content and Structure of Values: Theoretical Advances and Empirical Tests in 20 Countries”. Advances in Experimental Psychology. Vol. 25, pp. 1-65.

Sinclair, N. (2012). “Metaethics, Teleosemantics and the Function of Moral Judgements”. Biology and Philosophy, Vol. 27(5), pp. 639–662.

Sotola, K. (2016). “Defining human values for value learners”, Proc. AAAI Workshop on Artificial Intelligence AI Ethics and Society, pp. 113-123, 2017.

Vitz, P.C. (1985) A Critical review of Kohlberg’s model of moral development. Unpublished report for the Department of Education, Washington, DC.

Wallach, W., Allen, C. and Smit, I. (2007). Machine morality: bottom-up and top-down approaches for modelling human moral faculties. AI & SOCIETY, 22(4), pp.565-582.

Yudkowsky, E. [2001a] “Creating friendly AI,” .

Yudkowsky, E. (2011). “Complex value systems are required to realize valuable futures”. In J. Schmidhuber, K. R. Thorisson, & M. Looks (Eds.), Proceedings of the 4th conference on artificial general intelligence, AGI 2011 (pp. 388–393). Heidelberg: Springer.

[1] It is impossible for a very

2 Hume seems to have noticed this primacy of emotion over reason:  ‘Reason is, and ought only to be the slave of the passions, and can never pretend to any other office than to serve and obey them’(REFERENCE)

[3] Culture is ‘the complex and elaborate system of meaning and behaviour that defines the way of life for a group or society’ (Jain, 1996).

With Our Resume Writing Help, You Will Land Your Dream Job
Resume Writing Service, Resume101
Trust your assignments to an essay writing service with the fastest delivery time and fully original content.
Essay Writing Service, EssayPro
Nowadays, the PaperHelp website is a place where you can easily find fast and effective solutions to virtually all academic needs
Universal Writing Solution, PaperHelp
Professional Custom
Professional Custom Essay Writing Services
In need of qualified essay help online or professional assistance with your research paper?
Browsing the web for a reliable custom writing service to give you a hand with college assignment?
Out of time and require quick and moreover effective support with your term paper or dissertation?