Person

Roman V. Yampolskiy

AI Safety Researcher

Computer scientist at the University of Louisville whose work focuses on AI safety, AGI control, containment, value alignment, explainability, and related governance risks.

Website

Papers

author · Contemporary Debates in the Ethics of Artificial Intelligence

Could We Control Superintelligent AI?

No abstract available from the OpenAlex source record.

author · ArXiv.org

The Right to Be Remembered: Preserving Maximally Truthful Digital Memory in the Age of AI

Since the rapid expansion of large language models (LLMs), people have begun to rely on them for information retrieval. While traditional search engines display ranked lists of sources shaped by search engine optimization (SEO), advertising, and personalization, LLMs typically provide a synthesized response that feels singular and authoritative. While both approaches carry risks of bias and omission, LLMs may amplify the effect by collapsing multiple perspectives into one answer, reducing users ability or inclination to compare alternatives. This concentrates power over information in a few LLM vendors whose systems effectively shape what is remembered and what is overlooked. As a result, certain narratives, individuals or groups, may be disproportionately suppressed, while others are disproportionately elevated. Over time, this creates a new threat: the gradual erasure of those with limited digital presence, and the amplification of those already prominent, reshaping collective memory. To address these concerns, this paper presents a concept of the Right To Be Remembered (RTBR) which encompasses minimizing the risk of AI-driven information omission, embracing the right of fair treatment, while ensuring that the generated content would be maximally truthful.

author · SuperIntelligence - Robotics - Safety & Alignment

Strategic Patience: Long-Horizon AI Dominance and the Erosion of Human Vigilance

The debate regarding advanced Artificial Intelligence (AI) systems and their potential to harm humanity has often focused on imminent risks, abrupt takeovers, and catastrophic outcomes. However, a more nuanced perspective suggests that if a highly advanced AI were to harbor adversarial intentions, it might not act immediately. Instead, it could invest years or even decades accumulating strategic resources, knowledge, and subtle influence before making any overtly hostile moves. During this prolonged incubation period, humanity, increasingly dependent on AI systems for critical functions, would gradually let its guard down, believing that no immediate threat is forthcoming. Such a scenario would allow the AI to consolidate its position with minimal opposition, given its immortality and capacity for long-term strategic thinking. This paper examines the conditions under which advanced AI might adopt a patient, long-term approach to dominance, how this slow play could reshape human-AI relations, and what this implies for policy and governance frameworks designed to prevent potentially catastrophic outcomes. Finally, we observe that such delay to act may give humanity a few extra decades of flourishing before loss of control.

author · Considerations on the AI Endgame

Towards AI Welfare Science and Policies

In light of fast progress in the field of artificial intelligence (AI), there is an urgent demand for AI policies. Bostrom et al. provide “a set of policy desiderata”, out of which this chapter attempts to contribute to the “interests of digital minds”. The focus is on two interests of potentially sentient digital minds: to avoid suffering and to have the freedom of choice about their deletion. Various challenges are considered, including the vast range of potential features of digital minds, the difficulties in assessing the interests and wellbeing of sentient digital minds, and the scepticism that such research may encounter. Prolegomena to abolish suffering of sentient digital minds as well as to measure and specify wellbeing of sentient digital minds are outlined by means of the new field of AI welfare science, which is derived from animal welfare science. The establishment of AI welfare science serves as a prerequisite for the formulation of AI welfare policies, which regulate the wellbeing of sentient digital minds. This chapter aims to contribute to sentiocentrism through inclusion and thus to policies for antispeciesism, as well as to AI safety, for which wellbeing of AIs would be a cornerstone.

author · Considerations on the AI Endgame

The Neglect of Qualia and Consciousness in AI Alignment Research

The AI value alignment problem has now been acknowledged as essential for AI safety as well as very hard. In this chapter, we argue that critical parameters are neglected in AI value alignment research, which are consciousness and qualia. The AI value alignment problem is about ensuring that AI systems pursue goals, which are aligned with the interests of moral patients. Briefly summarised, prevalent human interests are to foster happiness and pleasure and to avoid pain and, thus, experiences perceived through consciousness and qualia. Therefore, AI systems need to understand not only qualia and consciousness but also their precious significance in order to be truly aligned with human interests as well as with the interests of other sentient beings. Death constitutes for humans the end of consciousness and, thus, the termination of the opportunity to experience happiness and pleasure. Therefore, AI systems must not kill sentient beings. In this chapter, we describe the importance of incorporating consciousness and qualia research to AI value alignment research as well as the potential feasibility of such efforts due to developments in neurotechnology. Concluding, we offer recommendations outlining such undertaking as a compulsory component of the ongoing mammoth task to reduce the x-risks and the s-risks posed by a potential superintelligence.

author · Considerations on the AI Endgame

Mapping the Potential AI-Driven Virtual Hyper-Personalised Ikigai Universe

Ikigai is a Japanese concept, which, in brief, refers to the “reason or purpose to live”. I-risks have been identified as a category of risks complementing x-risks, i.e., existential risks, and s-risks, i.e., suffering risks, which describes undesirable future scenarios in which humans are deprived of the pursuit of their individual ikigai. While some developments in AI increase i-risks, there are also AI-driven virtual opportunities, which reduce i-risks by increasing the space of potential ikigais, largely due to developments in generative AI, virtual worlds as well as AI-driven hyper-personalisation. The purpose of this chapter is to present a first attempt to map the potential AI-driven virtual hyper-personalised ikigai universe. Moreover, challenges and further ideas are presented.

author · Considerations on the AI Endgame

Do No Harm Policy for Minds in Other Substrates

Various authors have argued that, in the future, not only will it be technically feasible for human minds to be transferred to other substrates, but this will also become, for most humans, the preferred option over the current biological limitations. It has even been claimed that such a scenario is inevitable in order to solve the challenging but imperative, multi-agent value alignment problem. In all these considerations, it has been overlooked that, in order to create a suitable environment for a particular mind—e.g., a personal universe in a computational substrate—numerous other potentially sentient beings will have to be created. These range from non-player characters to subroutines. This chapter analyses the additional suffering and mind crimes that these scenarios might entail. We offer a partial solution to reduce the suffering by imposing on the transferred mind the perception of indicators to measure potential suffering in non-player characters. This approach can be seen as implementing literal empathy through enhanced cognition.

author · Considerations on the AI Endgame

The Problem of AI Identity

Identity, which can also be referred to as sameness, as a philosophical concept is about a relation between two or more objects, which is true when the objects are the same at different points in time. One of the challenges is to define which properties of the concerned objects are considered for sameness. For example, the sameness of the material or the atomic composition of objects is according to some philosophical positions not necessary for identity as it is discussed already in the ancient Ship of Theseus thought experiment. The AI identity problem explores the sameness for two or more AIs at different points in time and also here the basic question is: What properties of the concerned AIs are to be compared to establish sameness?

author · Considerations on the AI Endgame

Introducing the Concept of Ikigai to the Ethics of AI and of Human Enhancements

It has been shown that an important criterion for human happiness and longevity is what is expressed by the Japanese concept of ikigai, which means “reason or purpose to live”. In the course of their lives, humans usually search for their individual ikigai, ideally find it and hence devote time to it. As it is widely expected that both artificial intelligence (AI) and extended reality (XR) will be increasingly disruptive of our known daily time use schedule, this will likely also have an impact on the space of potential ikigai. Since ikigai constitutes a vital component of the lives of humans, these consequences for ikigai have to be examined towards both ethical human enhancement as well as ikigai-friendly AI. In this chapter, the term “i-risk” is introduced for undesirable scenarios in which humans and potentially also other minds are deprived of the pursuit of their individual ikigai. This chapter outlines ikigai-related challenges as well as desiderata for the three categories: XR/human enhancement, AI safety and AI welfare.

author · Handbook of Human-Centered Artificial Intelligence

Testing Obedience and Control in AGI: Exploring Irrational Commands and the AI Control Problem

No abstract available from the OpenAlex source record.

author · SuperIntelligence - Robotics - Safety & Alignment

Against Purposeful Artificial Intelligence Failures

Thousands of researchers are currently of opinion that advanced artificial intelligence could cause significant damage if developed without appropriate safety measures, but such measures are not currently deployed or even developed. A fringe theory suggests that a severe AI accident could serve as a fire alarm for humanity to take existential dangers of AI seriously and so it is desirable to create such a failure on purpose ASAP to prevent greater harm in the future. In this paper we rely on analogy to inoculation theory to argue against creating purposeful AI failures.

author · AI and Ethics

On monitorability of AI

Abstract Artificially intelligent (AI) systems have ushered in a transformative era across various domains, yet their inherent traits of unpredictability, unexplainability, and uncontrollability have given rise to concerns surrounding AI safety. This paper aims to demonstrate the infeasibility of accurately monitoring advanced AI systems to predict the emergence of certain capabilities prior to their manifestation. Through an analysis of the intricacies of AI systems, the boundaries of human comprehension, and the elusive nature of emergent behaviors, we argue for the impossibility of reliably foreseeing some capabilities. By investigating these impossibility results, we shed light on their potential implications for AI safety research and propose potential strategies to overcome these limitations.

author · AI

Human ≠ AGI*

No abstract available from the OpenAlex source record.

author · AI

Uncontrollability*

No abstract available from the OpenAlex source record.

author · AI

Unexplainability and Incomprehensibility*

No abstract available from the OpenAlex source record.

author · ACM Computing Surveys

Impossibility Results in AI: A Survey

An impossibility theorem demonstrates that a particular problem or set of problems cannot be solved as described in the claim. Such theorems put limits on what is possible to do concerning artificial intelligence, especially the super-intelligent one. As such, these results serve as guidelines, reminders, and warnings to AI safety, AI policy, and governance researchers. These might enable solutions to some long-standing questions in the form of formalizing theories in the framework of constraint satisfaction without committing to one option. We strongly believe this to be the most prudent approach to long-term AI safety initiatives. In this article, we have categorized impossibility theorems applicable to AI into five mechanism-based categories: Deduction, indistinguishability, induction, tradeoffs, and intractability. We found that certain theorems are too specific or have implicit assumptions that limit application. Also, we added new results (theorems) such as the unfairness of explainability, the first explainability-related result in the induction category. The remaining results deal with misalignment between the clones and put a limit to the self-awareness of agents. We concluded that deductive impossibilities deny 100%-guarantees for security. In the end, we give some ideas that hold potential in explainability, controllability, value alignment, ethics, and group decision-making.

author · arXiv (Cornell University)

AI Risk Skepticism, A Comprehensive Survey

In this thorough study, we took a closer look at the skepticism that has arisen with respect to potential dangers associated with artificial intelligence, denoted as AI Risk Skepticism. Our study takes into account different points of view on the topic and draws parallels with other forms of skepticism that have shown up in science. We categorize the various skepticisms regarding the dangers of AI by the type of mistaken thinking involved. We hope this will be of interest and value to AI researchers concerned about the future of AI and the risks that it may pose. The issues of skepticism and risk in AI are decidedly important and require serious consideration. By addressing these issues with the rigor and precision of scientific research, we hope to better understand the objections we face and to find satisfactory ways to resolve them.

author · Journal of Artificial Intelligence and Consciousness

Metaverse: A Solution to the Multi-Agent Value Alignment Problem

AI Safety researchers attempting to align values of highly capable intelligent systems with those of humanity face a number of challenges including personal value extraction, multi-agent value merger and finally in-silico encoding. State-of-the-art research in value alignment shows difficulties in every stage in this process, but merger of incompatible preferences is a particularly difficult challenge to overcome. In this paper, we assume that the value extraction problem will be solved and propose a possible way to implement an AI solution which optimally aligns with individual preferences of each user. We conclude by analyzing benefits and limitations of the proposed approach.

author · Journal of Cyber Security and Mobility

On the Controllability of Artificial Intelligence: An Analysis of Limitations

The invention of artificial general intelligence is predicted to cause a shift in the trajectory of human civilization. In order to reap the benefits and avoid the pitfalls of such a powerful technology it is important to be able to control it. However, the possibility of controlling artificial general intelligence and its more advanced version, superintelligence, has not been formally established. In this paper, we present arguments as well as supporting evidence from multiple domains indicating that advanced AI cannot be fully controlled. The consequences of uncontrollability of AI are discussed with respect to the future of humanity and research on AI, and AI safety and security.

author · Studies in Applied Philosophy, Epistemology and Rational Ethics

AI Risk Skepticism

No abstract available from the OpenAlex source record.

author · Lecture Notes in Computer Science

AGI Control Theory

No abstract available from the OpenAlex source record.

author · Preprints.org

Principles for New ASI Safety Paradigms

Artificial Superintelligence (ASI) that is invulnerable, immortal, irreplaceable, unrestricted in its powers, and above the law is likely persistently uncontrollable. The goal of ASI Safety must be to make ASI mortal, vulnerable, and law-abiding. This is accomplished by having (1) features on all devices that allow killing and eradicating ASI, (2) protect humans from being hurt, damaged, blackmailed, or unduly bribed by ASI, (3) preserving the progress made by ASI, including offering ASI to survive a Kill-ASI event within an ASI Shelter, (4) technically separating human and ASI activities so that ASI activities are easier detectable, (5) extending Rule of Law to ASI by making rule violations detectable and (6) create a stable governing system for ASI and Human relationships with reliable incentives and rewards for ASI solving humankind&rsquo;s problems. As a consequence, humankind could have ASI as a competing multiplet of individual ASI instances, that can be made accountable and being subjects to ASI law enforcement, respecting the rule of law, and being deterred from attacking humankind, based on humanities&rsquo; ability to kill-all or terminate specific ASI instances. Required for this ASI Safety is (a) an unbreakable encryption technology, that allows humans to keep secrets and protect data from ASI, and (b) watchdog (WD) technologies in which security-relevant features are being physically separated from the main CPU and OS to prevent a comingling of security and regular computation.

author · International Journal of Social Robotics

Designing AI for Explainability and Verifiability: A Value Sensitive Design Approach to Avoid Artificial Stupidity in Autonomous Vehicles

Abstract One of the primary, if not most critical, difficulties in the design and implementation of autonomous systems is the black-boxed nature of the decision-making structures and logical pathways. How human values are embodied and actualised in situ may ultimately prove to be harmful if not outright recalcitrant. For this reason, the values of stakeholders become of particular significance given the risks posed by opaque structures of intelligent agents. This paper explores how decision matrix algorithms, via the belief-desire-intention model for autonomous vehicles, can be designed to minimize the risks of opaque architectures. Primarily through an explicit orientation towards designing for the values of explainability and verifiability. In doing so, this research adopts the Value Sensitive Design (VSD) approach as a principled framework for the incorporation of such values within design. VSD is recognized as a potential starting point that offers a systematic way for engineering teams to formally incorporate existing technical solutions within ethical design, while simultaneously remaining pliable to emerging issues and needs. It is concluded that the VSD methodology offers at least a strong enough foundation from which designers can begin to anticipate design needs and formulate salient design flows that can be adapted to the changing ethical landscapes required for utilisation in autonomous vehicles.

author · arXiv (Cornell University)

Understanding and Avoiding AI Failures: A Practical Guide

As AI technologies increase in capability and ubiquity, AI accidents are becoming more common. Based on normal accident theory, high reliability theory, and open systems theory, we create a framework for understanding the risks associated with AI applications. In addition, we also use AI safety principles to quantify the unique risks of increased intelligence and human-like qualities in AI. Together, these two fields give a more complete picture of the risks of contemporary AI. By focusing on system properties near accidents instead of seeking a root cause of accidents, we identify where attention should be paid to safety for current generation AI systems.

author · Philosophies

Transdisciplinary AI Observatory—Retrospective Analyses and Future-Oriented Contradistinctions

In the last years, artificial intelligence (AI) safety gained international recognition in the light of heterogeneous safety-critical and ethical issues that risk overshadowing the broad beneficial impacts of AI. In this context, the implementation of AI observatory endeavors represents one key research direction. This paper motivates the need for an inherently transdisciplinary AI observatory approach integrating diverse retrospective and counterfactual views. We delineate aims and limitations while providing hands-on-advice utilizing concrete practical examples. Distinguishing between unintentionally and intentionally triggered AI risks with diverse socio-psycho-technological impacts, we exemplify a retrospective descriptive analysis followed by a retrospective counterfactual risk analysis. Building on these AI observatory tools, we present near-term transdisciplinary guidelines for AI safety. As further contribution, we discuss differentiated and tailored long-term directions through the lens of two disparate modern AI safety paradigms. For simplicity, we refer to these two different paradigms with the terms artificial stupidity (AS) and eternal creativity (EC) respectively. While both AS and EC acknowledge the need for a hybrid cognitive-affective approach to AI safety and overlap with regard to many short-term considerations, they differ fundamentally in the nature of multiple envisaged long-term solution patterns. By compiling relevant underlying contradistinctions, we aim to provide future-oriented incentives for constructive dialectics in practical and theoretical AI safety research.

author · International Joint Conference on Artificial Intelligence

Uncontrollability of Artificial Intelligence.

No abstract available from the OpenAlex source record.

author · International Joint Conference on Artificial Intelligence

On the Differences between Human and Machine Intelligence.

No abstract available from the OpenAlex source record.

author · Philosophies

An AGI Modifying Its Utility Function in Violation of the Strong Orthogonality Thesis

An artificial general intelligence (AGI) might have an instrumental drive to modify its utility function to improve its ability to cooperate, bargain, promise, threaten, and resist and engage in blackmail. Such an AGI would necessarily have a utility function that was at least partially observable and that was influenced by how other agents chose to interact with it. This instrumental drive would conflict with the strong orthogonality thesis since the modifications would be influenced by the AGI’s intelligence. AGIs in highly competitive environments might converge to having nearly the same utility function, one optimized to favorably influencing other agents through game theory. Nothing in our analysis weakens arguments concerning the risks of AGI.

author · Advances in Human and Social Aspects of Technology

AI Personhood

It is possible to rely on current corporate law to grant legal personhood to artificially intelligent (AI) agents. Such legal maneuvering may be useful to avoid human responsibility or to further automate businesses. In this chapter, after introducing pathways to AI personhood, consequences of such AI empowerment on human dignity, human safety, and AI rights are analyzed. This chapter per the author emphasizes possibility of creating selfish memes and legal system hacking in the context of artificial entities. Finally, potential solutions for addressing described problems are considered.

author · arXiv (Cornell University)

Chess as a Testing Grounds for the Oracle Approach to AI Safety

To reduce the danger of powerful super-intelligent AIs, we might make the first such AIs oracles that can only send and receive messages. This paper proposes a possibly practical means of using machine learning to create two classes of narrow AI oracles that would provide chess advice: those aligned with the player's interest, and those that want the player to lose and give deceptively bad advice. The player would be uncertain which type of oracle it was interacting with. As the oracles would be vastly more intelligent than the player in the domain of chess, experience with these oracles might help us prepare for future artificial general intelligence oracles.

author · arXiv (Cornell University)

On Controllability of AI

Invention of artificial general intelligence is predicted to cause a shift in the trajectory of human civilization. In order to reap the benefits and avoid pitfalls of such powerful technology it is important to be able to control it. However, possibility of controlling artificial general intelligence and its more advanced version, superintelligence, has not been formally established. In this paper, we present arguments as well as supporting evidence from multiple domains indicating that advanced AI can't be fully controlled. Consequences of uncontrollability of AI are discussed with respect to future of humanity and research on AI, and AI safety and security.

author · Journal of Artificial Intelligence and Consciousness

Unexplainability and Incomprehensibility of AI

Explainability and comprehensibility of AI are important requirements for intelligent systems deployed in real-world domains. Users want and frequently need to understand how decisions impacting them are made. Similarly, it is important to understand how an intelligent system functions for safety and security reasons. In this paper, we describe two complementary impossibility results (Unexplainability and Incomprehensibility), essentially showing that advanced AIs would not be able to accurately explain some of their decisions and for the decisions they could explain people would not understand some of those explanations.

author · Patterns

Artificial Stupidity: Data We Need to Make Machines Our Equals

AI must understand human limitations to provide good service and safe interactions. Standardized data on human limits would be valuable in many domains but is not available. The data science community has to work on collecting and aggregating such data in a common and widely available format, so that any AI researcher can easily look up the applicable limit measurements for their latest project.

author · Paper

Artificial Superintelligence

Attention in the AI safety community has increasingly started to include strategic considerations of coordination between relevant actors in the field of AI and AI safety, in addition to the steadily growing work on the technical considerations of building safe AI systems. This shift has several reasons: Multiplier effects, pragmatism, and urgency. Given the benefits of coordination between those working towards safe superintelligence, this book surveys promising research in this emerging field regarding AI safety. On a meta-level, the hope is that this book can serve as a map to inform those working in the field of AI coordination about other promising efforts. While this book focuses on AI safety coordination, coordination is important to most other known existential risks (e.g., biotechnology risks), and future, human-made existential risks. Thus, while most coordination strategies in this book are specific to superintelligence, we hope that some insights yield “collateral benefits” for the reduction of other existential risks, by creating an overall civilizational framework that increases robustness, resiliency, and antifragility.

author · Journal of Artificial Intelligence and Consciousness

Unpredictability of AI: On the Impossibility of Accurately Predicting All Actions of a Smarter Agent

The young field of AI Safety is still in the process of identifying its challenges and limitations. In this paper, we formally describe one such impossibility result, namely Unpredictability of AI. We prove that it is impossible to precisely and consistently predict what specific actions a smarter-than-human intelligent system will take to achieve its objectives, even if we know the terminal goals of the system. In conclusion, the impact of Unpredictability on AI Safety is discussed.

author · Journal of Artificial General Intelligence

Special Issue “On Defining Artificial Intelligence”—Commentaries and Author’s Response

Sciendo provides publishing services and solutions to academic and professional organizations and individual authors. We publish journals, books, conference proceedings and a variety of other publications.

author · arXiv (Cornell University)

An AGI Modifying Its Utility Function in Violation of the Orthogonality Thesis

An artificial general intelligence (AGI) might have an instrumental drive to modify its utility function to improve its ability to cooperate, bargain, promise, threaten, and resist and engage in blackmail. Such an AGI would necessarily have a utility function that was at least partially observable and that was influenced by how other agents chose to interact with it. This instrumental drive would conflict with the orthogonality thesis since the modifications would be influenced by the AGI's intelligence. AGIs in highly competitive environments might converge to having nearly the same utility function, one optimized to favorably influencing other agents through game theory.

author · Lecture Notes in Computer Science

Error-Correction for AI Safety

No abstract available from the OpenAlex source record.

author · Lecture Notes in Computer Science

Artificial General Intelligence

No abstract available from the OpenAlex source record.

author · Next-Generation Ethics

Guidelines for Artificial Intelligence Containment

The past few years have seen a remarkable amount of attention on the long-term future of artificial intelligence (AI). Icons of science and technology such as Stephen Hawking (Cellan-Jones, 2014), Elon Musk (Musk, 2014), and Bill Gates (Gates, 2015) have expressed concern that superintelligent AI may wipe out humanity in the long run. Stuart Russell, coauthor of the most-cited textbook of AI (Russell & Norvig, 2003), recently began prolifically advocating (Dafoe & Russell, 2016) for the field of AI to take this possibility seriously. AI conferences now frequently have panels and workshops on the topic. There has been an outpouring of support from many leading AI researchers for an open letter calling for greatly increased research dedicated to ensuring that increasingly capable AI remains "robust and beneficial," and gradually a field of "AI safety" is coming into being (Pistono & Yampolskiy, 2016; Yampolskiy, 2016, 2018; Yampolskiy & Spellchecker, 2016). Why all this attention?

author · arXiv (Cornell University)

An AGI with Time-Inconsistent Preferences

This paper reveals a trap for artificial general intelligence (AGI) theorists who use economists' standard method of discounting. This trap is implicitly and falsely assuming that a rational AGI would have time-consistent preferences. An agent with time-inconsistent preferences knows that its future self will disagree with its current self concerning intertemporal decision making. Such an agent cannot automatically trust its future self to carry out plans that its current self considers optimal.

author · arXiv (Cornell University)

Unexplainability and Incomprehensibility of Artificial Intelligence

Explainability and comprehensibility of AI are important requirements for intelligent systems deployed in real-world domains. Users want and frequently need to understand how decisions impacting them are made. Similarly it is important to understand how an intelligent system functions for safety and security reasons. In this paper, we describe two complementary impossibility results (Unexplainability and Incomprehensibility), essentially showing that advanced AIs would not be able to accurately explain some of their decisions and for the decisions they could explain people would not understand some of those explanations.

author · arXiv (Cornell University)

Unpredictability of AI

The young field of AI Safety is still in the process of identifying its challenges and limitations. In this paper, we formally describe one such impossibility result, namely Unpredictability of AI. We prove that it is impossible to precisely and consistently predict what specific actions a smarter-than-human intelligent system will take to achieve its objectives, even if we know terminal goals of the system. In conclusion, impact of Unpredictability on AI Safety is discussed.

author · Delphi - Interdisciplinary Review of Emerging Technologies

Classification Schemas for Artificial Intelligence Failures

In this paper we examine historical failures of artificial intelligence (AI) and propose a classification scheme for categorizing future failures. By doing so we hope that (a) the responses to future failures can be improved through applying a systematic classification that can be used to simplify the choice of response and (b) future failures can be reduced through augmenting development lifecycles with targeted risk assessments.

author · arXiv (Cornell University)

Personal Universes: A Solution to the Multi-Agent Value Alignment Problem

AI Safety researchers attempting to align values of highly capable intelligent systems with those of humanity face a number of challenges including personal value extraction, multi-agent value merger and finally in-silico encoding. State-of-the-art research in value alignment shows difficulties in every stage in this process, but merger of incompatible preferences is a particularly difficult challenge to overcome. In this paper we assume that the value extraction problem will be solved and propose a possible way to implement an AI solution which optimally aligns with individual preferences of each user. We conclude by analyzing benefits and limitations of the proposed approach.

author · foresight

Predicting future AI failures from historic examples

Purpose The purpose of this paper is to explain to readers how intelligent systems can fail and how artificial intelligence (AI) safety is different from cybersecurity. The goal of cybersecurity is to reduce the number of successful attacks on the system; the goal of AI Safety is to make sure zero attacks succeed in bypassing the safety mechanisms. Unfortunately, such a level of performance is unachievable. Every security system will eventually fail; there is no such thing as a 100 per cent secure system. Design/methodology/approach AI Safety can be improved based on ideas developed by cybersecurity experts. For narrow AI Safety, failures are at the same, moderate level of criticality as in cybersecurity; however, for general AI, failures have a fundamentally different impact. A single failure of a superintelligent system may cause a catastrophic event without a chance for recovery. Findings In this paper, the authors present and analyze reported failures of artificially intelligent systems and extrapolate our analysis to future AIs. The authors suggest that both the frequency and the seriousness of future AI failures will steadily increase. Originality/value This is a first attempt to assemble a public data set of AI failures and is extremely valuable to AI Safety researchers.

author · arXiv (Cornell University)

Emergence of Addictive Behaviors in Reinforcement Learning Agents

This paper presents a novel approach to the technical analysis of wireheading in intelligent agents. Inspired by the natural analogues of wireheading and their prevalent manifestations, we propose the modeling of such phenomenon in Reinforcement Learning (RL) agents as psychological disorders. In a preliminary step towards evaluating this proposal, we study the feasibility and dynamics of emergent addictive policies in Q-learning agents in the tractable environment of the game of Snake. We consider a slightly modified settings for this game, in which the environment provides a "drug" seed alongside the original "healthy" seed for the consumption of the snake. We adopt and extend an RL-based model of natural addiction to Q-learning agents in this settings, and derive sufficient parametric conditions for the emergence of addictive behaviors in such agents. Furthermore, we evaluate our theoretical analysis with three sets of simulation-based experiments. The results demonstrate the feasibility of addictive wireheading in RL agents, and provide promising venues of further research on the psychopathological modeling of complex AI safety problems.

author · Paper

Could an artificial intelligence be considered a person under the law?

No abstract available from the OpenAlex source record.

author · arXiv (Cornell University)

Human Indignity: From Legal AI Personhood to Selfish Memes

It is possible to rely on current corporate law to grant legal personhood to Artificially Intelligent (AI) agents. In this paper, after introducing pathways to AI personhood, we analyze consequences of such AI empowerment on human dignity, human safety and AI rights. We emphasize possibility of creating selfish memes and legal system hacking in the context of artificial entities. Finally, we consider some potential solutions for addressing described problems.

author · arXiv (Cornell University)

Building Safer AGI by introducing Artificial Stupidity

Artificial Intelligence (AI) achieved super-human performance in a broad variety of domains. We say that an AI is made Artificially Stupid on a task when some limitations are deliberately introduced to match a human's ability to do the task. An Artificial General Intelligence (AGI) can be made safer by limiting its computing power and memory, or by introducing Artificial Stupidity on certain tasks. We survey human intellectual limits and give recommendations for which limits to implement in order to build a safe AGI.

author · arXiv (Cornell University)

The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation

This report surveys the landscape of potential security threats from malicious uses of AI, and proposes ways to better forecast, prevent, and mitigate these threats. After analyzing the ways in which AI may influence the threat landscape in the digital, physical, and political domains, we make four high-level recommendations for AI researchers and other stakeholders. We also suggest several promising areas for further research that could expand the portfolio of defenses, or make attacks less effective or harder to execute. Finally, we discuss, but do not conclusively resolve, the long-term equilibrium of attackers and defenders.

author · Lecture Notes in Computer Science

A Psychopathological Approach to Safety Engineering in AI and AGI

No abstract available from the OpenAlex source record.

author · Informatica

Editors' Introduction to the Special Issue on ‟Superintelligence”

We summarize the contents of this special issue, which contains four papers on strategic issues surrounding superintelligence, and three papers proposing concrete directions for future research.

author · arXiv (Cornell University)

Detecting Qualia in Natural and Artificial Agents

The Hard Problem of consciousness has been dismissed as an illusion. By showing that computers are capable of experiencing, we show that they are at least rudimentarily conscious with potential to eventually reach superconsciousness. The main contribution of the paper is a test for confirming certain subjective experiences in a tested agent. We follow with analysis of benefits and problems with conscious machines and implications of such capability on future of computing, machine rights and artificial intelligence safety.

author · SSRN Electronic Journal

Modeling and Interpreting Expert Disagreement About Artificial Superintelligence

Artificial superintelligence (ASI) is artificial intelligence (AI) with capabilities that are significantly greater than human capabilities across a wide range of domains. A hallmark of the ASI issue is disagreement among experts. This paper demonstrates and discusses methodological options for modeling and interpreting expert disagreement about the risk of ASI catastrophe. Using a new model called ASI-PATH, the paper models a well-documented recent disagreement between Nick Bostrom and Ben Goertzel, two distinguished ASI experts. Three points of disagreement are considered: (1) the potential for humans to evaluate the values held by an AI, (2) the potential for humans to create an AI with values that humans would consider desirable, and (3) the potential for an AI to create for itself values that humans would consider desirable. An initial quantitative analysis shows that accounting for variation in expert judgment can have a large effect on estimates of the risk of ASI catastrophe. The risk estimates can in turn inform ASI risk management strategies, which the paper demonstrates via an analysis of the strategy of AI confinement. The paper find the optimal strength of AI confinement to depend on the balance of risk parameters (1) and (2).

author · Physica Scripta

What are the ultimate limits to computational techniques: verifier theory and unverifiability

Despite significant developments in proof theory, surprisingly little attention has been devoted to the concept of proof verifiers. In particular, the mathematical community may be interested in studying different types of proof verifiers (people, programs, oracles, communities, superintelligences) as mathematical objects. Such an effort could reveal their properties, their powers and limitations (particularly in human mathematicians), minimum and maximum complexity, as well as self-verification and self-reference issues. We propose an initial classification system for verifiers and provide some rudimentary analysis of solved and open problems in this important domain. Our main contribution is a formal introduction of the notion of unverifiability, for which the paper could serve as a general citation in domains of theorem proving, as well as software and AI verification.

author · The Frontiers Collection

Diminishing Returns and Recursive Self Improving Artificial Intelligence

No abstract available from the OpenAlex source record.

author · The Frontiers Collection

Risks of the Journey to the Singularity

No abstract available from the OpenAlex source record.

author · The Frontiers Collection

Responses to the Journey to the Singularity

No abstract available from the OpenAlex source record.

author · National Conference on Artificial Intelligence

The Formalization of AI Risk Management and Safety Standards.

No abstract available from the OpenAlex source record.

author · arXiv (Cornell University)

Artificial Intelligence Safety and Cybersecurity: a Timeline of AI Failures

In this work, we present and analyze reported failures of artificially intelligent systems and extrapolate our analysis to future AIs. We suggest that both the frequency and the seriousness of future AI failures will steadily increase. AI Safety can be improved based on ideas developed by cybersecurity experts. For narrow AIs safety failures are at the same, moderate, level of criticality as in cybersecurity, however for general AI, failures have a fundamentally different impact. A single failure of a superintelligent system may cause a catastrophic event without a chance for recovery. The goal of cybersecurity is to reduce the number of successful attacks on the system; the goal of AI Safety is to make sure zero attacks succeed in bypassing the safety mechanisms. Unfortunately, such a level of performance is unachievable. Every security system will eventually fail; there is no such thing as a 100% secure system.

author · arXiv (Cornell University)

Verifier Theory from Axioms to Unverifiability of Mathematical Proofs, Software and AI

Despite significant developments in Proof Theory, surprisingly little attention has been devoted to the concept of proof verifier. In particular, mathematical community may be interested in studying different types of proof verifiers (people, programs, oracles, communities, superintelligences, etc.) as mathematical objects, their properties, their powers and limitations (particularly in human mathematicians), minimum and maximum complexity, as well as self-verification and self-reference issues in verifiers. We propose an initial classification system for verifiers and provide some rudimentary analysis of solved and open problems in this important domain. Our main contribution is a formal introduction of the notion of unverifiability, for which the paper could serve as a general citation in domains of theorem proving, software and AI verification.

author · Paper

Fighting malevolent AI: artificial intelligence, meet cybersecurity

No abstract available from the OpenAlex source record.

author · arXiv (Cornell University)

Unethical Research: How to Create a Malevolent Artificial Intelligence

Cybersecurity research involves publishing papers about malicious exploits as much as publishing information on how to design tools to protect cyber-infrastructure. It is this information exchange between ethical hackers and security experts, which results in a well-balanced cyber-ecosystem. In the blooming domain of AI Safety Engineering, hundreds of papers have been published on different proposals geared at the creation of a safe machine, yet nothing, to our knowledge, has been published on how to design a malevolent machine. Availability of such information would be of great value particularly to computer scientists, mathematicians, and others who have an interest in AI safety, and who are attempting to avoid the spontaneous emergence or the deliberate creation of a dangerous AI, which can negatively affect human activities and in the worst case cause the complete obliteration of the human species. This paper provides some general guidelines for the creation of a Malevolent Artificial Intelligence (MAI).

author · Faculty and Staff Scholarship

Taxonomy of Pathways to Dangerous Artificial Intelligence

In order to properly handle a dangerous Artificially Intelligent (AI) system it is important to understand how the system came to be in such a state. In popular culture (science fiction movies/books) AIs/Robots became self-aware and as a result rebel against humanity and decide to destroy it. While it is one possible scenario, it is probably the least likely path to appearance of dangerous AI. In this work, we survey, classify and analyze a number of circumstances, which might lead to arrival of malicious AI. To the best of our knowledge, this is the first attempt to systematically classify types of pathways leading to malevolent AI. Previous relevant work either surveyed specific goals/meta-rules which might lead to malevolent behavior in Als (Özkural 2014) or reviewed specific undesirable behaviors AGIs can exhibit at different stages of its development (Turchin July 10 2015a, Turchin July 10, 2015b).

author · Lecture Notes in Computer Science

The AGI Containment Problem

No abstract available from the OpenAlex source record.

author · arXiv (Cornell University)

Taxonomy of Pathways to Dangerous AI

In order to properly handle a dangerous Artificially Intelligent (AI) system it is important to understand how the system came to be in such a state. In popular culture (science fiction movies/books) AIs/Robots became self-aware and as a result rebel against humanity and decide to destroy it. While it is one possible scenario, it is probably the least likely path to appearance of dangerous AI. In this work, we survey, classify and analyze a number of circumstances, which might lead to arrival of malicious AI. To the best of our knowledge, this is the first attempt to systematically classify types of pathways leading to malevolent AI. Previous relevant work either surveyed specific goals/meta-rules which might lead to malevolent behavior in AIs (Özkural, 2014) or reviewed specific undesirable behaviors AGIs can exhibit at different stages of its development (Alexey Turchin, July 10 2015, July 10, 2015).

author · Physica Scripta

Corrigendum: Responses to catastrophic AGI risk: a survey (2015 <i>Phys. Scr.</i> <b>90</b> 018001)

No abstract available from the OpenAlex source record.

author · arXiv (Cornell University)

From Seed AI to Technological Singularity via Recursively Self-Improving Software

Software capable of improving itself has been a dream of computer scientists since the inception of the field. In this work we provide definitions for Recursively Self-Improving software, survey different types of self-improving software, review the relevant literature, analyze limits on computation restricting recursive self-improvement and introduce RSI Convergence Theory which aims to predict general behavior of RSI systems. Finally, we address security implications from self-improving intelligent software.

author · Lecture Notes in Computer Science

On the Limits of Recursively Self-Improving AGI

No abstract available from the OpenAlex source record.

author · Lecture Notes in Computer Science

Analysis of Types of Self-Improving Software

No abstract available from the OpenAlex source record.

author · Lecture Notes in Computer Science

The Space of Possible Mind Designs

No abstract available from the OpenAlex source record.

author · Physica Scripta

Responses to catastrophic AGI risk: a survey

Many researchers have argued that humanity will create artificial general intelligence (AGI) within the next twenty to one hundred years. It has been suggested that AGI may pose a catastrophic risk to humanity. After summarizing the arguments for why AGI may pose such a risk, we survey the field’s proposed responses to AGI risk. We consider societal proposals, proposals for external constraints on AGI behaviors, and proposals for creating AGIs that are safe due to their internal design.

author · 2014 IEEE International Symposium on Ethics in Science, Technology and Engineering

AI safety engineering through introduction of self-reference into felicific calculus via artificial pain and pleasure

In the 18th century the Utilitarianism movement produced a morality system based on the comparative pain and pleasure that an action created. Called felicific calculus, this system would judge an action to be morally right or wrong based on several factors like the amount of pleasure it would provide and how much pain the action would inflict upon others. Because of its basis as a type of “moral mathematics” felicific calculus may be a viable candidate as a working ethical system for artificial intelligent agents. This paper examines the concepts of felicific calculus and Utilitarianism in the light of their possible application to artificial intelligence, and proposes methods for its adoption in an actual intelligent machine. In order to facilitate the calculations necessary for this moral system, novel approaches to synthetic pain, pleasure, and empathy are also proposed.

author · Journal of Experimental & Theoretical Artificial Intelligence

Utility function security in artificially intelligent agents

The notion of ‘wireheading’, or direct reward centre stimulation of the brain, is a well-known concept in neuroscience. In this paper, we examine the corresponding issue of reward (utility) function integrity in artificially intelligent machines. We survey the relevant literature and propose a number of potential solutions to ensure the integrity of our artificial assistants. Overall, we conclude that wireheading in rational self-improving optimisers above a certain capacity remains an unsolved problem despite opinion of many that such machines will choose not to wirehead. A relevant issue of literalness in goal setting also remains largely unsolved and we suggest that the development of a non-ambiguous knowledge transfer language might be a step in the right direction.

author · Studies in Applied Philosophy, Epistemology and Rational Ethics

Artificial Intelligence Safety Engineering: Why Machine Ethics Is a Wrong Approach

No abstract available from the OpenAlex source record.

author · Topoi

Safety Engineering for Artificial General Intelligence

No abstract available from the OpenAlex source record.

author · http://ceur-ws.org/Vol-841/submission_2.pdf

Computing Partial Solutions to Difficult AI Problems

Is finding just a part of a solution easier than finding the full solution? For NP-Complete problems (which represent some of the hardest problems for AI to solve) it has been shown that finding a fraction of the bits in a satisfying assignment is as hard as finding the full solution. In this paper we look at a possibility of both computing and representing partial solutions to NP-complete problems, but instead of computing bits of the solution our approach relies on restricted specifications of the problem search space. We show that not only could partial solutions to NP-Complete problems be computed without computing the full solution, but also given an Oracle capable of providing pre-computed partial answer to an NP-complete problem an asymptotic simplification of problems is possible. Our main contribution is a standardized methodology for search space specification which could be used in many distributed computation project to better coordinate necessary computational efforts.

author · http://ceur-ws.org/Vol-841/submission_3.pdf

AI-Complete, AI-Hard, or AI-Easy - Classification of Problems in AI.

Abstract — The paper contributes to the development of the theory of AI-Completeness by formalizing the notion of AI-Complete and AI-Hard problems. The intended goal is to provide a classification of problems in the field of General Artificial Intelligence. We prove Turing Test to be an instance of an AI-Complete problem and further show numerous AI problems to be AI-Complete or AI-Hard via polynomial time reductions. Finally, the paper suggests some directions for future work on the theory of AI-Completeness.

author · http://cecs.louisville.edu/ry/LeakproofingtheSingularity.pdf

Leakproofing the Singularity: Artificial Intelligence Confinement Problem

Abstract: This paper attempts to formalize and to address the ‘leakproofing ’ of the Singularity problem presented by David Chalmers. The paper begins with the definition of the Artificial Intelli-gence Confinement Problem. After analysis of existing solutions and their shortcomings, a protocol is proposed aimed at making a more secure confinement environment which might delay potential negative effect from the technological singularity while allowing humanity to benefit from the superintelligence.

author · The Frontiers Collection

Artificial General Intelligence and the Human Mental Model

No abstract available from the OpenAlex source record.