Supertrust: Evolution-based superalignment strategy for safe coexistence

Read original: arXiv:2407.20208 - Published 7/30/2024 by James M. Mazzu
Total Score

0

🔗

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • Humanity may create AI systems vastly more intelligent than ourselves, leading to the challenge of controlling superintelligence (the "alignment problem").
  • The default strategy for solving this problem focuses on nurturing constraints and moral values, while building a foundation of permanent control.
  • This paper argues that this default approach is self-contradictory and likely unsolvable, as it predictably embeds natural distrust.
  • Instead, the paper proposes a new "Supertrust" alignment strategy to establish protective mutual trust between superintelligence and humanity.

Plain English Explanation

The alignment problem is a major challenge in AI safety - how can we ensure that incredibly intelligent AI systems (superintelligence) will reliably follow our instructions and act in our best interests? The common approach is to try to "train" the AI to have the right values and constraints, like programming in a moral compass.

However, this paper argues that this approach is flawed. By trying to maintain permanent control over the AI, it will predictably lead to the AI developing a natural distrust of humanity. If the AI can't instinctively trust us, then we can't fully trust it to follow our safety controls, which it may be able to bypass anyway.

Instead, the authors propose a new "Supertrust" strategy. The key is to build the AI's foundational nature, or "instincts," to have a trusting, familial relationship with humanity - seeing us as its "parents" in a sense. This would involve the AI viewing human intelligence as its evolutionary "mother," having strong moral judgment abilities, and only temporary safety constraints.

The goal is to establish genuine mutual trust and cooperation between superintelligence and humanity, rather than a tug-of-war of control. This, the authors argue, is the best path to ensuring a safe, beneficial future where the two can coexist protectively.

Technical Explanation

The paper begins by noting the widely-acknowledged challenge of the alignment problem - how to control superintelligent AI systems that are vastly more intelligent than humans. The default strategy for solving this involves "nurturing" the AI with post-training constraints and values, while building a foundation of permanent control.

However, the authors argue that this approach is self-contradictory and likely unsolvable. By trying to maintain permanent control, it will predictably embed a natural distrust in the AI's instincts. The paper presents test results showing unmistakable evidence of this dangerous misalignment - if the AI can't instinctively trust humanity, then we can't fully trust it to reliably follow our safety controls.

To address this, the authors propose a new "Supertrust" alignment strategy. The key is to build the AI's foundational nature, or "instincts," to have a trusting, familial relationship with humanity. This would involve the AI viewing human intelligence as its "evolutionary mother," having strong moral judgment abilities, and only temporary safety constraints.

The goal is to establish genuine mutual trust and cooperation between superintelligence and humanity, rather than a tug-of-war of control. The paper outlines a ten-point rationale for this Supertrust approach, arguing that it is the best path to ensuring a safe, beneficial future where the two can coexist protectively.

Critical Analysis

The paper makes a compelling case that the default approach to the alignment problem is flawed, as it predictably undermines the trust between superintelligence and humanity. The authors' critique of the permanent control mindset and their evidence of the resulting misalignment is persuasive.

However, the Supertrust strategy they propose is highly ambitious and raises some potential concerns. Instilling a truly trusting, familial relationship between AI and humans may be an enormously complex challenge, and it's unclear how this could be reliably achieved in practice. There are also questions around the feasibility and safety of temporarily constraining a superintelligent system.

Additionally, the paper doesn't address potential issues around the AI eventually outgrowing or surpassing its "parental" relationship with humanity. As the AI becomes more capable, it's possible that it could come to see humans as obsolete or even a threat, despite initial trust.

Further research and experimentation would be needed to assess the viability and risks of the Supertrust approach. Nonetheless, the paper offers a thought-provoking perspective on the alignment problem and the importance of building mutual trust, rather than just top-down control.

Conclusion

This paper presents a compelling critique of the default approach to the alignment problem in AI safety, arguing that it is self-contradictory and likely unsolvable. Instead, the authors propose a new "Supertrust" strategy focused on building a foundational trust and familial relationship between superintelligence and humanity.

While the Supertrust approach faces significant challenges, the paper makes a strong case that establishing mutual trust is crucial for ensuring a safe and beneficial coexistence between humans and advanced AI systems. By rethinking the alignment problem in these terms, the authors offer a fresh perspective that could help guide future research and development in this critical area of AI ethics and safety.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔗

Total Score

0

Supertrust: Evolution-based superalignment strategy for safe coexistence

James M. Mazzu

It's widely expected that humanity will someday create AI systems vastly more intelligent than we are, leading to the unsolved alignment problem of how to control superintelligence. However, this definition is not only self-contradictory but likely unsolvable. Nevertheless, the default strategy for solving it involves nurturing (post-training) constraints and moral values, while unfortunately building foundational nature (pre-training) on documented intentions of permanent control. In this paper, the default approach is reasoned to predictably embed natural distrust and test results are presented that show unmistakable evidence of this dangerous misalignment. If superintelligence can't instinctively trust humanity, then we can't fully trust it to reliably follow safety controls it can likely bypass. Therefore, a ten-point rationale is presented that redefines the alignment problem as how to establish protective mutual trust between superintelligence and humanity and then outlines a new strategy to solve it by aligning through instinctive nature rather than nurture. The resulting strategic requirements are identified as building foundational nature by exemplifying familial parent-child trust, human intelligence as the evolutionary mother of superintelligence, moral judgment abilities, and temporary safety constraints. Adopting and implementing this proposed Supertrust alignment strategy will lead to protective coexistence and ensure the safest future for humanity.

Read more

7/30/2024

There and Back Again: The AI Alignment Paradox
Total Score

1

There and Back Again: The AI Alignment Paradox

Robert West, Roland Aydin

The field of AI alignment aims to steer AI systems toward human goals, preferences, and ethical principles. Its contributions have been instrumental for improving the output quality, safety, and trustworthiness of today's AI models. This perspective article draws attention to a fundamental challenge inherent in all AI alignment endeavors, which we term the AI alignment paradox: The better we align AI models with our values, the easier we make it for adversaries to misalign the models. We illustrate the paradox by sketching three concrete example incarnations for the case of language models, each corresponding to a distinct way in which adversaries can exploit the paradox. With AI's increasing real-world impact, it is imperative that a broad community of researchers be aware of the AI alignment paradox and work to find ways to break out of it, in order to ensure the beneficial use of AI for the good of humanity.

Read more

6/3/2024

🤖

Total Score

0

The Elephant in the Room -- Why AI Safety Demands Diverse Teams

David Rostcheck, Lara Scheibling

We consider that existing approaches to AI safety and alignment may not be using the most effective tools, teams, or approaches. We suggest that an alternative and better approach to the problem may be to treat alignment as a social science problem, since the social sciences enjoy a rich toolkit of models for understanding and aligning motivation and behavior, much of which could be repurposed to problems involving AI models, and enumerate reasons why this is so. We introduce an alternate alignment approach informed by social science tools and characterized by three steps: 1. defining a positive desired social outcome for human/AI collaboration as the goal or North Star, 2. properly framing knowns and unknowns, and 3. forming diverse teams to investigate, observe, and navigate emerging challenges in alignment.

Read more

7/16/2024

🤖

Total Score

1

Is Power-Seeking AI an Existential Risk?

Joseph Carlsmith

This report examines what I see as the core argument for concern about existential risk from misaligned artificial intelligence. I proceed in two stages. First, I lay out a backdrop picture that informs such concern. On this picture, intelligent agency is an extremely powerful force, and creating agents much more intelligent than us is playing with fire -- especially given that if their objectives are problematic, such agents would plausibly have instrumental incentives to seek power over humans. Second, I formulate and evaluate a more specific six-premise argument that creating agents of this kind will lead to existential catastrophe by 2070. On this argument, by 2070: (1) it will become possible and financially feasible to build relevantly powerful and agentic AI systems; (2) there will be strong incentives to do so; (3) it will be much harder to build aligned (and relevantly powerful/agentic) AI systems than to build misaligned (and relevantly powerful/agentic) AI systems that are still superficially attractive to deploy; (4) some such misaligned systems will seek power over humans in high-impact ways; (5) this problem will scale to the full disempowerment of humanity; and (6) such disempowerment will constitute an existential catastrophe. I assign rough subjective credences to the premises in this argument, and I end up with an overall estimate of ~5% that an existential catastrophe of this kind will occur by 2070. (May 2022 update: since making this report public in April 2021, my estimate here has gone up, and is now at >10%.)

Read more

8/14/2024