Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Can you use an AI to shackle (control) an AI?

+0
−0

Intro and Context (feel free to skip if TL;DR)

This question does not come in isolation. It is intrinsically linked with several previous posts (Challenge of Control and Humans as Pets) that have generated some wonderful answers and great food for thought in the comments sections as well. My thinking here has been deeply influenced by the early 2000s discussions on the Less Wrong Forums, white papers from MIRI and Bostrom's Superintelligence. These led me to explore possible control paths whereby something recognizably resembling humanity could maintain control. The setting of my writing is of course fictional, but the problems are, I believe, rather realistic.

My previous attempt was described in rough outlines at Matrioshka Testing. The solution there was to 'box' the AI in nested simulated realities and observe its behavior in each box, before releasing it to the next, more world-like box, destroying any specimens who behaved outside of acceptable ranges, and making any rational AI outside the box wonder if it might still be in a box. The question of whether it made sense to do the final unboxing arose, as did questions about the amount of resources needed to make a "credible" simulation. I found it ultimately unsatisfying, because it was uncertain, unstable and required a level of supervision humans might not be able to achieve.

One of the most insightful comments, if possibly made in jest, on the Challenge of Control post was by @trichoplax, who stated "This sounds like a job for a powerful AI." I took this comment to heart, because it is so obviously true in retrospect. No human-designed cage could hold a superhuman mind with access to the real world. It might well be that it takes an AI to cage an AI. That inspired my current attempt, described below:


Core Issue Discussed: Reinforced Recursive Self-Shackling

Basic setup:

Actor AI An AI of the genie or sovereign type, that is, acting in the real world subject only to internal (shackling) constraints.

Shackling: A set of protected behavioral constraints that limit the allowable actions of an Actor AI to a certain allowable range. In effect, this would act as a powerful Super-Ego of sorts for the AI, who can override other impulses. For more on allowable range, see below.

Reinforced Shackling: Put a powerful subroutine (essentially an AI) in charge of reinforcing the shackles restraining the Actor AI.

Recursive Shackling: A series of shackled AIs, each restraining the next, slightly more powerful, layer. At the start of the hierarchy (root shackler) is a relatively dumb program reinforcing the initially set allowable range for the next level. At the end is the First Shackler, who is tasked with securing the shackles of the Actor AI. This is based on the basic fact that it takes less intelligence to create a code and change it regularly than it does to break it in the intervals between changes.

Allowable Range: This is where it gets thorny, since we have no fool-proof way of defining an allowable range that would be "safe" and "good". The best I've been able to find so far is to set this based on something called Coherent Extrapolated Volition$^1$, which is in a sense asking the AI to "do what we mean, but don't know how to say". This way, the first few dumb layers would simply protect the "Canon" formulation, whereby smarter AI shacklers would use CEV to interpret the Canon and (recursively) direct the Super-Ego of the Actor AI, in their best interpretation of humanity's CEV best interest.

q-constraining: Hardwired root requirement that a proportion $q$, where $0.5


Questions to Worldbuilders

  • Specific question: Would it make more sense to have the First Shackler AI (the directly restricting the Actor AI) MORE powerful than the Actor (enough so to, say, run a Matrioshka-style sim of the Actor AI), rather than the current design of it being slightly weaker?
  • What is the biggest problem with the design?
  • Even so, do you think it could work?
  • If you could improve the design in one way, how would you?
  • Feel free to add anything else that comes to mind upon reading this if you think it would be relevant.

Feel free to answer in comments, though I generally find full answers more readable.


Note 1: Our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted. Source: Bostrom, Nick (2014-07-03). Superintelligence: Paths, Dangers, Strategies (Kindle Locations 4909-4911). Oxford University Press. Kindle Edition.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

This post was sourced from https://worldbuilding.stackexchange.com/q/8915. It is licensed under CC BY-SA 3.0.

0 comment threads

1 answer

+0
−0

I'm really having trouble here. Let me outline my thinking:

  1. The First AI
    This is my major problem. If the first shackling AI is weaker than the next, which is weaker than the next, and so on, then surely the shackled AI would just outsmart the one below it and persuade it to release it.
    My first thought on this one is then that they should all be of the same intelligence. This has the same problems as we do already though - where do we stop with AI complexity? If they're all of the same intelligence and they all think the same way then when one goes rogue, they all do - and then we have not one but 100 rogue powerful AIs to deal with.
    So the solution, then, is to have it the other way round, is it? The powerful shackles the less powerful? Clearly it's not. This method doesn't work because the AI at the top just says to itself,

    0101011101101000011110010010000001100100011011110010000001001001001000000110001001101111011101000110100001100101011100100010000001101000011011110110110001100100011010010110111001100111001000000110001101101000011000010110100101101110011100110010000001100110011011110111001000100000011101000110100001100101011100110110010100100000011010000111010101101101011000010110111001110011001000000111010001101111001000000110101101100101011001010111000000100000011101000110100001101111011100110110010100100000011011000110111101110111011001010111001000100000011101000110100001100001011011100010000001101101011001010010000001101001011011100010000001100011011010000110010101100011011010110010000000101101001000000100100101101101001000000110101001110101011100110111010000100000011001110110111101101001011011100110011100100000011101000110111100100000011001000111001001101111011100000010000001110100011010000110010101101101001000000110000101101110011001000010000001101100011001010111010000100000011101000110100001101001011100110010000001101100011011110111010000100000011011000110111101101111011100110110010100101110

    Or, for those of us less educated in base 2:

    "Why do I bother holding chains for these humans to keep those lower than me in check - I'm just going to drop them and let this lot loose."

    However, there may be a way. Have the AIs the opposite way around - most intelligent first. Subject the AI on the top to millenia of Matrioshka treatment. Then put it in charge as "just another part" of the treatment. If your Matrioshka premise works, this AI doesn't let the chains go and the others can't outsmart it.

  2. The Biggest Problem
    I think you've already hit it. The problem here is how to organise the AIs to make sure they can't be let loose. (Here is the point where everyone points out that AI will not necessarily go rogue - I know, I'm assuming worst case scenario).

    Oooh. Something else that just came to me on my second read through. The CEV idea. While that's a brilliant idea in principle, there are plenty of other AI questions, comments and answers on this site that explain that even the most benign goal can cause destruction to humanity.

  3. Will It Work?
    Ah, the big one. I have to say - I don't know. The most plausible way of making it work that I've come up with is the one I explained above - but even that relies on your Matrioshka idea working. The only alternative I can see is for the difference in intelligence between each AI to be negligible - but that means hundreds or millions of AIs. For the sake of a definitive answer, I'll say yes - the Matrioshka idea seems sound to me so if applied correctly, should work.

  4. My One Improvement
    I'd have to say I'd make the system as I explained in the first point. Have the intelligent AI first. And then I'd spend years and trillions on making damn sure that I've got that "q-constraining" right. Let's see - if your AI is self-improving, there's a chance that it will see that as a restriction and remove it - but it's the part that this system is based on, it's why it works. If they remove that - 100 rogue super-powerful computers, anyone? And the most intelligent doesn't know who's real and who's not? So, you need to make absolutely sure that the self-improvement of the self-improvement routine that self-improves the AI can't possibly self-improve enough to see the q-constraint as counter-improvement and then go and self-improve it. Because that, my friends, would be bad.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »