AI godfather Hinton: I am old, how to control the "super intelligence" that is smarter than humans is up to you

Source: Geek Park

Author | Li Yuan, Lingzi County Editor | Wei Shijie

"And I'm old," the 75-year-old Hinton said to all the young scientists present, and he hoped that everyone would study "how to have superintelligence". He sees an unprecedented challenge for a less intelligent species to control something smarter than itself. **

At the Zhiyuan Artificial Intelligence Conference, Hinton, the godfather of AI, gave a speech on "Two Paths to Intelligence". From the analysis of computing architecture and principles, he came to his own conclusion that "artificial neural network will be more intelligent than human brain", which is much faster than he originally imagined.

In his 30-minute speech, he talked about the current computing architecture where software and hardware are separated. Under this rule, training large models consumes a lot of computing power. In order to use less energy to train large models, he proposed the concept of Mortal Computing - like a person's intelligence depends on his body, it cannot be copied to another body at will, and the software is more Depends on the hardware it exists on.

But the ensuing problem is that when the specific hardware is damaged, the software is also damaged, and "the learned knowledge also dies together." The solution he proposed is to transfer the knowledge on the old hardware to the new hardware in a "distilled" way, just like a teacher teaching students.

**The concept corresponding to "knowledge distillation" (biological computing) is "weight sharing" (digital computing), which are what Hinton called "two paths to intelligence." **The relationship between a large language model and its copies is weight sharing, and each copy directly obtains the knowledge of the entire model parameters—for example, ChatGPT can talk to thousands of people at the same time based on the model behind it. And the continuous learning process of talking to everyone belongs to "knowledge distillation".

Although "knowledge distillation" is much less efficient than "weight sharing", and the bandwidth is also low, a large model can have 1000 copies, and eventually obtain 1000 times more knowledge than any one person.

Currently models only learn from documents—that is, human-processed knowledge. As the technology develops, they will be able to learn from visual information, and then may learn to manipulate robots. Then they are easily smarter than humans, smart enough to be good at deceiving people. **And humans are not good at getting along with things smarter than themselves. How to avoid the dangers of these "super smart" intelligences? This is the subject he left for every young scientist. **

The following is the main speech content compiled by Geek Park:

**I'm going to talk today about research that leads me to believe that superintelligence is closer than I thought. **

I have two questions I want to talk about, and my energy will be mainly focused on the first question, whether artificial neural networks will soon be smarter than real neural networks? I will elaborate on my research which leads me to the conclusion that such a thing may happen soon. At the end of the talk, I will talk about whether we can maintain control of superintelligence, but this will not be the main content of this talk.

In traditional computing, computers are designed to follow instructions exactly. We can run the exact same program or neural network on different physical hardware, because we know that the hardware will follow the instructions exactly. This means that the knowledge in the program or the weights of the neural network is immortal, i.e. it does not depend on any specific hardware. The cost of achieving this kind of immortality is high. We have to run transistors at high power, so their behavior is digital. And we cannot take advantage of the rich analog and variable properties of the hardware.

So the reason digital computers exist, and the reason they follow instructions precisely, is because in traditional designs, humans look at a problem, figure out what steps need to be taken to solve the problem, and then we tell the computer to take those steps. But that has changed.

We now have a different way of making computers do things, which is learning from examples, we just show them what we want them to do. Because of this change, we now have the opportunity to abandon one of the most fundamental principles of computer science, the separation of software from hardware.

Before we give up on it, let's take a look at why it's such a good principle. Separability allows us to run the same program on different hardware. We can also directly study the properties of programs without worrying about electronic hardware. And that's why the computer science department can become a discipline of its own, independent of the electrical engineering department.

**If we do give up the separation of hardware and software, we get what I call non-immortal computing. **

It obviously has big downsides, but it also has some huge upsides. In order to be able to run large language models with less energy, especially to train them, I started working on non-immortal computing.

The biggest benefit to be gained from giving up immortality is that giving up the separation of hardware and software can save a lot of energy. Because we can use analog computing at very low power, which is exactly what the brain is doing. It does require 1 bit of computation, since neurons are either on or off. But most of the calculations are done in analog, which can be done at very low power.

We can also get cheaper hardware. So today's hardware has to be manufactured very precisely in 2D (plane) whereas we can grow it in 3D (environment) because we don't need to know exactly how the hardware conducts electricity, or exactly how every single piece of it how to work.

Obviously, to do that would require a lot of new nanotechnology, or perhaps genetic reengineering of biological neurons, because biological neurons do roughly what we want them to do. **Before we discuss all the downsides of non-immortal computing, I want to give an example of computing that can be done much cheaper using analog hardware. **

If you want to multiply a vector of neural activity by a weight matrix, that's the central computation of a neural network, and it does most of the work for a neural network. What we're doing currently is driving transistors at very high power to represent the bits of the number, in numbers. Then we do O(n^2), multiplying two n-digit numbers. This may be an operation on a computer, but it is at the square bit level of n.

Another approach is to implement neuronal activity as a voltage and weight as a conductivity. Then in a unit time, the voltage is multiplied by the conductance to get a charge, and the charge is added by itself. So obviously you can just multiply the voltage vector with the conductance matrix. This is more energy efficient, and chips that work this way already exist.

Unfortunately, what people then do is try to convert the analog answer to digital, which requires the use of very expensive AC converters. We'd like to stay completely in the analog realm if we can. But doing so causes different hardware to end up computing slightly different things.

Therefore, the main problem with non-immortal computing is that when learning, the program must learn according to the specific properties of the simulated hardware it is on, without knowing exactly what the specific properties of each piece of hardware are, e.g. The exact function that connects the neuron's input to the neuron's output, unaware of connectivity.

This means that we cannot use algorithms like backpropagation to obtain gradients, because backpropagation requires an exact model of forward propagation. So the question is, if we can't use the backpropagation algorithm, what else can we do? Because we are all highly dependent on backpropagation now.

I can show a very simple and straightforward learning of weight perturbation, which has been studied a lot. For each weight in the network, a random small temporary perturbation vector is generated. Then measuring the change in the global objective function over a small batch of examples, you permanently change the weights by the size of the perturbation vector according to how the objective function improves. So if the objective function gets worse, you're obviously going the other direction.

The nice thing about this algorithm is that on average it performs as well as backpropagation because on average it also follows the gradient. The problem is that it has very large variance. So when you pick a random direction to move in, the resulting noise gets really bad as the size of the network increases. This means that this algorithm is effective for a small number of connections, but not for large networks.

We also have a better algorithm for activity perturbation learning. It still has similar problems, but is much better than weight perturbation. Activity perturbation is what you consider a random vector perturbation of the total input to each neuron. You do a random vector perturbation of each input to the neuron and see what happens to your objective function when you do this random perturbation on a small batch of examples and you get the objective function due to this perturbation Then you can calculate how to change each incoming weight of the neuron to follow the gradient. This method is less noisy.

For simple tasks like MNIST, such an algorithm is good enough. But it still doesn't work well enough to scale to large neural networks.

** Instead of finding an objective function that can be applied to a small neural network, we can try to find a learning algorithm that works for a large neural network. **The idea is to train a large neural network. And what we're going to do is have a lot of small objective functions that apply to a small part of the whole network. Therefore, each small group of neurons has its own local objective function.

**To summarize, so far, we haven't found a really good learning algorithm that can take advantage of the simulation properties, but we have a learning algorithm that is not bad, can solve simple problems like MNIST, but not so good. **

The second big problem with non-immortal computing is its non-immortal nature. This means that when a particular piece of hardware dies, all the knowledge it learned dies with it, because its learning is all based on the details of its specific piece of hardware. So the best way to solve this problem is that you distill the knowledge from the teacher (old hardware) to the student (new hardware) before the hardware dies. This is the research direction that I am trying to promote now.

Midjourney generated

The teacher would show the students the correct responses to various inputs, and the students would then attempt to mimic the teacher's responses. It's like Trump's Twitter. Some people are very angry with Trump's tweets because they feel that Trump is telling lies, and they think Trump is trying to explain the facts. no. What Trump has done is pick out a situation and have a targeted, very emotional response to that situation. His followers saw it, learned how to deal with the situation, learned how to adjust the weights in the neural network, and responded emotionally to the situation in the same way. It has nothing to do with the fact that this is a cult leader teaching bigotry to his cult followers, but it is very effective.

So, if we think about how distillation works, consider an agent classifying images into 1024 non-overlapping classes. The correct answer only takes about 10 bits to spell out. So when you train that agent on a training instance, if you tell it the correct answer, you're just putting 10-bit constraints on the weights of the network.

**But now suppose we train an agent to adjust itself according to the teacher's answers to these 1024 categories. ** Then the same probability distribution can be obtained, and 1023 real numbers are obtained in the distribution. Assuming that these probabilities are not small, this provides hundreds of times of constraints.

Typically, when you train a model, you train it correctly on the training data set, and then hope that it generalizes correctly on the test data. But here, when you find the student, you directly train the student to generalize, because the trained generalizes in the same way as the teacher.

I'll use the image data from MNIST on the digit 2 as an example. We can see the probabilities assigned by the teacher to various categories.

The first line is obviously a 2, and the teacher also gave a high probability of 2. The second row, the teacher is pretty confident it's a 2, but it also thinks it could be a 3, or it could be an 8, and you can see that, indeed, the 3 and the 8 have a slight resemblance to this picture. In the third row, this 2 is very close to 0. So the teacher will tell the students that you should choose to output 2 at this time, but you also have to place a small bet on 0. In this way, the student can learn more in this case than directly telling the student that this is a 2, and it can learn what number the shape looks like. In the fourth line, the teacher thinks it is a 2, but it is also very likely that it is a 1, which is the way I handwritten the 1 in the picture, and occasionally someone writes a 1 like this.

And the last line, in fact, the AI guessed wrong, it thought it was a 5, and the correct answer given by the MNIST dataset was 2. And the students can actually learn from the teacher's mistakes.

What I really like about the knowledge distillation model is that we are training the student to generalize in the same way as the teacher, including marking a small probability of wrong answers. Typically, when you train a model, you give it a training dataset and the correct answers, and then hope it generalizes correctly to the test dataset to produce the correct answers. You're trying to keep it from being too complicated, or doing various things, hoping it generalizes correctly. But here, when you train the student, you directly train the student to generalize in the same way as the teacher.

So now I want to talk about how an agent community can share knowledge. Instead of thinking about a single agent, it is better to think about sharing knowledge within a community.

And it turns out that the way the community shares knowledge determines a lot of the things you do about computing. So with the digital model, with the digital intelligence, you can have a whole bunch of agents using the exact same copy of the weights and using those weights in the exact same way. This means that different agents can look at different bits of the training data.

They can compute the gradient of the weights on these bits of the training data, and then can average their gradients. So now, each model learns from the data that each model sees, which means you gain a tremendous ability to see a lot of data, because you will have different copies of the model looking at different bits of data, and they can share the Gradients or shared weights to share what they learn very efficiently.

If you have a model with a trillion weights, that means every time they share something, you get a trillion bits of bandwidth. But the price of doing this is that you have to behave the digital agent in exactly the same way.

Therefore, an alternative to using weight sharing is to use distillation. And that's what we've done with digital models. This is a different architecture.

However, you have to do this if you have biological models that are taking advantage of the simulated nature of a particular piece of hardware. You cannot share weights. Therefore, you have to use distributed shared knowledge, which is not very efficient. **Sharing knowledge with distillation is hard. The sentences I generate, you are trying to figure out how to change your weights so that you will generate the same sentences. **

However, this is much lower bandwidth than just sharing gradients. Everyone who has ever taught, wishes to say what they know and pour it into the brains of their students. That would be the end of college. But we can't work like this because we are biologically intelligent and my way won't work for you.

So far, we have two different ways of doing calculations. **Numerical computing and biological computing, the latter using the characteristics of animals. They are very different in how to effectively share knowledge among different agents. **

If you look at large language models, they use numerical computation and weight sharing. But each copy of the model, each agent, is acquiring knowledge from the file in a very inefficient way. Taking a document and trying to predict the next word is actually very inefficient knowledge distillation, what it learns is not the teacher's prediction of the probability distribution of the next word, but the content of the next word chosen by the document author. Therefore, this is very low bandwidth. And that's how these big language models learn from people.

**While learning each copy of a large language model is inefficient, you have 1000 copies. That's why they can learn 1000 times more than us. So I believe these large language models know 1000 times more than any individual person. **

Now, the question is, what happens if these digital agents, instead of learning from us very slowly through knowledge distillation, start learning directly from the real world?

I should emphasize that even knowledge distillation learns very slowly, but when they learn from us, they can learn very abstract things. ** Humans have learned a lot about the world over the past few millennia, and digital agents are able to take advantage of this knowledge directly. Humans can verbalize what we have learned, so digital agents have direct access to everything humans have learned about the world over the past few millennia because we wrote it down.

But this way, the bandwidth of each digital agent is still very low, because they learn from documents. If they do unsupervised learning, like by modeling videos, once we find an efficient way to model videos to train the model, they can learn from all the YouTube videos, which is a lot of data. Or if they can manipulate the physical world, like they can control robotic arms and so on.

I really believe that once these digital agents start doing this, they will be able to learn a lot more than humans, and they will be able to learn fairly quickly. So we need to come to the second point I mentioned above in the slideshow, which is what happens if these things become smarter than us? **

Of course, this is also the main content of this meeting. But my main contribution is, **I want to tell you that these superintelligences may arrive much sooner than I used to think. **

**Bad people will use them to do things like manipulate electronics, which is already done in the US or many other places, and people will try to use AI to win wars. **

If you want a super agent to be efficient, you need to allow it to create subgoals. This brings up an obvious problem**, because there is an obvious sub-goal that can greatly enhance its ability to help us achieve anything: that is to give artificial intelligence systems more power and control. The more control you have, the easier it is to achieve your goals. **I don't see how we can stop digital intelligence from trying to gain more control to achieve their other goals. So once they start doing that, the problem arises.

For superintelligence, even if you store it in a completely offline isolated environment (airgap), it will find that it can easily gain more power by manipulating people. **We are not used to thinking about things that are much smarter than us and how we want to interact with them. **But it seems to me that they can obviously learn to be extremely good at deceiving people. Because it can see our practice of deceiving others in a large number of novels or in the works of Niccolo Machiavelli. And once you get really good at deceiving people, you can make them perform any action you want. For example, if you want to hack a building in Washington, you don't need to go there, you just trick people into thinking that by hacking that building, they're saving democracy. And I think it's pretty scary.

**I can't see how to prevent this from happening now, and I'm getting old. **I hope that many young and brilliant researchers, like you at the conference, can work out how we can have these superintelligences - that they will make our lives better without making them dominant party.

We have an advantage, a slight advantage, that these things didn't evolve, we built them. Because they didn't evolve, maybe they don't have the competing aggressive goals that humans have, maybe that helps, maybe we can give them a moral principle. But at the moment, I'm just nervous because I don't know of any examples of something more intelligent being dominated by something less intelligent than it was when there was a large gap in intelligence. **An example I like to give is to assume that frogs created humans. Who do you think is in control right now? Frog or Human? That's all for my speech. **

View Original
The content is for reference only, not a solicitation or offer. No investment, tax, or legal advice provided. See Disclaimer for more risks disclosure.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments