Jason says they train robots much like how humans interact with the physical world — cameras on the robot perceive the world, then a neural network outputs actions to command the robot's joints and motors. You collect input-output pairs of images in and robot joints out, then feed that data into a model that controls the robot autonomously.
More from this episode
The way Jason describes it: humans have eyes that perceive the world through vision, and a brain that turns those sensory image inputs into actions our arms and legs perform. Training robots is very much the same — they put cameras on the robots that perceive the physical world, then train a neural network to output actions to command the robot's joints and different motors. You collect input-output pairs of images in and robot joints out by controlling the robot to fold clothes many, many times to gather data, and that data gets fed into a model which can then control the robot autonomously. He stresses the paradigm is fairly general — it's not specific to laundry folding. If you have data for packaging, cooking a meal or cleaning a bedroom, the same algorithm can train models for those tasks, and you can combine all the different datasets into one very powerful model, much like language models today.