A better way to organize complex visual tasks | MIT News

MIT researchers have developed a generative artificial intelligence-driven method for planning long-term visual tasks, such as robot navigation, that is twice as efficient as other existing techniques.
Their method uses a special model of the visual language to see the situation in the picture and simulate the actions needed to achieve the goal. Then the second model translates those simulations into a standard programming language for programming problems, and filters the solution.
Finally, the system automatically generates a set of files that can be included in the classic planning software, which includes a plan to achieve the goal. This two-step process has produced programs with a success rate of about 70 percent, beating the best basic methods that can only achieve about 30 percent.
Importantly, the system can solve new problems it has never encountered before, making it well suited to real-world environments where conditions can change rapidly.
“Our framework combines the advantages of visual language models, such as their ability to understand images, with the robust programming capabilities of a formal solver,” said Yilun Hao, a graduate student in aeronautics and astronautics (AeroAstro) at MIT and lead author of an open-access paper on the approach. “It can take a single image and move it through a simulation to a reliable, long-horizon system that can be useful in many real-life applications.”
He is joined on the paper by Yongchao Chen, a graduate student at the MIT Laboratory for Information and Decision Systems (LIDS); Chuchu Fan, associate professor at AeroAstro and principal investigator at LIDS; and Yang Zhang, a research scientist at the MIT-IBM Watson AI Lab. Paper to be presented at the International Conference on Advocacy Learning.
Dealing with practical tasks
Over the past few years, Fan and his colleagues have studied the use of generative AI models to perform complex reasoning and programming, typically using large-scale linguistic models (LLMs) to process textual input.
Many real-world programming problems, such as robot assembly and automated driving, have visual inputs that LLM cannot handle well on its own. Researchers want to expand into the visual domain by using visual language models (VLMs), powerful AI systems that can process images and text.
But VLMs struggle to understand the spatial relationships between objects in a scene and often fail to think correctly in multiple steps. This makes it difficult to use VLMs for long-range planning.
On the other hand, scientists have developed robust, systematic planners that can create effective long-term programs in complex situations. However, these software programs cannot process visual input and require expert knowledge to formulate the problem in a language that the solver can understand.
Fan and his team have built an automated scheduling system that takes the best of both approaches. The program, called VLM-guided formal planning (VLMFP), uses two specialized VLMs that work together to convert visual planning problems into ready-to-use files for formal planning software.
The researchers first carefully trained a small model they called SimVLM to focus on describing a situation in an image using natural language and simulating a sequence of actions in that situation. Then a much larger model, which they call GenVLM, uses the description from SimVLM to generate a set of initial files in a formal programming language known as Programming Domain Description Language (PDDL).
The files are ready to be loaded into the classic PDDL solver, which includes a step-by-step program to solve the task. GenVLM compares the results of the solver with those of the simulator and iteratively refines the PDDL files.
“The generator and the simulator work together so that we can achieve the exact same result, which is the simulation of an action that achieves a goal,” Hao said.
Because GenVLM is a large generative AI model, it saw many examples of PDDL during training and learned how this structured language can solve many problems. This available information allows the model to generate accurate PDDL files.
A flexible approach
VLMFP generates two separate PDDL files. The first is the domain file that defines the environment, valid actions, and rules of the domain. It also generates a problem file that describes the initial conditions and the goal of the particular problem at hand.
“One advantage of PDDL is that the domain file is the same for all cases in that domain. This makes our framework good for combining cases that are not visible under the same domain,” explained Hao.
In order for the system to be able to adapt successfully, the researchers needed to carefully design enough training data for SimVLM so that the model could learn to understand the problem and the goal without memorizing patterns in the situation. When tested, SimVLM successfully described the situation, simulated actions, and detected when the goal was achieved in about 85 percent of the tests.
Overall, the VLMFP framework achieved a success rate of about 60 percent in six 2D planning tasks and over 80 percent in two 3D tasks, including multirobot interaction and robot integration. It has also developed strategies that are valid for more than 50 percent of situations that have never been seen before, far surpassing basic methods.
“Our framework can work generically when rules change in different contexts. This gives our system the flexibility to solve many types of view-based planning problems,” Fan said.
In the future, researchers want to enable VLMFP to cope with more complex situations and explore ways to identify and mitigate fraud with VLMs.
“In the long run, productive AI models can act as agents and use the right tools to solve the most complex problems. But what does it mean to have the right tools, and how do we integrate those tools? There’s still a long way to go, but by bringing vision-based programming into the picture, this work is an important piece of the puzzle,” Fan said.
This work was funded, in part, by the MIT-IBM Watson AI Lab.


