Salesforce AI Introduces FOFPred: A Language-Driven Future Flow Prediction Framework That Enables Advanced Robot Control and Video Production

0 16 5 minutes read

Salesforce AI Introduces FOFPred: A Language-Driven Future Flow Prediction Framework That Enables Advanced Robot Control and Video Production

The Salesforce AI research team introduces FOFPred, a language-driven visual flow prediction framework that connects large-scale visual language models and diffusion transformers for dense motion prediction in control and video production. FOFPred takes one or more images and a natural language instruction such as ‘move the bottle from right to left’ and predicts the next 4 streaming frames describing how each pixel is expected to move over time.

Optical flow of the future as a representation of movement

Optical flow is seen per pixel from the point between two frames. FOFPred focuses on future vision flow, which means predicting dense displacement fields of future frames given only current vision and text, without accessing future images intuitively.

The future optical flow represents the collective movement only. It removes the static appearance and keeps only the pixel-level motion, so it is well suited as an intermediate state for robot control policies and as a stop signal for video distribution models. Compared to predicting future RGB frames, it reduces the complexity of the output distribution and avoids modeling texture and frequency details that are not needed for motion planning.

In order to connect to the existing hidden distribution infrastructure, the research team combines the optical flow as RGB images. They map flow magnitude and direction from polar form to HSV channels, then convert to RGB. The scaling of each channel is tuned so that successive streaming frames are smooth and animation-like. The Flux.1 standard autoencoder then encodes and decodes these flow images.

Integrated core of VLM Diffusion

FOFPred uses an integrated architecture that includes a frozen vision language model, a frozen VAE and a trainable distribution converter. Pipeline i:

Qwen2.5-VL is used as a visual language encoder to collaboratively combine captions with visual input.
Flux.1 VAE encodes input images and optical flow guides for training into discrete components.
An OmniGen-style distribution transformer, DiT, takes the predicted physical and textual properties as conditional inputs and generates a future latent flow sequence.

Only DiT and small MLP projectors are qualified. The Qwen2.5-VL and Flux.1 weights remain intact, allowing the model to reuse image editing and multimodal imaging capabilities from earlier work. Temporal modeling is added by extending the spatial encoding of RoPE and attentional blocks from two-dimensional spaces to full spatio-temporal spaces across output and successive frames. This provides full spatio-temporal attention without adding additional parameters, so that DiT can reuse OmniGen image pre-training directly.

Training on web videos with audio accompanying visual flow

The main model is trained on the web scale of human activity videos and paired captions. The research team uses the Something Something V2 dataset and the EgoDex egocentric manipulation dataset to find nearly 500,000 pairs of video captions.

The training uses the end to finish the goal of matching the flow in the hidden area. The incoming optical flow sequences are first computed off-line, then VAE encoded and used as targets in the loss along the DiT diffusion flow. During training the method reuses the free guidance for classification in both text and visual contexts and hides certain frames and views to improve robustness.

An important contribution is the computation of the relative visual flow used to create pure training targets from subjective audio videos. For each frame pair the method:

It calculates dense optical flow on an off-the-shelf scale.
It measures camera movement with homography using depth features.
It uses projective geometry to remove the camera motion and find the flow vectors relative to the central object.
It filters the pairs of frames by selecting those where the highest percentage of k flows exceeds a threshold, focusing training on segments with beneficial motion.

These steps are performed offline with low resolution for efficiency, and then recalculated with the original resolution of the final target. Release studies show that static frame targets or raw flows without camera motion removal are detrimental to stream performance, while split relative flow targets provide the best results.

Language-driven robotic manipulation

The first use case of the river is robot control. FOFPred is implemented on robot caption data to predict future optical flow for both fixed and wrist-mounted cameras. On top of FOFPred, the research team attaches a distributed policy network that takes the predicted flow, robot state, and continuous output actions. This setup follows the previous distribution policy function but uses the future optical flow instead of the predicted RGB frames as the priority representation.

In the CALVIN ABCD benchmark, which evaluates long horizon zero chains of 5 language-specific manipulation functions, FOFPred achieves a chain length of 4.48. VPP reaches 4.33 and DreamVLA reaches 4.44 under the same protocol. FOFPred also achieves a Task 5 success rate of 78.7 percent, which is the best among reported methods. In the lower data set with 10 percent of CALVIN shows, FOFPred still reaches an average height of 3.43, which is higher than 3.25 for VPP.

In RoboTwin 2.0, a dual-arm manipulation benchmark with 5 tasks that require both arms, FOFPred achieves a success rate of 68.6 percent. Baseline VPP is up to 61.8 percent under similar training settings. FOFPred optimizes the success of every function in a subset.

Move text to video rendering

The second function below is to control the movement from text to video production. The research team created a two-stage pipeline by connecting FOFPred with the Go with the Flow video diffusion model. FOFPred takes the first frame and the definition of the motion language, predicts a sequence of future frames, and combines them into a dense motion field. Go with Flow and use this motion field and the first frame to compose the final video, enforcing the defined motion pattern.

In the heavy Something V2 benchmark motion, FOFPred and the Go with the Flow pipeline outperform the base CogVideoX under the same conditions. The method achieves SSIM 68.4, PSNR 22.26, LPIPS 28.5, FVD 75.39, KVD 11.38, and motion fidelity 0.662, which is always better than CogVideoX. Importantly, FOFPred uses only one language and framework for inference, while several controllable video frameworks require hand masks or object or trajectories as additional inputs.

Essential Extraction

FOFPred reframes motion prediction as language-driven future vision flow, predicting 4 dense flow frames from one or more current images and text instruction, providing only joint motion to represent the underlying tasks.
The model uses an integrated VLM Diffusion core, with Qwen2.5-VL as a frozen optical language encoder, Flux.1-VAE as a frozen image and flow encoder, and OmniGen DiT style as a trained single component with spatio temporal RoPE based attention.
The training is based on large-scale web and egocentric video from Something Something-V2 and EgoDex, and creates the target of the visual flow by measuring the movement of the ego with homography, removing the camera flow and filtering the high moving segments, which greatly improves the performance of the stream.
In the manipulation of robots, FOFPred acts as the backbone of the movement of the head of the distribution policy and achieves the state of the art or better results in CALVIN ABCD and RoboTwin 2.0, including 4.48 average length of the chain of operations in CALVIN and 68.6 percent of average success in RoboTwin, the variant of VPP and DreamVLA that works best.
For text-to-video production, connecting FOFPred to Go with the Flow produces better SSv2 metrics than CogVideoX, with higher SSIM and PSNR, lower FVD and KVD, and improved motion fidelity, while requiring only one language and one frame in thought, making FOFPred a reusable motion controller for both robotic and video synthesis pipelines.

Check out Paper, Model again Repo. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.