Artificial intelligence

Physical Intelligence Team Unveils MEM for Robots: A Multiscale Memory System That Gives Gemma 3-4B VLAs 15-Minute Context for Complex Tasks

Current robotics policies, especially Vision-Language-Action (VLA) models, often work with a single observation or a very short history. This ‘memory deficit’ makes long-horizon tasks, such as cleaning the kitchen or following a complex recipe, impossible to compute or prone to failure. To address this, researchers from Physical Intelligence, Stanford, UC Berkeley, and MIT have presented Multi-Measurement Memory (MEM).

Dual-Scale Memory Architecture

MEM divides the robot’s memory into two different scales to balance the semantic context and real-time control constraints..

(1) Short-Term Video Memory

For tasks that require fine-grained spatial awareness—such as solving a self-lock or adjusting a hold—dense visual data is required. MEM uses an efficient video encoder that extends Vision Transformers (ViTs). To maintain real-time indexing (a ‘real-time barrier’ of 380ms), the architecture avoids shared attention across patches. Instead, it uses Space-Time Separable Attentionspatial attention varies between frames with causal-temporal attention across frames every fourth layer.

The computational complexity is reduced from O(n2K2) until O(Kn2+nK2), there n is the number of tracts of land and K number of time steps. By discarding tokens from previous time steps in higher layers, the model transfers only the representation of the current view to the VLA backbone, keeping the number of tokens constant compared to single-frame models.

(2) Long-Term Language Memory

To handle tasks of up to 15 minutes, MEM uses a language-based representation of semantic events.. The system decomposes the action prediction as:

$$$

Here, the top-level policy (πHL) maintains a working language abbreviation (mt) of past events and generate instructions for subordinate tasks (lt+1) with a low-level policy (πLL). This language memory is trained using LLM-generated abstractions that suppress information (eg, ‘I put three bowls’ instead of individual attributes), which reduces the risk of changing the training target distribution.

Implementation and Operation

The research team assembled the MEM at π0.6 VLAwhich is initiated from the previous training Gemma 3-4B model. The model was pre-trained on a diverse mix of robot demonstrations, visual language tasks, and Internet video data.

Key Results:

  • Content adaptation: MEM enables robots to adapt manipulation strategies based on recent failures. In analysis, this led to a + 62% increased success rate in opening refrigerators with unknown directions and a +11% increase in picking up sticks at variable altitudes.
  • Long-Horizon Jobs: The model successfully performed 15-minute tasks such as ‘Recipe Setup’ (retrieving ingredients from multiple locations) and ‘Kitchen Cleaning’ (washing dishes and wiping counters). Memory-less VLAs fail miserably at these tasks.
  • Good performance: The video encoder allows the model to process up to 16 frames of view (taking ~1 minute) while remaining under critical real-time constraints on a single NVIDIA H100 GPU.

MEM shows that combining dense, short-term visual tokens with compressed, long-term linguistic abstractions allows VLAs to measure their ‘working memory’ without incurring computational computational costs.


Check it out Paper again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button