Physical Intelligence Team Unveils MEM for Robots: A Multiscale Memory System That Gives Gemma 3-4B VLAs 15-Minute Context for Complex Tasks

0 0 2 minutes read

Physical Intelligence Team Unveils MEM for Robots: A Multiscale Memory System That Gives Gemma 3-4B VLAs 15-Minute Context for Complex Tasks

Current robotics policies, especially Vision-Language-Action (VLA) models, often work with a single observation or a very short history. This ‘memory deficit’ makes long-horizon tasks, such as cleaning the kitchen or following a complex recipe, impossible to compute or prone to failure. To address this, researchers from Physical Intelligence, Stanford, UC Berkeley, and MIT have presented Multi-Measurement Memory (MEM).

Dual-Scale Memory Architecture

MEM divides the robot’s memory into two different scales to balance the semantic context and real-time control constraints.^{^{^{^.}}}

(1) Short-Term Video Memory

For tasks that require fine-grained spatial awareness—such as solving a self-lock or adjusting a hold—dense visual data is required. MEM uses an efficient video encoder that extends Vision Transformers (ViTs). To maintain real-time indexing (a ‘real-time barrier’ of 380ms), the architecture avoids shared attention across patches. Instead, it uses Space-Time Separable Attentionspatial attention varies between frames with causal-temporal attention across frames every fourth layer.

The computational complexity is reduced from O(n²K²) until O(Kn²+nK²), there n is the number of tracts of land and K number of time steps. By discarding tokens from previous time steps in higher layers, the model transfers only the representation of the current view to the VLA backbone, keeping the number of tokens constant compared to single-frame models.

(2) Long-Term Language Memory

To handle tasks of up to 15 minutes, MEM uses a language-based representation of semantic events.^{^{^{^{^{^{^{^{^{. The system decomposes the action prediction as:}}}}}}}}}

$$$

Here, the top-level policy (π_HL₎ maintains a working language abbreviation (m_t) of past events and generate instructions for subordinate tasks (l_t+1) with a low-level policy (π_LL). This language memory is trained using LLM-generated abstractions that suppress information (eg, ‘I put three bowls’ instead of individual attributes), which reduces the risk of changing the training target distribution.

Implementation and Operation

The research team assembled the MEM at π_0.6 VLAwhich is initiated from the previous training Gemma 3-4B model. The model was pre-trained on a diverse mix of robot demonstrations, visual language tasks, and Internet video data.

Key Results:

Content adaptation: MEM enables robots to adapt manipulation strategies based on recent failures. In analysis, this led to a + 62% increased success rate in opening refrigerators with unknown directions and a +11% increase in picking up sticks at variable altitudes.
Long-Horizon Jobs: The model successfully performed 15-minute tasks such as ‘Recipe Setup’ (retrieving ingredients from multiple locations) and ‘Kitchen Cleaning’ (washing dishes and wiping counters). Memory-less VLAs fail miserably at these tasks.
Good performance: The video encoder allows the model to process up to 16 frames of view (taking ~1 minute) while remaining under critical real-time constraints on a single NVIDIA H100 GPU.

MEM shows that combining dense, short-term visual tokens with compressed, long-term linguistic abstractions allows VLAs to measure their ‘working memory’ without incurring computational computational costs.

Check it out Paper again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

ocopd 4 hours ago

0 0 2 minutes read

Physical Intelligence Team Unveils MEM for Robots: A Multiscale Memory System That Gives Gemma 3-4B VLAs 15-Minute Context for Complex Tasks

Dual-Scale Memory Architecture

(1) Short-Term Video Memory

(2) Long-Term Language Memory

Implementation and Operation

Key Results:

ocopd

Leave a Reply Cancel reply

EV Supply Chain Play: Why Solid State Power (SLDP) Can Be a Hidden Gem in Battery Tech

5 Ways to Plan Your Financial Journey to Buying a Home – RISMedia’s Housecall

A $350K Wire Almost Went Into a Scam How a Fraudster Used the US Embassy to Steal $350K

Holy Ship! HII Stock Jumps on Trump Navy Ship Plans

Multilingual Sentiment Analysis – Importance, Methodology, and Challenges

Humans and AI at Work: Who’s Really in Control?

Dual-Scale Memory Architecture

(1) Short-Term Video Memory

(2) Long-Term Language Memory

Implementation and Operation

Key Results:

ocopd

LIBRA Confidential Advisory Agreement Between Creator and President Milei Revealed

CFTC Chairman Says Approval of Crypto Perps Is Near - Why Is This Big for Hyperliquid?

Related Articles

“ChatGPT for spreadsheets” helps solve difficult engineering challenges quickly | MIT News

Meet SymTorch: A PyTorch Library That Translates Deep Learning Models into Human-Readable Statistics

Google Drops Gemini 3.1 Flash-Lite: A Cost-Effective Powerhouse with Adjustable Memory Levels Designed for High AI Productivity

How Bad ITSM Impacts the Employee Experience

Leave a Reply Cancel reply

EV Supply Chain Play: Why Solid State Power (SLDP) Can Be a Hidden Gem in Battery Tech

5 Ways to Plan Your Financial Journey to Buying a Home – RISMedia’s Housecall

A $350K Wire Almost Went Into a Scam How a Fraudster Used the US Embassy to Steal $350K

Holy Ship! HII Stock Jumps on Trump Navy Ship Plans

Multilingual Sentiment Analysis – Importance, Methodology, and Challenges

Humans and AI at Work: Who’s Really in Control?