Liquid AI Releases LFM2.5-1.2B-Thinking: A 1.2B-Parameter Thinking Model That Fits Less Than 1 GB on a Device

Liquid AI has released LFM2.5-1.2B-Thinking, a 1.2 billion parameter thinking model that runs fully on the device and fits in around 900 MB on a modern phone. What required a data center 2 years ago can now run offline on consumer hardware, with a focus on structured reasoning tracking, tool usage, and analytics, rather than traditional conversation.
Position in the LFM2.5 family and core specs
LFM2.5-1.2B-Thinking is part of the LFM2.5 family of Liquid Foundation Models, which extends the previous LFM2 architecture with additional pre-training and multi-stage reinforcement learning for edge use.
The model is text-only and general-purpose with the following configuration:
- 1.17B parameters, reported as a 1.2B class model
- 16 layers, with 10 blocks with two LIV convolution gates and 6 GQA blocks
- Training budget of 28T tokens
- The core length is 32,768 tokens
- Vocabulary size 65,536
- 8 languages, English, Arabic, Chinese, French, German, Japanese, Korean, Spanish
Early behavioral counseling and cognitive cues
A variant of ‘Thinking’ is trained to think. During thinking it generates internal thought tracks before the final answer. These traces are a series of intermediate steps that the model uses to schedule tool calls, verify incomplete results, and work with multi-step commands.
The Liquid AI team recommends this model of agent operations, data extraction pipelines, and retrieval flows for advanced generation when you want transparent reasoning and verifiable intermediate steps. A practical way to think about it, is to use LFM2.5-1.2B-Thinking as the programming brain within agents and tools, and use other models when you need extensive world knowledge or heavy workflow codes.
Benchmarks compared to other 1B class models
The Liquid AI team tests LFM2.5-1.2B-Thinking against models around 1B parameters in a set of instructional benchmarks.

Compared to LFM2.5-1.2B-Instruct, three metrics improve significantly, mathematical reasoning increases from approximately 63 to 88 in MATH 500, following instructions increases from approximately 61 to 69 in Multi IF, and tool use increases from approximately 49 to 57 in BFCLv3.
LFM2.5-1.2B-Thinking competes with Qwen3-1.7B in the thinking mode in most thinking benchmarks while using about 40 percent fewer parameters and fewer output tokens on average. It also outperforms other 1B class bases such as Granite-4.0-H-1B, Granite-4.0-1B, Gemma-3-1B-IT, and Llama-3.2-1B Yalela in some of these tasks.
A recipe for training and reducing doom looping
Thinking models often suffer from doom looping, where the model repeats pieces of its chain of thought instead of completing the answer. LFM2.5-1.2B-Thinking uses a multi-stage training pipeline to reduce this.
The process starts with intermediate training that involves a sequence of thinking so that the model learns the ‘reason first then respond’ pattern. Then fine-tuning is monitored in the manufacturing chain to develop a chain of thought inventions. After that, preference alignment and RLVR are applied. For preference matching, the research team generates 5 heat samples and 1 greedy candidate quickly and uses an LLM judge to select preferred and rejected results, while clearly labeling the looping output. During RLVR they add a repetition penalty in grams at the beginning of training. This reduced the doom loop rate from 15.74 percent during training to 0.36 percent after RLVR with a set of representative instructions.
The result is a small thinking model that can express a sequence of thoughts without getting stuck in long iterative results, which is important for interactive agents and device UX.
Inference performance and hardware footprint
The main design target is fast computing with a small memory footprint for CPUs and NPUs. LFM2.5-1.2B-Thinking can decode at around 239 tokens per second on an AMD CPU and around 82 tokens per second on a mobile NPU, while running on less than 1 GB of memory, with extensive day-one support for llama.cpp, MLX, and vLLM.
The detailed hardware table uses 1K prefill and 100 code tokens and provides the following examples of LFM2.5-1.2B-Thinking


These numbers show that the model fits well under 1 GB in phones and embedded devices while maintaining useful usage even in long-term conditions.
Key Takeaways
- LFM2.5-1.2B-Thinking is a 1.17B parameter thinking model with 32,768 core length and runs under 1 GB on phones and laptops.
- The model is designed to track transparent reasoning, agent workflow, data extraction, and RAG.
- It achieves strong scores for a 1B class model, for example 87.96 in MATH 500, 85.60 in GSM8K, and competitive performance with Qwen3 1.7B in the inference mode with few parameters.
- The training pipeline uses intermediate training and tracking, supervised optimization, preference alignment with 5 samples and one greedy individual, and RLVR with n gram penalties, reducing doom loops from 15.74 percent to 0.36 percent.
- The model works well on AMD and Qualcomm NPUs and CPUs with runtimes like llama.cpp, FastFlowLM, and NexaML, is available in GGUF, ONNX, and MLX formats, and can easily be loaded by Hugging Face for use on a device.
Hosting/Delivery Providers
You can access or host the model through the following providers and platforms:
Cloud & API Providers
Model Repositories (Self-hosted)
If you want to use the model in your environment or infrastructure, weights are available in various formats:



