Models Compression - Search News

Context compression finally works in production: new research cuts LLM input 16x without the accuracy hit

LCLMs compress LLM context before decode — 8.8x faster at 16x compression, beating every KV cache method tested. Open-sourced by NYU and Columbia.

Morning Overview on MSN

Google unveiled TurboQuant, a method that cuts the memory bottleneck slowing large AI models

Companies running large language models face a persistent bottleneck: the memory consumed by key-value caches during ...

The Manila Times

Nota AI Has Two MoE Quantization Papers Accepted at ICML 2026 Workshop, Demonstrating Global Competitiveness in Large-Scale AI Optimization

Two papers on MoE-specific quantization algorithms accepted at a workshop held in conjunction with ICML 2026Recognition ...

scmp.com

DeepSeek unveils multimodal AI model that uses visual perception to compress text input

DeepSeek on Monday released a new multimodal artificial intelligence model that can handle large and complex documents with significantly fewer tokens – the smallest unit of text that a model ...

Ars Technica

Running local models on Macs gets faster with Ollama’s MLX support

Ollama, a runtime system for operating large language models on a local computer, has introduced support for Apple’s open source MLX framework for machine learning. Additionally, Ollama says it has ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results