Proof pending. Core topic summary fields are still materializing.
LLM quantization is a critical area of research aimed at reducing the memory and computational demands of large language models while maintaining their performance. Recent advancements, such as ReSpinQuant and MUXQ, focus on innovative techniques like layer-wise adaptation and mixed precision to effectively manage activation outliers and enhance inference efficiency. These methods demonstrate that it is possible to achieve high accuracy with minimal overhead, making them particularly relevant for developers working on deploying LLMs in resource-constrained environments. As the demand for efficient AI solutions grows, these quantization strategies are essential for enabling practical applications of LLMs across various platforms, including edge devices.
The MXFP4 microscaling format, which partitions tensors into blocks of 32 elements sharing an E8M0 scaling factor, has emerged as a promising substrate for efficient LLM inference, backed by native ha...
As large language models continue to scale, low-bit weight-only post-training quantization (PTQ) offers a practical solution to their memory-efficient deployment. Although block-wise PTQ is capable of...
Large language models (LLMs) have achieved outstanding performance across a wide range of natural language processing tasks, but their enormous parameter counts impose ubstantial memory and computatio...
Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achi...
We present Bielik-Q2-Sharp, the first systematic academic evaluation of extreme 2-bit quantization applied to a Polish large language model. Using Bielik-11B-v2.3-Instruct (11B parameters, Mistral arc...
Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g....
Post-training quantization (PTQ) is essential for deploying LLMs under memory and bandwidth constraints. However, extreme low-bit quantization remains highly sensitive to activation outliers and aniso...
We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-Edu, and ask which input-activation sites dominate the error....
This technical note revisits the relationship between RaBitQ and TurboQuant under a unified comparison framework. We compare the two methods in terms of methodology, theoretical guarantees, and empiri...
Freshness
Canonical route: /topics
Agent Handoff
Canonical ID llm-quantization | Route /topic/llm-quantization
REST example
curl https://sciencetostartup.com/api/v1/agent-handoff/topic/llm-quantizationMCP example
{
"tool": "search_papers",
"arguments": {
"query": "LLM Quantization",
"cluster": "LLM Quantization"
}
}source_context
{
"surface": "topic",
"mode": "topic",
"query": "LLM Quantization",
"normalized_query": "llm-quantization",
"route": "/topic/llm-quantization",
"paper_ref": null,
"topic_slug": "llm-quantization",
"benchmark_ref": null,
"dataset_ref": null
}Use This Via API or MCP
Topic pages bundle paper counts, viability trends, author concentration, and top questions into one canonical surface your agents can reference before they open Signal Canvas or create a workspace.