Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Rethinking Language Model Scaling under Transferable Hypersphere Optimization | Signal Canvas | ScienceToStartup

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/rethinking-language-model-scaling-under-transferable-hypersphere-optimization

stale

Proof freshness: stale
Proof status: verified
Display score: 7/10
Last proof check: 2026-03-31
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 22
Source count: 4
Coverage: 83%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Canonical ID rethinking-language-model-scaling-under-transferable-hypersphere-optimization | Route /signal-canvas/rethinking-language-model-scaling-under-transferable-hypersphere-optimization

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/rethinking-language-model-scaling-under-transferable-hypersphere-optimization

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "rethinking-language-model-scaling-under-transferable-hypersphere-optimization",
    "query_text": "Summarize Rethinking Language Model Scaling under Transferable Hypersphere Optimization"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "Rethinking Language Model Scaling under Transferable Hypersphere Optimization",
  "normalized_query": "2603.28743",
  "route": "/signal-canvas/rethinking-language-model-scaling-under-transferable-hypersphere-optimization",
  "paper_ref": "rethinking-language-model-scaling-under-transferable-hypersphere-optimization",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Evidence Receipt

Route status: stale

Claims: 8

References: 22

Proof: Verified

Freshness state: stale

Source paper: Rethinking Language Model Scaling under Transferable Hypersphere Optimization

PDF: https://arxiv.org/pdf/2603.28743v1

Repository: https://github.com/microsoft/ArchScale

Source count: 4

Coverage: 83%

Last proof check: 2026-03-31T20:30:19.492Z

Signal Canvas receipt window

Ready for execution: Rethinking Language Model Scaling under Transferable Hypersphere Optimization

/buildability/rethinking-language-model-scaling-under-transferable-hypersphere-optimization

Build Nowready

Subject: Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Verdict

Build Now

Verdict is Build Now because viability and implementation proof cleared the Wave 1 scaffold thresholds.

GitHub Code Pulse

Cached

Stars

333

Health

Last commit

3/31/2026

Forks

Open repository

Claim map

Strong 8Mixed 0Weak 0

Evidencepartial
A single base learning rate tuned at the smallest scale transfers across all compute budgets under HyperP, yielding $1.58\times$ compute efficiency over a strong Muon baseline at $6\times10^{21}$ FLOPs.
Implicationpartial
Directly stated in the abstract with a specific numeric result.
Verificationpartial
partial
Evidencepartial
HyperP delivers transferable stability: all monitored instability indicators, including $Z$-values, output RMS, and activation outliers, remain bounded and non-increasing under training FLOPs scaling.
Implicationpartial
Explicitly claimed in the abstract with clear metrics listed.
Verificationpartial
partial
Evidencepartial
We prove that weight decay is a first-order no-op on the Frobenius sphere
Implicationpartial
Stated as a proven theoretical result in the abstract.
Verificationpartial
partial
Evidencepartial
find that the optimal learning rate follows the same data-scaling power law with the "magic exponent" 0.32 previously observed for AdamW.
Implicationpartial
Directly stated in the abstract with a specific numeric exponent.
Verificationpartial
partial
Evidencepartial
We also propose SqrtGate, an MoE gating mechanism derived from the hypersphere constraint that preserves output RMS across MoE granularities for improved granularity scaling
Implicationpartial
Explicitly proposed in the abstract with a stated purpose.
Verificationpartial
partial
Evidencepartial
show that hypersphere optimization enables substantially larger auxiliary load-balancing weights, yielding both strong performance and good expert balance.
Implicationpartial
Directly stated in the abstract as a benefit of the method.
Verificationpartial
partial
Evidencepartial
Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale.
Implicationpartial
Directly stated as a limitation of prior work in the abstract.
Verificationpartial
partial
Evidencepartial
show that Depth-$\mu$P remains necessary
Implicationpartial
Stated as a finding in the abstract, though the exact necessity is a conclusion.
Verificationpartial
partial

Competitive landscape

No named competitor graph is public yet; the page still exposes the segment, adoption evidence, and score state so the commercial read is not blank.

Segment

Research market

Adoption evidence

No public code link in the paper record yet

Commercial read

score refresh pending

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Use Signal Canvas as the narrative proof surface

Use this Signal Canvas via API or MCP

Signal Canvas proof surface