ARXIV:2606.09019 · UNCATEGORIZED · SUBMITTED 09 JUN · 03:26 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

Yejin Lee · Junwon Moon · Hyoeun Kim · Hyunjin Choi · Heeseung Kim · Kyuhong Shim · arXiv

ScienceToStartup currently rates this 0.0/10 on the public viability pass. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global…

Blocked on Code›Score0.0Evidence unverified

Opportunity summary

Pain customer pain not on file

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones.

METHOD

Full abstract

Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.

RESULT

ScienceToStartup currently rates this 0.0/10 on the public viability pass. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up…

WHY NOW

Uncategorized moved forward this cycle; last verified June 2026. Public score 0.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score0.0

Paincustomer pain not on file

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

ScienceToStartup currently rates this 0.0/10 on the public viability pass. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global…

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

No named competitor graph is public yet; the page still exposes the segment, adoption evidence, and score state so the commercial read is not blank.

Segment

Uncategorized

Adoption evidence

No public code link in the paper record yet

Commercial read

0.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "42d6ce04-5fc4-47d5-b69c-a8cc85dc0b84", "arxiv_id": "2606.09019", "canonical_route": "/paper/tldr-compressing-audio-tokens-for-efficient-autoregressive-text-to-speech", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "tldr-compressing-audio-tokens-for-efficient-autoregressive-text-to-speech", "endpoints": { "paper_pack": "/api/v1/paper/tldr-compressing-audio-tokens-for-efficient-autoregressive-text-to-speech/paper-pack", "build_passport": "/api/v1/paper/tldr-compressing-audio-tokens-for-efficient-autoregressive-text-to-speech/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech", "normalized_query": "2606.09019", "route": "/paper/tldr-compressing-audio-tokens-for-efficient-autoregressive-text-to-speech", "paper_ref": "tldr-compressing-audio-tokens-for-efficient-autoregressive-text-to-speech", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/tldr-compressing-audio-tokens-for-efficient-autoregressive-text-to-speech#webpage", "url": "https://sciencetostartup.com/paper/tldr-compressing-audio-tokens-for-efficient-autoregressive-text-to-speech", "name": "TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech", "description": "Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/tldr-compressing-audio-tokens-for-efficient-autoregressive-text-to-speech#scholarlyArticle", "headline": "TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech", "description": "Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a…", "url": "https://sciencetostartup.com/paper/tldr-compressing-audio-tokens-for-efficient-autoregressive-text-to-speech", "sameAs": "https://arxiv.org/abs/2606.09019", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2606.09019" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-06-08T04:32:08.000Z", "author": [ { "@type": "Person", "name": "Yejin Lee" }, { "@type": "Person", "name": "Junwon Moon" }, { "@type": "Person", "name": "Hyoeun Kim" }, { "@type": "Person", "name": "Hyunjin Choi" }, { "@type": "Person", "name": "Heeseung Kim" }, { "@type": "Person", "name": "Kyuhong Shim" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Uncategorized" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Uncategorized", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "TLDR: Compressing Audio Tokens for Efficient Autoregressive ", "item": "https://sciencetostartup.com/paper/tldr-compressing-audio-tokens-for-efficient-autoregressive-text-to-speech" } ] } ] }

Competitive landscape

No named competitor graph is public yet; the page still exposes the segment, adoption evidence, and score state so the commercial read is not blank.

Segment

Uncategorized

Adoption evidence

No public code link in the paper record yet

Commercial read

0.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline