ARXIV:2605.04312 · MULTI-AGENT GAMES · SUBMITTED 07 MAY · 20:29 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games

Connacher Murphy · arXiv

Agent Island: a dynamic, saturation-resistant benchmark from multiagent games to track LLM capabilities progress.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain Agent Island: a dynamic, saturation-resistant benchmark from multiagent games to track LLM capabilities progress.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Agent Island: a dynamic, saturation-resistant benchmark from multiagent games to track LLM capabilities progress. We introduce Agent Island, a multiplayer simulation environment in which language-model agents compete in a game of interagent cooperation, conflict,…

METHOD

Full abstract

Static capabilities benchmarks suffer from saturation and contamination, making it difficult to track capabilities progress over time. We introduce Agent Island, a multiplayer simulation environment in which language-model agents compete in a game of interagent cooperation, conflict, and persuasion. The environment yields a dynamic benchmark designed to mitigate both saturation and contamination; new models can always outperform the current leading player in this winner-take-all game, and agents compete against other adaptive agents rather than face a fixed task set. We rank players with a Bayesian Plackett-Luce model, allowing us to quantify uncertainty in player skill. In 999 games involving 49 unique models, openai/gpt-5.5 dominates its peers with a posterior mean skill of 5.64, compared with 3.10 for the second-ranked model, openai/gpt-5.2, and 2.86 for the third-ranked model, openai/gpt-5.3-codex. We release the game logs as a dataset for analyses of model behavior. As an example, we investigate same-provider preference in final-round votes and find that models are 8.3 p.p. more likely to support a same-provider finalist than finalists from other providers. This preference is not uniform across providers: among separately estimated providers, the effect is strongest for OpenAI models and weakest for Anthropic models.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. more likely to support a same-provider finalist than finalists from other providers. Code availability is flagged in the production record; the public repository link…

WHY NOW

Multi-Agent Games moved forward this cycle; last verified May 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainAgent Island: a dynamic, saturation-resistant benchmark from multiagent games to track LLM capabilities progress.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

Agent Island: a dynamic, saturation-resistant benchmark from multiagent games to track LLM capabilities progress.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Agent Island: a dynamic, saturation-resistant benchmark from multiagent games to track LLM capabilities progress.

Segment

Multi-Agent Games

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "4e2b8f32-0bb3-4e30-add7-25f91a556475", "arxiv_id": "2605.04312", "canonical_route": "/paper/agent-island-a-saturation-and-contamination-resistant-benchmark-from-multiagent-games", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "agent-island-a-saturation-and-contamination-resistant-benchmark-from-multiagent-games", "endpoints": { "paper_pack": "/api/v1/paper/agent-island-a-saturation-and-contamination-resistant-benchmark-from-multiagent-games/paper-pack", "build_passport": "/api/v1/paper/agent-island-a-saturation-and-contamination-resistant-benchmark-from-multiagent-games/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games", "normalized_query": "2605.04312", "route": "/paper/agent-island-a-saturation-and-contamination-resistant-benchmark-from-multiagent-games", "paper_ref": "agent-island-a-saturation-and-contamination-resistant-benchmark-from-multiagent-games", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/agent-island-a-saturation-and-contamination-resistant-benchmark-from-multiagent-games#webpage", "url": "https://sciencetostartup.com/paper/agent-island-a-saturation-and-contamination-resistant-benchmark-from-multiagent-games", "name": "Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games", "description": "Agent Island: a dynamic, saturation-resistant benchmark from multiagent games to track LLM capabilities progress.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/agent-island-a-saturation-and-contamination-resistant-benchmark-from-multiagent-games#scholarlyArticle", "headline": "Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games", "description": "Agent Island: a dynamic, saturation-resistant benchmark from multiagent games to track LLM capabilities progress.", "url": "https://sciencetostartup.com/paper/agent-island-a-saturation-and-contamination-resistant-benchmark-from-multiagent-games", "sameAs": "https://arxiv.org/abs/2605.04312", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.04312" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-05T21:24:58.000Z", "author": [ { "@type": "Person", "name": "Connacher Murphy" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Multi-Agent Games" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Multi-Agent Games", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Agent Island: A Saturation- and Contamination-Resistant Benc", "item": "https://sciencetostartup.com/paper/agent-island-a-saturation-and-contamination-resistant-benchmark-from-multiagent-games" } ] } ] }

Competitive landscape

Agent Island: a dynamic, saturation-resistant benchmark from multiagent games to track LLM capabilities progress.

Segment

Multi-Agent Games

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games

Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline