ARXIV:2602.23974 · REINFORCEMENT LEARNING · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Pessimistic Auxiliary Policy for Offline Reinforcement Learning

arXiv

Enhance offline reinforcement learning with a pessimistic auxiliary policy for reliable action sampling.

Blocked on Code›Score5.0Evidence unverified

Opportunity summary

Pain Enhance offline reinforcement learning with a pessimistic auxiliary policy for reliable action sampling.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Enhance offline reinforcement learning with a pessimistic auxiliary policy for reliable action sampling. However, inevitable access to out-ofdistribution actions during the learning process introduces approximation errors, causing the error accumulation and considerable overestimation.

METHOD

Full abstract

Offline reinforcement learning aims to learn an agent from pre-collected datasets, avoiding unsafe and inefficient real-time interaction. However, inevitable access to out-ofdistribution actions during the learning process introduces approximation errors, causing the error accumulation and considerable overestimation. In this paper, we construct a new pessimistic auxiliary policy for sampling reliable actions. Specifically, we develop a pessimistic auxiliary strategy by maximizing the lower confidence bound of the Q-function. The pessimistic auxiliary strategy exhibits a relatively high value and low uncertainty in the vicinity of the learned policy, avoiding the learned policy sampling high-value actions with potentially high errors during the learning process. Less approximation error introduced by sampled action from pessimistic auxiliary strategy leads to the alleviation of error accumulation. Extensive experiments on offline reinforcement learning benchmarks reveal that utilizing the pessimistic auxiliary strategy can effectively improve the efficacy of other offline RL approaches.

RESULT

ScienceToStartup currently rates this 5.0/10 on the public viability pass. Extensive experiments on offline reinforcement learning benchmarks reveal that utilizing the pessimistic auxiliary strategy can effectively improve the efficacy of other offline RL approaches.

WHY NOW

Reinforcement Learning moved forward this cycle; last verified April 2026. Public score 5.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score5.0

PainEnhance offline reinforcement learning with a pessimistic auxiliary policy for reliable action sampling.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

Enhance offline reinforcement learning with a pessimistic auxiliary policy for reliable action sampling.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

ARXIV:2602.23974 · REINFORCEMENT LEARNING · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Pessimistic Auxiliary Policy for Offline Reinforcement Learning

arXiv

Enhance offline reinforcement learning with a pessimistic auxiliary policy for reliable action sampling.

Blocked on Code›Score5.0Evidence unverified

Opportunity summary

Pain Enhance offline reinforcement learning with a pessimistic auxiliary policy for reliable action sampling.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

METHOD

Full abstract

RESULT

WHY NOW

Reinforcement Learning moved forward this cycle; last verified April 2026. Public score 5.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score5.0

PainEnhance offline reinforcement learning with a pessimistic auxiliary policy for reliable action sampling.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

Enhance offline reinforcement learning with a pessimistic auxiliary policy for reliable action sampling.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Paper Pack

10.48550/arXiv.2602.23974

Pessimistic Auxiliary Policy for Offline Reinforcement Learning

Enhance offline reinforcement learning with a pessimistic auxiliary policy for reliable action sampling.

Abstract

Source availability

PDF linked

The paper record includes a public PDF URL.

Extraction status

Derived fallback

Read summaries are estimated from adjacent metadata, not verified extraction rows.

Proof status

unverified

0 refs; 0 sources; 17% coverage.

What was readable

linkedon filenot materializedderived fallback42 indexednot indexed

Derived fallback: Estimated from adjacent evidence; not verified from source.

Viability

5.0

Time to MVP

MVP estimate missing

Commercial

No commercial flags on file

Export

Preparing verified analysis

lens / founder

PROBLEM

METHOD

RESULT

WHY NOW

Reinforcement Learning moved forward this cycle; last verified April 2026. Public score 5.0/10.

Claim map

Abstract-backed public claims while anchored extraction refreshes.

Strong 0Mixed 0Weak 4

Evidencepartial
Enhance offline reinforcement learning with a pessimistic auxiliary policy for reliable action sampling. However, inevitable access to out-ofdistribution actions during the learning process introduces approximation errors, causing the error accumulation and considerable overestimation.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
Offline reinforcement learning aims to learn an agent from pre-collected datasets, avoiding unsafe and inefficient real-time interaction. However, inevitable access to out-ofdistribution actions during the learning process introduces approximation errors, causing the error accumulation and considerable overestimation.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
ScienceToStartup currently rates this 5.0/10 on the public viability pass. Extensive experiments on offline reinforcement learning benchmarks reveal that utilizing the pessimistic auxiliary strategy can effectively improve the efficacy of other offline RL approaches.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
Reinforcement Learning moved forward this cycle; last verified April 2026. Public score 5.0/10.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial

Constellation map

Paper-native neighborhood for concepts, methods, materials, markets, and competitors. Missing lanes stay labeled instead of disappearing behind commercialization gates.

Open full Signal Canvas

Concepts

not indexed

Methods

Materials

PDF linked

Markets

Reinforcement Learning

Competitors

not indexed

Competitive landscape

Enhance offline reinforcement learning with a pessimistic auxiliary policy for reliable action sampling.

Segment

Reinforcement Learning

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Buzz

No indexed public discussion is attached to 2602.23974 yet. That is a visibility signal, not a blank module: the monitor is watching the public channels below.

Hacker News

Not indexed yet

Bluesky

Not indexed yet

PDF

Preview the source document here, or use the hero PDF action for a new tab.

References(42)

Channel-adaptive generative reconstruction and fusion for multi-sensor graph features in few-shot fault diagnosis

2026Peijie You, Lei Wang et al.

Inhibiting Error Exacerbation in Offline Reinforcement Learning With Data Sparsity.

2025Fan Zhang, Malu Zhang et al.

AI-powered spatiotemporal imputation and prediction of chlorophyll-a concentration in coastal ecosystems

2025Fan Zhang, H. Kung et al.

NeoRL-2: Near Real-World Benchmarks for Offline Reinforcement Learning with Extended Realistic Scenarios

2025Songyi Gao, Zuolin Tu et al.

QFAE: Q-Function guided Action Exploration for offline deep reinforcement learning

2025Teng Pang, Guoqiang Wu et al.

Multiscale Channel Attention-Driven Graph Dynamic Fusion Learning Method for Robust Fault Diagnosis

2024Xin Zhang, Jie Liu et al.

Learning Latent Dynamic Robust Representations for World Models

2024Ruixiang Sun, Hongyu Zang et al.

HarmonyDream: Task Harmonization Inside World Models

2023Haoyu Ma, Jialong Wu et al.

Self-imitation guided goal-conditioned reinforcement learning

2023Yao Li, Yuhui Wang et al.

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

2022Zhendong Wang, Jonathan J. Hunt et al.

Bootstrapped Transformer for Offline Reinforcement Learning

2022Kerong Wang, Hanye Zhao et al.

Offline Reinforcement Learning with Implicit Q-Learning

2021Ilya Kostrikov, Ashvin Nair et al.

Addressing Hindsight Bias in Multigoal Reinforcement Learning

2021Chenjia Bai, Lingxiao Wang et al.

Bellman-consistent Pessimism for Offline Reinforcement Learning

2021Tengyang Xie, Ching-An Cheng et al.

A Minimalist Approach to Offline Reinforcement Learning

2021Scott Fujimoto, S. Gu

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

2021Tengyang Xie, Nan Jiang et al.

Decision Transformer: Reinforcement Learning via Sequence Modeling

2021Lili Chen, Kevin Lu et al.

Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

2021Yue Wu, Shuangfei Zhai et al.

Critic Regularized Regression

2020Ziyun Wang, Alexander Novikov et al.

Accelerating Online Reinforcement Learning with Offline Datasets

2020Ashvin Nair, Murtaza Dalal et al.

Showing 20 of 42 references

CITED BY

No citing papers are indexed in the public S2S graph yet. This is an explicit zero-signal state, not a hidden lookup.

Foundation

none indexed

Extension

Builds On ThisResiduals-based Offline Reinforcement Learning

4.0

Builds On ThisBeyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies

2.0

Commercially relevant

Higher ViabilityOffline Policy Optimization with Posterior Sampling

7.0

Higher ViabilityRobust Regularized Policy Iteration under Transition Uncertainty

7.0

Higher ViabilityROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization

7.0

Higher ViabilityRobust Probabilistic Shielding for Safe Offline Reinforcement Learning

7.0

Higher ViabilityOff-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

7.0

Higher ViabilityRankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

7.0

Conflicting

Competing ApproachLearning Optimal and Sample-Efficient Decision Policies with Guarantees

5.0

Competing ApproachEscaping Offline Pessimism: Vector-Field Reward Shaping for Safe Frontier Exploration

5.0

Related Resources

Just-In-Time Reinforcement Learning(glossary)
Multi-Agent Reinforcement Learning(glossary)
Multi-Agent Test-Time Reinforcement Learning (MATTRL)(glossary)
How does PRISM improve reinforcement learning?(question)
What is the significance of reinforcement learning in AI?(question)
How does RetroAgent improve reinforcement learning?(question)
Reinforcement Learning – Use Cases(use_case)

Owned Distribution

Subscribe to the weekly brief

Get the weekly shortlist of commercializable papers, benchmark movers, and proof receipts that matter for product execution.

Agent drawer

5 surfaces preserved for agents. Humans can ignore.

Developer contracts, payload previews, evidence maps, and run controls stay here instead of the Read, Build, and Track workspace.

Run context

Paper: 2602.23974
Route: /paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning
Active tab: read
Artifact: pessimistic-auxiliary-policy-for-offline-reinforcement-learning

Available agents

Read extractor
Build planner
Track monitor
Competitive mapper
Related-paper scout

API/MCP endpoints

REST paper pack API/api/v1/paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning/paper-pack
REST build passport API/api/v1/paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning/build-passport
REST OpenAPI/api/openapi.json
MCP descriptor/api/mcp
MCP resourcesciencetostartup://surfaces/paper-workspace

Tool contracts

paper_packbuild_passportopportunity_kernelforesightsource_proofevidence_state

Payload preview

Inspect payload

{
  "contract_version": "paper-r2",
  "paper_id": "40af5f27-307c-4136-86fa-2f95f123036a",
  "arxiv_id": "2602.23974",
  "canonical_route": "/paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning",
  "active_tab": "synced from current hash by the drawer client",
  "selected_artifact": "pessimistic-auxiliary-policy-for-offline-reinforcement-learning",
  "endpoints": {
    "paper_pack": "/api/v1/paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning/paper-pack",
    "build_passport": "/api/v1/paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning/build-passport",
    "mcp_resource": "sciencetostartup://surfaces/paper-workspace"
  }
}

Schema validation

paper-r2 contract: present
JSON-LD twin: SSR emitted
OpenAPI path parity: /api/openapi.json
MCP resource parity: paper-workspace

Job trace

queued: drawer opened by user action
running: inspect or copy payload
succeeded: payload available in SSR
failed: route errors appear in evidence cards

Evidence map

sources used: page freshness, source proof anchors, JSON-LD
missing sources: exposed by PaperPack and EvidenceState chips
derived fallbacks: marked unverified before handoff

Page Freshness

Canonical route, proof status, last verified, refs, sources, and coverage.

Page Freshness

Paper proof surface

Canonical route: /paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning

stale

Proof freshness: stale
Proof status: unverified
Display score: 5/10
Last proof check: 2026-04-02
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 0
Source count: 0
Coverage: 17%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

OpenAlex: pending — this preprint is not yet indexed by OpenAlex.

Agent Handoff

Endpoint list, payload shape, route context, and copyable handoff data.

Agent Handoff

Pessimistic Auxiliary Policy for Offline Reinforcement Learning

Canonical ID pessimistic-auxiliary-policy-for-offline-reinforcement-learning | Route /paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning

MCP example

{
  "tool": "get_paper",
  "arguments": {
    "arxiv_id": "2602.23974"
  }
}

source_context

{
  "surface": "paper",
  "mode": "paper",
  "query": "Pessimistic Auxiliary Policy for Offline Reinforcement Learning",
  "normalized_query": "2602.23974",
  "route": "/paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning",
  "paper_ref": "pessimistic-auxiliary-policy-for-offline-reinforcement-learning",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Buildability Receipt

Verdict, compute envelope, blockers, signature state, and receipt links.

Paper proof page receipt window

Watch and verify: Pessimistic Auxiliary Policy for Offline Reinforcement Learning

/buildability/pessimistic-auxiliary-policy-for-offline-reinforcement-learning

Watchwatch

Subject: Pessimistic Auxiliary Policy for Offline Reinforcement Learning

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Time to first demo

Insufficient data

No first-demo timestamp, owner estimate, or elapsed demo receipt is attached to this surface.

Compute envelope

Structured compute envelope

Insufficient data

No data, compute, hardware, memory, latency, dependency, or serving requirement receipt is attached.

Evidence ids

Receipt path

/buildability/pessimistic-auxiliary-policy-for-offline-reinforcement-learning

Paper ref

pessimistic-auxiliary-policy-for-offline-reinforcement-learning

arXiv id

2602.23974

Freshness

Generated at

2026-04-02T02:30:40.136Z

Evidence freshness

stale

Last verification

2026-04-02T02:30:40.136Z

Sources

References

Coverage

17%

Hash state

Lineage hash

dfb67ef1fb31d9796613c7bf1a57f78eb2eb10c210611cefea54ce2cc57fa907

Canonical opportunity-kernel lineage hash.

Signature state

External signature

unsigned_external

No founder, registry, pilot, or production-adoption signature is attached to this receipt.

Verification

not_verified

Verification is blocked until an external signature is provided.

Blockers

Missing: repo_url
Missing: references
Missing: proof_status
Missing: distribution_readiness_scores
Missing: paper_extraction_scorecards
Unknown: distribution readiness has not been computed yet
Unknown: proof verification has not been recorded yet

Verification pending / evidence receipt incomplete

repo_url

references

Missing proof, requirement, signature, approval, adoption, or telemetry fields are blockers and must not be inferred.

Open receipt API receipt Build Loop Signal Canvas Proof divergence Divergence API Brier outcomes API

Source Proof anchors

Visual citations from the paper document graph.

JSON-LD twin

The application/ld+json payload rendered for agents.

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "WebPage",
      "@id": "https://sciencetostartup.com/paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning#webpage",
      "url": "https://sciencetostartup.com/paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning",
      "name": "Pessimistic Auxiliary Policy for Offline Reinforcement Learning",
      "description": "Enhance offline reinforcement learning with a pessimistic auxiliary policy for reliable action sampling.",
      "isPartOf": {
        "@id": "https://sciencetostartup.com/#website"
      }
    },
    {
      "@type": "ScholarlyArticle",
      "@id": "https://sciencetostartup.com/paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning#scholarlyArticle",
      "headline": "Pessimistic Auxiliary Policy for Offline Reinforcement Learning",
      "description": "Enhance offline reinforcement learning with a pessimistic auxiliary policy for reliable action sampling.",
      "url": "https://sciencetostartup.com/paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning",
      "sameAs": "https://arxiv.org/abs/2602.23974",
      "identifier": {
        "@type": "PropertyValue",
        "propertyID": "arXiv",
        "value": "2602.23974"
      },
      "isAccessibleForFree": true,
      "isPartOf": {
        "@id": "https://sciencetostartup.com/#website"
      },
      "datePublished": "2026-02-27T12:34:20.000Z",
      "citation": [
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "20063011502ed777b857de4d8e9d75df5bb55337"
          },
          "url": "https://www.semanticscholar.org/paper/20063011502ed777b857de4d8e9d75df5bb55337"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "b2b10d67acce04b89cf0c003a631a85e18ee70ff"
          },
          "url": "https://www.semanticscholar.org/paper/b2b10d67acce04b89cf0c003a631a85e18ee70ff"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "089ff72ad5b2c25a1a4f96d1d207530b94946a0a"
          },
          "url": "https://www.semanticscholar.org/paper/089ff72ad5b2c25a1a4f96d1d207530b94946a0a"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "78446f5e8abac58765f09b271929a1b3edd60fe6"
          },
          "url": "https://www.semanticscholar.org/paper/78446f5e8abac58765f09b271929a1b3edd60fe6"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "bb6e9956108634fb0109d2360ff9febdf065311a"
          },
          "url": "https://www.semanticscholar.org/paper/bb6e9956108634fb0109d2360ff9febdf065311a"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "e70d671b3333e2f437e00f5dbab98c29d412bc80"
          },
          "url": "https://www.semanticscholar.org/paper/e70d671b3333e2f437e00f5dbab98c29d412bc80"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "f218a95e583841c6f668e360a90739d7e4f4610b"
          },
          "url": "https://www.semanticscholar.org/paper/f218a95e583841c6f668e360a90739d7e4f4610b"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "854d76ef9685b75fc6fb576609a95e12326e5309"
          },
          "url": "https://www.semanticscholar.org/paper/854d76ef9685b75fc6fb576609a95e12326e5309"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "2cbea7615ebecea2c414d8fbad47d5d258a5c3b4"
          },
          "url": "https://www.semanticscholar.org/paper/2cbea7615ebecea2c414d8fbad47d5d258a5c3b4"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "b7d27c5af2d314f6ec45b6d88984fb45220eb379"
          },
          "url": "https://www.semanticscholar.org/paper/b7d27c5af2d314f6ec45b6d88984fb45220eb379"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "348a855fe01f3f4273bf0ecf851ca688686dbfcc"
          },
          "url": "https://www.semanticscholar.org/paper/348a855fe01f3f4273bf0ecf851ca688686dbfcc"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "288e5a79e73742f9ead3bfa9463891717414f6fd"
          },
          "url": "https://www.semanticscholar.org/paper/288e5a79e73742f9ead3bfa9463891717414f6fd"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "e2ad21dae85950ab3631f65a0f142924c99fb9c4"
          },
          "url": "https://www.semanticscholar.org/paper/e2ad21dae85950ab3631f65a0f142924c99fb9c4"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "c879b25308026d6538e52b27bcf4fd3cb60855f3"
          },
          "url": "https://www.semanticscholar.org/paper/c879b25308026d6538e52b27bcf4fd3cb60855f3"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "d769ca62d90adc7e7869849a421426bdc54a32fb"
          },
          "url": "https://www.semanticscholar.org/paper/d769ca62d90adc7e7869849a421426bdc54a32fb"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "c1ad5f9b32d80f1c65d67894e5b8c2fdf0ae4500"
          },
          "url": "https://www.semanticscholar.org/paper/c1ad5f9b32d80f1c65d67894e5b8c2fdf0ae4500"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "2c23085488337c4c1b5673b8d0f4ac95bda73529"
          },
          "url": "https://www.semanticscholar.org/paper/2c23085488337c4c1b5673b8d0f4ac95bda73529"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "7acbdb961f67d50fef359066f2a1d7755cf16ee2"
          },
          "url": "https://www.semanticscholar.org/paper/7acbdb961f67d50fef359066f2a1d7755cf16ee2"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "0272b14dd471fe7b81df703af1b71d7600b77215"
          },
          "url": "https://www.semanticscholar.org/paper/0272b14dd471fe7b81df703af1b71d7600b77215"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "b405f4213700a32f9773a6f397fa80b3c8621457"
          },
          "url": "https://www.semanticscholar.org/paper/b405f4213700a32f9773a6f397fa80b3c8621457"
        }
      ],
      "additionalProperty": [
        {
          "@type": "PropertyValue",
          "propertyID": "viabilityScore",
          "value": 5
        },
        {
          "@type": "PropertyValue",
          "propertyID": "researchDomain",
          "value": "Reinforcement Learning"
        }
      ]
    },
    {
      "@type": "BreadcrumbList",
      "itemListElement": [
        {
          "@type": "ListItem",
          "position": 1,
          "name": "Home",
          "item": "https://sciencetostartup.com"
        },
        {
          "@type": "ListItem",
          "position": 2,
          "name": "Reinforcement Learning",
          "item": "https://sciencetostartup.com/topics"
        },
        {
          "@type": "ListItem",
          "position": 3,
          "name": "Pessimistic Auxiliary Policy for Offline Reinforcement Learn",
          "item": "https://sciencetostartup.com/paper/pessimistic-auxiliary-policy-for-offline-reinforcement-learning"
        }
      ]
    }
  ]
}

Pessimistic Auxiliary Policy for Offline Reinforcement Learning

Pessimistic Auxiliary Policy for Offline Reinforcement Learning

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(42)

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(42)

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline