HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models | Signal Canvas | ScienceToStartup

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/handvqa-diagnosing-and-improving-fine-grained-spatial-reasoning-about-hands-in-vision-language-models

stale

Proof freshness: stale
Proof status: unverified
Display score: 7/10
Last proof check: 2026-03-30
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 113
Source count: 3
Coverage: 50%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Canonical ID handvqa-diagnosing-and-improving-fine-grained-spatial-reasoning-about-hands-in-vision-language-models | Route /signal-canvas/handvqa-diagnosing-and-improving-fine-grained-spatial-reasoning-about-hands-in-vision-language-models

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/handvqa-diagnosing-and-improving-fine-grained-spatial-reasoning-about-hands-in-vision-language-models

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "handvqa-diagnosing-and-improving-fine-grained-spatial-reasoning-about-hands-in-vision-language-models",
    "query_text": "Summarize HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models",
  "normalized_query": "2603.26362",
  "route": "/signal-canvas/handvqa-diagnosing-and-improving-fine-grained-spatial-reasoning-about-hands-in-vision-language-models",
  "paper_ref": "handvqa-diagnosing-and-improving-fine-grained-spatial-reasoning-about-hands-in-vision-language-models",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Evidence Receipt

Route status: building

Claims: 12

References: 113

Proof: Verification pending

Freshness state: computing

Source paper: HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

PDF: https://arxiv.org/pdf/2603.26362v1

Source count: 3

Coverage: 50%

Last proof check: 2026-03-30T22:21:30.245Z

Signal Canvas receipt window

Watch and verify: HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

/buildability/handvqa-diagnosing-and-improving-fine-grained-spatial-reasoning-about-hands-in-vision-language-models

Watchwatch

Subject: HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Preparing verified analysis

GitHub Code Pulse

No public code linked for this paper yet.

Claim map

Strong 12Mixed 0Weak 0

Evidencepartial
Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex and articulated hand poses.
Implicationpartial
This is a central theme of the abstract and is directly supported by the introduction of the HandVQA benchmark to address this issue.
Verificationpartial
partial
Evidencepartial
We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs' understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions.
Implicationpartial
The abstract explicitly states the creation and scale of the HandVQA benchmark and its purpose.
Verificationpartial
partial
Evidencepartial
our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions.
Implicationpartial
The abstract clearly defines the types of spatial relationships evaluated by the HandVQA benchmark.
Verificationpartial
partial
Evidencepartial
Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization.
Implicationpartial
The abstract summarizes the findings from evaluating these models on HandVQA, highlighting their failures.
Verificationpartial
partial
Evidencepartial
We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).
Implicationpartial
The abstract provides specific quantitative improvements achieved through fine-tuning on HandVQA for downstream tasks.
Verificationpartial
partial
Evidencepartial
As T able 2 shows, base VLMs generally seem to perform poorly on distance pose descriptor with LLaV A and Qwen performing well below the accuracy of 33.3% accuracy that would have been achieved via random choice. Even the MAE remains high for two of these base models with the lowest MAE being 1.208 for Qwen on the FreiHAND dataset.
Implicationpartial
Table 2 in the provided text explicitly shows accuracy below 33.3% for LLaVA and Qwen on distance descriptors and mentions high MAE.
Verificationpartial
partial
Evidencepartial
According to T able 2, the performance of base VLMs across datasets excluding FPHA is generally substantially higher than the accuracy of 25% that would have been achieved via random choice, with the lowest being 34.10% for DeepSeek on the InterHand2.6M
Implicationpartial
The text mentions that base VLMs perform better than random chance on angles but still implies a struggle, suggesting room for improvement.
Verificationpartial
partial
Evidencepartial
We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs' understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions.
Implicationpartial
The abstract explicitly states the creation and scale of the HandVQA benchmark, and the analysis section mentions 'over 1.6 million questions'.
Verificationpartial
partial
Evidencepartial
Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization.
Implicationpartial
The abstract and analysis sections clearly state the limitations found in current VLMs when evaluated on the HandVQA benchmark.
Verificationpartial
partial
Evidencepartial
VLMs struggle to grasp distance between joints.As T able 2 shows, base VLMs generally seem to perform poorly on dis- tance pose descriptor with LLaV A and Qwen performing well below the accuracy of 33.3% accuracy that would have been achieved via random choice.
Implicationpartial
The analysis section provides specific details about the poor performance of base VLMs on distance descriptors, including accuracy below random choice.
Verificationpartial
partial
Evidencepartial
We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).
Implicationpartial
The abstract explicitly states the performance improvement on downstream tasks after fine-tuning with HandVQA, including specific percentage gains.
Verificationpartial
partial
Evidencepartial
HandVQA is constructed using precise 3D annotations from widely-used datasets—FreiHAND [ 70], InterHand2.6M [46], and FPHA [20].
Implicationpartial
The abstract and analysis section clearly state the origin of the 3D annotations used for constructing HandVQA.
Verificationpartial
partial

Author intelligence and commercialization panels stay hidden until the proof receipt is verified, cites at least 3 references, includes at least 2 sources, and clears 50% coverage. The paper narrative and citation surfaces remain public while verification is pending.

HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

Use Signal Canvas as the narrative proof surface

Use this Signal Canvas via API or MCP

Signal Canvas proof surface