Goto

Collaborating Authors

 mvp


DirectMulti-viewMulti-person3DPoseEstimation

Neural Information Processing Systems

Multi-view multi-person 3D pose estimation aims to localize 3D skeleton joints for each person instance in a scene from multi-view camera inputs. It is a fundamental task that benefits many real-world applications (such assurveillance, sportscast, gaming and mixed reality) and ismainly tackled byreconstruction-based [6,14,4]andvolumetric [40]approaches inpreviousliterature, as showninFig.1(a)and(b).




MVP-Shapley: Feature-based Modeling for Evaluating the Most Valuable Player in Basketball

arXiv.org Artificial Intelligence

The burgeoning growth of the esports and multiplayer online gaming community has highlighted the critical importance of evaluating the Most Valuable Player (MVP). The establishment of an explainable and practical MVP evaluation method is very challenging. In our study, we specifically focus on play-by-play data, which records related events during the game, such as assists and points. We aim to address the challenges by introducing a new MVP evaluation framework, denoted as \oursys, which leverages Shapley values. This approach encompasses feature processing, win-loss model training, Shapley value allocation, and MVP ranking determination based on players' contributions. Additionally, we optimize our algorithm to align with expert voting results from the perspective of causality. Finally, we substantiated the efficacy of our method through validation using the NBA dataset and the Dunk City Dynasty dataset and implemented online deployment in the industry.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

To make the paper stronger, the authors can try to use the STOA Siamese CNN approach of [24] and see how it compares to the proposed methods. Q2: Please summarize your review in 1-2 sentences This paper proposes to use an autoencoder for pose representation learning on faces data. The problems are interesting and the experiments seems to suggest the usefulness of the approach.




Direct Multi-view Multi-person 3D Pose Estimation (Supplementary Material) Tao Wang

Neural Information Processing Systems

Figure S1: (a) Illustration of the proposed hierarchical query embedding and the input-dependent query adaptation schemes. It consist of a self-attention, a projective attention and a feed-forward network (FFN) with residual connections. Fig. S1 (a) illustrates our proposed hierarchical query The decoder of MvP transformer consists of multiple decoder layers for regressing 3D joint locations progressively. Fig. S1 (b) demonstrates the detailed architecture of a decoder layer, Results are shown in Table S1. Table S1: Results of replacing camera ray directions with 2D coordinates in RayConv.Positional Input AP We further investigate the effectiveness of the proposed projective attention by comparing it with the dense dot product attention, i.e., conducting Results are given in Table S2.


Direct Multi-view Multi-person 3D Pose Estimation Tao Wang

Neural Information Processing Systems

Notably, it achieves 92.3% AP Multi-view multi-person 3D pose estimation aims to localize 3D skeleton joints for each person instance in a scene from multi-view camera inputs. Additionally, we mitigate the commonly faced generalization issue by a simple query adaptation strategy.


A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

arXiv.org Artificial Intelligence

Existing benchmarks for assessing the spatio-temporal understanding and reasoning abilities of video language models are susceptible to score inflation due to the presence of shortcut solutions based on superficial visual or textual cues. This paper mitigates the challenges in accurately assessing model performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple shortcut-aware video QA benchmark for assessing the physical understanding of video language models. The benchmark is comprised of 55K high-quality multiple-choice video QA examples focusing on physical world understanding. Examples are curated from nine video data sources, spanning first-person egocentric and exocentric videos, robotic interaction data, and cognitive science intuitive physics benchmarks. To mitigate shortcut solutions that rely on superficial visual or textual cues and biases, each sample in MVP has a minimal-change pair -- a visually similar video accompanied by an identical question but an opposing answer. To answer a question correctly, a model must provide correct answers for both examples in the minimal-change pair; as such, models that solely rely on visual or textual biases would achieve below random performance. Human performance on MVP is 92.9\%, while the best open-source state-of-the-art video-language model achieves 40.2\% compared to random performance at 25\%.