Wang, Robert
emg2pose: A Large and Diverse Benchmark for Surface Electromyographic Hand Pose Estimation
Salter, Sasha, Warren, Richard, Schlager, Collin, Spurr, Adrian, Han, Shangchen, Bhasin, Rohin, Cai, Yujun, Walkington, Peter, Bolarinwa, Anuoluwapo, Wang, Robert, Danielson, Nathan, Merel, Josh, Pnevmatikakis, Eftychios, Marshall, Jesse
Hands are the primary means through which humans interact with the world. Reliable and always-available hand pose inference could yield new and intuitive control schemes for human-computer interactions, particularly in virtual and augmented reality. Computer vision is effective but requires one or multiple cameras and can struggle with occlusions, limited field of view, and poor lighting. Wearable wrist-based surface electromyography (sEMG) presents a promising alternative as an always-available modality sensing muscle activities that drive hand motion. However, sEMG signals are strongly dependent on user anatomy and sensor placement, and existing sEMG models have required hundreds of users and device placements to effectively generalize. To facilitate progress on sEMG pose inference, we introduce the emg2pose benchmark, the largest publicly available dataset of high-quality hand pose labels and wrist sEMG recordings. emg2pose contains 2kHz, 16 channel sEMG and pose labels from a 26-camera motion capture rig for 193 users, 370 hours, and 29 stages with diverse gestures - a scale comparable to vision-based hand pose datasets. We provide competitive baselines and challenging tasks evaluating real-world generalization scenarios: held-out users, sensor placements, and stages. emg2pose provides the machine learning community a platform for exploring complex generalization problems, holding potential to significantly enhance the development of sEMG-based human-computer interactions.
HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos
Banerjee, Prithviraj, Shkodrani, Sindi, Moulon, Pierre, Hampali, Shreyas, Han, Shangchen, Zhang, Fan, Zhang, Linguang, Fountain, Jade, Miller, Edward, Basol, Selen, Newcombe, Richard, Wang, Robert, Engel, Jakob Julian, Hodan, Tomas
We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground-truth annotations including 3D poses of objects, hands, and cameras, and 3D models of hands and objects. In addition to simple pick-up/observe/put-down actions, HOT3D contains scenarios resembling typical actions in a kitchen, office, and living room environment. The dataset is recorded by two head-mounted devices from Meta: Project Aria, a research prototype of light-weight AR/AI glasses, and Quest 3, a production VR headset sold in millions of units. Ground-truth poses were obtained by a professional motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, 6DoF object pose estimation, and 3D lifting of unknown in-hand objects. The evaluated multi-view methods, whose benchmarking is uniquely enabled by HOT3D, significantly outperform their single-view counterparts.
Experimental Design Using Interlacing Polynomials
Lau, Lap Chi, Wang, Robert, Zhou, Hong
Experimental design is a classical problem in statistics [ Puk06 ], which recently found wide applications from machine learning (e.g., active learning, feature selection, data summ arization) to numerical linear algebra (e.g., column subset selection, sparse least squares regression) t o graph algorithms (e.g., total effective resistance minimization, algebraic connectivity maximization). We ref er the reader to [ SX20, AZLSW21, NST22, LZ22b, LZ22a, LWZ23 ] and the references therein for additional background and related applications.
Analysis of Corrected Graph Convolutions
Wang, Robert, Baranwal, Aseem, Fountoulakis, Kimon
Machine learning for node classification on graphs is a prominent area driven by applications such as recommendation systems. State-of-the-art models often use multiple graph convolutions on the data, as empirical evidence suggests they can enhance performance. However, it has been shown empirically and theoretically, that too many graph convolutions can degrade performance significantly, a phenomenon known as oversmoothing. In this paper, we provide a rigorous theoretical analysis, based on the contextual stochastic block model (CSBM), of the performance of vanilla graph convolution from which we remove the principal eigenvector to avoid oversmoothing. We perform a spectral analysis for $k$ rounds of corrected graph convolutions, and we provide results for partial and exact classification. For partial classification, we show that each round of convolution can reduce the misclassification error exponentially up to a saturation level, after which performance does not worsen. For exact classification, we show that the separability threshold can be improved exponentially up to $O({\log{n}}/{\log\log{n}})$ corrected convolutions.
DeltaCNN: End-to-End CNN Inference of Sparse Frame Differences in Videos
Parger, Mathias, Tang, Chengcheng, Twigg, Christopher D., Keskin, Cem, Wang, Robert, Steinberger, Markus
Convolutional neural network inference on video data requires powerful hardware for real-time processing. Given the inherent coherence across consecutive frames, large parts of a video typically change little. By skipping identical image regions and truncating insignificant pixel updates, computational redundancy can in theory be reduced significantly. However, these theoretical savings have been difficult to translate into practice, as sparse updates hamper computational consistency and memory access coherence; which are key for efficiency on real hardware. With DeltaCNN, we present a sparse convolutional neural network framework that enables sparse frame-by-frame updates to accelerate video inference in practice. We provide sparse implementations for all typical CNN layers and propagate sparse feature updates end-to-end - without accumulating errors over time. DeltaCNN is applicable to all convolutional neural networks without retraining. To the best of our knowledge, we are the first to significantly outperform the dense reference, cuDNN, in practical settings, achieving speedups of up to 7x with only marginal differences in accuracy.
MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera Videos
Parger, Mathias, Tang, Chengcheng, Neff, Thomas, Twigg, Christopher D., Keskin, Cem, Wang, Robert, Steinberger, Markus
Convolutional neural network inference on video input is computationally expensive and requires high memory bandwidth. Recently, DeltaCNN managed to reduce the cost by only processing pixels with significant updates over the previous frame. However, DeltaCNN relies on static camera input. Moving cameras add new challenges in how to fuse newly unveiled image regions with already processed regions efficiently to minimize the update rate - without increasing memory overhead and without knowing the camera extrinsics of future frames. In this work, we propose MotionDeltaCNN, a sparse CNN inference framework that supports moving cameras. We introduce spherical buffers and padded convolutions to enable seamless fusion of newly unveiled regions and previously processed regions -- without increasing memory footprint. Our evaluation shows that we outperform DeltaCNN by up to 90% for moving camera videos.
Fast Algorithms for Directed Graph Partitioning Using Flows and Reweighted Eigenvalues
Lau, Lap Chi, Tung, Kam Chuen, Wang, Robert
We consider a new semidefinite programming relaxation for directed edge expansion, which is obtained by adding triangle inequalities to the reweighted eigenvalue formulation. Applying the matrix multiplicative weight update method to this relaxation, we derive almost linear-time algorithms to achieve $O(\sqrt{\log{n}})$-approximation and Cheeger-type guarantee for directed edge expansion, as well as an improved cut-matching game for directed graphs. This provides a primal-dual flow-based framework to obtain the best known algorithms for directed graph partitioning. The same approach also works for vertex expansion and for hypergraphs, providing a simple and unified approach to achieve the best known results for different expansion problems and different algorithmic techniques.
Experimental Design for Any $p$-Norm
Lau, Lap Chi, Wang, Robert, Zhou, Hong
We consider a general $p$-norm objective for experimental design problems that captures some well-studied objectives (D/A/E-design) as special cases. We prove that a randomized local search approach provides a unified algorithm to solve this problem for all $p$. This provides the first approximation algorithm for the general $p$-norm objective, and a nice interpolation of the best known bounds of the special cases.
Cheeger Inequalities for Directed Graphs and Hypergraphs Using Reweighted Eigenvalues
Lau, Lap Chi, Tung, Kam Chuen, Wang, Robert
We derive Cheeger inequalities for directed graphs and hypergraphs using the reweighted eigenvalue approach that was recently developed for vertex expansion in undirected graphs [OZ22,KLT22,JPV22]. The goal is to develop a new spectral theory for directed graphs and an alternative spectral theory for hypergraphs. The first main result is a Cheeger inequality relating the vertex expansion $\vec{\psi}(G)$ of a directed graph $G$ to the vertex-capacitated maximum reweighted second eigenvalue $\vec{\lambda}_2^{v*}$: \[ \vec{\lambda}_2^{v*} \lesssim \vec{\psi}(G) \lesssim \sqrt{\vec{\lambda}_2^{v*} \cdot \log (\Delta/\vec{\lambda}_2^{v*})}. \] This provides a combinatorial characterization of the fastest mixing time of a directed graph by vertex expansion, and builds a new connection between reweighted eigenvalued, vertex expansion, and fastest mixing time for directed graphs. The second main result is a stronger Cheeger inequality relating the edge conductance $\vec{\phi}(G)$ of a directed graph $G$ to the edge-capacitated maximum reweighted second eigenvalue $\vec{\lambda}_2^{e*}$: \[ \vec{\lambda}_2^{e*} \lesssim \vec{\phi}(G) \lesssim \sqrt{\vec{\lambda}_2^{e*} \cdot \log (1/\vec{\lambda}_2^{e*})}. \] This provides a certificate for a directed graph to be an expander and a spectral algorithm to find a sparse cut in a directed graph, playing a similar role as Cheeger's inequality in certifying graph expansion and in the spectral partitioning algorithm for undirected graphs. We also use this reweighted eigenvalue approach to derive the improved Cheeger inequality for directed graphs, and furthermore to derive several Cheeger inequalities for hypergraphs that match and improve the existing results in [Lou15,CLTZ18]. These are supporting results that this provides a unifying approach to lift the spectral theory for undirected graphs to more general settings.
Neural Correspondence Field for Object Pose Estimation
Huang, Lin, Hodan, Tomas, Ma, Lingni, Zhang, Linguang, Tran, Luan, Twigg, Christopher, Wu, Po-Chen, Yuan, Junsong, Keskin, Cem, Wang, Robert
We propose a method for estimating the 6DoF pose of a rigid object with an available 3D model from a single RGB image. Unlike classical correspondence-based methods which predict 3D object coordinates at pixels of the input image, the proposed method predicts 3D object coordinates at 3D query points sampled in the camera frustum. The move from pixels to 3D points, which is inspired by recent PIFu-style methods for 3D reconstruction, enables reasoning about the whole object, including its (self-)occluded parts. For a 3D query point associated with a pixel-aligned image feature, we train a fully-connected neural network to predict: (i) the corresponding 3D object coordinates, and (ii) the signed distance to the object surface, with the first defined only for query points in the surface vicinity. We call the mapping realized by this network as Neural Correspondence Field. The object pose is then robustly estimated from the predicted 3D-3D correspondences by the Kabsch-RANSAC algorithm. The proposed method achieves state-of-the-art results on three BOP datasets and is shown superior especially in challenging cases with occlusion. The project website is at: linhuang17.github.io/NCF.