Goto

Collaborating Authors

 odin





From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos

Neural Information Processing Systems

Three-dimensional (3D) understanding of objects and scenes play a key role in humans' ability to interact with the world and has been an active area of research in computer vision, graphics, and robotics. Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects. However, applying a similar approach to real-world objects and scenes is difficult due to a lack of large-scale data. Videos are a potential source for real-world 3D data, but finding diverse yet corresponding views of the same content have shown to be difficult at scale. Furthermore, standard videos come with fixed viewpoints, determined at the time of capture.


Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning

Hong, Kaifeng, Zhang, Yinglong, Hong, Xiaoying, Xia, Xuewen, Xu, Xing

arXiv.org Artificial Intelligence

Text-attributed graphs require models to effectively combine strong textual understanding with structurally informed reasoning. Existing approaches either rely on GNNs--limited by over-smoothing and hop-dependent diffusion--or employ Transformers that overlook graph topology and treat nodes as isolated sequences. We propose Odin (Oriented Dual-module INtegration), a new architecture that injects graph structure into Transformers at selected depths through an oriented dual-module mechanism. Unlike message-passing GNNs, Odin does not rely on multi-hop diffusion; instead, multi-hop structures are integrated at specific Transformer layers, yielding low-, mid-, and high-level structural abstraction aligned with the model's semantic hierarchy. Because aggregation operates on the global [CLS] representation, Odin fundamentally avoids over-smoothing and decouples structural abstraction from neighborhood size or graph topology. We further establish that Odin's expressive power strictly contains that of both pure Transformers and GNNs. To make the design efficient in large-scale or low-resource settings, we introduce Light Odin, a lightweight variant that preserves the same layer-aligned structural abstraction for faster training and inference. Experiments on multiple text-rich graph benchmarks show that Odin achieves state-of-the-art accuracy, while Light Odin delivers competitive performance with significantly reduced computational cost. Together, Odin and Light Odin form a unified, hop-free framework for principled structure-text integration. The source code of this model has been released at https://github.com/hongkaifeng/Odin.




From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos

Neural Information Processing Systems

Three-dimensional (3D) understanding of objects and scenes play a key role in humans' ability to interact with the world and has been an active area of research in computer vision, graphics, and robotics. Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects. However, applying a similar approach to real-world objects and scenes is difficult due to a lack of large-scale data. Videos are a potential source for real-world 3D data, but finding diverse yet corresponding views of the same content have shown to be difficult at scale. Furthermore, standard videos come with fixed viewpoints, determined at the time of capture.


Reviews: Out-of-Distribution Detection using Multiple Semantic Label Representations

Neural Information Processing Systems

The problem of the study is out-of-distribution detection; we would like to detect samples that do not belong to the train distribution. To solve this problem, the authors propose reframing the prediction problem as a regression in which the sparse one-hot encoding of the labels in the prediction space is replaced by word embeddings. The presented method also has a flavour of ensemble methods in the sense that a base feature representation (Resnet for image prediction, Lenet for speech recognition) is used and then the feature branches off to multiple fully connected networks each of which generates an embedding vector as the prediction. Each fully connected network is responsible for generating vectors in a specific embedding space. A total of five embedding spaces are used (Skip-gram, GloVe, and FastText trained on independent text datasets).


Image-based Novel Fault Detection with Deep Learning Classifiers using Hierarchical Labels

Sergin, Nurettin, Huang, Jiayu, Chang, Tzyy-Shuh, Yan, Hao

arXiv.org Artificial Intelligence

Many manufacturing systems are instrumented with image-sensing systems to monitor process performance and product quality. The low cost and rich information of the image-based sensing systems have led to high-dimensional data streams that provide distinctive opportunities for performance improvement. Among these, accurate process monitoring and fault classification are among the benefits gained from the rich information these image sensors can provide. In literature, process monitoring often refers to the step of detecting and isolating abnormal samples in a certain process. Normally, after process monitoring, fault classification is performed, and the isolated fault is classified into one or more known types of fault. Fault classification is an essential step within the process monitoring loop, at which point the type of detected and identified faults are determined (Chiang et al., 2001). Accurate fault classification can provide engineers with favorable information to isolate and diagnose system faults and anomalies to improve quality and maximize system efficiency. However, fault classification in manufacturing systems typically assumes a fixed set of fault modes. In this case, the existing fault classification model may make overconfident decisions or fail silently and, at certain times, dangerously for new unseen fault types.