Wang, Jiaxu
Mamba as Decision Maker: Exploring Multi-scale Sequence Modeling in Offline Reinforcement Learning
Cao, Jiahang, Zhang, Qiang, Wang, Ziqing, Wang, Jiaxu, Cheng, Hao, Shao, Yecheng, Zhao, Wen, Han, Gang, Guo, Yijie, Xu, Renjing
Sequential modeling has demonstrated remarkable capabilities in offline reinforcement learning (RL), with Decision Transformer (DT) being one of the most notable representatives, achieving significant success. However, RL trajectories possess unique properties to be distinguished from the conventional sequence (e.g., text or audio): (1) local correlation, where the next states in RL are theoretically determined solely by current states and actions based on the Markov Decision Process (MDP), and (2) global correlation, where each step's features are related to long-term historical information due to the time-continuous nature of trajectories. In this paper, we propose a novel action sequence predictor, named Mamba Decision Maker (MambaDM), where Mamba is expected to be a promising alternative for sequence modeling paradigms, owing to its efficient modeling of multi-scale dependencies. In particular, we introduce a novel mixer module that proficiently extracts and integrates both global and local features of the input sequence, effectively capturing interrelationships in RL datasets. Extensive experiments demonstrate that MambaDM achieves state-of-the-art performance in Atari and OpenAI Gym datasets. Furthermore, we empirically investigate the scaling laws of MambaDM, finding that increasing model size does not bring performance improvement, but scaling the dataset amount by 2x for MambaDM can obtain up to 33.7% score improvement on Atari dataset. This paper delves into the sequence modeling capabilities of MambaDM in the RL domain, paving the way for future advancements in robust and efficient decision-making systems. Our code will be available at https://github.com/AndyCao1125/MambaDM.
EvGGS: A Collaborative Learning Framework for Event-based Generalizable Gaussian Splatting
Wang, Jiaxu, He, Junhao, Zhang, Ziyi, Sun, Mingyuan, Sun, Jingkai, Xu, Renjing
Event cameras offer promising advantages such as high dynamic range and low latency, making them well-suited for challenging lighting conditions and fast-moving scenarios. However, reconstructing 3D scenes from raw event streams is difficult because event data is sparse and does not carry absolute color information. To release its potential in 3D reconstruction, we propose the first event-based generalizable 3D reconstruction framework, called EvGGS, which reconstructs scenes as 3D Gaussians from only event input in a feedforward manner and can generalize to unseen cases without any retraining. This framework includes a depth estimation module, an intensity reconstruction module, and a Gaussian regression module. These submodules connect in a cascading manner, and we collaboratively train them with a designed joint loss to make them mutually promote. To facilitate related studies, we build a novel event-based 3D dataset with various material objects and calibrated labels of grayscale images, depth maps, camera poses, and silhouettes. Experiments show models that have jointly trained significantly outperform those trained individually. Our approach performs better than all baselines in reconstruction quality, and depth/intensity predictions with satisfactory rendering speed.
Physical Priors Augmented Event-Based 3D Reconstruction
Wang, Jiaxu, He, Junhao, Zhang, Ziyi, Xu, Renjing
3D neural implicit representations play a significant component in many robotic applications. However, reconstructing neural radiance fields (NeRF) from realistic event data remains a challenge due to the sparsities and the lack of information when only event streams are available. In this paper, we utilize motion, geometry, and density priors behind event data to impose strong physical constraints to augment NeRF training. The proposed novel pipeline can directly benefit from those priors to reconstruct 3D scenes without additional inputs. Moreover, we present a novel density-guided patch-based sampling strategy for robust and efficient learning, which not only accelerates training procedures but also conduces to expressions of local geometries. More importantly, we establish the first large dataset for event-based 3D reconstruction, which contains 101 objects with various materials and geometries, along with the groundtruth of images and depth maps for all camera viewpoints, which significantly facilitates other research in the related fields. The code and dataset will be publicly available at https://github.com/Mercerai/PAEv3d.
Chasing Day and Night: Towards Robust and Efficient All-Day Object Detection Guided by an Event Camera
Cao, Jiahang, Zheng, Xu, Lyu, Yuanhuiyi, Wang, Jiaxu, Xu, Renjing, Wang, Lin
The ability to detect objects in all lighting (i.e., normal-, over-, and under-exposed) conditions is crucial for real-world applications, such as self-driving.Traditional RGB-based detectors often fail under such varying lighting conditions.Therefore, recent works utilize novel event cameras to supplement or guide the RGB modality; however, these methods typically adopt asymmetric network structures that rely predominantly on the RGB modality, resulting in limited robustness for all-day detection. In this paper, we propose EOLO, a novel object detection framework that achieves robust and efficient all-day detection by fusing both RGB and event modalities. Our EOLO framework is built based on a lightweight spiking neural network (SNN) to efficiently leverage the asynchronous property of events. Buttressed by it, we first introduce an Event Temporal Attention (ETA) module to learn the high temporal information from events while preserving crucial edge information. Secondly, as different modalities exhibit varying levels of importance under diverse lighting conditions, we propose a novel Symmetric RGB-Event Fusion (SREF) module to effectively fuse RGB-Event features without relying on a specific modality, thus ensuring a balanced and adaptive fusion for all-day detection. In addition, to compensate for the lack of paired RGB-Event datasets for all-day training and evaluation, we propose an event synthesis approach based on the randomized optical flow that allows for directly generating the event frame from a single exposure image. We further build two new datasets, E-MSCOCO and E-VOC based on the popular benchmarks MSCOCO and PASCAL VOC. Extensive experiments demonstrate that our EOLO outperforms the state-of-the-art detectors,e.g.,RENet,by a substantial margin (+3.74% mAP50) in all lighting conditions.Our code and datasets will be available at https://vlislab22.github.io/EOLO/