Not enough data to create a plot.
Try a different view from the menu above.
Gao, Jun
Attentional Graph Meta-Learning for Indoor Localization Using Extremely Sparse Fingerprints
Yan, Wenzhong, Yin, Feng, Gao, Jun, Wang, Ao, Tian, Yang, Chen, Ruizhi
Fingerprint-based indoor localization is often labor-intensive due to the need for dense grids and repeated measurements across time and space. Maintaining high localization accuracy with extremely sparse fingerprints remains a persistent challenge. Existing benchmark methods primarily rely on the measured fingerprints, while neglecting valuable spatial and environmental characteristics. In this paper, we propose a systematic integration of an Attentional Graph Neural Network (AGNN) model, capable of learning spatial adjacency relationships and aggregating information from neighboring fingerprints, and a meta-learning framework that utilizes datasets with similar environmental characteristics to enhance model training. To minimize the labor required for fingerprint collection, we introduce two novel data augmentation strategies: 1) unlabeled fingerprint augmentation using moving platforms, which enables the semi-supervised AGNN model to incorporate information from unlabeled fingerprints, and 2) synthetic labeled fingerprint augmentation through environmental digital twins, which enhances the meta-learning framework through a practical distribution alignment, which can minimize the feature discrepancy between synthetic and real-world fingerprints effectively. By integrating these novel modules, we propose the Attentional Graph Meta-Learning (AGML) model. This novel model combines the strengths of the AGNN model and the meta-learning framework to address the challenges posed by extremely sparse fingerprints. To validate our approach, we collected multiple datasets from both consumer-grade WiFi devices and professional equipment across diverse environments. Extensive experiments conducted on both synthetic and real-world datasets demonstrate that the AGML model-based localization method consistently outperforms all baseline methods using sparse fingerprints across all evaluated metrics.
InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models
Lu, Yifan, Ren, Xuanchi, Yang, Jiawei, Shen, Tianchang, Wu, Zhangjie, Gao, Jun, Wang, Yue, Chen, Siheng, Chen, Mike, Fidler, Sanja, Huang, Jiahui
Previous methods for scene generation either suffer from limited scales or lack geometric and appearance Generating simulatable and controllable 3D scenes is an essential consistency along generated sequences. In contrast, task for a wide spectrum of applications, including we leverage the recent advancements in scalable 3D mixed reality, robotics, and the training and testing of autonomous representation and video models to achieve large dynamic vehicles (AV) [25, 33]. In particular, the requirements scene generation that allows flexible controls through HD of AV applications have introduced new challenges maps, vehicle bounding boxes, and text descriptions. First, for 3D generative models in driving scenarios, posing the we construct a map-conditioned sparse-voxel-based 3D following key desiderata: (1) fidelity and consistency, to generative model to unleash its power for unbounded voxel ensure that the generated scenes support photo-realistic rendering world generation. Then, we re-purpose a video model and while preserving consistent appearance and geometry ground it on the voxel world through a set of carefully designed for reliable and stable physics simulation; (2) largescale, pixel-aligned guidance buffers, synthesizing a consistent to generate scenes at map-level for traffic simulation; appearance. Finally, we propose a fast feed-forward and (3) controllability, to allow easy manipulation of the approach that employs both voxel and pixel branches to lift scene layout, appearance, and ego-car behaviors for curating the dynamic videos to dynamic 3D Gaussians with control-adversarial scenarios.
Interleaved-Modal Chain-of-Thought
Gao, Jun, Li, Yongqi, Cao, Ziqiang, Li, Wenjie
Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer. However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image. In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named \textbf{Interleaved-modal Chain-of-Thought (ICoT)}, which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer. Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill. Considering that the required visual information is usually part of the input image, we propose \textbf{Attention-driven Selection (ADS)} to realize ICoT over existing VLMs. ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency. ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs. We apply ADS to realize ICoT on two popular VLMs of different architectures. Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14\%) and interpretability improvements compared to existing multimodal CoT prompting methods.
BSG4Bot: Efficient Bot Detection based on Biased Heterogeneous Subgraphs
Miao, Hao, Liu, Zida, Gao, Jun
The detection of malicious social bots has become a crucial task, as bots can be easily deployed and manipulated to spread disinformation, promote conspiracy messages, and more. Most existing approaches utilize graph neural networks (GNNs)to capture both user profle and structural features,achieving promising progress. However, they still face limitations including the expensive training on large underlying graph, the performance degration when similar neighborhood patterns' assumption preferred by GNNs is not satisfied, and the dynamic features of bots in a highly adversarial context. Motivated by these limitations, this paper proposes a method named BSG4Bot with an intuition that GNNs training on Biased SubGraphs can improve both performance and time/space efficiency in bot detection. Specifically, BSG4Bot first pre-trains a classifier on node features efficiently to define the node similarities, and constructs biased subgraphs by combining the similarities computed by the pre-trained classifier and the node importances computed by Personalized PageRank (PPR scores). BSG4Bot then introduces a heterogeneous GNN over the constructed subgraphs to detect bots effectively and efficiently. The relatively stable features, including the content category and temporal activity features, are explored and incorporated into BSG4Bot after preliminary verification on sample data. The extensive experimental studies show that BSG4Bot outperforms the state-of-the-art bot detection methods, while only needing nearly 1/5 training time.
SpaceMesh: A Continuous Representation for Learning Manifold Surface Meshes
Shen, Tianchang, Li, Zhaoshuo, Law, Marc, Atzmon, Matan, Fidler, Sanja, Lucas, James, Gao, Jun, Sharp, Nicholas
Meshes are ubiquitous in visual computing and simulation, yet most existing machine learning techniques represent meshes only indirectly, e.g. as the level set of a scalar field or deformation of a template, or as a disordered triangle soup lacking local structure. This work presents a scheme to directly generate manifold, polygonal meshes of complex connectivity as the output of a neural network. Our key innovation is to define a continuous latent connectivity space at each mesh vertex, which implies the discrete mesh. In particular, our vertex embeddings generate cyclic neighbor relationships in a halfedge mesh representation, which gives a guarantee of edge-manifoldness and the ability to represent general polygonal meshes. This representation is well-suited to machine learning and stochastic optimization, without restriction on connectivity or topology. We first explore the basic properties of this representation, then use it to fit distributions of meshes from large datasets. The resulting models generate diverse meshes with tessellation structure learned from the dataset population, with concise details and high-quality mesh elements. In applications, this approach not only yields high-quality outputs from generative models, but also enables directly learning challenging geometry processing tasks such as mesh repair.
DECIDER: A Dual-System Rule-Controllable Decoding Framework for Language Generation
Xu, Chen, Lan, Tian, Yu, Changlong, Wang, Wei, Gao, Jun, Ji, Yu, Dong, Qunxi, Qian, Kun, Li, Piji, Bi, Wei, Hu, Bin
Constrained decoding approaches aim to control the meaning or style of text generated by a Pre-trained Language Model (PLM) using specific target words during inference. However, these methods often guide plausible continuations by greedily selecting targets, which, while completing the task, may disrupt the natural patterns of human language generation. In this work, we propose a novel decoding framework, DECIDER, which enables us to program rules on how we complete tasks to control a PLM. Differing from previous work, our framework transforms the encouragement of target words into the encouragement of all words that satisfy the rule. Specifically, DECIDER is a dual system where a PLM is equipped with a First-OrderLogic (FOL) reasoner to express and evaluate the rules, and a decision function to merge the outputs from both systems to steer the generation. Experiments on CommonGen and PersonaChat demonstrate that DECIDER can effectively follow given rules to achieve generation tasks in a more human-like manner.
AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning
Gao, Jun, Qiao, Qian, Cao, Ziqiang, Wang, Zili, Li, Wenjie
In-context learning (ICL) facilitates Large Language Models (LLMs) exhibiting emergent ability on downstream tasks without updating billions of parameters. However, in the area of multi-modal Large Language Models (MLLMs), two problems hinder the application of multi-modal ICL: (1) Most primary MLLMs are only trained on single-image datasets, making them unable to read multi-modal demonstrations. (2) With the demonstrations increasing, thousands of visual tokens highly challenge hardware and degrade ICL performance. During preliminary explorations, we discovered that the inner LLM tends to focus more on the linguistic modality within multi-modal demonstrations to generate responses. Therefore, we propose a general and light-weighted framework \textbf{AIM} to tackle the mentioned problems through \textbf{A}ggregating \textbf{I}mage information of \textbf{M}ultimodal demonstrations to the dense latent space of the corresponding linguistic part. Specifically, AIM first uses the frozen backbone MLLM to read each image-text demonstration and extracts the vector representations on top of the text. These vectors naturally fuse the information of the image-text pair, and AIM transforms them into fused virtual tokens acceptable for the inner LLM via a trainable projection layer. Ultimately, these fused tokens function as variants of multi-modal demonstrations, fed into the MLLM to direct its response to the current query as usual. Because these fused tokens stem from the textual component of the image-text pair, a multi-modal demonstration is nearly reduced to a pure textual demonstration, thus seamlessly applying to any MLLMs. With its de facto MLLM frozen, AIM is parameter-efficient and we train it on public multi-modal web corpora which have nothing to do with downstream test tasks.
SelfCP: Compressing Over-Limit Prompt via the Frozen Large Language Model Itself
Gao, Jun, Cao, Ziqiang, Li, Wenjie
Long prompt leads to huge hardware costs when using transformer-based Large Language Models (LLMs). Unfortunately, many tasks, such as summarization, inevitably introduce long documents, and the wide application of in-context learning easily makes the prompt length explode. This paper proposes a Self-Compressor (SelfCP), which employs the target LLM itself to compress over-limit prompts into dense vectors while keeping the allowed prompts unmodified. Dense vectors are then projected into dense tokens via a learnable connector to make the same LLM unburden to understand. The connector is supervised-tuned under the language modeling objective of the LLM on relatively long texts selected from publicly accessed datasets, involving an instruction dataset to make SelfCP respond to various prompts, while the target LLM keeps frozen during training. We build the lightweight SelfCP upon 2 different backbones with merely 17M learnable parameters originating from the connector and a learnable embedding. Evaluation on both English and Chinese benchmarks demonstrate that SelfCP effectively substitutes 12$\times$ over-limit prompts with dense tokens to reduce memory costs and booster inference throughputs, yet improving response quality. The outstanding performance brings an efficient solution for LLMs to tackle long prompts without training LLMs from scratch.
Unifying Demonstration Selection and Compression for In-Context Learning
Gao, Jun, Cao, Ziqiang, Li, Wenjie
In-context learning (ICL) facilitates large language models (LLMs) exhibiting spectacular emergent capabilities in various scenarios. Unfortunately, introducing demonstrations easily makes the prompt length explode, bringing a significant burden to hardware. In addition, random demonstrations usually achieve limited improvements in ICL, necessitating demonstration selection among accessible candidates. Previous studies introduce extra modules to perform demonstration compression or selection independently. In this paper, we propose an ICL framework UniICL, which Unifies demonstration selection and compression, and final response generation via a single frozen LLM. Specifically, UniICL first projects actual demonstrations and inference text inputs into short virtual tokens, respectively. Then, virtual tokens are applied to select suitable demonstrations by measuring semantic similarity within latent space among candidate demonstrations and inference input. Finally, inference text inputs together with selected virtual demonstrations are fed into the same frozen LLM for response generation. Notably, UniICL is a parameter-efficient framework that only contains 17M trainable parameters originating from the projection layer. We conduct experiments and analysis over in- and out-domain datasets of both generative and understanding tasks, encompassing ICL scenarios with plentiful and limited demonstration candidates. Results show that UniICL effectively unifies $12 \times$ compression, demonstration selection, and response generation, efficiently scaling up the baseline from 4-shot to 64-shot ICL in IMDb with 24 GB CUDA allocation
Guiding ChatGPT to Generate Salient Domain Summaries
Gao, Jun, Cao, Ziqiang, Huang, Shaoyao, Qin, Luozheng, Ai, Chunhui
ChatGPT is instruct-tuned to generate general and human-expected content to align with human preference through Reinforcement Learning from Human Feedback (RLHF), meanwhile resulting in generated responses not salient enough. Therefore, in this case, ChatGPT may fail to satisfy domain requirements in zero-shot settings, leading to poor ROUGE scores. Inspired by the In-Context Learning (ICL) and retelling ability of ChatGPT, this paper proposes PADS, a \textbf{P}ipeline for \textbf{A}ssisting ChatGPT in \textbf{D}omain \textbf{S}ummarization. PADS consists of a retriever to retrieve similar examples from corpora and a rank model to rerank the multiple candidate summaries generated by ChatGPT. Specifically, given an inference document, we first retrieve an in-context demonstration via the retriever. Then, we require ChatGPT to generate $k$ candidate summaries for the inference document at a time under the guidance of the retrieved demonstration. Finally, the rank model independently scores the $k$ candidate summaries according to their quality and selects the optimal one. We extensively explore dense and sparse retrieval methods to select effective demonstrations for reference and efficiently train the rank model to reflect the quality of candidate summaries for each given summarized document. Additionally, PADS contains merely 400M trainable parameters originating from the rank model and we merely collect 2.5k data to train it. We evaluate PADS on five datasets from different domains, and the result indicates that each module in PADS is committed to effectively guiding ChatGPT to generate salient summaries fitting different domain requirements. Specifically, in the popular summarization dataset Gigaword, PADS achieves over +8 gain on ROUGE-L, compared with the naive ChatGPT in the zero-shot setting. \footnote{Our code are available at \url{https://github.com/jungao1106/PADS}}