SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Neural Information Processing Systems

Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation.


OpenFilter: A Framework to Democratize Research Access to Social Media AR Filters

Neural Information Processing Systems

Augmented Reality or AR filters on selfies have become very popular on social media platforms for a variety of applications, including marketing, entertainment and aesthetics. Given the wide adoption of AR face filters and the importance of faces in our social structures and relations, there is increased interest by the scientific community to analyze the impact of such filters from a psychological, artistic and sociological perspective. However, there are few quantitative analyses in this area mainly due to a lack of publicly available datasets of facial images with applied AR filters. The proprietary, close nature of most social media platforms does not allow users, scientists and practitioners to access the code and the details of the available AR face filters. Scraping faces from these platforms to collect data is ethically unacceptable and should, therefore, be avoided in research.


Construction and Application of Materials Knowledge Graph in Multidisciplinary Materials Science via Large Language Model

Neural Information Processing Systems

Knowledge in materials science is widely dispersed across extensive scientific literature, posing significant challenges to the efficient discovery and integration of new materials. Traditional methods, often reliant on costly and time-consuming experimental approaches, further complicate rapid innovation. Addressing these challenges, the integration of artificial intelligence with materials science has opened avenues for accelerating the discovery process, though it also demands precise annotation, data extraction, and traceability of information. To tackle these issues, this article introduces the Materials Knowledge Graph (MKG), which utilizes advanced natural language processing techniques integrated with large language models to extract and systematically organize a decade's worth of highquality research into structured triples, contains 162,605 nodes and 731,772 edges. MKG categorizes information into comprehensive labels such as Name, Formula, and Application, structured around a meticulously designed ontology, thus enhancing data usability and integration. By implementing network-based algorithms, MKG not only facilitates efficient link prediction but also significantly reduces reliance on traditional experimental methods. This structured approach not only streamlines materials research but also lays the groundwork for more sophisticated science knowledge graphs.


Efficient Projection-Free Algorithms for Saddle Point Problems Luo Luo 2

Neural Information Processing Systems

The Frank-Wolfe algorithm is a classic method for constrained optimization problems. It has recently been popular in many machine learning applications because its projection-free property leads to more efficient iterations. In this paper, we study projection-free algorithms for convex-strongly-concave saddle point problems with complicated constraints.


7a53928fa4dd31e82c6ef826f341daec-AuthorFeedback.pdf

Neural Information Processing Systems

We greatly appreciate the reviewers' effort and helpful comments. Comment 1: "The significance of the proposed method is not very clear..." It also has great theoretical significance in the optimization area. Though the convergence rate of this method could be suboptimal, it's a practical way to In addition, [6] shows some examples of saddle point algorithms where projection onto the constrain sets is hard. Comment 2: "Why do we consider nuclear norm constraint for this classification problem?" Response 3: We find that this paper does not have section 5.4 and 5.6. Also, it is irrelevant to our paper.


One for All: Multi-Domain Joint Training for Point Cloud Based 3D Object Detection Zhenyu Wang Yali Li1 Shengjin Wang

Neural Information Processing Systems

The current trend in computer vision is to utilize one universal model to address all various tasks. Achieving such a universal model inevitably requires incorporating multi-domain data for joint training to learn across multiple problem scenarios. In point cloud based 3D object detection, however, such multi-domain joint training is highly challenging, because large domain gaps among point clouds from different datasets lead to the severe domain-interference problem. In this paper, we propose OneDet3D, a universal one-for-all model that addresses 3D detection across different domains, including diverse indoor and outdoor scenes, within the same framework and only one set of parameters. We propose the domain-aware partitioning in scatter and context, guided by a routing mechanism, to address the data interference issue, and further incorporate the text modality for a language-guided classification to unify the multi-dataset label spaces and mitigate the category interference issue. The fully sparse structure and anchor-free head further accommodate point clouds with significant scale disparities. Extensive experiments demonstrate the strong universal ability of OneDet3D to utilize only one trained model for addressing almost all 3D object detection tasks (Figure 1).


A Tasks description and assumptions used for the different method of reward shaping

Neural Information Processing Systems

This supplementary material provides additional results and discussion, as well as implementation details. Section A summarises the different tasks and the assumption used in RIDE, EAGER, ELLA. Section B gives more details about training of the QA module and the agent. It also includes explanations of how we built the training data set for the QA module. Section C gathers several results on EAGER: comparison with behavioural cloning, generalisation capacity of QA, robustness results of EAGER... Section D contains a commented version of the EAGER algorithm. Table 1 describes the tasks used in the experiments with an example and if it has been used to train the QA module or the agent.


L4GM: Large 4D Gaussian Reconstruction Model Kevin Xie

Neural Information Processing Systems

We present L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input - in a single feed-forward pass that takes only a second. Key to our success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames. We keep our L4GM simple for scalability and build directly on top of LGM [49], a pretrained 3D Large Reconstruction Model that outputs 3D Gaussian ellipsoids from multiview image input. L4GM outputs a per-frame 3D Gaussian Splatting representation from video frames sampled at a low fps and then upsamples the representation to a higher fps to achieve temporal smoothness. We add temporal self-attention layers to the base LGM to help it learn consistency across time, and utilize a per-timestep multiview rendering loss to train the model. The representation is upsampled to a higher framerate by training an interpolation model which produces intermediate 3D Gaussian representations. We showcase that L4GM that is only trained on synthetic data generalizes well on in-the-wild videos, producing high quality animated 3D assets.


An Optimal Elimination Algorithm for Learning a Best Arm

Neural Information Processing Systems

We consider the classic problem of (ɛ, δ)-PAC learning a best arm where the goal is to identify with confidence 1 δ an arm whose mean is an ɛ-approximation to that of the highest mean arm in a multi-armed bandit setting. This problem is one of the most fundamental problems in statistics and learning theory, yet somewhat surprisingly its worst case sample complexity is not well understood. In this paper we propose a new approach for (ɛ, δ)-PAC learning a best arm. This approach leads to an algorithm whose sample complexity converges to exactly the optimal sample complexity of (ɛ, δ)-learning the mean of n arms separately and we complement this result with a conditional matching lower bound.


6801fa3fd290229efc490ee0cf1c5687-Paper-Conference.pdf

Neural Information Processing Systems

Large Language models (LLMs) have demonstrated supreme capabilities in textual understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel LLM-driven audio codec model, LLM-Codec, which transfers the audio modality into textual space by representing audio tokens with words or sub-words from the LLM vocabulary, while maintaining high audio reconstruction quality. The key idea is to reduce the modality heterogeneity between text and audio by compressing the audio modality into the well-trained textual space of LLMs. Thus, the audio representation can be viewed as a new foreign language, and LLMs can learn the new foreign language with several demonstrations. In experiments, we investigate the performance of the proposed approach across multiple audio understanding and generation tasks, e.g.