ilm
Inverse Language Modeling towards Robust and Grounded LLMs
Gabrielli, Davide, Sestito, Simone, Masi, Iacopo
The current landscape of defensive mechanisms for LLMs is fragmented and underdeveloped, unlike prior work on classifiers. To further promote adversarial robustness in LLMs, we propose Inverse Language Modeling (ILM), a unified framework that simultaneously 1) improves the robustness of LLMs to input perturbations, and, at the same time, 2) enables native grounding by inverting model outputs to identify potentially toxic or unsafe input triggers. ILM transforms LLMs from static generators into analyzable and robust systems, potentially helping RED teaming. ILM can lay the foundation for next-generation LLMs that are not only robust and grounded but also fundamentally more controllable and trustworthy. The code is publicly available at github.com/davegabe/pag-llm.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California (0.04)
CLAP: Clustering to Localize Across n Possibilities, A Simple, Robust Geometric Approach in the Presence of Symmetries
Fernandez, Gabriel I., Hou, Ruochen, Xu, Alex, Togashi, Colin, Hong, Dennis W.
Abstract-- In this paper, we present our localization method called CLAP, Clustering to Localize Across n Possibilities, which helped us win the RoboCup 2024 adult-sized autonomous humanoid soccer competition. In addition, our robot had to deal with varying lighting conditions, dynamic feature occlusions, noise from high-impact stepping, and mistaken features from bystanders and neighboring fields. Therefore, we needed an accurate, and most importantly robust localization algorithm that would be the foundation for our path-planning and game-strategy algorithms. CLAP achieves these requirements by clustering estimated states of our robot from pairs of field features to localize its global position and orientation. Correct state estimates naturally cluster together, while incorrect estimates spread apart, making CLAP resilient to noise and incorrect inputs. CLAP is paired with a particle filter and an extended Kalman filter to improve consistency and smoothness. T ests of CLAP with other landmark-based localization methods showed similar accuracy. However, tests with increased false positive feature detection showed that CLAP outperformed other methods in terms of robustness with very little divergence and velocity jumps. Our localization performed well in competition, allowing our robot to shoot faraway goals and narrowly defend our goal. Every year, the Robocup Federation hosts a humanoid soccer competition in hopes of one day playing a live match of robots versus humans. To ensure a fair match, rules are put in place such that robots must be able to play autonomously, be of similar physiological proportions to a human, and only be equipped with sensors that have biological equivalents.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Europe > Switzerland (0.04)
Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions
Patel, Dhruvesh, Sahoo, Aishwarya, Amballa, Avinash, Naseem, Tahira, Rudner, Tim G. J., McCallum, Andrew
Autoregressive models (ARMs), which predict subsequent tokens one-by-one ``from left to right,'' have achieved significant success across a wide range of sequence generation tasks. However, they struggle to accurately represent sequences that require satisfying sophisticated constraints or whose sequential dependencies are better addressed by out-of-order generation. Masked Diffusion Models (MDMs) address some of these limitations, but the process of unmasking multiple tokens simultaneously in MDMs can introduce incoherences, and MDMs cannot handle arbitrary infilling constraints when the number of tokens to be filled in is not known in advance. In this work, we introduce Insertion Language Models (ILMs), which learn to insert tokens at arbitrary positions in a sequence -- that is, they select jointly both the position and the vocabulary element to be inserted. By inserting tokens one at a time, ILMs can represent strong dependencies between tokens, and their ability to generate sequences in arbitrary order allows them to accurately model sequences where token dependencies do not follow a left-to-right sequential structure. To train ILMs, we propose a tailored network parameterization and use a simple denoising objective. Our empirical evaluation demonstrates that ILMs outperform both ARMs and MDMs on common planning tasks. Furthermore, we show that ILMs outperform MDMs and perform on par with ARMs in an unconditional text generation task while offering greater flexibility than MDMs in arbitrary-length text infilling. The code is available at: https://dhruveshp.com/projects/ilm .
- North America > United States > New York (0.04)
- North America > United States > Iowa (0.04)
Label-Context-Dependent Internal Language Model Estimation for CTC
Yang, Zijian, Phan, Minh-Nghia, Schlüter, Ralf, Ney, Hermann
Although connectionist temporal classification (CTC) has the label context independence assumption, it can still implicitly learn a context-dependent internal language model (ILM) due to modern powerful encoders. In this work, we investigate the implicit context dependency modeled in the ILM of CTC. To this end, we propose novel context-dependent ILM estimation methods for CTC based on knowledge distillation (KD) with theoretical justifications. Furthermore, we introduce two regularization methods for KD. We conduct experiments on Librispeech and TED-LIUM Release 2 datasets for in-domain and cross-domain evaluation, respectively. Experimental results show that context-dependent ILMs outperform the context-independent priors in cross-domain evaluation, indicating that CTC learns a context-dependent ILM. The proposed label-level KD with smoothing method surpasses other ILM estimation approaches, with more than 13% relative improvement in word error rate compared to shallow fusion.
Fast and Robust Localization for Humanoid Soccer Robot via Iterative Landmark Matching
Hou, Ruochen, Zhu, Mingzhang, Nam, Hyunwoo, Fernandez, Gabriel I., Hong, Dennis W.
Accurate robot localization is essential for effective operation. Monte Carlo Localization (MCL) is commonly used with known maps but is computationally expensive due to landmark matching for each particle. Humanoid robots face additional challenges, including sensor noise from locomotion vibrations and a limited field of view (FOV) due to camera placement. This paper proposes a fast and robust localization method via iterative landmark matching (ILM) for humanoid robots. The iterative matching process improves the accuracy of the landmark association so that it does not need MCL to match landmarks to particles. Pose estimation with the outlier removal process enhances its robustness to measurement noise and faulty detections. Furthermore, an additional filter can be utilized to fuse inertial data from the inertial measurement unit (IMU) and pose data from localization. We compared ILM with Iterative Closest Point (ICP), which shows that ILM method is more robust towards the error in the initial guess and easier to get a correct matching. We also compared ILM with the Augmented Monte Carlo Localization (aMCL), which shows that ILM method is much faster than aMCL and even more accurate. The proposed method's effectiveness is thoroughly evaluated through experiments and validated on the humanoid robot ARTEMIS during RoboCup 2024 adult-sized soccer competition.
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
Xu, Zhenran, Wang, Longyue, Wang, Jifang, Li, Zhouyi, Shi, Senbao, Yang, Xue, Wang, Yiyu, Hu, Baotian, Yu, Jun, Zhang, Min
Virtual film production requires intricate decision-making processes, including scriptwriting, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in automated decision-making with language agent-based societies, this paper introduces FilmAgent, a novel LLM-based multi-agent collaborative framework for end-to-end film automation in our constructed 3D virtual spaces. FilmAgent simulates various crew roles, including directors, screenwriters, actors, and cinematographers, and covers key stages of a film production workflow: (1) idea development transforms brainstormed ideas into structured story outlines; (2) scriptwriting elaborates on dialogue and character actions for each scene; (3) cinematography determines the camera setups for each shot. A team of agents collaborates through iterative feedback and revisions, thereby verifying intermediate scripts and reducing hallucinations. We evaluate the generated videos on 15 ideas and 4 key aspects. Human evaluation shows that FilmAgent outperforms all baselines across all aspects and scores 3.98 out of 5 on average, showing the feasibility of multi-agent collaboration in filmmaking. Further analysis reveals that FilmAgent, despite using the less advanced GPT-4o model, surpasses the single-agent o1, showing the advantage of a well-coordinated multi-agent system. Lastly, we discuss the complementary strengths and weaknesses of OpenAI's text-to-video model Sora and our FilmAgent in filmmaking.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Singapore (0.04)
- (3 more...)
- Workflow (1.00)
- Research Report (0.82)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
An iterated learning model of language change that mixes supervised and unsupervised learning
Bunyan, Jack, Bullock, Seth, Houghton, Conor
The iterated learning model is an agent-based model of language change in which language is transmitted from a tutor to a pupil which itself becomes a tutor to a new pupil, and so on. Languages that are stable, expressive, and compositional arise spontaneously as a consequence of a language transmission bottleneck. Previous models have implemented an agent's mapping from signals to meanings using an artificial neural network decoder, but have relied on an unrealistic and computationally expensive process of obversion to implement the associated encoder, mapping from meanings to signals. Here, a new model is presented in which both decoder and encoder are neural networks, trained separately through supervised learning, and trained together through unsupervised learning in the form of an autoencoder. This avoids the substantial computational burden entailed in obversion and introduces a mixture of supervised and unsupervised learning as observed during human development.
- Europe > United Kingdom > England > Bristol (0.04)
- Oceania > Samoa (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (3 more...)
Modeling language contact with the Iterated Learning Model
Bullock, Seth, Houghton, Conor
Contact between languages has the potential to transmit vocabulary and other language features; however, this does not always happen. Here, an iterated learning model is used to examine, in a simple way, the resistance of languages to change during language contact. Iterated learning models are agent-based models of language change, they demonstrate that languages that are expressive and compositional arise spontaneously as a consequence of a language transmission bottleneck. A recently introduced type of iterated learning model, the Semi-Supervised ILM is used to simulate language contact. These simulations do not include many of the complex factors involved in language contact and do not model a population of speakers; nonetheless the model demonstrates that the dynamics which lead languages in the model to spontaneously become expressive and compositional, also cause a language to maintain its core traits even after mixing with another language.
- Europe > Ireland (0.05)
- North America > United States > District of Columbia > Washington (0.04)
- North America > United States > California (0.04)
- (2 more...)
Integrating Language Models into Direct Speech Translation: An Inference-Time Solution to Control Gender Inflection
Fucci, Dennis, Gaido, Marco, Papi, Sara, Cettolo, Mauro, Negri, Matteo, Bentivogli, Luisa
When translating words referring to the speaker, speech translation (ST) systems should not resort to default masculine generics nor rely on potentially misleading vocal traits. Rather, they should assign gender according to the speakers' preference. The existing solutions to do so, though effective, are hardly feasible in practice as they involve dedicated model re-training on gender-labeled ST data. To overcome these limitations, we propose the first inference-time solution to control speaker-related gender inflections in ST. Our approach partially replaces the (biased) internal language model (LM) implicitly learned by the ST decoder with gender-specific external LMs. Experiments on en->es/fr/it show that our solution outperforms the base models and the best training-time mitigation strategy by up to 31.0 and 1.6 points in gender accuracy, respectively, for feminine forms. The gains are even larger (up to 32.0 and 3.4) in the challenging condition where speakers' vocal traits conflict with their gender.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Czechia > South Moravian Region > Brno (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- (17 more...)
Towards Deep Learning Guided Autonomous Eye Surgery Using Microscope and iOCT Images
Kim, Ji Woong, Wei, Shuwen, Zhang, Peiyao, Gehlbach, Peter, Kang, Jin U., Iordachita, Iulian, Kobilarov, Marin
Recent advancements in retinal surgery have paved the way for a modern operating room equipped with a surgical robot, a microscope, and intraoperative optical coherence tomography (iOCT)- a depth sensor widely used in retinal surgery. Integrating these tools raises the fundamental question of how to effectively combine them to enable surgical autonomy. In this work, we tackle this question by developing a unified framework that facilitates real-time autonomous surgical workflows leveraging these devices. The system features: (1) a novel imaging system that integrates the microscope and iOCT in real-time by dynamically tracking the surgical instrument via a small iOCT scanning region, providing real-time depth feedback; (2) implementation of convolutional neural networks (CNN) that automatically detect and segment task-relevant information for surgical autonomy; (3) intuitive selection of goal waypoints within both the microscope and iOCT views through simple mouse-click interactions; and (4) integration of model predictive control (MPC) for trajectory generation, ensuring patient safety by implementing safety-related kinematic constraints. The system's utility is demonstrated by automating subretinal injection (SI), a challenging procedure with high accuracy and depth perception requirements. We validate our system by conducting 30 successful SI trials on pig eyes, achieving mean needle insertion accuracy of 26 micrometers to various subretinal goals and mean duration of 55 seconds. Preliminary comparisons to a human operator performing SI in robot-assisted mode highlight the enhanced safety of our system. Project website is here: https://sites.google.com/view/eyesurgerymicroscopeoct/home
- Research Report (0.64)
- Workflow (0.50)
- Health & Medicine > Surgery (1.00)
- Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (0.82)