Materials
Revolve: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization
Zhang, Peiyan, Jin, Haibo, Hu, Leyang, Li, Xinnuo, Kang, Liying, Luo, Man, Song, Yangqiu, Wang, Haohan
Recent advancements in large language models (LLMs) have significantly enhanced the ability of LLM-based systems to perform complex tasks through natural language processing and tool interaction. However, optimizing these LLM-based systems for specific tasks remains challenging, often requiring manual interventions like prompt engineering and hyperparameter tuning. Existing automatic optimization methods, such as textual feedback-based techniques (e.g., TextGrad), tend to focus on immediate feedback, analogous to using immediate derivatives in traditional numerical gradient descent. However, relying solely on such feedback can be limited when the adjustments made in response to this feedback are either too small or fluctuate irregularly, potentially slowing down or even stalling the optimization process. To overcome these challenges, more adaptive methods are needed, especially in situations where the system's response is evolving slowly or unpredictably. In this paper, we introduce REVOLVE, an optimization method that tracks how "R"esponses "EVOLVE" across iterations in LLM systems. By focusing on the evolution of responses over time, REVOLVE enables more stable and effective optimization by making thoughtful, progressive adjustments at each step. Experimental results demonstrate that REVOLVE outperforms competitive baselines, achieving a 7.8% improvement in prompt optimization, a 20.72% gain in solution refinement, and a 29.17% increase in code optimization. Additionally, REVOLVE converges in fewer iterations, resulting in significant computational savings. These advantages highlight its adaptability and efficiency, positioning REVOLVE as a valuable tool for optimizing LLM-based systems and accelerating the development of next-generation AI technologies. Code is available at: https://github.com/Peiyance/REVOLVE.
FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL
Koh, Woosung, Oh, Wonbeen, Kim, Siyeol, Shin, Suhin, Kim, Hyeongjin, Jang, Jaein, Lee, Junghyun, Yun, Se-Young
Multi-agent reinforcement learning has demonstrated significant potential in addressing complex cooperative tasks across various real-world applications. However, existing MARL approaches often rely on the restrictive assumption that the number of entities (e.g., agents, obstacles) remains constant between training and inference. This overlooks scenarios where entities are dynamically removed or added during the inference trajectory -- a common occurrence in real-world environments like search and rescue missions and dynamic combat situations. In this paper, we tackle the challenge of intra-trajectory dynamic entity composition under zero-shot out-of-domain (OOD) generalization, where such dynamic changes cannot be anticipated beforehand. Our empirical studies reveal that existing MARL methods suffer significant performance degradation and increased uncertainty in these scenarios. In response, we propose FlickerFusion, a novel OOD generalization method that acts as a universally applicable augmentation technique for MARL backbone methods. FlickerFusion stochastically drops out parts of the observation space, emulating being in-domain when inferenced OOD. The results show that FlickerFusion not only achieves superior inference rewards but also uniquely reduces uncertainty vis-\`a-vis the backbone, compared to existing methods. Benchmarks, implementations, and model weights are organized and open-sourced at flickerfusion305.github.io, accompanied by ample demo video renderings.
Deep Learning, Machine Learning, Advancing Big Data Analytics and Management
Hsieh, Weiche, Bi, Ziqian, Chen, Keyu, Peng, Benji, Zhang, Sen, Xu, Jiawei, Wang, Jinlang, Yin, Caitlyn Heqi, Zhang, Yichao, Feng, Pohsun, Wen, Yizhu, Wang, Tianyang, Li, Ming, Liang, Chia Xin, Ren, Jintao, Niu, Qian, Chen, Silin, Yan, Lawrence K. Q., Xu, Han, Tseng, Hong-Ming, Song, Xinyuan, Jing, Bowen, Yang, Junjie, Song, Junhao, Liu, Junyu, Liu, Ming
Advancements in artificial intelligence, machine learning, and deep learning have catalyzed the transformation of big data analytics and management into pivotal domains for research and application. This work explores the theoretical foundations, methodological advancements, and practical implementations of these technologies, emphasizing their role in uncovering actionable insights from massive, high-dimensional datasets. The study presents a systematic overview of data preprocessing techniques, including data cleaning, normalization, integration, and dimensionality reduction, to prepare raw data for analysis. Core analytics methodologies such as classification, clustering, regression, and anomaly detection are examined, with a focus on algorithmic innovation and scalability. Furthermore, the text delves into state-of-the-art frameworks for data mining and predictive modeling, highlighting the role of neural networks, support vector machines, and ensemble methods in tackling complex analytical challenges. Special emphasis is placed on the convergence of big data with distributed computing paradigms, including cloud and edge computing, to address challenges in storage, computation, and real-time analytics. The integration of ethical considerations, including data privacy and compliance with global standards, ensures a holistic perspective on data management. Practical applications across healthcare, finance, marketing, and policy-making illustrate the real-world impact of these technologies. Through comprehensive case studies and Python-based implementations, this work equips researchers, practitioners, and data enthusiasts with the tools to navigate the complexities of modern data analytics. It bridges the gap between theory and practice, fostering the development of innovative solutions for managing and leveraging data in the era of artificial intelligence.
Bridging Hard and Soft: Mechanical Metamaterials Enable Rigid Torque Transmission in Soft Robots
Carton, Molly, Kowalewski, Jakub F., Guo, Jiani, Alpert, Jacob F., Garg, Aman, Revier, Daniel, Lipton, Jeffrey Ian
Torque and continuous rotation are fundamental methods of actuation and manipulation in rigid robots. Soft robot arms use soft materials and structures to mimic the passive compliance of biological arms that bend and extend. This use of compliance prevents soft arms from continuously transmitting and exerting torques to interact with their environment. Here, we show how relying on patterning structures instead of inherent material properties allows soft robotic arms to remain compliant while continuously transmitting torque to their environment. We demonstrate a soft robotic arm made from a pair of mechanical metamaterials that act as compliant constant-velocity joints. The joints are up to 52 times stiffer in torsion than bending and can bend up to 45{\deg}. This robot arm can continuously transmit torque while deforming in all other directions. The arm's mechanical design achieves high motion repeatability (0.4 mm and 0.1{\deg}) when tracking trajectories. We then trained a neural network to learn the inverse kinematics, enabling us to program the arm to complete tasks that are challenging for existing soft robots such as installing light bulbs, fastening bolts, and turning valves. The arm's passive compliance makes it safe around humans and provides a source of mechanical intelligence, enabling it to adapt to misalignment when manipulating objects. This work will bridge the gap between hard and soft robotics with applications in human assistance, warehouse automation, and extreme environments.
An Automated Data Mining Framework Using Autoencoders for Feature Extraction and Dimensionality Reduction
Liang, Yaxin, Li, Xinshi, Huang, Xin, Zhang, Ziqi, Yao, Yue
This study proposes an automated data mining framework based on autoencoders and experimentally verifies its effectiveness in feature extraction and data dimensionality reduction. Through the encoding-decoding structure, the autoencoder can capture the data's potential characteristics and achieve noise reduction and anomaly detection, providing an efficient and stable solution for the data mining process. The experiment compared the performance of the autoencoder with traditional dimensionality reduction methods (such as PCA, FA, T-SNE, and UMAP). The results showed that the autoencoder performed best in terms of reconstruction error and root mean square error and could better retain data structure and enhance the generalization ability of the model. The autoencoder-based framework not only reduces manual intervention but also significantly improves the automation of data processing. In the future, with the advancement of deep learning and big data technology, the autoencoder method combined with a generative adversarial network (GAN) or graph neural network (GNN) is expected to be more widely used in the fields of complex data processing, real-time data analysis and intelligent decision-making.
3D Interaction Geometric Pre-training for Molecular Relational Learning
Lee, Namkyeong, Oh, Yunhak, Noh, Heewoong, Na, Gyoung S., Xu, Minkai, Wang, Hanchen, Fu, Tianfan, Park, Chanyoung
Molecular relational learning (MRL) focuses on understanding the interaction dynamics between molecules and has gained significant attention from researchers thanks to its diverse applications [20]. For instance, understanding how a medication dissolves in different solvents (medication-solvent interaction) is vital in pharmacy [30, 26, 3], while predicting the optical and photophysical properties of chromophores in various solvents (chromophore-solvent interaction) is essential for material discovery [16]. Because of the expensive time and financial costs associated with conducting wet lab experiments to test the interaction behavior of all possible molecular pairs [31], machine learning methods have been quickly embraced for MRL. Despite recent advancements in MRL, previous works tend to ignore molecules' 3D geometric information and instead focus solely on their 2D topological structures. However, in molecular science, the 3D geometric information of molecules (Figure 1 (a)) is crucial for understanding and predicting molecular behavior across various contexts, ranging from physical properties [1] to biological functions [10, 46]. This is particularly important in MRL, as geometric information plays a key role in molecular interactions by determining how molecules recognize, interact, and bind with one another in their interaction environment [34]. In traditional molecular dynamics simulations, explicit solvent models, which directly consider the detailed environment of molecular interaction, have demonstrated superior performance compared to implicit solvent models, which simplify the solvent as a continuous medium, highlighting the significance of explicitly modeling the complex geometries of interaction environments [47]. However, acquiring stereochemical structures of molecules is often very costly, resulting in limited availability of such 3D geometric information for downstream tasks [23].
Measuring Bias of Web-filtered Text Datasets and Bias Propagation Through Training
Mansour, Youssef, Heckel, Reinhard
We investigate biases in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of biases in popular computer vision datasets, we analyze popular open-source pretraining datasets for LLMs derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb, and DCLM-Baseline. Despite those datasets being obtained with similar filtering and deduplication steps, neural networks can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that popular pretraining datasets have their own unique biases or fingerprints. Those biases remain even when the text is rewritten with LLMs. Moreover, these biases propagate through training: Random sequences generated by models trained on those datasets can be classified well by a classifier trained on the original datasets.
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback
Furuta, Hiroki, Zen, Heiga, Schuurmans, Dale, Faust, Aleksandra, Matsuo, Yutaka, Liang, Percy, Yang, Sherry
Large text-to-video models hold immense potential for a wide range of downstream applications. However, these models struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. This enables the model to refine its responses autonomously, eliminating extensive manual data collection. In this work, we investigate the use of feedback to enhance the object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively improve text-video alignment and realistic object interactions? We begin by deriving a unified probabilistic objective for offline RL finetuning of text-to-video models. This perspective highlights how design elements in existing algorithms like KL regularization and policy projection emerge as specific choices within a unified framework. We then use derived methods to optimize a set of text-video alignment metrics (e.g., CLIP scores, optical flow), but notice that they often fail to align with human perceptions of generation quality. To address this limitation, we propose leveraging vision-language models to provide more nuanced feedback specifically tailored to object dynamics in videos. Our experiments demonstrate that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions, as confirmed by both AI and human evaluations. Notably, we observe substantial gains when using reward signals derived from AI feedback, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.
VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning
Jiang, Zhihuan, Yang, Zhen, Chen, Jinhao, Du, Zhengxiao, Wang, Weihan, Xu, Bin, Tang, Jie
Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks by integrating textual and visual information to achieve visual understanding in complex scenarios. Despite the availability of several benchmarks aims to evaluating MLLMs in tasks from visual question answering to complex problem-solving, most focus predominantly on mathematics or general visual understanding tasks. This reveals a critical gap in current benchmarks, which often overlook the inclusion of other key scientific disciplines such as physics and chemistry. To address this gap, we meticulously construct a comprehensive benchmark, named VisScience, which is utilized to assess the multi-modal scientific reasoning across the three disciplines of mathematics, physics, and chemistry. This benchmark comprises 3,000 questions drawn from K12 education - spanning elementary school through high school - equally distributed across three disciplines, with 1,000 questions per discipline. The questions within VisScience span 21 distinct subjects and are categorized into five difficulty levels, offering a broad spectrum of topics within each discipline. With VisScience, we present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning. Experimental results demonstrate that closed-source MLLMs generally outperform open-source models. The best performance observed include a 53.4\% accuracy in mathematics by Claude3.5-Sonnet, 38.2\% in physics by GPT-4o, and 47.0\% in chemistry by Gemini-1.5-Pro. These results underscore the strengths and limitations of MLLMs, suggesting areas for future improvement and highlighting the importance of developing models that can effectively handle the diverse demands of multi-modal scientific reasoning.
Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds
Li, Shuangqi, Le, Hieu, Xu, Jingyi, Salzmann, Mathieu
Text-to-image diffusion models have demonstrated remarkable capability in generating realistic images from arbitrary text prompts. However, they often produce inconsistent results for compositional prompts such as "two dogs" or "a penguin on the right of a bowl". Understanding these inconsistencies is crucial for reliable image generation. In this paper, we highlight the significant role of initial noise in these inconsistencies, where certain noise patterns are more reliable for compositional prompts than others. Our analyses reveal that different initial random seeds tend to guide the model to place objects in distinct image areas, potentially adhering to specific patterns of camera angles and image composition associated with the seed. To improve the model's compositional ability, we propose a method for mining these reliable cases, resulting in a curated training set of generated images without requiring any manual annotation. By fine-tuning text-to-image models on these generated images, we significantly enhance their compositional capabilities. For numerical composition, we observe relative increases of 29.3% and 19.5% for Stable Diffusion and PixArt-{\alpha}, respectively. Spatial composition sees even larger gains, with 60.7% for Stable Diffusion and 21.1% for PixArt-{\alpha}.