Overview
Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions
Lee, Yu-Ang, Yi, Guan-Ting, Liu, Mei-Yi, Lu, Jui-Chao, Yang, Guan-Bo, Chen, Yun-Nung
Recent advancements in large language models (LLMs) and AI systems have led to a paradigm shift in the design and optimization of complex AI workflows. By integrating multiple components, compound AI systems have become increasingly adept at performing sophisticated tasks. However, as these systems grow in complexity, new challenges arise in optimizing not only individual components but also their interactions. While traditional optimization methods such as supervised fine-tuning (SFT) and reinforcement learning (RL) remain foundational, the rise of natural language feedback introduces promising new approaches, especially for optimizing non-differentiable systems. This paper provides a systematic review of recent progress in optimizing compound AI systems, encompassing both numerical and language-based techniques. We formalize the notion of compound AI system optimization, classify existing methods along several key dimensions, and highlight open research challenges and future directions in this rapidly evolving field. A list of surveyed papers is publicly available at https://github.com/MiuLab/AISysOpt-Survey.
A Survey on Stereotype Detection in Natural Language Processing
Cignarella, Alessandra Teresa, Giachanou, Anastasia, Lefever, Els
Stereotypes influence social perceptions and can escalate into discrimination and violence. While NLP research has extensively addressed gender bias and hate speech, stereotype detection remains an emerging field with significant societal implications. In this work is presented a survey of existing research, analyzing definitions from psychology, sociology, and philosophy. A semi-automatic literature review was performed by using Semantic Scholar. We retrieved and filtered over 6,000 papers (in the year range 2000-2025), identifying key trends, methodologies, challenges and future directions. The findings emphasize stereotype detection as a potential early-monitoring tool to prevent bias escalation and the rise of hate speech. Conclusions highlight the need for a broader, multilingual, and intersectional approach in NLP studies.
Measuring LLM Novelty As The Frontier Of Original And High-Quality Output
Padmakumar, Vishakh, Yueh-Han, Chen, Pan, Jane, Chen, Valerie, He, He
As large language models (LLMs) are increasingly used for ideation and scientific discovery, it is important to evaluate their ability to generate novel output. Prior work evaluates novelty as originality with respect to model training data, but original outputs may be of low quality. In contrast, non-expert judges more reliably score quality but may favor memorized outputs, limiting the reliability of human preference as a metric. We introduce a new novelty metric for LLM generations that balances originality and quality -- the harmonic mean of the fraction of \ngrams unseen during training and a task-specific quality score. Using this framework, we identify trends that affect the novelty of generations from three families of open-data models (OLMo, OLMo-2, and Pythia) on three creative tasks: story completion, poetry writing, and creative tool use. We find that model-generated text from some base LLMs is less novel than human-written text from the internet. However, increasing model scale and post-training reliably improves novelty due to improvements in output quality. We also find that improving the base model at the same scale (\eg OLMo 7B to OLMo-2 7B) leads to higher novelty due to higher originality. Finally, we observe that inference-time methods, such as prompting and providing novel in-context examples, have a much smaller effect on novelty, often increasing originality at the expense of quality. This highlights the need for further research into more effective elicitation strategies as we use models for creative applications.
Downsized and Compromised?: Assessing the Faithfulness of Model Compression
Kamal, Moumita, Talbert, Douglas A.
In real-world applications, computational constraints often require transforming large models into smaller, more efficient versions through model compression. While these techniques aim to reduce size and computational cost without sacrificing performance, their evaluations have traditionally focused on the trade-off between size and accuracy, overlooking the aspect of model faithfulness. This limited view is insufficient for high-stakes domains like healthcare, finance, and criminal justice, where compressed models must remain faithful to the behavior of their original counterparts. This paper presents a novel approach to evaluating faithfulness in compressed models, moving beyond standard metrics. We introduce and demonstrate a set of faithfulness metrics that capture how model behavior changes post-compression. Our contributions include introducing techniques to assess predictive consistency between the original and compressed models using model agreement, and applying chi-squared tests to detect statistically significant changes in predictive patterns across both the overall dataset and demographic subgroups, thereby exposing shifts that aggregate fairness metrics may obscure. We demonstrate our approaches by applying quantization and pruning to artificial neural networks (ANNs) trained on three diverse and socially meaningful datasets. Our findings show that high accuracy does not guarantee faithfulness, and our statistical tests detect subtle yet significant shifts that are missed by standard metrics, such as Accuracy and Equalized Odds. The proposed metrics provide a practical and more direct method for ensuring that efficiency gains through compression do not compromise the fairness or faithfulness essential for trustworthy AI.