Government
Beyond Line-Level Filtering for the Pretraining Corpora of LLMs
Park, Chanwoo, Park, Suyoung, Ahn, Yelim, Kim, Jongmin, Park, Jongyeon, Lee, Jaejin
While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance generative question-answering accuracy on both SQuAD v1 and KorQuAD v1.
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures
Chang, Tyler A., Arnett, Catherine, Eldesokey, Abdelrahman, Sadallah, Abdelrahman, Kashar, Abeer, Daud, Abolade, Olanihun, Abosede Grace, Mohammed, Adamu Labaran, Praise, Adeyemi, Sharma, Adhikarinayum Meerajita, Gupta, Aditi, Iyigun, Afitab, Simplรญcio, Afonso, Essouaied, Ahmed, Chorana, Aicha, Eppa, Akhil, Oladipo, Akintunde, Ramesh, Akshay, Dorkin, Aleksei, Kondoro, Alfred Malengo, Aji, Alham Fikri, รetintaล, Ali Eren, Hanbury, Allan, Dembele, Alou, Niksarli, Alp, Arroyo, รlvaro, Bajand, Amin, Khanna, Amol, Chkhaidze, Ana, Condez, Ana, Mkhonto, Andiswa, Hoblitzell, Andrew, Tran, Andrew, Poulis, Angelos, Majumder, Anirban, Vacalopoulou, Anna, Wong, Annette Kuuipolani Kanahele, Simonsen, Annika, Kovalev, Anton, S, Ashvanth., Lana, Ayodeji Joseph, Kinay, Barkin, Alhafni, Bashar, Busole, Benedict Cibalinda, Ghanem, Bernard, Nathani, Bharti, ฤuriฤ, Biljana Stojanovska, Agbonile, Bola, Bergsson, Bragi, Fischer, Bruce Torres, Tutar, Burak, รฤฑnar, Burcu Alakuล, Kane, Cade J. Kanoniakapueo, Udomcharoenchaikit, Can, Arnett, Catherine, Helwe, Chadi, Nerella, Chaithra Reddy, Liu, Chen Cecilia, Nwokolo, Chiamaka Glory, Espaรฑa-Bonet, Cristina, Amol, Cynthia, Lee, DaeYeop, Arad, Dana, Dzenhaliou, Daniil, Pugacheva, Daria, Choi, Dasol, Abolade, Daud, Liu, David, Semedo, David, Popoola, Deborah, Mataciunas, Deividas, Nyaboke, Delphine, Kumar, Dhyuthy Krishna, Glรณria-Silva, Diogo, Tavares, Diogo, Goyal, Divyanshu, Lee, DongGeon, Anajemba, Ebele Nwamaka, Grace, Egonu Ngozi, Mickel, Elena, Tutubalina, Elena, Herranen, Elias, Anand, Emile, Habumuremyi, Emmanuel, Ajiboye, Emuobonuvie Maria, Yulianrifat, Eryawan Presma, Adenuga, Esther, Rudnicka, Ewa, Itiola, Faith Olabisi, Butt, Faran Taimoor, Thekkekara, Fathima, Haouari, Fatima, Tjiaranata, Filbert Aurelian, Laakom, Firas, Grasso, Francesca, Orabona, Francesco, Periti, Francesco, Solomon, Gbenga Kayode, Ngo, Gia Nghia, Udhehdhe-oze, Gloria, Martins, Gonรงalo, Challagolla, Gopi Naga Sai Ram, Son, Guijin, Abdykadyrova, Gulnaz, Einarsson, Hafsteinn, Hu, Hai, Saffari, Hamidreza, Zaidi, Hamza, Zhang, Haopeng, Shairah, Harethah Abu, Vuong, Harry, Kuulmets, Hele-Andra, Bouamor, Houda, Yu, Hwanjo, Debess, Iben Nyholm, Deveci, ฤฐbrahim Ethem, Hanif, Ikhlasul Akmal, Cho, Ikhyun, Calvo, Inรชs, Vieira, Inรชs, Manzi, Isaac, Daud, Ismail, Itzhak, Itay, Iuliia, null, Alekseenko, null, Belashkin, Ivan, Spada, Ivan, Zhelyazkov, Ivan, Brinton, Jacob, Isbarov, Jafar, ฤibej, Jaka, ฤuhel, Jan, Kocoล, Jan, Krito, Jauza Akbar, Purbey, Jebish, Mickel, Jennifer, Za, Jennifer, Kunz, Jenny, Jeong, Jihae, Dรกvalos, Jimena Tena, Lee, Jinu, Magalhรฃes, Joรฃo, Yi, John, Kim, Jongin, Chataignon, Joseph, Imperial, Joseph Marvin, Thevakumar, Jubeerathan, Land, Judith, Jiang, Junchen, Kim, Jungwhan, Sirts, Kairit, R, Kamesh, V, Kamesh, Tshinu, Kanda Patrick, Kukk, Kรคtriin, Ponkshe, Kaustubh, Huseynova, Kavsar, He, Ke, Buchanan, Kelly, Sarveswaran, Kengatharaiyer, Zaman, Kerem, Mrini, Khalil, Kyars, Kian, Kruusmaa, Krister, Chouhan, Kusum, Krishnakumar, Lainitha, Sรกnchez, Laura Castro, Moscoso, Laura Porrino, Choshen, Leshem, Sencan, Levent, รvrelid, Lilja, Alazraki, Lisa, Ehimen-Ugbede, Lovina, Thevakumar, Luheerathan, Thavarasa, Luxshan, Malik, Mahnoor, Keita, Mamadou K., Jangid, Mansi, De Santis, Marco, Garcรญa, Marcos, Suppa, Marek, D'Ciofalo, Mariam, Ojastu, Marii, Sikander, Maryam, Narayan, Mausami, Skandalis, Maximos, Mehak, Mehak, Bozkurt, Mehmet ฤฐlteriล, Workie, Melaku Bayu, Velayuthan, Menan, Leventhal, Michael, Marciลczuk, Michaล, Potoฤnjak, Mirna, Shafiei, Mohammadamin, Sharma, Mridul, Indoria, Mrityunjaya, Habibi, Muhammad Ravi Shulthan, Koliฤ, Murat, Galant, Nada, Permpredanun, Naphat, Maugin, Narada, Corrรชa, Nicholas Kluge, Ljubeลกiฤ, Nikola, Thomas, Nirmal, de Silva, Nisansa, Joshi, Nisheeth, Ponkshe, Nitish, Habash, Nizar, Udeze, Nneoma C., Thomas, Noel, Ligeti-Nagy, Noรฉmi, Coulibaly, Nouhoum, Faustin, Nsengiyumva, Buliaminu, Odunayo Kareemat, Ogundepo, Odunayo, Fejiro, Oghojafor Godswill, Funmilola, Ogundipe Blessing, God'spraise, Okechukwu, Samuel, Olanrewaju, Oluwaseun, Olaoye Deborah, Akindejoye, Olasoji, Popova, Olga, Snissarenko, Olga, Chiemezie, Onyinye Anulika, Kinay, Orkun, Tursun, Osman, Moses, Owoeye Tobiloba, Joshua, Oyelade Oluwafemi, Fiyinfoluwa, Oyesanmi, Gamallo, Pablo, Fernรกndez, Pablo Rodrรญguez, Arora, Palak, Valente, Pedro, Rupnik, Peter, Ekiugbo, Philip Oghenesuowho, Sahoo, Pramit, Prokopidis, Prokopis, Niau-Puhipau, Pua, Yahya, Quadri, Mignone, Rachele, Singhal, Raghav, Kadiyala, Ram Mohan Rao, Merx, Raphael, Afolayan, Rapheal, Rajalakshmi, Ratnavel, Ghosh, Rishav, Oji, Romina, Solis, Ron Kekeha, Guerra, Rui, Zawar, Rushikesh, Bashir, Sa'ad Nasir, Alzaabi, Saeed, Sandeep, Sahil, Batchu, Sai Pavan, Kantareddy, SaiSandeep, Pranida, Salsabila Zahirah, Buchanan, Sam, Rutunda, Samuel, Land, Sander, Sulollari, Sarah, Ali, Sardar, Sapkota, Saroj, Tautvaisas, Saulius, Sen, Sayambhu, Banerjee, Sayantani, Diarra, Sebastien, M, SenthilNathan., Lee, Sewoong, Shah, Shaan, Venkitachalam, Shankar, Djurabaeva, Sharifa, Ibejih, Sharon, Dutta, Shivanya Shomir, Gupta, Siddhant, Suรกrez, Silvia Paniagua, Ahmadi, Sina, Sukumar, Sivasuthan, Song, Siyuan, A., Snegha, Sofianopoulos, Sokratis, Simon, Sona Elza, Benฤina, Sonja, Gvasalia, Sophie, More, Sphurti Kirit, Dragazis, Spyros, Kaufhold, Stephan P., S, Suba., AlRashed, Sultan, Ranathunga, Surangika, Someya, Taiga, Pungerลกek, Taja Kuzman, Haklay, Tal, Jibril, Tasi'u, Aoyama, Tatsuya, Abashidze, Tea, Cruz, Terenz Jomar Dela, Blevins, Terra, Nikas, Themistoklis, Idoko, Theresa Dora, Do, Thu Mai, Chubakov, Tilek, Gargiani, Tommaso, Rathore, Uma, Johannesen, Uni, Ugwu, Uwuma Doris, Putra, Vallerie Alexandra, Kumar, Vanya Bannihatti, Jeyarajalingam, Varsha, Arzt, Varvara, Nedumpozhimana, Vasudevan, Ondrejova, Viktoria, Horbik, Viktoryia, Kummitha, Vishnu Vardhan Reddy, Diniฤ, Vuk, Sewunetie, Walelign Tewabe, Wu, Winston, Zhao, Xiaojing, Diarra, Yacouba, Nikankin, Yaniv, Mathur, Yash, Chen, Yixi, Li, Yiyuan, Xavier, Yolanda, Belinkov, Yonatan, Abayomi, Yusuf Ismail, Alyafeai, Zaid, Shan, Zhengyang, Tam, Zhi Rui, Tang, Zilu, Nadova, Zuzana, Abbasi, Baber, Biderman, Stella, Stap, David, Ataman, Duygu, Schmidt, Fabian, Gonen, Hila, Wang, Jiayi, Adelani, David Ifeoluwa
To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.
Low-N Protein Activity Optimization with FolDE
Roberts, Jacob B., Ji, Catherine R., Donnell, Isaac, Young, Thomas D., Pearson, Allison N., Hudson, Graham A., Keiser, Leah S., Wesselkamper, Mia, Winegar, Peter H., Ludwig, Janik, Klass, Sarah H., Sheth, Isha V., Ukabiala, Ezechinyere C., Astolfi, Maria C. T., Eysenbach, Benjamin, Keasling, Jay D.
Proteins are traditionally optimized through the costly construction and measurement of many mutants. Active Learning-assisted Directed Evolution (ALDE) alleviates that cost by predicting the best improvements and iteratively testing mutants to inform predictions. However, existing ALDE methods face a critical limitation: selecting the highest-predicted mutants in each round yields homogeneous training data insufficient for accurate prediction models in subsequent rounds. Here we present FolDE, an ALDE method designed to maximize end-of-campaign success. In simulations across 20 protein targets, FolDE discovers 23% more top 10% mutants than the best baseline ALDE method (p=0.005) and is 55% more likely to find top 1% mutants. FolDE achieves this primarily through naturalness-based warm-starting, which augments limited activity measurements with protein language model outputs to improve activity prediction. We also introduce a constant-liar batch selector, which improves batch diversity; this is important in multi-mutation campaigns but had limited effect in our benchmarks. The complete workflow is freely available as open-source software, making efficient protein optimization accessible to any laboratory.
Predicting Barge Tow Size on Inland Waterways Using Vessel Trajectory Derived Features: Proof of Concept
Agorku, Geoffery, Hernandez, Sarah, Hames, Hayley, Wagner, Cade
Accurate, real-time estimation of barge quantity on inland waterways remains a critical challenge due to the non-self-propelled nature of barges and the limitations of existing monitoring systems. This study introduces a novel method to use Automatic Identification System (AIS) vessel tracking data to predict the number of barges in tow using Machine Learning (ML). To train and test the model, barge instances were manually annotated from satellite scenes across the Lower Mississippi River. Labeled images were matched to AIS vessel tracks using a spatiotemporal matching procedure. A comprehensive set of 30 AIS-derived features capturing vessel geometry, dynamic movement, and trajectory patterns were created and evaluated using Recursive Feature Elimination (RFE) to identify the most predictive variables. Six regression models, including ensemble, kernel-based, and generalized linear approaches, were trained and evaluated. The Poisson Regressor model yielded the best performance, achieving a Mean Absolute Error (MAE) of 1.92 barges using 12 of the 30 features. The feature importance analysis revealed that metrics capturing vessel maneuverability such as course entropy, speed variability and trip length were most predictive of barge count. The proposed approach provides a scalable, readily implementable method for enhancing Maritime Domain Awareness (MDA), with strong potential applications in lock scheduling, port management, and freight planning. Future work will expand the proof of concept presented here to explore model transferability to other inland rivers with differing operational and environmental conditions.
SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability
Xu, Peiyang, Pan, Minzhou, Chen, Zhaorun, Yang, Shuang, Xiao, Chaowei, Li, Bo
With the rapid proliferation of digital media, the need for efficient and transparent safeguards against unsafe content is more critical than ever. Traditional image guardrail models, constrained by predefined categories, often misclassify content due to their pure feature-based learning without semantic reasoning. Moreover, these models struggle to adapt to emerging threats, requiring costly retraining for new threats. To address these limitations, we introduce SafeVision, a novel image guardrail that integrates human-like reasoning to enhance adaptability and transparency. Our approach incorporates an effective data collection and generation framework, a policy-following training pipeline, and a customized loss function. We also propose a diverse QA generation and training strategy to enhance learning effectiveness. SafeVision dynamically aligns with evolving safety policies at inference time, eliminating the need for retraining while ensuring precise risk assessments and explanations. Recognizing the limitations of existing unsafe image benchmarks, which either lack granularity or cover limited risks, we introduce VisionHarm, a high-quality dataset comprising two subsets: VisionHarm Third-party (VisionHarm-T) and VisionHarm Comprehensive(VisionHarm-C), spanning diverse harmful categories. Through extensive experiments, we show that SafeVision achieves state-of-the-art performance on different benchmarks. SafeVision outperforms GPT-4o by 8.6% on VisionHarm-T and by 15.5% on VisionHarm-C, while being over 16x faster. SafeVision sets a comprehensive, policy-following, and explainable image guardrail with dynamic adaptation to emerging threats.
MFiSP: A Multimodal Fire Spread Prediction Framework
Sathiyamoorthy, Alec, Zhou, Wenhao, Zhou, Xiangmin, Li, Xiaodong, Gondal, Iqbal
The 2019-2020 Black Summer bushfires in Australia devastated 19 million hectares, destroyed 3,000 homes, and lasted seven months, demonstrating the escalating scale and urgency of wildfire threats requiring better forecasting for effective response. Traditional fire modeling relies on manual interpretation by Fire Behaviour Analysts (FBAns) and static environmental data, often leading to inaccuracies and operational limitations. Emerging data sources, such as NASA's FIRMS satellite imagery and Volunteered Geographic Information, offer potential improvements by enabling dynamic fire spread prediction. This study proposes a Multimodal Fire Spread Prediction Framework (MFiSP) that integrates social media data and remote sensing observations to enhance forecast accuracy. By adapting fuel map manipulation strategies between assimilation cycles, the framework dynamically adjusts fire behavior predictions to align with the observed rate of spread. We evaluate the efficacy of MFiSP using synthetically generated fire event polygons across multiple scenarios, analyzing individual and combined impacts on forecast perimeters. Results suggest that our MFiSP integrating multimodal data can improve fire spread prediction beyond conventional methods reliant on FBAn expertise and static inputs.
Coordinated Autonomous Drones for Human-Centered Fire Evacuation in Partially Observable Urban Environments
Mendoza, Maria G., Kalanther, Addison, Bostwick, Daniel, Stephan, Emma, Maheshwari, Chinmay, Sastry, Shankar
Autonomous drone technology holds significant promise for enhancing search and rescue operations during evacuations by guiding humans toward safety and supporting broader emergency response efforts. However, their application in dynamic, real-time evacuation support remains limited. Existing models often overlook the psychological and emotional complexity of human behavior under extreme stress. In real-world fire scenarios, evacuees frequently deviate from designated safe routes due to panic and uncertainty. To address these challenges, this paper presents a multi-agent coordination framework in which autonomous Unmanned Aerial Vehicles (UAVs) assist human evacuees in real-time by locating, intercepting, and guiding them to safety under uncertain conditions. We model the problem as a Partially Observable Markov Decision Process (POMDP), where two heterogeneous UAV agents, a high-level rescuer (HLR) and a low-level rescuer (LLR), coordinate through shared observations and complementary capabilities. Human behavior is captured using an agent-based model grounded in empirical psychology, where panic dynamically affects decision-making and movement in response to environmental stimuli. The environment features stochastic fire spread, unknown evacuee locations, and limited visibility, requiring UAVs to plan over long horizons to search for humans and adapt in real-time. Our framework employs the Proximal Policy Optimization (PPO) algorithm with recurrent policies to enable robust decision-making in partially observable settings. Simulation results demonstrate that the UAV team can rapidly locate and intercept evacuees, significantly reducing the time required for them to reach safety compared to scenarios without UAV assistance.
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
Datta, Shrestha, Nahin, Shahriar Kabir, Chhabra, Anshuman, Mohapatra, Prasant
Agentic AI systems powered by large language models (LLMs) and endowed with planning, tool use, memory, and autonomy, are emerging as powerful, flexible platforms for automation. Their ability to autonomously execute tasks across web, software, and physical environments creates new and amplified security risks, distinct from both traditional AI safety and conventional software security. This survey outlines a taxonomy of threats specific to agentic AI, reviews recent benchmarks and evaluation methodologies, and discusses defense strategies from both technical and governance perspectives. We synthesize current research and highlight open challenges, aiming to support the development of secure-by-design agent systems.
Multi-Environment POMDPs: Discrete Model Uncertainty Under Partial Observability
Bovy, Eline M., Probine, Caleb, Suilen, Marnix, Topcu, Ufuk, Jansen, Nils
Multi-environment POMDPs (ME-POMDPs) extend standard POMDPs with discrete model uncertainty. ME-POMDPs represent a finite set of POMDPs that share the same state, action, and observation spaces, but may arbitrarily vary in their transition, observation, and reward models. Such models arise, for instance, when multiple domain experts disagree on how to model a problem. The goal is to find a single policy that is robust against any choice of POMDP within the set, i.e., a policy that maximizes the worst-case reward across all POMDPs. We generalize and expand on existing work in the following way. First, we show that ME-POMDPs can be generalized to POMDPs with sets of initial beliefs, which we call adversarial-belief POMDPs (AB-POMDPs). Second, we show that any arbitrary ME-POMDP can be reduced to a ME-POMDP that only varies in its transition and reward functions or only in its observation and reward functions, while preserving (optimal) policies. We then devise exact and approximate (point-based) algorithms to compute robust policies for AB-POMDPs, and thus ME-POMDPs. We demonstrate that we can compute policies for standard POMDP benchmarks extended to the multi-environment setting.
On the Societal Impact of Machine Learning
This PhD thesis investigates the societal impact of machine learning (ML). ML increasingly informs consequential decisions and recommendations, significantly affecting many aspects of our lives. As these data-driven systems are often developed without explicit fairness considerations, they carry the risk of discriminatory effects. The contributions in this thesis enable more appropriate measurement of fairness in ML systems, systematic decomposition of ML systems to anticipate bias dynamics, and effective interventions that reduce algorithmic discrimination while maintaining system utility. I conclude by discussing ongoing challenges and future research directions as ML systems, including generative artificial intelligence, become increasingly integrated into society. This work offers a foundation for ensuring that ML's societal impact aligns with broader social values.