Education
MAGNET: A Multi-Graph Attentional Network for Code Clone Detection
Zhang, Zixian, Saber, Takfarinas
Code clone detection is a fundamental task in software engineering that underpins refactoring, debugging, plagiarism detection, and vulnerability analysis. Existing methods often rely on singular representations such as abstract syntax trees (ASTs), control flow graphs (CFGs), and data flow graphs (DFGs), which capture only partial aspects of code semantics. Hybrid approaches have emerged, but their fusion strategies are typically handcrafted and ineffective. In this study, we propose MAGNET, a multi-graph attentional framework that jointly leverages AST, CFG, and DFG representations to capture syntactic and semantic features of source code. MAGNET integrates residual graph neural networks with node-level self-attention to learn both local and long-range dependencies, introduces a gated cross-attention mechanism for fine-grained inter-graph interactions, and employs Set2Set pooling to fuse multi-graph embeddings into unified program-level representations. Extensive experiments on BigCloneBench and Google Code Jam demonstrate that MAGNET achieves state-of-the-art performance with an overall F1 score of 96.5\% and 99.2\% on the two datasets, respectively. Ablation studies confirm the critical contributions of multi-graph fusion and each attentional component. Our code is available at https://github.com/ZixianReid/Multigraph_match
HACK: Hallucinations Along Certainty and Knowledge Axes
Simhi, Adi, Herzig, Jonathan, Itzhak, Itay, Arad, Dana, Gekhman, Zorik, Reichart, Roi, Barez, Fazl, Stanovsky, Gabriel, Szpektor, Idan, Belinkov, Yonatan
Hallucinations in LLMs present a critical barrier to their reliable usage. Existing research usually categorizes hallucination by their external properties rather than by the LLMs' underlying internal properties. This external focus overlooks that hallucinations may require tailored mitigation strategies based on their underlying mechanism. We propose a framework for categorizing hallucinations along two axes: knowledge and certainty. Since parametric knowledge and certainty may vary across models, our categorization method involves a model-specific dataset construction process that differentiates between those types of hallucinations. Along the knowledge axis, we distinguish between hallucinations caused by a lack of knowledge and those occurring despite the model having the knowledge of the correct response. To validate our framework along the knowledge axis, we apply steering mitigation, which relies on the existence of parametric knowledge to manipulate model activations. This addresses the lack of existing methods to validate knowledge categorization by showing a significant difference between the two hallucination types. We further analyze the distinct knowledge and hallucination patterns between models, showing that different hallucinations do occur despite shared parametric knowledge. Turning to the certainty axis, we identify a particularly concerning subset of hallucinations where models hallucinate with certainty despite having the correct knowledge internally. We introduce a new evaluation metric to measure the effectiveness of mitigation methods on this subset, revealing that while some methods perform well on average, they fail disproportionately on these critical cases. Our findings highlight the importance of considering both knowledge and certainty in hallucination analysis and call for targeted mitigation approaches that consider the hallucination underlying factors.
Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean
Park, Chanwoo, Park, Suyoung, Kang, JiA, Park, Jongyeon, Kim, Sangho, Park, Hyunji M., Bae, Sumin, Kang, Mingyu, Lee, Jaejin
We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models -- two multilingual and two Korean-specialized -- show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.
Beyond Line-Level Filtering for the Pretraining Corpora of LLMs
Park, Chanwoo, Park, Suyoung, Ahn, Yelim, Kim, Jongmin, Park, Jongyeon, Lee, Jaejin
While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance generative question-answering accuracy on both SQuAD v1 and KorQuAD v1.
Learning Parameterized Skills from Demonstrations
Gupta, Vedant, Fu, Haotian, Luo, Calvin, Jiang, Yiding, Konidaris, George
We present DEPS, an end-to-end algorithm for discovering parameterized skills from expert demonstrations. Our method learns parameterized skill policies jointly with a meta-policy that selects the appropriate discrete skill and continuous parameters at each timestep. Using a combination of temporal variational inference and information-theoretic regularization methods, we address the challenge of degeneracy common in latent variable models, ensuring that the learned skills are temporally extended, semantically meaningful, and adaptable. We empirically show that learning parameterized skills from multitask expert demonstrations significantly improves generalization to unseen tasks. Our method outperforms multitask as well as skill learning baselines on both LIBERO and MetaWorld benchmarks. We also demonstrate that DEPS discovers interpretable parameterized skills, such as an object grasping skill whose continuous arguments define the grasp location.
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures
Chang, Tyler A., Arnett, Catherine, Eldesokey, Abdelrahman, Sadallah, Abdelrahman, Kashar, Abeer, Daud, Abolade, Olanihun, Abosede Grace, Mohammed, Adamu Labaran, Praise, Adeyemi, Sharma, Adhikarinayum Meerajita, Gupta, Aditi, Iyigun, Afitab, Simplรญcio, Afonso, Essouaied, Ahmed, Chorana, Aicha, Eppa, Akhil, Oladipo, Akintunde, Ramesh, Akshay, Dorkin, Aleksei, Kondoro, Alfred Malengo, Aji, Alham Fikri, รetintaล, Ali Eren, Hanbury, Allan, Dembele, Alou, Niksarli, Alp, Arroyo, รlvaro, Bajand, Amin, Khanna, Amol, Chkhaidze, Ana, Condez, Ana, Mkhonto, Andiswa, Hoblitzell, Andrew, Tran, Andrew, Poulis, Angelos, Majumder, Anirban, Vacalopoulou, Anna, Wong, Annette Kuuipolani Kanahele, Simonsen, Annika, Kovalev, Anton, S, Ashvanth., Lana, Ayodeji Joseph, Kinay, Barkin, Alhafni, Bashar, Busole, Benedict Cibalinda, Ghanem, Bernard, Nathani, Bharti, ฤuriฤ, Biljana Stojanovska, Agbonile, Bola, Bergsson, Bragi, Fischer, Bruce Torres, Tutar, Burak, รฤฑnar, Burcu Alakuล, Kane, Cade J. Kanoniakapueo, Udomcharoenchaikit, Can, Arnett, Catherine, Helwe, Chadi, Nerella, Chaithra Reddy, Liu, Chen Cecilia, Nwokolo, Chiamaka Glory, Espaรฑa-Bonet, Cristina, Amol, Cynthia, Lee, DaeYeop, Arad, Dana, Dzenhaliou, Daniil, Pugacheva, Daria, Choi, Dasol, Abolade, Daud, Liu, David, Semedo, David, Popoola, Deborah, Mataciunas, Deividas, Nyaboke, Delphine, Kumar, Dhyuthy Krishna, Glรณria-Silva, Diogo, Tavares, Diogo, Goyal, Divyanshu, Lee, DongGeon, Anajemba, Ebele Nwamaka, Grace, Egonu Ngozi, Mickel, Elena, Tutubalina, Elena, Herranen, Elias, Anand, Emile, Habumuremyi, Emmanuel, Ajiboye, Emuobonuvie Maria, Yulianrifat, Eryawan Presma, Adenuga, Esther, Rudnicka, Ewa, Itiola, Faith Olabisi, Butt, Faran Taimoor, Thekkekara, Fathima, Haouari, Fatima, Tjiaranata, Filbert Aurelian, Laakom, Firas, Grasso, Francesca, Orabona, Francesco, Periti, Francesco, Solomon, Gbenga Kayode, Ngo, Gia Nghia, Udhehdhe-oze, Gloria, Martins, Gonรงalo, Challagolla, Gopi Naga Sai Ram, Son, Guijin, Abdykadyrova, Gulnaz, Einarsson, Hafsteinn, Hu, Hai, Saffari, Hamidreza, Zaidi, Hamza, Zhang, Haopeng, Shairah, Harethah Abu, Vuong, Harry, Kuulmets, Hele-Andra, Bouamor, Houda, Yu, Hwanjo, Debess, Iben Nyholm, Deveci, ฤฐbrahim Ethem, Hanif, Ikhlasul Akmal, Cho, Ikhyun, Calvo, Inรชs, Vieira, Inรชs, Manzi, Isaac, Daud, Ismail, Itzhak, Itay, Iuliia, null, Alekseenko, null, Belashkin, Ivan, Spada, Ivan, Zhelyazkov, Ivan, Brinton, Jacob, Isbarov, Jafar, ฤibej, Jaka, ฤuhel, Jan, Kocoล, Jan, Krito, Jauza Akbar, Purbey, Jebish, Mickel, Jennifer, Za, Jennifer, Kunz, Jenny, Jeong, Jihae, Dรกvalos, Jimena Tena, Lee, Jinu, Magalhรฃes, Joรฃo, Yi, John, Kim, Jongin, Chataignon, Joseph, Imperial, Joseph Marvin, Thevakumar, Jubeerathan, Land, Judith, Jiang, Junchen, Kim, Jungwhan, Sirts, Kairit, R, Kamesh, V, Kamesh, Tshinu, Kanda Patrick, Kukk, Kรคtriin, Ponkshe, Kaustubh, Huseynova, Kavsar, He, Ke, Buchanan, Kelly, Sarveswaran, Kengatharaiyer, Zaman, Kerem, Mrini, Khalil, Kyars, Kian, Kruusmaa, Krister, Chouhan, Kusum, Krishnakumar, Lainitha, Sรกnchez, Laura Castro, Moscoso, Laura Porrino, Choshen, Leshem, Sencan, Levent, รvrelid, Lilja, Alazraki, Lisa, Ehimen-Ugbede, Lovina, Thevakumar, Luheerathan, Thavarasa, Luxshan, Malik, Mahnoor, Keita, Mamadou K., Jangid, Mansi, De Santis, Marco, Garcรญa, Marcos, Suppa, Marek, D'Ciofalo, Mariam, Ojastu, Marii, Sikander, Maryam, Narayan, Mausami, Skandalis, Maximos, Mehak, Mehak, Bozkurt, Mehmet ฤฐlteriล, Workie, Melaku Bayu, Velayuthan, Menan, Leventhal, Michael, Marciลczuk, Michaล, Potoฤnjak, Mirna, Shafiei, Mohammadamin, Sharma, Mridul, Indoria, Mrityunjaya, Habibi, Muhammad Ravi Shulthan, Koliฤ, Murat, Galant, Nada, Permpredanun, Naphat, Maugin, Narada, Corrรชa, Nicholas Kluge, Ljubeลกiฤ, Nikola, Thomas, Nirmal, de Silva, Nisansa, Joshi, Nisheeth, Ponkshe, Nitish, Habash, Nizar, Udeze, Nneoma C., Thomas, Noel, Ligeti-Nagy, Noรฉmi, Coulibaly, Nouhoum, Faustin, Nsengiyumva, Buliaminu, Odunayo Kareemat, Ogundepo, Odunayo, Fejiro, Oghojafor Godswill, Funmilola, Ogundipe Blessing, God'spraise, Okechukwu, Samuel, Olanrewaju, Oluwaseun, Olaoye Deborah, Akindejoye, Olasoji, Popova, Olga, Snissarenko, Olga, Chiemezie, Onyinye Anulika, Kinay, Orkun, Tursun, Osman, Moses, Owoeye Tobiloba, Joshua, Oyelade Oluwafemi, Fiyinfoluwa, Oyesanmi, Gamallo, Pablo, Fernรกndez, Pablo Rodrรญguez, Arora, Palak, Valente, Pedro, Rupnik, Peter, Ekiugbo, Philip Oghenesuowho, Sahoo, Pramit, Prokopidis, Prokopis, Niau-Puhipau, Pua, Yahya, Quadri, Mignone, Rachele, Singhal, Raghav, Kadiyala, Ram Mohan Rao, Merx, Raphael, Afolayan, Rapheal, Rajalakshmi, Ratnavel, Ghosh, Rishav, Oji, Romina, Solis, Ron Kekeha, Guerra, Rui, Zawar, Rushikesh, Bashir, Sa'ad Nasir, Alzaabi, Saeed, Sandeep, Sahil, Batchu, Sai Pavan, Kantareddy, SaiSandeep, Pranida, Salsabila Zahirah, Buchanan, Sam, Rutunda, Samuel, Land, Sander, Sulollari, Sarah, Ali, Sardar, Sapkota, Saroj, Tautvaisas, Saulius, Sen, Sayambhu, Banerjee, Sayantani, Diarra, Sebastien, M, SenthilNathan., Lee, Sewoong, Shah, Shaan, Venkitachalam, Shankar, Djurabaeva, Sharifa, Ibejih, Sharon, Dutta, Shivanya Shomir, Gupta, Siddhant, Suรกrez, Silvia Paniagua, Ahmadi, Sina, Sukumar, Sivasuthan, Song, Siyuan, A., Snegha, Sofianopoulos, Sokratis, Simon, Sona Elza, Benฤina, Sonja, Gvasalia, Sophie, More, Sphurti Kirit, Dragazis, Spyros, Kaufhold, Stephan P., S, Suba., AlRashed, Sultan, Ranathunga, Surangika, Someya, Taiga, Pungerลกek, Taja Kuzman, Haklay, Tal, Jibril, Tasi'u, Aoyama, Tatsuya, Abashidze, Tea, Cruz, Terenz Jomar Dela, Blevins, Terra, Nikas, Themistoklis, Idoko, Theresa Dora, Do, Thu Mai, Chubakov, Tilek, Gargiani, Tommaso, Rathore, Uma, Johannesen, Uni, Ugwu, Uwuma Doris, Putra, Vallerie Alexandra, Kumar, Vanya Bannihatti, Jeyarajalingam, Varsha, Arzt, Varvara, Nedumpozhimana, Vasudevan, Ondrejova, Viktoria, Horbik, Viktoryia, Kummitha, Vishnu Vardhan Reddy, Diniฤ, Vuk, Sewunetie, Walelign Tewabe, Wu, Winston, Zhao, Xiaojing, Diarra, Yacouba, Nikankin, Yaniv, Mathur, Yash, Chen, Yixi, Li, Yiyuan, Xavier, Yolanda, Belinkov, Yonatan, Abayomi, Yusuf Ismail, Alyafeai, Zaid, Shan, Zhengyang, Tam, Zhi Rui, Tang, Zilu, Nadova, Zuzana, Abbasi, Baber, Biderman, Stella, Stap, David, Ataman, Duygu, Schmidt, Fabian, Gonen, Hila, Wang, Jiayi, Adelani, David Ifeoluwa
To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.
GraphNet: A Large-Scale Computational Graph Dataset for Tensor Compiler Research
Li, Xinqi, Liu, Yiqun, Jiang, Shan, Zheng, Enrong, Zheng, Huaijin, Dai, Wenhao, Deng, Haodong, Yu, Dianhai, Ma, Yanjun
We introduce GraphNet, a dataset of 2.7K real-world deep learning computational graphs with rich metadata, spanning six major task categories across multiple deep learning frameworks. To evaluate tensor compiler performance on these samples, we propose the benchmark metric Speedup Score S(t), which jointly considers runtime speedup and execution correctness under tunable tolerance levels, offering a reliable measure of general optimization capability. Furthermore, we extend S(t) to the Error-aware Speedup Score ES(t), which incorporates error information and helps compiler developers identify key performance bottlenecks. In this report, we benchmark the default tensor compilers, CINN for PaddlePaddle and TorchInductor for PyTorch, on computer vision (CV) and natural language processing (NLP) samples to demonstrate the practicality of GraphNet. The full construction pipeline with graph extraction and compiler evaluation tools is available at https://github.com/PaddlePaddle/GraphNet .
Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward
Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model's own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on $\textbf{\underline{Fi}ne-grained \underline{S}emantic \underline{Co}nfidence \underline{Re}ward (\Ours)}$, which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those within low-confidence ones, thereby promoting accurate post-hoc abstention. Additionally, we propose a new metric for evaluating the reliability of abstention fine-tuning tasks more comprehensively. Our method significantly enhances reliability in both in-domain and out-of-distribution benchmarks.
VOCALoco: Viability-Optimized Cost-aware Adaptive Locomotion
Wu, Stanley, Danesh, Mohamad H., Li, Simon, Yurchyk, Hanna, Abyaneh, Amin, Houssaini, Anas El, Meger, David, Lin, Hsiu-Chin
Recent advancements in legged robot locomotion have facilitated traversal over increasingly complex terrains. Despite this progress, many existing approaches rely on end-to-end deep reinforcement learning (DRL), which poses limitations in terms of safety and interpretability, especially when generalizing to novel terrains. To overcome these challenges, we introduce VOCALoco, a modular skill-selection framework that dynamically adapts locomotion strategies based on perceptual input. Given a set of pre-trained locomotion policies, VOCALoco evaluates their viability and energy-consumption by predicting both the safety of execution and the anticipated cost of transport over a fixed planning horizon. This joint assessment enables the selection of policies that are both safe and energy-efficient, given the observed local terrain. We evaluate our approach on staircase locomotion tasks, demonstrating its performance in both simulated and real-world scenarios using a quadrupedal robot. Empirical results show that VOCALoco achieves improved robustness and safety during stair ascent and descent compared to a conventional end-to-end DRL policy
Predicting Barge Tow Size on Inland Waterways Using Vessel Trajectory Derived Features: Proof of Concept
Agorku, Geoffery, Hernandez, Sarah, Hames, Hayley, Wagner, Cade
Accurate, real-time estimation of barge quantity on inland waterways remains a critical challenge due to the non-self-propelled nature of barges and the limitations of existing monitoring systems. This study introduces a novel method to use Automatic Identification System (AIS) vessel tracking data to predict the number of barges in tow using Machine Learning (ML). To train and test the model, barge instances were manually annotated from satellite scenes across the Lower Mississippi River. Labeled images were matched to AIS vessel tracks using a spatiotemporal matching procedure. A comprehensive set of 30 AIS-derived features capturing vessel geometry, dynamic movement, and trajectory patterns were created and evaluated using Recursive Feature Elimination (RFE) to identify the most predictive variables. Six regression models, including ensemble, kernel-based, and generalized linear approaches, were trained and evaluated. The Poisson Regressor model yielded the best performance, achieving a Mean Absolute Error (MAE) of 1.92 barges using 12 of the 30 features. The feature importance analysis revealed that metrics capturing vessel maneuverability such as course entropy, speed variability and trip length were most predictive of barge count. The proposed approach provides a scalable, readily implementable method for enhancing Maritime Domain Awareness (MDA), with strong potential applications in lock scheduling, port management, and freight planning. Future work will expand the proof of concept presented here to explore model transferability to other inland rivers with differing operational and environmental conditions.