Fès-Meknès Region
MILP-SAT-GNN: Yet Another Neural SAT Solver
Cardillo, Franco Alberto, Khyari, Hamza, Straccia, Umberto
We proposes a novel method that enables Graph Neural Networks (GNNs) to solve SAT problems by leveraging a technique developed for applying GNNs to Mixed Integer Linear Programming (MILP). Specifically, k-CNF formulae are mapped into MILP problems, which are then encoded as weighted bipartite graphs and subsequently fed into a GNN for training and testing. From a theoretical perspective: (i) we establish permutation and equivalence invariance results, demonstrating that the method produces outputs that are stable under reordering of clauses and variables; (ii) we identify a theoretical limitation, showing that for a class of formulae called foldable formulae, standard GNNs cannot always distinguish satisfiable from unsatisfiable instances; (iii) we prove a universal approximation theorem, establishing that with Random Node Initialization (RNI), the method can approximate SAT solving to arbitrary precision on finite datasets--that is, the GNN becomes approximately sound and complete on such datasets. Furthermore, we show that for unfoldable formulae, the same approximation guarantee can be achieved without the need for RNI. Finally, we conduct an experimental evaluation of our approach, which show that, despite the simplicity of the neural architecture, the method achieves promising results.
ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content
Chandna, Bhavik, Aboujenane, Mariam, Naseem, Usman
Large Multimodal Models (LMMs) are increasingly vulnerable to AI-generated extremist content, including photorealistic images and text, which can be used to bypass safety mechanisms and generate harmful outputs. However, existing datasets for evaluating LMM robustness offer limited exploration of extremist content, often lacking AI-generated images, diverse image generation models, and comprehensive coverage of historical events, which hinders a complete assessment of model vulnerabilities. To fill this gap, we introduce ExtremeAIGC, a benchmark dataset and evaluation framework designed to assess LMM vulnerabilities against such content. ExtremeAIGC simulates real-world events and malicious use cases by curating diverse text- and image-based examples crafted using state-of-the-art image generation techniques. Our study reveals alarming weaknesses in LMMs, demonstrating that even cutting-edge safety measures fail to prevent the generation of extremist material. We systematically quantify the success rates of various attack strategies, exposing critical gaps in current defenses and emphasizing the need for more robust mitigation strategies.
Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Alwajih, Fakhraddin, Mekki, Abdellah El, Magdy, Samar Mohamed, Elmadany, Abdelrahim A., Nacar, Omer, Nagoudi, El Moatez Billah, Abdel-Salam, Reem, Atwany, Hanin, Nafea, Youssef, Yahya, Abdulfattah Mohammed, Alhamouri, Rahaf, Alsayadi, Hamzah A., Zayed, Hiba, Shatnawi, Sara, Sibaee, Serry, Ech-Chammakhy, Yasir, Al-Dhabyani, Walid, Ali, Marwa Mohamed, Jarraya, Imen, El-Shangiti, Ahmed Oumar, Alraeesi, Aisha, Al-Ghrawi, Mohammed Anwar, Al-Batati, Abdulrahman S., Mohamed, Elgizouli, Elgindi, Noha Taha, Saeed, Muhammed, Atou, Houdaifa, Yahia, Issam Ait, Bouayad, Abdelhak, Machrouh, Mohammed, Makouar, Amal, Alkawi, Dania, Mohamed, Mukhtar, Abdelfadil, Safaa Taher, Ounnoughene, Amine Ziad, Anfel, Rouabhia, Assi, Rwaa, Sorkatti, Ahmed, Tourad, Mohamedou Cheikh, Koubaa, Anis, Berrada, Ismail, Jarrar, Mustafa, Shehata, Shady, Abdul-Mageed, Muhammad
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.
Towards Precision in Bolted Joint Design: A Preliminary Machine Learning-Based Parameter Prediction
Boujnah, Ines, Afifi, Nehal, Wettstein, Andreas, Matthiesen, Sven
Bolted joints are critical in engineering for maintaining structural integrity and reliability. Accurate prediction of parameters influencing their function and behavior is essential for optimal performance. Traditional methods often fail to capture the non-linear behavior of bolted joints or require significant computational resources, limiting accuracy and efficiency. This study addresses these limitations by combining empirical data with a feed-forward neural network to predict load capacity and friction coefficients. Leveraging experimental data and systematic preprocessing, the model effectively captures nonlinear relationships, including rescaling output variables to address scale discrepancies, achieving 95.24% predictive accuracy. While limited dataset size and diversity restrict generalizability, the findings demonstrate the potential of neural networks as a reliable, efficient alternative for bolted joint design. Future work will focus on expanding datasets and exploring hybrid modeling techniques to enhance applicability.
Enhancing Table Representations with LLM-powered Synthetic Data Generation
Yang, Dayu, Monaikul, Natawut, Ding, Amanda, Tan, Bozhao, Mosaliganti, Kishore, Iyengar, Giri
In the era of data-driven decision-making, accurate table-level representations and efficient table recommendation systems are becoming increasingly crucial for improving table management, discovery, and analysis. However, existing approaches to tabular data representation often face limitations, primarily due to their focus on cell-level tasks and the lack of high-quality training data. To address these challenges, we first formulate a clear definition of table similarity in the context of data transformation activities within data-driven enterprises. This definition serves as the foundation for synthetic data generation, which require a well-defined data generation process. Building on this, we propose a novel synthetic data generation pipeline that harnesses the code generation and data manipulation capabilities of Large Language Models (LLMs) to create a large-scale synthetic dataset tailored for table-level representation learning. Through manual validation and performance comparisons on the table recommendation task, we demonstrate that the synthetic data generated by our pipeline aligns with our proposed definition of table similarity and significantly enhances table representations, leading to improved recommendation performance.
A Systematic Review of NLP for Dementia- Tasks, Datasets and Opportunities
Peled-Cohen, Lotem, Reichart, Roi
The close link between cognitive decline and language has fostered long-standing collaboration between the NLP and medical communities in dementia research. To examine this, we reviewed over 200 papers applying NLP to dementia related efforts, drawing from medical, technological, and NLP-focused literature. We identify key research areas, including dementia detection, linguistic biomarker extraction, caregiver support, and patient assistance, showing that half of all papers focus solely on dementia detection using clinical data. However, many directions remain unexplored: artificially degraded language models, synthetic data, digital twins, and more. We highlight gaps and opportunities around trust, scientific rigor, applicability, and cross-community collaboration, and showcase the diverse datasets encountered throughout our review: recorded, written, structured, spontaneous, synthetic, clinical, social media based, and more. This review aims to inspire more creative approaches to dementia research within the medical and NLP communities.
DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification
Mahdaouy, Abdelkader El, Lamsiyah, Salima, Idrissi, Meryem Janati, Alami, Hamza, Yartaoui, Zakaria, Berrada, Ismail
Detecting and classifying suspicious or malicious domain names and URLs is fundamental task in cybersecurity. To leverage such indicators of compromise, cybersecurity vendors and practitioners often maintain and update blacklists of known malicious domains and URLs. However, blacklists frequently fail to identify emerging and obfuscated threats. Over the past few decades, there has been significant interest in developing machine learning models that automatically detect malicious domains and URLs, addressing the limitations of blacklists maintenance and updates. In this paper, we introduce DomURLs_BERT, a pre-trained BERT-based encoder adapted for detecting and classifying suspicious/malicious domains and URLs. DomURLs_BERT is pre-trained using the Masked Language Modeling (MLM) objective on a large multilingual corpus of URLs, domain names, and Domain Generation Algorithms (DGA) dataset. In order to assess the performance of DomURLs_BERT, we have conducted experiments on several binary and multi-class classification tasks involving domain names and URLs, covering phishing, malware, DGA, and DNS tunneling. The evaluations results show that the proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets. The pre-training dataset, the pre-trained DomURLs_BERT encoder, and the experiments source code are publicly available.
Cryptocurrency Price Forecasting Using XGBoost Regressor and Technical Indicators
Hafid, Abdelatif, Ebrahim, Maad, Alfatemi, Ali, Rahouti, Mohamed, Oliveira, Diogo
--The rapid growth of the stock market has attracted many investors due to its potential for significant profits. However, predicting stock prices accurately is difficult because financial markets are complex and constantly changing. This is especially true for the cryptocurrency market, which is known for its extreme volatility, making it challenging for traders and investors to make wise and profitable decisions. This study introduces a machine learning approach to predict cryptocur-rency prices. Specifically, we make use of important technical indicators such as Exponential Moving A verage (EMA) and Moving A verage Convergence Divergence (MACD) to train and feed the XGBoost regressor model. We demonstrate our approach through an analysis focusing on the closing prices of Bitcoin cryptocurrency. We evaluate the model's performance through various simulations, showing promising results that suggest its usefulness in aiding/guiding cryptocurrency traders and investors in dynamic market conditions. Over the past few years, the rapid expansion of the stock market has made it an appealing option for investors seeking high returns and easy access.
Contingency Analysis of a Grid of Connected EVs for Primary Frequency Control of an Industrial Microgrid Using Efficient Control Scheme
Sabhahit, J. N., Solanke, S. S., Jadoun, V. K., Malik, H., Márquez, F. P. García, Pinar-Pérez, J. M.
After over a century of internal combustion engines ruling the transport sector, electric vehicles appear to be on the verge of gaining traction due to a slew of advantages, including lower operating costs and lower CO2 emissions. By using the Vehicle-to-Grid (or Grid-to-Vehicle if Electric vehicles (EVs) are utilized as load) approach, EVs can operate as both a load and a source. Primary frequency regulation and congestion management are two essential characteristics of this technology that are added to an industrial microgrid. Industrial Microgrids are made up of different energy sources such as wind farms and PV farms, storage systems, and loads. EVs have gained a lot of interest as a technique for frequency management because of their ability to regulate quickly. Grid reliability depends on this quick reaction. Different contingency, state of charge of the electric vehicles, and a varying number of EVs in an EV fleet are considered in this work, and a proposed control scheme for frequency management is presented. This control scheme enables bidirectional power flow, allowing for primary frequency regulation during the various scenarios that an industrial microgrid may encounter over the course of a 24-h period. The presented controller will provide dependable frequency regulation support to the industrial microgrid during contingencies, as will be demonstrated by simulation results, achieving a more reliable system. However, simulation results will show that by increasing a number of the EVs in a fleet for the Vehicle-to-Grid approach, an industrial microgrid\'s frequency can be enhanced even further.
Skin Cancer Segmentation and Classification Using Vision Transformer for Automatic Analysis in Dermatoscopy-based Non-invasive Digital System
Himel, Galib Muhammad Shahriar, Islam, Md. Masudul, Al-Aff, Kh Abdullah, Karim, Shams Ibne, Sikder, Md. Kabir Uddin
The development of cancer is triggered by alterations and mutations in the DNA. The majority of DNA changes responsible for cancer occur within specific regions known as genes. Among the various types of cancers, skin cancer is among the five on the list. If we disregard breast and prostate cancer which are gender-dependent, skin cancer will remain in the third largest cancer category among many others. Based on the statistics released by the American Cancer Society (ACS) [1], there were 58,120 recorded cases of skin cancer among males and 39,490 cases among females. An intriguing observation is that the incidence of skin cancer has been steadily rising from 1992 to 2019, with a notable exception in 2020 [2]. This exception can be attributed to the understandable decrease in cases during the COVID-19 pandemic, as people were mostly confined to their homes. This decline is reasonable considering that exposure to ultraviolet (UV) radiation is a significant contributing factor to the development of skin cancer. More people are diagnosed with skin cancer each year in the U.S. than all other cancers combined [3].