AITopics

2503.23927

Country: Europe (1.00)

Genre: Research Report > Experimental Study (0.93)

Industry:

Health & Medicine (0.67)
Government (0.46)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Machine LearningApr-2-2025

Identifying Obfuscated Code through Graph-Based Semantic Analysis of Binary Code

Cohen, Roxane, David, Robin, Yger, Florian, Rossi, Fabrice

Protecting sensitive program content is a critical issue in various situations, ranging from legitimate use cases to unethical contexts. Obfuscation is one of the most used techniques to ensure such protection. Consequently, attackers must first detect and characterize obfuscation before launching any attack against it. This paper investigates the problem of function-level obfuscation detection using graph-based approaches, comparing algorithms, from elementary baselines to promising techniques like GNN (Graph Neural Networks), on different feature choices. We consider various obfuscation types and obfuscators, resulting in two complex datasets. Our findings demonstrate that GNNs need meaningful features that capture aspects of function semantics to outperform baselines. Our approach shows satisfactory results, especially in a challenging 11-class classification task and in a practical malware analysis example.

data mining, machine learning, obfuscation, (17 more...)

2504.01481

Country:

North America > United States (0.46)
Europe > France (0.29)

Genre: Research Report > New Finding (0.86)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)

arXiv.org Machine LearningMar-31-2025

Spatio-temporal Prediction of Fine-Grained Origin-Destination Matrices with Applications in Ridesharing

Yang, Run, Dai, Runpeng, Gao, Siran, Tang, Xiaocheng, Zhou, Fan, Zhu, Hongtu

Accurate spatial-temporal prediction of network-based travelers' requests is crucial for the effective policy design of ridesharing platforms. Having knowledge of the total demand between various locations in the upcoming time slots enables platforms to proactively prepare adequate supplies, thereby increasing the likelihood of fulfilling travelers' requests and redistributing idle drivers to areas with high potential demand to optimize the global supply-demand equilibrium. This paper delves into the prediction of Origin-Destination (OD) demands at a fine-grained spatial level, especially when confronted with an expansive set of local regions. While this task holds immense practical value, it remains relatively unexplored within the research community. To fill this gap, we introduce a novel prediction model called OD-CED, which comprises an unsupervised space coarsening technique to alleviate data sparsity and an encoder-decoder architecture to capture both semantic and geographic dependencies. Through practical experimentation, OD-CED has demonstrated remarkable results. It achieved an impressive reduction of up to 45% reduction in root-mean-square error and 60% in weighted mean absolute percentage error over traditional statistical methods when dealing with OD matrices exhibiting a sparsity exceeding 90%.

artificial intelligence, data mining, machine learning, (22 more...)

2503.24237

Country:

North America > United States (0.46)
Asia > China (0.28)

Genre: Research Report (1.00)

Industry:

Transportation > Infrastructure & Services (0.48)
Transportation > Ground > Road (0.34)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
(3 more...)

Vödisch, Niclas, Dodel, David, Schötz, Michael

FSOCO: The Formula Student Objects in Context Dataset

arXiv.org Artificial IntelligenceMar-31-2025

This paper presents the FSOCO dataset, a collaborative dataset for vision-based cone detection systems in Formula Student Driverless competitions. It contains human annotated ground truth labels for both bounding boxes and instance-wise segmentation masks. The data buy-in philosophy of FSOCO asks student teams to contribute to the database first before being granted access ensuring continuous growth. By providing clear labeling guidelines and tools for a sophisticated raw image selection, new annotations are guaranteed to meet the desired quality. The effectiveness of the approach is shown by comparing prediction results of a network trained on FSOCO and its unregulated predecessor. The FSOCO dataset can be found at https://fsoco.github.io/fsoco-dataset/.

artificial intelligence, dataset, machine learning, (13 more...)

doi: 10.4271/12-05-01-0003

2012.07139

Country:

Europe (1.00)
North America > United States (0.47)
North America > Canada (0.29)

Genre: Research Report (0.40)

Industry: Automobiles & Trucks (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Data Science (0.93)
(2 more...)

arXiv.org Artificial IntelligenceMar-30-2025

Beyond Unimodal Boundaries: Generative Recommendation with Multimodal Semantics

Zhu, Jing, Ju, Mingxuan, Liu, Yozen, Koutra, Danai, Shah, Neil, Zhao, Tong

Generative recommendation (GR) has become a powerful paradigm in recommendation systems that implicitly links modality and semantics to item representation, in contrast to previous methods that relied on non-semantic item identifiers in autoregressive models. However, previous research has predominantly treated modalities in isolation, typically assuming item content is unimodal (usually text). We argue that this is a significant limitation given the rich, multimodal nature of real-world data and the potential sensitivity of GR models to modality choices and usage. Our work aims to explore the critical problem of Multimodal Generative Recommendation (MGR), highlighting the importance of modality choices in GR nframeworks. We reveal that GR models are particularly sensitive to different modalities and examine the challenges in achieving effective GR when multiple modalities are available. By evaluating design strategies for effectively leveraging multiple modalities, we identify key challenges and introduce MGR-LF++, an enhanced late fusion framework that employs contrastive modality alignment and special tokens to denote different modalities, achieving a performance improvement of over 20% compared to single-modality alternatives.

large language model, machine learning, natural language, (18 more...)

2503.23333

Country: North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Data Science (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Almeida, Tiago, Moreno, Plinio, Barata, Catarina

Prediction of 30-day hospital readmission with clinical notes and EHR information

High hospital readmission rates are associated with significant costs and health risks for patients. Therefore, it is critical to develop predictive models that can support clinicians to determine whether or not a patient will return to the hospital in a relatively short period of time (e.g, 30-days). Nowadays, it is possible to collect both structured (electronic health records - EHR) and unstructured information (clinical notes) about a patient hospital event, all potentially containing relevant information for a predictive model. However, their integration is challenging. In this work we explore the combination of clinical notes and EHRs to predict 30-day hospital readmissions. We address the representation of the various types of information available in the EHR data, as well as exploring LLMs to characterize the clinical notes. We collect both information sources as the nodes of a graph neural network (GNN). Our model achieves an AUROC of 0.72 and a balanced accuracy of 66.7\%, highlighting the importance of combining the multimodal information.

data mining, machine learning, natural language, (20 more...)

2503.2305

Country: North America > United States (0.93)

Genre: Research Report > New Finding (0.95)

Industry:

Health & Medicine > Health Care Technology > Medical Record (1.00)
Health & Medicine > Health Care Providers & Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Data Science > Data Mining (0.86)

Learning Structure-enhanced Temporal Point Processes with Gromov-Wasserstein Regularization

Wang, Qingmei, Wang, Fanmeng, Su, Bing, Xu, Hongteng

Real-world event sequences are often generated by different temporal point processes (TPPs) and thus have clustering structures. Nonetheless, in the modeling and prediction of event sequences, most existing TPPs ignore the inherent clustering structures of the event sequences, leading to the models with unsatisfactory interpretability. In this study, we learn structure-enhanced TPPs with the help of Gromov-Wasserstein (GW) regularization, which imposes clustering structures on the sequence-level embeddings of the TPPs in the maximum likelihood estimation framework.In the training phase, the proposed method leverages a nonparametric TPP kernel to regularize the similarity matrix derived based on the sequence embeddings. In large-scale applications, we sample the kernel matrix and implement the regularization as a Gromov-Wasserstein (GW) discrepancy term, which achieves a trade-off between regularity and computational efficiency.The TPPs learned through this method result in clustered sequence embeddings and demonstrate competitive predictive and clustering performance, significantly improving the model interpretability without compromising prediction accuracy.

artificial intelligence, data mining, machine learning, (15 more...)

2503.23002

Country:

Asia > China (0.51)
Europe (0.46)

Genre: Research Report > New Finding (0.35)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.55)

AuditVotes: A Framework Towards More Deployable Certified Robustness for Graph Neural Networks

Lai, Yuni, Zhu, Yulin, Sun, Yixuan, Wu, Yulun, Xiao, Bin, Li, Gaolei, Li, Jianhua, Zhou, Kai

Despite advancements in Graph Neural Networks (GNNs), adaptive attacks continue to challenge their robustness. Certified robustness based on randomized smoothing has emerged as a promising solution, offering provable guarantees that a model's predictions remain stable under adversarial perturbations within a specified range. However, existing methods face a critical trade-off between accuracy and robustness, as achieving stronger robustness requires introducing greater noise into the input graph. This excessive randomization degrades data quality and disrupts prediction consistency, limiting the practical deployment of certifiably robust GNNs in real-world scenarios where both accuracy and robustness are essential. To address this challenge, we propose \textbf{AuditVotes}, the first framework to achieve both high clean accuracy and certifiably robust accuracy for GNNs. It integrates randomized smoothing with two key components, \underline{au}gmentation and con\underline{dit}ional smoothing, aiming to improve data quality and prediction consistency. The augmentation, acting as a pre-processing step, de-noises the randomized graph, significantly improving data quality and clean accuracy. The conditional smoothing, serving as a post-processing step, employs a filtering function to selectively count votes, thereby filtering low-quality predictions and improving voting consistency. Extensive experimental results demonstrate that AuditVotes significantly enhances clean accuracy, certified robustness, and empirical robustness while maintaining high computational efficiency. Notably, compared to baseline randomized smoothing, AuditVotes improves clean accuracy by $437.1\%$ and certified accuracy by $409.3\%$ when the attacker can arbitrarily insert $20$ edges on the Cora-ML datasets, representing a substantial step toward deploying certifiably robust GNNs in real-world applications.

accuracy, artificial intelligence, machine learning, (17 more...)

2503.22998

Country:

Asia (0.68)
North America > United States > New York (0.28)

Genre: Research Report > New Finding (0.87)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

TRACE: Intra-visit Clinical Event Nowcasting via Effective Patient Trajectory Encoding

Liang, Yuyang, Chen, Yankai, Fang, Yixiang, Lakshmanan, Laks V. S., Ma, Chenhao

Electronic Health Records (EHR) have become a valuable resource for a wide range of predictive tasks in healthcare. However, existing approaches have largely focused on inter-visit event predictions, overlooking the importance of intra-visit nowcasting, which provides prompt clinical insights during an ongoing patient visit. To address this gap, we introduce the task of laboratory measurement prediction within a hospital visit. We study the laboratory data that, however, remained underexplored in previous work. We propose TRACE, a Transformer-based model designed for clinical event nowcasting by encoding patient trajectories. TRACE effectively handles long sequences and captures temporal dependencies through a novel timestamp embedding that integrates decay properties and periodic patterns of data. Additionally, we introduce a smoothed mask for denoising, improving the robustness of the model. Experiments on two large-scale electronic health record datasets demonstrate that the proposed model significantly outperforms previous methods, highlighting its potential for improving patient care through more accurate laboratory measurement nowcasting. The code is available at https://github.com/Amehi/TRACE.

machine learning, natural language, prediction, (16 more...)

doi: 10.1145/3701716.3715545

2503.23072

Country:

North America (1.00)
Asia > China > Guangdong Province (0.48)

Genre:

Overview (0.93)
Research Report (0.64)

Industry: Health & Medicine > Health Care Technology > Medical Record (0.96)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Bhattacharjee, Robi, Frohnapfel, Karolin, von Luxburg, Ulrike

How to safely discard features based on aggregate SHAP values

arXiv.org Machine LearningMar-29-2025

SHAP is one of the most popular local feature-attribution methods. Given a function f and an input x, it quantifies each feature's contribution to f(x). Recently, SHAP has been increasingly used for global insights: practitioners average the absolute SHAP values over many data points to compute global feature importance scores, which are then used to discard unimportant features. In this work, we investigate the soundness of this practice by asking whether small aggregate SHAP values necessarily imply that the corresponding feature does not affect the function. Unfortunately, the answer is no: even if the i-th SHAP value is 0 on the entire data support, there exist functions that clearly depend on Feature i. The issue is that computing SHAP values involves evaluating f on points outside of the data support, where f can be strategically designed to mask its dependence on Feature i. To address this, we propose to aggregate SHAP values over the extended support, which is the product of the marginals of the underlying distribution. With this modification, we show that a small aggregate SHAP value implies that we can safely discard the corresponding feature. We then extend our results to KernelSHAP, the most popular method to approximate SHAP values in practice. We show that if KernelSHAP is computed over the extended distribution, a small aggregate value justifies feature removal. This result holds independently of whether KernelSHAP accurately approximates true SHAP values, making it one of the first theoretical results to characterize the KernelSHAP algorithm itself. Our findings have both theoretical and practical implications. We introduce the Shapley Lie algebra, which offers algebraic insights that may enable a deeper investigation of SHAP and we show that randomly permuting each column of the data matrix enables safely discarding features based on aggregate SHAP and KernelSHAP values.

artificial intelligence, machine learning, shap value, (17 more...)

2503.23111

Genre: Research Report > New Finding (0.54)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.46)
Water & Waste Management > Water Management > Water Supplies & Services (0.46)

Technology:

Information Technology > Data Science (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)