AITopics

2405.13005

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Illinois > Cook County > Maywood (0.04)
North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Rheumatology (1.00)
Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

arXiv.org Artificial IntelligenceMay-12-2024

Maximizing Information Gain in Privacy-Aware Active Learning of Email Anomalies

Chung, Mu-Huan Miles, Li, Sharon, Kongmanee, Jaturong, Wang, Lu, Yang, Yuhong, Giang, Calvin, Jerath, Khilan, Raman, Abhay, Lie, David, Chignell, Mark

Redacted emails satisfy most privacy requirements but they make it more difficult to detect anomalous emails that may be indicative of data exfiltration. In this paper we develop an enhanced method of Active Learning using an information gain maximizing heuristic, and we evaluate its effectiveness in a real world setting where only redacted versions of email could be labeled by human analysts due to privacy concerns. In the first case study we examined how Active Learning should be carried out. We found that model performance was best when a single highly skilled (in terms of the labelling task) analyst provided the labels. In the second case study we used confidence ratings to estimate the labeling uncertainty of analysts and then prioritized instances for labeling based on the expected information gain (the difference between model uncertainty and analyst uncertainty) that would be provided by labelling each instance. We found that the information maximization gain heuristic improved model performance over existing sampling methods for Active Learning. Based on the results obtained, we recommend that analysts should be screened, and possibly trained, prior to implementation of Active Learning in cybersecurity applications. We also recommend that the information gain maximizing sample method (based on expert confidence) should be used in early stages of Active Learning, providing that well-calibrated confidence can be obtained. We also note that the expertise of analysts should be assessed prior to Active Learning, as we found that analysts with lower labelling skill had poorly calibrated (over-) confidence in their labels.

acm trans, participant, publication date, (13 more...)

2405.0744

Country:

North America > Canada > Ontario > Toronto (0.14)
South America > Paraguay > Asunción > Asunción (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Pieroni, Riccardo, Specchia, Simone, Corno, Matteo, Savaresi, Sergio Matteo

Multi-Object Tracking with Camera-LiDAR Fusion for Autonomous Driving

arXiv.org Artificial IntelligenceMay-12-2024

This paper presents a novel multi-modal Multi-Object Tracking (MOT) algorithm for self-driving cars that combines camera and LiDAR data. Camera frames are processed with a state-of-the-art 3D object detector, whereas classical clustering techniques are used to process LiDAR observations. The proposed MOT algorithm comprises a three-step association process, an Extended Kalman filter for estimating the motion of each detected dynamic obstacle, and a track management phase. The EKF motion model requires the current measured relative position and orientation of the observed object and the longitudinal and angular velocities of the ego vehicle as inputs. Unlike most state-of-the-art multi-modal MOT approaches, the proposed algorithm does not rely on maps or knowledge of the ego global pose. Moreover, it uses a 3D detector exclusively for cameras and is agnostic to the type of LiDAR sensor used. The algorithm is validated both in simulation and with real-world data, with satisfactory results.

algorithm, cam, vehicle, (13 more...)

2403.04112

Country:

North America > United States > Tennessee > Davidson County > Nashville (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Netherlands > South Holland > The Hague (0.04)
(5 more...)

Genre: Research Report (0.50)

Industry:

Automobiles & Trucks (0.85)
Transportation > Ground > Road (0.71)
Information Technology > Robotics & Automation (0.71)

Technology:

Information Technology > Sensing and Signal Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Meymandi, Arash Rasti, Hosseini, Zahra, Davari, Sina, Moshiri, Abolfazl, Rahimi-Golkhandan, Shabnam, Namdar, Khashayar, Feizi, Nikta, Tavakoli-Targhi, Mohamad, Khalvati, Farzad

Opportunities for Persian Digital Humanities Research with Artificial Intelligence Language Models; Case Study: Forough Farrokhzad

arXiv.org Artificial IntelligenceMay-10-2024

This study explores the integration of advanced Natural Language Processing (NLP) and Artificial Intelligence (AI) techniques to analyze and interpret Persian literature, focusing on the poetry of Forough Farrokhzad. Utilizing computational methods, we aim to unveil thematic, stylistic, and linguistic patterns in Persian poetry. Specifically, the study employs AI models including transformer-based language models for clustering of the poems in an unsupervised framework. This research underscores the potential of AI in enhancing our understanding of Persian literary heritage, with Forough Farrokhzad's work providing a comprehensive case study. This approach not only contributes to the field of Persian Digital Humanities but also sets a precedent for future research in Persian literary studies using computational techniques.

farrokhzad, poem, poetry, (14 more...)

2405.0676

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > India (0.06)
Asia > Middle East > Iran > Tehran Province > Tehran (0.05)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Taloma, Redemptor Jr Laceda, Pisani, Patrizio, Comminiello, Danilo

Concrete Dense Network for Long-Sequence Time Series Clustering

Time series clustering is fundamental in data analysis for discovering temporal patterns. Despite recent advancements, learning cluster-friendly representations is still challenging, particularly with long and complex time series. Deep temporal clustering methods have been trying to integrate the canonical k-means into end-to-end training of neural networks but fall back on surrogate losses due to the non-differentiability of the hard cluster assignment, yielding sub-optimal solutions. In addition, the autoregressive strategy used in the state-of-the-art RNNs is subject to error accumulation and slow training, while recent research findings have revealed that Transformers are less effective due to time points lacking semantic meaning, to the permutation invariance of attention that discards the chronological order and high computation cost. In light of these observations, we present LoSTer which is a novel dense autoencoder architecture for the long-sequence time series clustering problem (LSTC) capable of optimizing the k-means objective via the Gumbel-softmax reparameterization trick and designed specifically for accurate and fast clustering of long time series. Extensive experiments on numerous benchmark datasets and two real-world applications prove the effectiveness of LoSTer over state-of-the-art RNNs and Transformer-based deep clustering methods.

dataset, loster, time sery, (16 more...)

2405.05015

Country:

South America > Ecuador (0.04)
North America > United States > Arizona (0.04)
Europe > Italy > Lazio > Rome (0.04)

Genre:

Research Report > Promising Solution (0.34)
Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Vikrant, Vicky, P, Narodia Parth, Bhattacharjee, Kamalika

G\"odel Number based Clustering Algorithm with Decimal First Degree Cellular Automata

In this paper, a decimal first degree cellular automata (FDCA) based clustering algorithm is proposed where clusters are created based on reachability. Cyclic spaces are created and configurations which are in the same cycle are treated as the same cluster. Here, real-life data objects are encoded into decimal strings using G\"odel number based encoding. The benefits of the scheme is, it reduces the encoded string length while maintaining the features properties. Candidate CA rules are identified based on some theoretical criteria such as self-replication and information flow. An iterative algorithm is developed to generate the desired number of clusters over three stages. The results of the clustering are evaluated based on benchmark clustering metrics such as Silhouette score, Davis Bouldin, Calinski Harabasz and Dunn Index. In comparison with the existing state-of-the-art clustering algorithms, our proposed algorithm gives better performance.

algorithm, cellular automata, dataset, (12 more...)

2405.04881

Country:

North America > United States > Oregon > Multnomah County > Portland (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > India > Tamil Nadu (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Ntekouli, Mandani, Spanakis, Gerasimos, Waldorp, Lourens, Roefs, Anne

Explaining Clustering of Ecological Momentary Assessment Data Through Temporal and Feature Attention

In the field of psychopathology, Ecological Momentary Assessment (EMA) studies offer rich individual data on psychopathology-relevant variables (e.g., affect, behavior, etc) in real-time. EMA data is collected dynamically, represented as complex multivariate time series (MTS). Such information is crucial for a better understanding of mental disorders at the individual- and group-level. More specifically, clustering individuals in EMA data facilitates uncovering and studying the commonalities as well as variations of groups in the population. Nevertheless, since clustering is an unsupervised task and true EMA grouping is not commonly available, the evaluation of clustering is quite challenging. An important aspect of evaluation is clustering explainability. Thus, this paper proposes an attention-based interpretable framework to identify the important time-points and variables that play primary roles in distinguishing between clusters. A key part of this study is to examine ways to analyze, summarize, and interpret the attention weights as well as evaluate the patterns underlying the important segments of the data that differentiate across clusters. To evaluate the proposed approach, an EMA dataset of 187 individuals grouped in 3 clusters is used for analyzing the derived attention-based importance attributes. More specifically, this analysis provides the distinct characteristics at the cluster-, feature- and individual level. Such clustering explanations could be beneficial for generalizing existing concepts of mental disorders, discovering new insights, and even enhancing our knowledge at an individual level.

attention weight, cluster2, explanation, (14 more...)

2405.04854

Country:

Europe > Netherlands > Limburg > Maastricht (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.54)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Giakoumoglou, Nikolaos, Stathaki, Tania

A review on discriminative self-supervised learning methods

In the field of computer vision, self-supervised learning has emerged as a method to extract robust features from unlabeled data, where models derive labels autonomously from the data itself, without the need for manual annotation. This paper provides a comprehensive review of discriminative approaches of self-supervised learning within the domain of computer vision, examining their evolution and current status. Through an exploration of various methods including contrastive, self-distillation, knowledge distillation, feature decorrelation, and clustering techniques, we investigate how these approaches leverage the abundance of unlabeled data. Finally, we have comparison of self-supervised learning methods on the standard ImageNet classification benchmark.

learning, representation, self-supervised learning, (15 more...)

2405.04969

Country:

North America > United States > New York (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Overview (1.00)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

arXiv.org Artificial IntelligenceMay-7-2024

Carbon Filter: Real-time Alert Triage Using Large Scale Clustering and Fast Search

Oliver, Jonathan, Batta, Raghav, Bates, Adam, Inam, Muhammad Adil, Mehta, Shelly, Xia, Shugao

"Alert fatigue" is one of the biggest challenges faced by the Security Operations Center (SOC) today, with analysts spending more than half of their time reviewing false alerts. Endpoint detection products raise alerts by pattern matching on event telemetry against behavioral rules that describe potentially malicious behavior, but can suffer from high false positives that distract from actual attacks. While alert triage techniques based on data provenance may show promise, these techniques can take over a minute to inspect a single alert, while EDR customers may face tens of millions of alerts per day; the current reality is that these approaches aren't nearly scalable enough for production environments. We present Carbon Filter, a statistical learning based system that dramatically reduces the number of alerts analysts need to manually review. Our approach is based on the observation that false alert triggers can be efficiently identified and separated from suspicious behaviors by examining the process initiation context (e.g., the command line) that launched the responsible process. Through the use of fast-search algorithms for training and inference, our approach scales to millions of alerts per day. Through batching queries to the model, we observe a theoretical maximum throughput of 20 million alerts per hour. Based on the analysis of tens of million alerts from customer deployments, our solution resulted in a 6-fold improvement in the Signal-to-Noise ratio without compromising on alert triage performance.

artificial intelligence, data mining, machine learning, (15 more...)

2405.04691

Country:

North America > United States > Illinois > Champaign County > Champaign (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Croatia > Zagreb County > Zagreb (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

arXiv.org Artificial IntelligenceMay-7-2024

Sora Detector: A Unified Hallucination Detection for Large Text-to-Video Models

Chu, Zhixuan, Zhang, Lei, Sun, Yichen, Xue, Siqiao, Wang, Zhibo, Qin, Zhan, Ren, Kui

The rapid advancement in text-to-video (T2V) generative models has enabled the synthesis of high-fidelity video content guided by textual descriptions. Despite this significant progress, these models are often susceptible to hallucination, generating contents that contradict the input text, which poses a challenge to their reliability and practical deployment. To address this critical issue, we introduce the SoraDetector, a novel unified framework designed to detect hallucinations across diverse large T2V models, including the cutting-edge Sora model. Our framework is built upon a comprehensive analysis of hallucination phenomena, categorizing them based on their manifestation in the video content. Leveraging the state-of-the-art keyframe extraction techniques and multimodal large language models, SoraDetector first evaluates the consistency between extracted video content summary and textual prompts, then constructs static and dynamic knowledge graphs (KGs) from frames to detect hallucination both in single frames and across frames. Sora Detector provides a robust and quantifiable measure of consistency, static and dynamic hallucination. In addition, we have developed the Sora Detector Agent to automate the hallucination detection process and generate a complete video quality report for each input video. Lastly, we present a novel meta-evaluation benchmark, T2VHaluBench, meticulously crafted to facilitate the evaluation of advancements in T2V hallucination detection. Through extensive experiments on videos generated by Sora and other large T2V models, we demonstrate the efficacy of our approach in accurately detecting hallucinations. The code and dataset can be accessed via GitHub.

hallucination, hallucination detection, video, (14 more...)

2405.0418

Country: North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)

Genre:

Overview (0.67)
Research Report (0.64)
Workflow (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.87)
(2 more...)