AITopics

2312.08029

Country:

North America > United States > New York (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

arXiv.org Artificial IntelligenceDec-13-2023

PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search

Yu, Shangdi, Engels, Joshua, Huang, Yihao, Shun, Julian

This paper studies density-based clustering of point sets. These methods use dense regions of points to detect clusters of arbitrary shapes. In particular, we study variants of density peaks clustering, a popular type of algorithm that has been shown to work well in practice. Our goal is to cluster large high-dimensional datasets, which are prevalent in practice. Prior solutions are either sequential, and cannot scale to large data, or are specialized for low-dimensional data. This paper unifies the different variants of density peaks clustering into a single framework, PECANN, by abstracting out several key steps common to this class of algorithms. One such key step is to find nearest neighbors that satisfy a predicate function, and one of the main contributions of this paper is an efficient way to do this predicate search using graph-based approximate nearest neighbor search (ANNS). To provide ample parallelism, we propose a doubling search technique that enables points to find an approximate nearest neighbor satisfying the predicate in a small number of rounds. Our technique can be applied to many existing graph-based ANNS algorithms, which can all be plugged into PECANN. We implement five clustering algorithms with PECANN and evaluate them on synthetic and real-world datasets with up to 1.28 million points and up to 1024 dimensions on a 30-core machine with two-way hyper-threading. Compared to the state-of-the-art FASTDP algorithm for high-dimensional density peaks clustering, which is sequential, our best algorithm is 45x-734x faster while achieving competitive ARI scores. Compared to the state-of-the-art parallel DPC-based algorithm, which is optimized for low dimensions, we show that PECANN is two orders of magnitude faster. As far as we know, our work is the first to evaluate DPC variants on large high-dimensional real-world image and text embedding datasets.

algorithm, dataset, neighbor, (13 more...)

2312.0394

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Finland > North Karelia > Joensuu (0.04)

Genre:

Overview (0.88)
Research Report (0.84)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Bogatskiy, Alexander, Hoffman, Timothy, Miller, David W., Offermann, Jan T., Liu, Xiaoyang

Explainable Equivariant Neural Networks for Particle Physics: PELICAN

arXiv.org Artificial IntelligenceDec-13-2023

PELICAN is a novel permutation equivariant and Lorentz invariant or covariant aggregator network designed to overcome common limitations found in architectures applied to particle physics problems. Compared to many approaches that use non-specialized architectures that neglect underlying physics principles and require very large numbers of parameters, PELICAN employs a fundamentally symmetry group-based architecture that demonstrates benefits in terms of reduced complexity, increased interpretability, and raw performance. We present a comprehensive study of the PELICAN algorithm architecture in the context of both tagging (classification) and reconstructing (regression) Lorentz-boosted top quarks, including the difficult task of specifically identifying and measuring the $W$-boson inside the dense environment of the Lorentz-boosted top-quark hadronic final state. We also extend the application of PELICAN to the tasks of identifying quark-initiated vs.~gluon-initiated jets, and a multi-class identification across five separate target categories of jets. When tested on the standard task of Lorentz-boosted top-quark tagging, PELICAN outperforms existing competitors with much lower model complexity and high sample efficiency. On the less common and more complex task of 4-momentum regression, PELICAN also outperforms hand-crafted, non-machine learning algorithms. We discuss the implications of symmetry-restricted architectures for the wider field of machine learning for physics.

architecture, dataset, pelican, (17 more...)

2307.16506

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
Europe > Germany > Hamburg (0.04)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Igneczi, Gergo, Dobay, Tamas

Typification of Driver Models Using Clustering Methods

The rapid development of automated driving systems in recent years has led to improvements in road safety and travel comfort. One typical function of these systems is Lane Keep Assist, which generally does not take human driving preferences into account. In our previous work, we have demonstrated that it is possible to implement a Lane Keep Assist function that is appropriate to human preferences using a trajectory planning algorithm based on a linear driving model. In our current work, we investigated how to separate the driving styles of individual drivers. We assumed that there are three driving styles: sporty, neutral and defensive. To prove these relations, clustering methods were applied to previously recorded measurements . Simulations with parameters describing the average behaviour of the classes (re-simulated with clustered types) showed that the resulting paths successfully classified drivers, that the 3 classes are distinct in their behaviour and that our model reproduces these behaviours.

driver model, modellek, mutatja, (14 more...)

2401.0864

Country:

Asia > China (0.04)
North America > United States > Virginia > Alexandria County > Alexandria (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(9 more...)

Genre: Research Report (0.40)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Amouei, Mohammadhossein, Rezvani, Mohsen, Fateh, Mansoor

RAT: Reinforcement-Learning-Driven and Adaptive Testing for Vulnerability Discovery in Web Application Firewalls

Abstract--Due to the increasing sophistication of web attacks, Web Application Firewalls (WAFs) have to be tested and updated regularly to resist the relentless flow of web attacks. In practice, using a brute-force attack to discover vulnerabilities is infeasible due to the wide variety of attack patterns. Thus, various black-box testing techniques have been proposed in the literature. However, these techniques suffer from low efficiency. This paper presents Reinforcement-Learning-Driven and Adaptive Testing (RAT), an automated black-box testing strategy to discover injection vulnerabilities in WAFs. In particular, we focus on SQL injection and Cross-site Scripting, which have been among the top ten vulnerabilities over the past decade. It then utilizes a reinforcement learning technique combined with a novel adaptive search algorithm to discover almost all bypassing attack patterns efficiently. We compare RAT with three state-of-the-art methods considering their objectives. The experiments show that RAT performs 33.53% and 63.16% on average better than its counterparts in discovering the most possible bypassing payloads and reducing the number of attempts before finding the first bypassing payload when testing well-configured WAFs, respectively. Thus, an enormous amount of private data of individuals and organizations is stored in web applications databases, making them tempting targets for attackers. A recent report reveals that web applications may experience up to 26 attacks per minute [1]. Moreover, according to Symantec's security report, 76% of websites are vulnerable to several attacks [2].

application, fragment, payload, (16 more...)

doi: 10.1109/TDSC.2021.3095417

2312.07885

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
Oceania > Australia (0.04)
Europe > United Kingdom > Wales (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.90)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.88)
(3 more...)

Simeone, Fernando, Chaves, Maik Olher, Esmin, Ahmed

Incremental hierarchical text clustering methods: a review

The growth in Internet usage has contributed to a large volume of continuously available data, and has created the need for automatic and efficient organization of the data. In this context, text clustering techniques are significant because they aim to organize documents according to their characteristics. More specifically, hierarchical and incremental clustering techniques can organize dynamic data in a hierarchical form, thus guaranteeing that this organization is updated and its exploration is facilitated. Based on the relevance and contemporary nature of the field, this study aims to analyze various hierarchical and incremental clustering techniques; the main contribution of this research is the organization and comparison of the techniques used by studies published between 2010 and 2018 that aimed to texts documents clustering. We describe the principal concepts related to the challenge and the different characteristics of these published works in order to provide a better understanding of the research in this field.

algorithm, dataset, hierarchy, (16 more...)

2312.07769

Country:

South America > Brazil (0.14)
North America > United States > New York > New York County > New York City (0.05)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)
(8 more...)

Genre: Research Report > New Finding (0.87)

Industry: Health & Medicine (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

A Clustering Framework for Unsupervised and Semi-supervised New Intent Discovery

Zhang, Hanlei, Xu, Hua, Wang, Xin, Long, Fei, Gao, Kai

New intent discovery is of great value to natural language processing, allowing for a better understanding of user needs and providing friendly services. However, most existing methods struggle to capture the complicated semantics of discrete text representations when limited or no prior knowledge of labeled data is available. To tackle this problem, we propose a novel clustering framework, USNID, for unsupervised and semi-supervised new intent discovery, which has three key technologies. First, it fully utilizes unsupervised or semi-supervised data to mine shallow semantic similarity relations and provide well-initialized representations for clustering. Second, it designs a centroid-guided clustering mechanism to address the issue of cluster allocation inconsistency and provide high-quality self-supervised targets for representation learning. Third, it captures high-level semantics in unsupervised or semi-supervised data to discover fine-grained intent-wise clusters by optimizing both cluster-level and instance-level objectives. We also propose an effective method for estimating the cluster number in open-world scenarios without knowing the number of new intents beforehand. USNID performs exceptionally well on several benchmark intent datasets, achieving new state-of-the-art results in unsupervised and semi-supervised new intent discovery and demonstrating robust performance with different cluster numbers.

learning, new intent discovery, representation, (14 more...)

doi: 10.1109/TKDE.2023.3340732

2304.07699

Country:

Asia > China > Beijing > Beijing (0.05)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Halawa, Mohamed Soliman, Díaz-Redondo, Rebeca P., Fernández-Vilas, Ana

KPIs-Based Clustering and Visualization of HPC jobs: a Feature Reduction Approach

arXiv.org Artificial IntelligenceDec-11-2023

High-Performance Computing (HPC) systems need to be constantly monitored to ensure their stability. The monitoring systems collect a tremendous amount of data about different parameters or Key Performance Indicators (KPIs), such as resource usage, IO waiting time, etc. A proper analysis of this data, usually stored as time series, can provide insight in choosing the right management strategies as well as the early detection of issues. In this paper, we introduce a methodology to cluster HPC jobs according to their KPI indicators. Our approach reduces the inherent high dimensionality of the collected data by applying two techniques to the time series: literature-based and variance-based feature extraction. We also define a procedure to visualize the obtained clusters by combining the two previous approaches and the Principal Component Analysis (PCA). Finally, we have validated our contributions on a real data set to conclude that those KPIs related to CPU usage provide the best cohesion and separation for clustering analysis and the good results of our visualization methodology.

artificial intelligence, data mining, machine learning, (18 more...)

doi: 10.1109/ACCESS.2021.3057427

2312.06534

Country:

Europe (0.28)
Africa > Middle East > Egypt (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Energy > Oil & Gas > Upstream (0.67)

Technology:

Information Technology > Sensing and Signal Processing (1.00)
Information Technology > Scientific Computing (1.00)
Information Technology > Information Management (1.00)
(5 more...)

Braylan, Alexander (a:1:{s:5:"en_US";s:29:"University of Texas at Austin";}) | Marabella, Madalyn | Alonso, Omar | Lease, Matthew

A General Model for Aggregating Annotations Across Simple, Complex, and Multi-Object Annotation Tasks

Journal of Artificial Intelligence ResearchDec-11-2023

Human annotations are vital to supervised learning, yet annotators often disagree on the correct label, especially as annotation tasks increase in complexity. A common strategy to improve label quality is to ask multiple annotators to label the same item and then aggregate their labels. To date, many aggregation models have been proposed for simple categorical or numerical annotation tasks, but far less work has considered more complex annotation tasks, such as those involving open-ended, multivariate, or structured responses. Similarly, while a variety of bespoke models have been proposed for specific tasks, our work is the first we are aware of to introduce aggregation methods that generalize across many, diverse complex tasks, including sequence labeling, translation, syntactic parsing, ranking, bounding boxes, and keypoints. This generality is achieved by applying readily available task-specific distance functions, then devising a task-agnostic method to model these distances between labels, rather than the labels themselves. This article presents a unified treatment of our prior work on complex annotation modeling and extends that work with investigation of three new research questions. First, how do complex annotation task and dataset properties impact aggregation accuracy? Second, how should a task owner navigate the many modeling choices in order to maximize aggregation accuracy? Finally, what tests and diagnoses can verify that aggregation models are specified correctly for the given data? To understand how various factors impact accuracy and to inform model selection, we conduct large-scale simulation studies and broad experiments on real, complex datasets. Regarding testing, we introduce the concept of unit tests for aggregation models and present a suite of such tests to ensure that a given model is not mis-specified and exhibits expected behavior. Beyond investigating these research questions above, we discuss the foundational concept and nature of annotation complexity, present a new aggregation model as a conceptual bridge between traditional models and our own, and contribute a new general semisupervised learning method for complex label aggregation that outperforms prior work.

annotation, annotation task, annotator, (16 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.1.14388

AI Access Foundation

14388

Journal of Artificial Intelligence Research

Country:

North America > United States > Texas > Travis County > Austin (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Middle East > Israel (0.04)
(9 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.54)

Industry:

Health & Medicine (1.00)
Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.92)
(2 more...)

arXiv.org Artificial IntelligenceDec-11-2023

Using Analytics on Student Created Data to Content Validate Pedagogical Tools

Kos, John, Eaton, Kenneth, Zhang, Sareen, Dass, Rahul, Buckley, Stephen, An, Sungeun, Goel, Ashok

Conceptual and simulation models can function as useful pedagogical tools, however it is important to categorize different outcomes when evaluating them in order to more meaningfully interpret results. VERA is a ecology-based conceptual modeling software that enables users to simulate interactions between biotics and abiotics in an ecosystem, allowing users to form and then verify hypothesis through observing a time series of the species populations. In this paper, we classify this time series into common patterns found in the domain of ecological modeling through two methods, hierarchical clustering and curve fitting, illustrating a general methodology for showing content validity when combining different pedagogical tools. When applied to a diverse sample of 263 models containing 971 time series collected from three different VERA user categories: a Georgia Tech (GATECH), North Georgia Technical College (NGTC), and ``Self Directed Learners'', results showed agreement between both classification methods on 89.38\% of the sample curves in the test set. This serves as a good indication that our methodology for determining content validity was successful.

simulation, student, vera, (12 more...)

2312.06871

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.05)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.90)

Industry:

Education (1.00)
Health & Medicine (0.93)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)