AITopics

2412.08197

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Molinari, Marco, Shao, Victor, Tregubiak, Vladimir, Pandey, Abhimanyu, Mikolajczak, Mateusz, Pereira, Sebastian Kuznetsov Ryder Torres

Interpretable Company Similarity with Sparse Autoencoders

Determining company similarity is a vital task in finance, underpinning hedging, risk management, portfolio diversification, and more. Practitioners often rely on sector and industry classifications to gauge similarity, such as SIC-codes and GICS-codes - the former being used by the U.S. Securities and Exchange Commission (SEC), and the latter widely used by the investment community. Since these classifications can lack granularity and often need to be updated, using clusters of embeddings of company descriptions has been proposed as a potential alternative, but the lack of interpretability in token embeddings poses a significant barrier to adoption in high-stakes contexts. Sparse Autoencoders (SAEs) have shown promise in enhancing the interpretability of Large Language Models (LLMs) by decomposing LLM activations into interpretable features. We apply SAEs to company descriptions, obtaining meaningful clusters of equities in the process. We benchmark SAE features against SIC-codes, Major Group codes, and Embeddings. Our results demonstrate that SAE features not only replicate but often surpass sector classifications and embeddings in capturing fundamental company characteristics. This is evidenced by their superior performance in correlating monthly returns - a proxy for similarity - and generating higher Sharpe ratio co-integration strategies, which underscores deeper fundamental similarities among companies.

large language model, machine learning, natural language, (21 more...)

2412.02605

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Virginia (0.04)
North America > Canada > Nova Scotia > Halifax Regional Municipality > Halifax (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Banking & Finance > Trading (1.00)
Government > Regional Government > North America Government > United States Government (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

An objective function for order preserving hierarchical clustering

Bakkelund, Daniel

We present a theory and an objective function for similarity-based hierarchical clustering of probabilistic partial orders and directed acyclic graphs (DAGs). Specifically, given elements $x \le y$ in the partial order, and their respective clusters $[x]$ and $[y]$, the theory yields an order relation $\le'$ on the clusters such that $[x]\le'[y]$. The theory provides a concise definition of order-preserving hierarchical clustering, and offers a classification theorem identifying the order-preserving trees (dendrograms). To determine the optimal order-preserving trees, we develop an objective function that frames the problem as a bi-objective optimisation, aiming to satisfy both the order relation and the similarity measure. We prove that the optimal trees under the objective are both order-preserving and exhibit high-quality hierarchical clustering. Since finding an optimal solution is NP-hard, we introduce a polynomial-time approximation algorithm and demonstrate that the method outperforms existing methods for order-preserving hierarchical clustering by a significant margin.

artificial intelligence, machine learning, relation, (17 more...)

2109.04266

Country:

Europe > Norway > Eastern Norway > Oslo (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > Utah (0.04)
(8 more...)

Genre: Research Report (0.40)

Industry: Government > Regional Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Bhanderi, Aniket, Bhatnagar, Raj

Incremental Gaussian Mixture Clustering for Data Streams

The problem of analyzing data streams of very large volumes is important and is very desirable for many application domains. In this paper we present and demonstrate effective working of an algorithm to find clusters and anomalous data points in a streaming datasets. Entropy minimization is used as a criterion for defining and updating clusters formed from a streaming dataset. As the clusters are formed we also identify anomalous datapoints that show up far away from all known clusters. With a number of 2-D datasets we demonstrate the effectiveness of discovering the clusters and also identifying anomalous data points.

artificial intelligence, data mining, machine learning, (19 more...)

2412.07217

Country:

North America > United States > Maryland > Montgomery County > Bethesda (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

A Real-time Degeneracy Sensing and Compensation Method for Enhanced LiDAR SLAM

Liao, Zongbo, Zhang, Xuanxuan, Zhang, Tianxiang, Li, Zhi, Zheng, Zhenqi, Wen, Zhichao, Li, You

LiDAR is widely used in Simultaneous Localization and Mapping (SLAM) and autonomous driving. The LiDAR odometry is of great importance in multi-sensor fusion. However, in some unstructured environments, the point cloud registration cannot constrain the poses of the LiDAR due to its sparse geometric features, which leads to the degeneracy of multi-sensor fusion accuracy. To address this problem, we propose a novel real-time approach to sense and compensate for the degeneracy of LiDAR. Firstly, this paper introduces the degeneracy factor with clear meaning, which can measure the degeneracy of LiDAR. Then, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering method adaptively perceives the degeneracy with better environmental generalization. Finally, the degeneracy perception results are utilized to fuse LiDAR and IMU, thus effectively resisting degeneracy effects. Experiments on our dataset show the method's high accuracy and robustness and validate our algorithm's adaptability to different environments and LiDAR scanning modalities.

artificial intelligence, information fusion, machine learning, (17 more...)

2412.07513

Country:

Asia > China > Hubei Province > Wuhan (0.06)
Asia > China > Beijing > Beijing (0.05)
North America > United States > California > Alameda County > Berkeley (0.04)
North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.04)

Genre: Research Report (1.00)

Industry:

Transportation (0.34)
Information Technology > Robotics & Automation (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.48)

arXiv.org Machine LearningDec-10-2024

Two-way Node Popularity Model for Directed and Bipartite Networks

Jing, Bing-Yi, Li, Ting, Wang, Jiangzhou, Wang, Ya

There has been extensive research on community detection in directed and bipartite networks. However, these studies often fail to consider the popularity of nodes in different communities, which is a common phenomenon in real-world networks. To address this issue, we propose a new probabilistic framework called the Two-Way Node Popularity Model (TNPM). The TNPM also accommodates edges from different distributions within a general sub-Gaussian family. We introduce the Delete-One-Method (DOM) for model fitting and community structure identification, and provide a comprehensive theoretical analysis with novel technical skills dealing with sub-Gaussian generalization. Additionally, we propose the Two-Stage Divided Cosine Algorithm (TSDC) to handle large-scale networks more efficiently. Our proposed methods offer multi-folded advantages in terms of estimation accuracy and computational efficiency, as demonstrated through extensive numerical studies. We apply our methods to two real-world applications, uncovering interesting findings.

adjacency matrix, algorithm, matrix, (14 more...)

arXiv.org Machine Learning

2412.08051

Country:

North America > United States (0.14)
Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Hong Kong (0.04)
(7 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
Information Technology > Data Science (0.68)

Hume, Jacob, Balzano, Laura

A Spectral Framework for Tracking Communities in Evolving Networks

arXiv.org Machine LearningDec-10-2024

Discovering and tracking communities in time-varying networks is an important task in network science, motivated by applications in fields ranging from neuroscience to sociology. In this work, we characterize the celebrated family of spectral methods for static clustering in terms of the low-rank approximation of high-dimensional node embeddings. From this perspective, it becomes natural to view the evolving community detection problem as one of subspace tracking on the Grassmann manifold. While the resulting optimization problem is nonconvex, we adopt a recently proposed block majorize-minimize Riemannian optimization scheme to learn the Grassmann geodesic which best fits the data. Our framework generalizes any static spectral community detection approach and leads to algorithms achieving favorable performance on synthetic and real temporal networks, including those that are weighted, signed, directed, mixed-membership, multiview, hierarchical, cocommunity-structured, bipartite, or some combination thereof. We demonstrate how to specifically cast a wide variety of methods into our framework, and demonstrate greatly improved dynamic community detection results in all cases.

spectral, spectral framework, tracking community, (15 more...)

arXiv.org Machine Learning

2412.07378

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Middle East > Cyprus > Nicosia > Nicosia (0.04)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.66)
Leisure & Entertainment > Sports > Football (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Guérin, Axel, Chauvet, Pierre, Saubion, Frédéric

A Survey on Recent Advances in Self-Organizing Maps

The Self-Organising Map algorithm is a well-known approach for unsupervised learning, designed to distill a high-dimensional dataset into a more manageable, typically two-dimensional, representation. Imagine a dataset full of p measured variables across n observations. A Self-Organising Map elegantly organises similar observations into groups and visually displays them on a map. This model, also known as Kohonen maps or Kohonen networks, has been introduced by Teuvo Kohonen [Koh82, Koh97]. Unlike conventional neural networks, which rely on error correction, SOM training relies on competitive principles. Kohonen drew inspiration from biological paradigms, in particular the neural models [MP69] and Alan Turing's pioneering theories of morphogenesis [Tur52]. Basically, self-organising maps serve as powerful tools for dissecting and visualising complex data landscapes, facilitating a deeper understanding of the intricate structures and relationships that permeate multidimensional datasets. Self-organising maps, like most artificial neural network architectures, operate in two distinct modes: training and mapping.

artificial intelligence, machine learning, self-organizing map, (18 more...)

2501.08416

Country:

Asia > China (0.28)
Oceania > Australia (0.28)
South America > Brazil (0.28)
(2 more...)

Genre:

Summary/Review (1.00)
Research Report > Promising Solution (1.00)
Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Education (0.93)
Health & Medicine > Therapeutic Area (0.93)
Information Technology (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Ousat, Behzad, Shariatnasab, Mahshad, Schafir, Esteban, Chaharsooghi, Farhad Shirani, Kharraz, Amin

In-Application Defense Against Evasive Web Scans through Behavioral Analysis

arXiv.org Artificial IntelligenceDec-9-2024

Web traffic has evolved to include both human users and automated agents, ranging from benign web crawlers to adversarial scanners such as those capable of credential stuffing, command injection, and account hijacking at the web scale. The estimated financial costs of these adversarial activities are estimated to exceed tens of billions of dollars in 2023. In this work, we introduce WebGuard, a low-overhead in-application forensics engine, to enable robust identification and monitoring of automated web scanners, and help mitigate the associated security risks. WebGuard focuses on the following design criteria: (i) integration into web applications without any changes to the underlying software components or infrastructure, (ii) minimal communication overhead, (iii) capability for real-time detection, e.g., within hundreds of milliseconds, and (iv) attribution capability to identify new behavioral patterns and detect emerging agent categories. To this end, we have equipped WebGuard with multi-modal behavioral monitoring mechanisms, such as monitoring spatio-temporal data and browser events. We also design supervised and unsupervised learning architectures for real-time detection and offline attribution of human and automated agents, respectively. Information theoretic analysis and empirical evaluations are provided to show that multi-modal data analysis, as opposed to uni-modal analysis which relies solely on mouse movement dynamics, significantly improves time-to-detection and attribution accuracy. Various numerical evaluations using real-world data collected via WebGuard are provided achieving high accuracy in hundreds of milliseconds, with a communication overhead below 10 KB per second.

data mining, machine learning, pattern recognition, (22 more...)

2412.07005

Country:

North America > United States > Florida > Hillsborough County > University (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(3 more...)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Law Enforcement & Public Safety (0.86)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Vielhaben, Johanna, Bareeva, Dilyara, Berend, Jim, Samek, Wojciech, Strodthoff, Nils

Beyond Scalars: Concept-Based Alignment Analysis in Vision Transformers

arXiv.org Artificial IntelligenceDec-9-2024

Vision transformers (ViTs) can be trained using various learning paradigms, from fully supervised to self-supervised. Diverse training protocols often result in significantly different feature spaces, which are usually compared through alignment analysis. However, current alignment measures quantify this relationship in terms of a single scalar value, obscuring the distinctions between common and unique features in pairs of representations that share the same scalar alignment. We address this limitation by combining alignment analysis with concept discovery, which enables a breakdown of alignment into single concepts encoded in feature space. This fine-grained comparison reveals both universal and unique concepts across different representations, as well as the internal structure of concepts within each of them. Our methodological contributions address two key prerequisites for concept-based alignment: 1) For a description of the representation in terms of concepts that faithfully capture the geometry of the feature space, we define concepts as the most general structure they can possibly form - arbitrary manifolds, allowing hidden features to be described by their proximity to these manifolds. 2) To measure distances between concept proximity scores of two representations, we use a generalized Rand index and partition it for alignment between pairs of concepts. We confirm the superiority of our novel concept definition for alignment analysis over existing linear baselines in a sanity check. The concept-based alignment analysis of representations from four different ViTs reveals that increased supervision correlates with a reduction in the semantic structure of learned representations.

artificial intelligence, machine learning, representation, (19 more...)

2412.06639

Country:

Europe > United Kingdom > England > Staffordshire (0.04)
Oceania > New Zealand > South Island > Marlborough District > Blenheim (0.04)
North America > United States > Virginia (0.04)
(5 more...)

Genre: Research Report > Promising Solution (0.34)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Leisure & Entertainment > Sports (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)