AITopics

2410.02733

Country:

North America > United States > Virginia (0.04)
North America > United States > North Carolina > Mecklenburg County > Charlotte (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.89)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.50)

Das, Akanksha, Bhattacharyya, Malay

Density based Spatial Clustering of Lines via Probabilistic Generation of Neighbourhood

arXiv.org Artificial IntelligenceOct-3-2024

Density based spatial clustering of points in $\mathbb{R}^n$ has a myriad of applications in a variety of industries. We generalise this problem to the density based clustering of lines in high-dimensional spaces, keeping in mind there exists no valid distance measure that follows the triangle inequality for lines. In this paper, we design a clustering algorithm that generates a customised neighbourhood for a line of a fixed volume (given as a parameter), based on an optional parameter as a continuous probability density function. This algorithm is not sensitive to the outliers and can effectively identify the noise in the data using a cardinality parameter. One of the pivotal applications of this algorithm is clustering data points in $\mathbb{R}^n$ with missing entries, while utilising the domain knowledge of the respective data. In particular, the proposed algorithm is able to cluster $n$-dimensional data points that contain at least $(n-1)$-dimensional information. We illustrate the neighbourhoods for the standard probability distributions with continuous probability density functions and demonstrate the effectiveness of our algorithm on various synthetic and real-world datasets (e.g., rail and road networks). The experimental results also highlight its application in clustering incomplete data.

dataset, line segment, neighbourhood, (12 more...)

2410.0229

Country:

North America > United States (0.29)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)
Asia > India > West Bengal > Kolkata (0.04)

Genre: Research Report (0.64)

Industry: Transportation > Ground (0.49)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Xu, Shizhou, Strohmer, Thomas

WHOMP: Optimizing Randomized Controlled Trials via Wasserstein Homogeneity

arXiv.org Machine LearningOct-3-2024

We investigate methods for partitioning datasets into subgroups that maximize diversity within each subgroup while minimizing dissimilarity across subgroups. We introduce a novel partitioning method called the $\textit{Wasserstein Homogeneity Partition}$ (WHOMP), which optimally minimizes type I and type II errors that often result from imbalanced group splitting or partitioning, commonly referred to as accidental bias, in comparative and controlled trials. We conduct an analytical comparison of WHOMP against existing partitioning methods, such as random subsampling, covariate-adaptive randomization, rerandomization, and anti-clustering, demonstrating its advantages. Moreover, we characterize the optimal solutions to the WHOMP problem and reveal an inherent trade-off between the stability of subgroup means and variances among these solutions. Based on our theoretical insights, we design algorithms that not only obtain these optimal solutions but also equip practitioners with tools to select the desired trade-off. Finally, we validate the effectiveness of WHOMP through numerical experiments, highlighting its superiority over traditional methods.

experiment, subgroup, var, (14 more...)

arXiv.org Machine Learning

2409.18504

Country: North America > United States > California > Yolo County > Davis (0.14)

Genre:

Research Report > Strength High (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Data Science (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Arya Mazumdar, Soumyabrata Pal

Semisupervised Clustering, AND-Queries and Locally Encodable Source Coding

Neural Information Processing SystemsOct-2-2024, 22:01:46 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, probability, query, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Information Management (0.85)
Information Technology > Data Science (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Neural Information Processing SystemsOct-2-2024, 20:48:31 GMT

YASS: Yet Another Spike Sorter

Jin Hyung Lee, David E. Carlson, Hooshmand Shokri Razaghi, Weichi Yao, Georges A. Goetz, Espen Hagen, Eleanor Batty, E.J. Chichilnisky, Gaute T. Einevoll, Liam Paninski

Neural Information Processing Systems http://nips.cc/

algorithm, pipeline, spike, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > Norway > Eastern Norway > Oslo (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Workflow (0.46)
Research Report > New Finding (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Arya Mazumdar, Barna Saha

Query Complexity of Clustering with Side Information

Neural Information Processing SystemsOct-2-2024, 15:35:36 GMT

Suppose, we are given a set of n elements to be clustered into k (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, "do two elements u and v belong to the same cluster?". The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we provide a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and give strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. However, obtaining an ideal similarity function is extremely challenging due to ambiguity in data representation, poor data quality etc., and this is one of the primary reasons that makes clustering hard.

algorithm, query, side information, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
North America > United States > Nevada > Clark County > Las Vegas (0.05)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(3 more...)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

HiReview: Hierarchical Taxonomy-Driven Automatic Literature Review Generation

Hu, Yuntong, Li, Zhuofeng, Zhang, Zheng, Ling, Chen, Kanjiani, Raasikh, Zhao, Boxin, Zhao, Liang

In this work, we present HiReview, a novel framework for hierarchical taxonomy-driven automatic literature review generation. With the exponential growth of academic documents, manual literature reviews have become increasingly labor-intensive and time-consuming, while traditional summarization models struggle to generate comprehensive document reviews effectively. Large language models (LLMs), with their powerful text processing capabilities, offer a potential solution; however, research on incorporating LLMs for automatic document generation remains limited. To address key challenges in large-scale automatic literature review generation (LRG), we propose a two-stage taxonomy-then-generation approach that combines graph-based hierarchical clustering with retrieval-augmented LLMs. First, we retrieve the most relevant sub-community within the citation network, then generate a hierarchical taxonomy tree by clustering papers based on both textual content and citation relationships. In the second stage, an LLM generates coherent and contextually accurate summaries for clusters or topics at each hierarchical level, ensuring comprehensive coverage and logical organization of the literature. Extensive experiments demonstrate that HiReview significantly outperforms state-of-the-art methods, achieving superior hierarchical organization, content relevance, and factual accuracy in automatic literature review generation tasks.

large language model, machine learning, natural language, (16 more...)

2410.03761

Country:

Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.68)

Industry:

Education (0.48)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Tola, Astrit, Taiwo, Funmilola Mary, Akcora, Cuneyt Gurcan, Coskunuzer, Baris

TopER: Topological Embeddings in Graph Representation Learning

Graph embeddings play a critical role in graph representation learning, allowing machine learning models to explore and interpret graph-structured data. However, existing methods often rely on opaque, high-dimensional embeddings, limiting interpretability and practical visualization. In this work, we introduce Topological Evolution Rate (TopER), a novel, low-dimensional embedding approach grounded in topological data analysis. TopER simplifies a key topological approach, Persistent Homology, by calculating the evolution rate of graph substructures, resulting in intuitive and interpretable visualizations of graph data. This approach not only enhances the exploration of graph datasets but also delivers competitive performance in graph clustering and classification tasks. Our TopER-based models achieve or surpass state-of-the-art results across molecular, biological, and social network datasets in tasks such as classification, clustering, and visualization.

dataset, filtration function, graph, (15 more...)

2410.01778

Country:

North America > United States > Texas > Dallas County > Richardson (0.04)
North America > United States > Florida > Orange County > Orlando (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(2 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Integrating Protein Sequence and Expression Level to Analysis Molecular Characterization of Breast Cancer Subtypes

Sholehrasa, Hossein

Breast cancer's complexity and variability pose significant challenges in understanding its progression and guiding effective treatment. This study aims to integrate protein sequence data with expression levels to improve the molecular characterization of breast cancer subtypes and predict clinical outcomes. Using ProtGPT2, a language model designed for protein sequences, we generated embeddings that capture the functional and structural properties of proteins sequence. These embeddings were integrated with protein expression level to form enriched biological representations, which were analyzed using machine learning methods like ensemble K-means for clustering and XGBoost for classification. Our approach enabled successful clustering of patients into biologically distinct groups and accurately predicted clinical outcomes such as survival and biomarkers status, achieving high performance metrics, notably an F1 score of 0.88 for survival and 0.87 for biomarkers status prediction. Analysis of feature importance highlighted key proteins like KMT2C, GCN1, and CLASP2, linked to hormone receptor and Human Epidermal Growth Factor Receptor 2 (HER2) expression, which play a role in tumor progression and patient outcomes, respectively. Furthermore, protein-protein interaction networks and correlation analyses revealed the interdependence of proteins that may influence breast cancer subtype behaviors. These findings suggest that integrating protein sequence and expression data provides valuable insights into tumor biology and has significant potential to enhance personalized treatment strategies in breast cancer care.

breast cancer, expression level, protein, (13 more...)

2410.01755

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Kansas (0.04)
Europe > Switzerland (0.04)

Genre:

Research Report > Experimental Study (0.87)
Research Report > New Finding (0.66)

Industry:

Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Chucri, Charbel, Azouz, Rami, Ott, Joachim

Recursive Abstractive Processing for Retrieval in Dynamic Datasets

Recent retrieval-augmented models enhance basic methods by building a hierarchical structure over retrieved text chunks through recursive embedding, clustering, and summarization. The most relevant information is then retrieved from both the original text and generated summaries. However, such approaches face limitations with dynamic datasets, where adding or removing documents over time complicates the updating of hierarchical representations formed through clustering. We propose a new algorithm to efficiently maintain the recursive-abstractive tree structure in dynamic datasets, without compromising performance. Additionally, we introduce a novel post-retrieval method that applies query-focused recursive abstractive processing to substantially improve context quality. Our method overcomes the limitations of other approaches by functioning as a black-box post-retrieval layer compatible with any retrieval algorithm. Both algorithms are validated through extensive experiments on real-world datasets, demonstrating their effectiveness in handling dynamic data and improving retrieval performance.

algorithm, dataset, postqfrap, (16 more...)

2410.01736

Country:

North America > Canada (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Law (0.93)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
Information Technology > Data Science > Data Mining (0.93)
(3 more...)