hierarchical
- Asia > Afghanistan > Parwan Province > Charikar (0.05)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (3 more...)
- Asia > Afghanistan > Parwan Province > Charikar (0.05)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Maryland (0.04)
- (3 more...)
- Asia > Afghanistan > Parwan Province > Charikar (0.05)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada (0.04)
- (3 more...)
- Information Technology (0.93)
- Health & Medicine > Therapeutic Area > Oncology (0.46)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Hierarchical clustering with dot products recovers hidden tree structure
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.
Hierarchical clustering of complex energy systems using pretopology
Levy, Loup-Noe, Bosom, Jeremie, Guerard, Guillaume, Amor, Soufian Ben, Bui, Marc, Tran, Hai
This article attempts answering the following problematic: How to model and classify energy consumption profiles over a large distributed territory to optimize the management of buildings' consumption? Doing case-by-case in depth auditing of thousands of buildings would require a massive amount of time and money as well as a significant number of qualified people. Thus, an automated method must be developed to establish a relevant and effective recommendations system. To answer this problematic, pretopology is used to model the sites' consumption profiles and a multi-criterion hierarchical classification algorithm, using the properties of pretopological space, has been developed in a Python library. To evaluate the results, three data sets are used: A generated set of dots of various sizes in a 2D space, a generated set of time series and a set of consumption time series of 400 real consumption sites from a French Energy company. On the point data set, the algorithm is able to identify the clusters of points using their position in space and their size as parameter. On the generated time series, the algorithm is able to identify the time series clusters using Pearson's correlation with an Adjusted Rand Index (ARI) of 1. Keywords: Artificial intelligence data analysis clustering algorithms pretopology
- North America > United States (0.14)
- Asia > China (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- (2 more...)
Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search
Hierarchical clustering is a data analysis method that has been used for decades. Despite its widespread use, the method has an underdeveloped analytical foundation. Having a well understood foundation would both support the currently used methods and help guide future improvements. The goal of this paper is to give an analytic framework to better understand observations seen in practice.
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (9 more...)
- Europe > United Kingdom > England > Bristol (0.04)
- Oceania > Australia (0.04)
- North America > United States > Kansas (0.04)
- (2 more...)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- (12 more...)
AI-Powered Detection of Inappropriate Language in Medical School Curricula
Salavati, Chiman, Song, Shannon, Hale, Scott A., Montenegro, Roberto E., Dori-Hacohen, Shiri, Murai, Fabricio
The use of inappropriate language--such as outdated, exclu-sionary, or non-patient-centered terms--in medical instructional materials can significantly influence clinical training, patient interactions, and health outcomes. Despite their reputability, many materials developed over past decades contain examples now considered inappropriate by current medical standards. Given the volume of curricular content, manually identifying instances of inappropriate use of language (IUL) and its subcategories for systematic review is prohibitively costly and impractical. To address this challenge, we conduct a first-in-class evaluation of small language models (SLMs) fine-tuned on labeled data and pre-trained LLMs with in-context learning on a dataset containing approximately 500 documents and over 12,000 pages. For SLMs, we consider: (1) a general IUL classifier, (2) subcategory-specific binary classifiers, (3) a multilabel classifier, and (4) a two-stage hierarchical pipeline for general IUL detection followed by mul-tilabel classification. For LLMs, we consider variations of prompts that include subcategory definitions and/or shots. We found that both LLama-3 8B and 70B, even with carefully curated shots, are largely outperformed by SLMs. While the multilabel classifier performs best on annotated data, supplementing training with unflagged excerpts as negative examples boosts the specific classifiers' AUC by up to 25%, making them most effective models for mitigating harmful language in medical curricula.
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > Connecticut > Tolland County > Storrs (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- (6 more...)
- Research Report > New Finding (1.00)
- Instructional Material (1.00)
- Health & Medicine > Consumer Health (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.68)
- Education > Educational Setting > Higher Education (0.64)
- (2 more...)