lcss
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
- Asia > Japan > Honshū > Kantō > Ibaraki Prefecture > Tsukuba (0.04)
- North America > United States > California (0.04)
- (2 more...)
Local Curvature Smoothing with Stein's Identity for Efficient Score Matching
The training of score-based diffusion models (SDMs) is based on score matching. The challenge of score matching is that it includes a computationally expensive Ja-cobian trace. While several methods have been proposed to avoid this computation, each has drawbacks, such as instability during training and approximating the learning as learning a denoising vector field rather than a true score.
- Asia > Japan (0.14)
- Europe > France (0.14)
- North America > United States (0.14)
MemHunter: Automated and Verifiable Memorization Detection at Dataset-scale in LLMs
Wu, Zhenpeng, Lou, Jian, Zheng, Zibin, Chen, Chuan
Large language models (LLMs) have been shown to memorize and reproduce content from their training data, raising significant privacy concerns, especially with web-scale datasets. Existing methods for detecting memorization are largely sample-specific, relying on manually crafted or discretely optimized memory-inducing prompts generated on a per-sample basis, which become impractical for dataset-level detection due to the prohibitive computational cost of iterating over all samples. In real-world scenarios, data owners may need to verify whether a susceptible LLM has memorized their dataset, particularly if the LLM may have collected the data from the web without authorization. To address this, we introduce \textit{MemHunter}, which trains a memory-inducing LLM and employs hypothesis testing to efficiently detect memorization at the dataset level, without requiring sample-specific memory inducing. Experiments on models such as Pythia and Llama-2 demonstrate that \textit{MemHunter} can extract up to 40\% more training data than existing methods under constrained time resources and reduce search time by up to 80\% when integrated as a plug-in. Crucially, \textit{MemHunter} is the first method capable of dataset-level memorization detection, providing an indispensable tool for assessing privacy risks in LLMs that are powered by vast web-sourced datasets.
- Asia > China > Guangdong Province > Zhuhai (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
Local Curvature Smoothing with Stein's Identity for Efficient Score Matching
Osada, Genki, Shing, Makoto, Nishide, Takashi
The training of score-based diffusion models (SDMs) is based on score matching. The challenge of score matching is that it includes a computationally expensive Jacobian trace. While several methods have been proposed to avoid this computation, each has drawbacks, such as instability during training and approximating the learning as learning a denoising vector field rather than a true score. We propose a novel score matching variant, local curvature smoothing with Stein's identity (LCSS). The LCSS bypasses the Jacobian trace by applying Stein's identity, enabling regularization effectiveness and efficient computation. We show that LCSS surpasses existing methods in sample generation performance and matches the performance of denoising score matching, widely adopted by most SDMs, in evaluations such as FID, Inception score, and bits per dimension. Furthermore, we show that LCSS enables realistic image generation even at a high resolution of $1024 \times 1024$.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
- Asia > Japan > Honshū > Kantō > Ibaraki Prefecture > Tsukuba (0.04)
- (3 more...)
Evaluating the Effectiveness of Large Language Models in Representing and Understanding Movement Trajectories
This research focuses on assessing the ability of AI foundation models in representing the trajectories of movements. We utilize one of the large language models (LLMs) (i.e., GPT-J) to encode the string format of trajectories and then evaluate the effectiveness of the LLM-based representation for trajectory data analysis. The experiments demonstrate that while the LLM-based embeddings can preserve certain trajectory distance metrics (i.e., the correlation coefficients exceed 0.74 between the Cosine distance derived from GPT-J embeddings and the Hausdorff and Dynamic Time Warping distances on raw trajectories), challenges remain in restoring numeric values and retrieving spatial neighbors in movement trajectory analytics. In addition, the LLMs can understand the spatiotemporal dependency contained in trajectories and have good accuracy in location prediction tasks. This research highlights the need for improvement in terms of capturing the nuances and complexities of the underlying geospatial data and integrating domain knowledge to support various GeoAI applications using LLMs.
- North America > United States > Wisconsin > Dane County > Madison (0.05)
- Europe > Germany (0.04)
- Asia > China > Beijing > Beijing (0.04)
Elastic Similarity Measures for Multivariate Time Series Classification
Shifaz, Ahmed, Pelletier, Charlotte, Petitjean, Francois, Webb, Geoffrey I.
Elastic similarity measures are a class of similarity measures specifically designed to work with time series data. When scoring the similarity between two time series, they allow points that do not correspond in timestamps to be aligned. This can compensate for misalignments in the time axis of time series data, and for similar processes that proceed at variable and differing paces. Elastic similarity measures are widely used in machine learning tasks such as classification, clustering and outlier detection when using time series data. There is a multitude of research on various univariate elastic similarity measures. However, except for multivariate versions of the well known Dynamic Time Warping (DTW) there is a lack of work to generalise other similarity measures for multivariate cases. This paper adapts two existing strategies used in multivariate DTW, namely, Independent and Dependent DTW, to several commonly used elastic similarity measures. Using 23 datasets from the University of East Anglia (UEA) multivariate archive, for nearest neighbour classification, we demonstrate that each measure outperforms all others on at least one dataset and that there are datasets for which either the dependent versions of all measures are more accurate than their independent counterparts or vice versa. This latter finding suggests that these differences arise from a fundamental property of the data. We also show that an ensemble of such nearest neighbour classifiers is highly competitive with other state-of-the-art multivariate time series classifiers.
- Europe > United Kingdom > England (0.24)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (3 more...)
Learning Classifier Systems in a Nutshell
The first introductory textbook on LCS algorithms (authored by Will Browne and myself) will be published by'Springer' this fall: (link will be found here once it's available) To follow research and software developed by Ryan Urbanowicz PhD on rule-based machine learning methods or other topics, check out the following links.
- North America > United States > Pennsylvania (0.06)
- North America > United States > Michigan (0.06)
Review and Perspective for Distance Based Trajectory Clustering
Besse, Philippe, Guillouet, Brendan, Loubes, Jean-Michel, François, Royer
TRAJECTORY is a set of positional information for a moving object, ordered by time. This kind of multidimensional data is prevalent in many fields and applications, for example, to understand migration patterns by studying trajectories of animals, predict meteorology with hurricane data, improve athletes performance, etc. Our study is concentrated on vehicle trajectories within a road network. The growing use of GPS receivers and WIFI embedded mobile devices equipped with hardware for storing data enables us to collect a very large amount of data, that has to be analyzed in order to extract any relevant information. The complexity of the extracted data makes it a difficult challenge.
- North America > United States > California > San Francisco County > San Francisco (0.04)
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
ATNoSFERES revisited
Landau, Samuel, Sigaud, Olivier, Schoenauer, Marc
ATNoSFERES is a Pittsburgh style Learning Classifier System (LCS) in which the rules are represented as edges of an Augmented Transition Network. Genotypes are strings of tokens of a stack-based language, whose execution builds the labeled graph. The original ATNoSFERES, using a bitstring to represent the language tokens, has been favorably compared in previous work to several Michigan style LCSs architectures in the context of Non Markov problems. Several modifications of ATNoSFERES are proposed here: the most important one conceptually being a representational change: each token is now represented by an integer, hence the genotype is a string of integers; several other modifications of the underlying grammar language are also proposed. The resulting ATNoSFERES-II is validated on several standard animat Non Markov problems, on which it outperforms all previously published results in the LCS literature. The reasons for these improvement are carefully analyzed, and some assumptions are proposed on the underlying mechanisms in order to explain these good results.
- North America > United States > Michigan (0.25)
- North America > United States > Wisconsin > Dane County > Madison (0.14)
- North America > United States > New York (0.04)
- (4 more...)