AITopics | gzip

Collaborating Authors

gzip

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

645e6bfdd05d1a69c5e47b20f0a91d46-AuthorFeedback.pdf

Neural Information Processing SystemsOct-3-2025, 02:20:52 GMT

artificial intelligence, reviewer, vector, (15 more...)

Neural Information Processing Systems

Country: North America > United States (0.16)

Technology: Information Technology > Artificial Intelligence (0.35)

Add feedback

An Enhancement of Jiang, Z., et al.s Compression-Based Classification Algorithm Applied to News Article Categorization

Benavides, Sean Lester C., Masapol, Cid Antonio F., Morano, Jonathan C., Cortez, Dan Michael A.

arXiv.org Artificial IntelligenceFeb-20-2025

This study enhances Jiang et al.'s compression-based classification algorithm by addressing its limitations in detecting semantic similarities between text documents. The proposed improvements focus on unigram extraction and optimized concatenation, eliminating reliance on entire document compression. By compressing extracted unigrams, the algorithm mitigates sliding window limitations inherent to gzip, improving compression efficiency and similarity detection. The optimized concatenation strategy replaces direct concatenation with the union of unigrams, reducing redundancy and enhancing the accuracy of Normalized Compression Distance (NCD) calculations. Experimental results across datasets of varying sizes and complexities demonstrate an average accuracy improvement of 5.73%, with gains of up to 11% on datasets containing longer documents. Notably, these improvements are more pronounced in datasets with high-label diversity and complex text structures. The methodology achieves these results while maintaining computational efficiency, making it suitable for resource-constrained environments. This study provides a robust, scalable solution for text classification, emphasizing lightweight preprocessing techniques to achieve efficient compression, which in turn enables more accurate classification.

algorithm, classification, dataset, (15 more...)

arXiv.org Artificial Intelligence

2502.14444

Country: Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)

Genre: Research Report > Experimental Study (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.39)

Add feedback

gzip Predicts Data-dependent Scaling Laws

Pandey, Rohan

arXiv.org Artificial IntelligenceMay-26-2024

Past work has established scaling laws that predict the performance of a neural language model (LM) as a function of its parameter count and the number of tokens it's trained on, enabling optimal allocation of a fixed compute budget. Are these scaling laws agnostic to training data as some prior work suggests? We generate training datasets of varying complexities by modulating the syntactic properties of a PCFG, finding that 1) scaling laws are sensitive to differences in data complexity and that 2) gzip, a compression algorithm, is an effective predictor of how data complexity impacts scaling properties. We propose a new data-dependent scaling law for LM's that accounts for the training data's gzip-compressibility; its compute-optimal frontier increases in dataset size preference (over parameter count preference) as training data becomes harder to compress.

arxiv preprint arxiv, dataset, gzip, (14 more...)

arXiv.org Artificial Intelligence

2405.16684

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Gzip versus bag-of-words for text classification

Opitz, Juri

arXiv.org Artificial IntelligenceAug-8-2023

KNN is a simple classifier that uses distance measurements between data points: For a given testing point, we calculate its distance to every other point from some labeled training set and check the labels of the K closest points (i.e., the K-Neirest Neighbors), predicting the label that we observe most frequently. Hence, it is straightforward to build a general text classifier from KNN, if we can equip it with a sensible distance measure between documents. Interestingly, recent findings [4] suggest that we can exploit compression to assess the distance of two documents, by comparing their individual compression lengths to the length of their compressed concatenation (we call this measurement gzip). With this approach, [4] show strong text classification performance across different data sets, sometimes achieving higher accuracy than trained neural classifiers such as BERT [2], especially in scenarios where only few training data are available. Against this background, it is not surprising that gzip has quickly attracted lots of attention.

machine learning, natural language, training data, (20 more...)

arXiv.org Artificial Intelligence

2307.15002

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.72)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Less is More: Parameter-Free Text Classification with Gzip

Jiang, Zhiying, Yang, Matthew Y. R., Tsirlin, Mikhail, Tang, Raphael, Lin, Jimmy

arXiv.org Artificial IntelligenceDec-19-2022

Deep neural networks (DNNs) are often used for text classification tasks as they usually achieve high levels of accuracy. However, DNNs can be computationally intensive with billions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that's easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a $k$-nearest-neighbor classifier. Without any training, pre-training or fine-tuning, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy.

machine learning, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

2212.0941

Country:

Asia > Japan (0.04)
North America > United States > Louisiana (0.04)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback