Goto

Collaborating Authors

 naïve bayes classifier


Test-Time Steering for Lossless Text Compression via Weighted Product of Experts

Zhang, Qihang, Li, Muchen, Wang, Ziao, Liao, Renjie, Wang, Lele

arXiv.org Artificial Intelligence

Lossless compression techniques are crucial in an era of rapidly growing data. Traditional universal compressors like gzip offer low computational overhead, high speed, and broad applicability across data distributions. However, they often lead to worse compression rates than modern neural compressors, which leverage large-scale training data to model data distributions more effectively. Despite their advantages, neural compressors struggle to generalize to unseen data. To address this limitation, we propose a novel framework that performs Test-Time Steering via a Weighted Product of Experts (wPoE). At inference, our method adaptively combines a universal compression model with a pretrained neural language model, ensuring the compression rate is at least as good as that of the best individual model. Extensive experiments demonstrate that our approach improves the performance of text compression without requiring fine-tuning. Furthermore, it seamlessly integrates with any autoregressive language model, providing a practical solution for enhancing text compression across diverse data distributions.


Cascade of one-class classifier ensemble and dynamic naive Bayes classifier applied to the myoelectric-based upper limb prosthesis control with contaminated channels detection

Trajdos, Pawel, Kurzynski, Marek

arXiv.org Artificial Intelligence

Modern upper limb bioprostheses are typically controlled by sEMG signals using a pattern recognition scheme in the control process. Unfortunately, the sEMG signal is very susceptible to contamination that deteriorates the quality of the control system and reduces the usefulness of the prosthesis in the patient's everyday life. In the paper, the authors propose a new recognition system intended for sEMG-based control of the hand prosthesis with detection of contaminated sEMG signals. The originality of the proposed solution lies in the co-operation of two recognition systems working in a cascade structure: (1) an ensemble of one-class classifiers used to recognise contaminated signals and (2) a naive Bayes classifier (NBC) which recognises the patient's intentions using the information about contaminations produced by the ensemble. Although in the proposed approach, the NBC model is changed dynamically, due to the multiplicative form of the classification functions, training can be performed in a one-shot procedure. Experimental studies were conducted using real sEMG signals. The results obtained confirm the hypothesis that the use of the one-class classifier ensemble and the dynamic NBC model leads to improved classification quality.


Deep Learning, Machine Learning, Advancing Big Data Analytics and Management

Hsieh, Weiche, Bi, Ziqian, Chen, Keyu, Peng, Benji, Zhang, Sen, Xu, Jiawei, Wang, Jinlang, Yin, Caitlyn Heqi, Zhang, Yichao, Feng, Pohsun, Wen, Yizhu, Wang, Tianyang, Li, Ming, Liang, Chia Xin, Ren, Jintao, Niu, Qian, Chen, Silin, Yan, Lawrence K. Q., Xu, Han, Tseng, Hong-Ming, Song, Xinyuan, Jing, Bowen, Yang, Junjie, Song, Junhao, Liu, Junyu, Liu, Ming

arXiv.org Artificial Intelligence

Advancements in artificial intelligence, machine learning, and deep learning have catalyzed the transformation of big data analytics and management into pivotal domains for research and application. This work explores the theoretical foundations, methodological advancements, and practical implementations of these technologies, emphasizing their role in uncovering actionable insights from massive, high-dimensional datasets. The study presents a systematic overview of data preprocessing techniques, including data cleaning, normalization, integration, and dimensionality reduction, to prepare raw data for analysis. Core analytics methodologies such as classification, clustering, regression, and anomaly detection are examined, with a focus on algorithmic innovation and scalability. Furthermore, the text delves into state-of-the-art frameworks for data mining and predictive modeling, highlighting the role of neural networks, support vector machines, and ensemble methods in tackling complex analytical challenges. Special emphasis is placed on the convergence of big data with distributed computing paradigms, including cloud and edge computing, to address challenges in storage, computation, and real-time analytics. The integration of ethical considerations, including data privacy and compliance with global standards, ensures a holistic perspective on data management. Practical applications across healthcare, finance, marketing, and policy-making illustrate the real-world impact of these technologies. Through comprehensive case studies and Python-based implementations, this work equips researchers, practitioners, and data enthusiasts with the tools to navigate the complexities of modern data analytics. It bridges the gap between theory and practice, fostering the development of innovative solutions for managing and leveraging data in the era of artificial intelligence.


Sampling Audit Evidence Using a Naive Bayes Classifier

Sheu, Guang-Yih, Liu, Nai-Ru

arXiv.org Artificial Intelligence

Taiwan's auditors have suffered from processing excessive audit data, including drawing audit evidence. This study advances sampling techniques by integrating machine learning with sampling. This machine learning integration helps avoid sampling bias, keep randomness and variability, and target risker samples. We first classify data using a Naive Bayes classifier into some classes. Next, a user-based, item-based, or hybrid approach is employed to draw audit evidence. The representativeness index is the primary metric for measuring its representativeness. The user-based approach samples data symmetric around the median of a class as audit evidence. It may be equivalent to a combination of monetary and variable samplings. The item-based approach represents asymmetric sampling based on posterior probabilities for obtaining risky samples as audit evidence. It may be identical to a combination of non-statistical and monetary samplings. Auditors can hybridize those user-based and item-based approaches to balance representativeness and riskiness in selecting audit evidence. Three experiments show that sampling using machine learning integration has the benefits of drawing unbiased samples, handling complex patterns, correlations, and unstructured data, and improving efficiency in sampling big data. However, the limitations are the classification accuracy output by machine learning algorithms and the range of prior probabilities.


Viewing the process of generating counterfactuals as a source of knowledge

Lemaire, Vincent, Boudec, Nathan Le, Guyomard, Victor, Fessant, Françoise

arXiv.org Artificial Intelligence

There are now many explainable AI methods for understanding the decisions of a machine learning model. Among these are those based on counterfactual reasoning, which involve simulating features changes and observing the impact on the prediction. This article proposes to view this simulation process as a source of creating a certain amount of knowledge that can be stored to be used, later, in different ways. This process is illustrated in the additive model and, more specifically, in the case of the naive Bayes classifier, whose interesting properties for this purpose are shown.


An Efficient Shapley Value Computation for the Naive Bayes Classifier

Lemaire, Vincent, Clérot, Fabrice, Boullé, Marc

arXiv.org Artificial Intelligence

Variable selection or importance measurement of input variables to a machine learning model has become the focus of much research. It is no longer enough to have a good model, one also must explain its decisions. This is why there are so many intelligibility algorithms available today. Among them, Shapley value estimation algorithms are intelligibility methods based on cooperative game theory. In the case of the naive Bayes classifier, and to our knowledge, there is no ``analytical" formulation of Shapley values. This article proposes an exact analytic expression of Shapley values in the special case of the naive Bayes Classifier. We analytically compare this Shapley proposal, to another frequently used indicator, the Weight of Evidence (WoE) and provide an empirical comparison of our proposal with (i) the WoE and (ii) KernelShap results on real world datasets, discussing similar and dissimilar results. The results show that our Shapley proposal for the naive Bayes classifier provides informative results with low algorithmic complexity so that it can be used on very large datasets with extremely low computation time.


Naive Bayes algorithm. A Simple and Effective Approach for…

#artificialintelligence

Naive Bayes is a machine learning algorithm that is used for classification tasks. It is based on the idea of applying Bayes' theorem, which describes the probability of an event based on prior knowledge of conditions that might be related to the event. The algorithm makes the assumption that all of the features in the dataset are independent of each other, which is why it is called "naive." This means that the presence or absence of one feature does not affect the probability of the other features. To classify a new data point, the algorithm first calculates the probability of the new data point belonging to each class. It then chooses the class with the highest probability as the predicted class for the new data point. To calculate the probability of a new data point belonging to a given class, the algorithm uses Bayes' theorem, which states that the probability of A given B is equal to the probability of B given A times the probability of A, divided by the probability of B. For example, suppose we have a dataset with two classes: "spam" and "not spam." We can use Bayes' theorem to calculate the probability that a new email belongs to the "spam" class, given that it contains the word "Viagra." We first need to calculate the probability of the word "Viagra" appearing in a "spam" email, and the probability of the word "Viagra" appearing in a "not spam" email. We then multiply these probabilities by the overall probability of the email belonging to the "spam" class, and divide by the probability of the word "Viagra" appearing in any email. Once we have calculated the probabilities for each class, we choose the class with the highest probability as the predicted class for the new data point. Naive Bayes is a simple and effective algorithm for classification tasks, and it can be used with a variety of different types of data.


Comparative Study of Sentiment Analysis for Multi-Sourced Social Media Platforms

Kapur, Keshav, Harikrishnan, Rajitha

arXiv.org Artificial Intelligence

There is a vast amount of data generated every second due to the rapidly growing technology in the current world. This area of research attempts to determine the feelings or opinions of people on social media posts. The dataset we used was a multi-source dataset from the comment section of various social networking sites like Twitter, Reddit, etc. Natural Language Processing Techniques were employed to perform sentiment analysis on the obtained dataset. In this paper, we provide a comparative analysis using techniques of lexicon-based, machine learning and deep learning approaches. The Machine Learning algorithm used in this work is Naive Bayes, the Lexicon-based approach used in this work is TextBlob, and the deep-learning algorithm used in this work is LSTM. The rise of the internet has altered how people now express their ideas and thoughts.


Feature Engineering vs BERT on Twitter Data

Gani, Ryiaadh, Chalaguine, Lisa

arXiv.org Artificial Intelligence

In this paper, we compare the performances of traditional machine learning models using feature engineering and word vectors and the state-of-the-art language model BERT using word embeddings on three datasets. We also consider the time and cost efficiency of feature engineering compared to BERT. From our results we conclude that the use of the BERT model was only worth the time and cost trade-off for one of the three datasets we used for comparison, where the BERT model significantly outperformed any kind of traditional classifier that uses feature vectors, instead of embeddings. Using the BERT model for the other datasets only achieved an increase of 0.03 and 0.05 of accuracy and F1 score respectively, which could be argued makes its use not worth the time and cost of GPU.


Integrating question answering and text-to-SQL in Portuguese

José, Marcos Menon, José, Marcelo Archanjo, Mauá, Denis Deratani, Cozman, Fábio Gagliardi

arXiv.org Artificial Intelligence

Deep learning transformers have drastically improved systems that automatically answer questions in natural language. However, different questions demand different answering techniques; here we propose, build and validate an architecture that integrates different modules to answer two distinct kinds of queries. Our architecture takes a free-form natural language text and classifies it to send it either to a Neural Question Answering Reasoner or a Natural Language parser to SQL. We implemented a complete system for the Portuguese language, using some of the main tools available for the language and translating training and testing datasets. Experiments show that our system selects the appropriate answering method with high accuracy (over 99\%), thus validating a modular question answering strategy.