Goto

Collaborating Authors

 quanteda


The R package sentometrics to compute, aggregate and predict with textual sentiment

Ardia, David, Bluteau, Keven, Borms, Samuel, Boudt, Kris

arXiv.org Machine Learning

We provide a hands-on introduction to optimized textual sentiment indexation using the R package sentometrics. Textual sentiment analysis is increasingly used to unlock the potential information value of textual data. The sentometrics package implements an intuitive framework to efficiently compute sentiment scores of numerous texts, to aggregate the scores into multiple time series, and to use these time series to predict other variables. The workflow of the package is illustrated with a built-in corpus of news articles from two major U.S. journals to forecast the CBOE Volatility Index.


Advancing Text Mining with R and quanteda

#artificialintelligence

The data that we usually use for text analysis is available in text formats (e.g., .txt After reading in the data, we need to generate a corpus. A corpus is a type of dataset that is used in text analysis. It contains "a collection of text or speech material that has been brought together according to a certain set of predetermined criteria" (Shmelova et al. 2019, p. 33). These criteria are usually set by the researchers and are in concordance with the guiding question.


Data Science Tallahassee

#artificialintelligence

Dr. Mark Jack is an experienced Data Scientist and Associate Professor of Physics at Florida A&M University with several years of experience in computational modeling in particle physics, neuroscience, nanoscience and high-performance computing. He is a certified trainer in machine learning and statistical programming in R. He has spoken in several Data Science conferences which includes the Global Big Data Conferences in Tampa, Fl and Atlanta, GA. The creation of a corpus of documents from three text data files mostly relies on the use of the library'quanteda' in R. It allows to quickly tokenize the corpus of documents to remove text features such as punctuation, numbers, white space, lowercase words etc. The processing time for the complete text data is considerable.