Goto

Collaborating Authors

 wordcloud


LLM-Generated Negative News Headlines Dataset: Creation and Benchmarking Against Real Journalism

arXiv.org Artificial Intelligence

This research examines the potential of datasets generated by Large Language Models (LLMs) to support Natural Language Processing (NLP) tasks, aiming to overcome challenges related to data acquisition and privacy concerns associated with real-world data. Focusing on negative valence text, a critical component of sentiment analysis, we explore the use of LLM-generated synthetic news headlines as an alternative to real-world data. A specialized corpus of negative news headlines was created using tailored prompts to capture diverse negative sentiments across various societal domains. The synthetic headlines were validated by expert review and further analyzed in embedding space to assess their alignment with real-world negative news in terms of content, tone, length, and style. Key metrics such as correlation with real headlines, perplexity, coherence, and realism were evaluated. The synthetic dataset was benchmarked against two sets of real news headlines using evaluations including the Comparative Perplexity Test, Comparative Readability Test, Comparative POS Profiling, BERTScore, and Comparative Semantic Similarity. Results show the generated headlines match real headlines with the only marked divergence being in the proper noun score of the POS profile test.


Opinion Mining on Offshore Wind Energy for Environmental Engineering

arXiv.org Artificial Intelligence

In this paper, we conduct sentiment analysis on social media data to study mass opinion about offshore wind energy. We adapt three machine learning models, namely, TextBlob, VADER, and SentiWordNet because different functions are provided by each model. TextBlob provides subjectivity analysis as well as polarity classification. VADER offers cumulative sentiment scores. SentiWordNet considers sentiments with reference to context and performs classification accordingly. Techniques in NLP are harnessed to gather meaning from the textual data in social media. Data visualization tools are suitably deployed to display the overall results. This work is much in line with citizen science and smart governance via involvement of mass opinion to guide decision support. It exemplifies the role of Machine Learning and NLP here.


Web scraping and text analysis in R and GGplot2 – A.Z. Andis Arietta

#artificialintelligence

I recently needed to learn text mining for a project at work. I generally learn more quickly with a real-world project. So, I turned to a topic I love: Wilderness, to see how I could apply the skills of text scrubbing and natural language processing. You can clone my Git repo for the project or follow along in the post below. The first portion of this post will cover web scraping, then text mining, and finally analysis and visualization.


Providing Insights for Open-Response Surveys via End-to-End Context-Aware Clustering

arXiv.org Artificial Intelligence

Teachers often conduct surveys in order to collect data from a predefined group of students to gain insights into topics of interest. When analyzing surveys with open-ended textual responses, it is extremely time-consuming, labor-intensive, and difficult to manually process all the responses into an insightful and comprehensive report. In the analysis step, traditionally, the teacher has to read each of the responses and decide on how to group them in order to extract insightful information. Even though it is possible to group the responses only using certain keywords, such an approach would be limited since it not only fails to account for embedded contexts but also cannot detect polysemous words or phrases and semantics that are not expressible in single words. In this work, we present a novel end-to-end context-aware framework that extracts, aggregates, and abbreviates embedded semantic patterns in open-response survey data. Our framework relies on a pre-trained natural language model in order to encode the textual data into semantic vectors. The encoded vectors then get clustered either into an optimally tuned number of groups or into a set of groups with pre-specified titles. In the former case, the clusters are then further analyzed to extract a representative set of keywords or summary sentences that serve as the labels of the clusters. In our framework, for the designated clusters, we finally provide context-aware wordclouds that demonstrate the semantically prominent keywords within each group. Honoring user privacy, we have successfully built the on-device implementation of our framework suitable for real-time analysis on mobile devices and have tested it on a synthetic dataset. Our framework reduces the costs at-scale by automating the process of extracting the most insightful information pieces from survey data.


Resume Screening using Deep Learning on Cainvas

#artificialintelligence

Resume Screening is necessary when companies receive thousands of applications for different roles and need to find suitable matches. For this project, the dataset originally consists of 2 columns -- Category and Resume, where the Category denotes the field (eg: Data Science, HR, Testing etc.). By using value_counts on Category, we can find the frequency-wise distribution of different categories present in our dataset. During pre-processing, we need to remove links, hashtags, urls etc. as these are irrelevant in the resume. Further, using nltk, we also remove stopwords (for eg words like'are', 'the', 'or') that provide no significance to the content.


Predicting the Difficulty of Texts Using Machine Learning and Getting a Visual Representation of…

#artificialintelligence

We see that text data is ubiquitous in nature. There is a lot of text present in different forms such as posts, books, articles, and blogs. What is more interesting is the fact that there is a subset of Artificial Intelligence called Natural Language Processing (NLP) that would convert text into a form that could be used for machine learning. I know that sounds a lot but getting to know the details and the proper implementation of machine learning algorithms could ensure that one learns the important tools in the process. Since there are newer and better libraries being created to be used for machine learning purposes, it would make sense to learn some of the state-of-the-art tools that could be used for predictions. I've recently come across a challenge on Kaggle about predicting the difficulty of the text.


Predicting Fake News using NLP and Machine Learning

#artificialintelligence

The ratio is disturbed from being 1:1 to 4:5 for genuine to fake news. It is seen that the median length is lower for fake articles but it also has loads of outliers. It is seen that they start from 0 which is concerning. It actually starts from 1 when I used .describe() to see the numbers. So I took a look at these texts and found that they are blank.


Spam Email Detection Using Machine Learning

#artificialintelligence

There are 4,825 ham and 747 spam messages. This indicates the data is imbalanced which needs to be fixed. The top ham message is "Sorry, I'll call later", whereas the top spam message is "Please call our customer service…" which occurred 30 and 4 times, respectively. First, let's create a separate dataframe for ham and spam messages and convert it to NumPy array and then to a list to generate WordCloud later. Since it is a text data, there are many unnecessary stopwords like articles, prepositions etc., which needs to be removed from the data.


The discovery of wine's structural form

#artificialintelligence

Today I will present a guided tutorial for applying Kemp & Tenembaum's brilliant "form discovery" algorithm to a wine dataset. Ultimately, this provides a data-driven map to choose wines from, based on our tastes. If you are, like me, fond of data science, machine learning, cognition and/or a wine lover, then you might find this post interesting. Actually, if you know of ways it could be improved I'd love to hear them!] First of all, like every recipe, we'll start with a list of things we need: Essentially, in their work Kemp & Tenenbaum created an algorithm which finds the best structural representation for a dataset, without any assumption nor indication about this dimension.


Develop Text into WordCloud in Python

#artificialintelligence

Word clouds or tag clouds are graphical representations of word frequency that give greater prominence to words that appear more frequently in a source text. The larger the word in the visual the more common the word was in the document(s). Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud. Word clouds are widely used for analyzing data from social network websites. For generating word cloud in Python, modules needed are -- matplotlib, pandas and wordcloud.