Goto

Collaborating Authors

 gensim


GenSim: A General Social Simulation Platform with Large Language Model based Agents

Tang, Jiakai, Gao, Heyang, Pan, Xuchen, Wang, Lei, Tan, Haoran, Gao, Dawei, Chen, Yushuo, Chen, Xu, Lin, Yankai, Li, Yaliang, Ding, Bolin, Zhou, Jingren, Wang, Jun, Wen, Ji-Rong

arXiv.org Artificial Intelligence

With the rapid advancement of large language models (LLMs), recent years have witnessed many promising studies on leveraging LLM-based agents to simulate human social behavior. While prior work has demonstrated significant potential across various domains, much of it has focused on specific scenarios involving a limited number of agents and has lacked the ability to adapt when errors occur during simulation. To overcome these limitations, we propose a novel LLM-agent-based simulation platform called \textit{GenSim}, which: (1) \textbf{Abstracts a set of general functions} to simplify the simulation of customized social scenarios; (2) \textbf{Supports one hundred thousand agents} to better simulate large-scale populations in real-world contexts; (3) \textbf{Incorporates error-correction mechanisms} to ensure more reliable and long-term simulations. To evaluate our platform, we assess both the efficiency of large-scale agent simulations and the effectiveness of the error-correction mechanisms. To our knowledge, GenSim represents an initial step toward a general, large-scale, and correctable social simulation platform based on LLM agents, promising to further advance the field of social science.


A Comparative Study of Text Embedding Models for Semantic Text Similarity in Bug Reports

Patil, Avinash, Han, Kihwan, Jadon, Aryan

arXiv.org Artificial Intelligence

Bug reports are an essential aspect of software development, and it is crucial to identify and resolve them quickly to ensure the consistent functioning of software systems. Retrieving similar bug reports from an existing database can help reduce the time and effort required to resolve bugs. In this paper, we compared the effectiveness of semantic textual similarity methods for retrieving similar bug reports based on a similarity score. We explored several embedding models such as TF-IDF (Baseline), FastText, Gensim, BERT, and ADA. We used the Software Defects Data containing bug reports for various software projects to evaluate the performance of these models. Our experimental results showed that BERT generally outperformed the rest of the models regarding recall, followed by ADA, Gensim, FastText, and TFIDF. Our study provides insights into the effectiveness of different embedding methods for retrieving similar bug reports and highlights the impact of selecting the appropriate one for this task. Our code is available on GitHub.


GitHub - RaRe-Technologies/gensim: Topic Modelling for Humans

#artificialintelligence

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia. This software depends on NumPy and Scipy, two Python packages for scientific computing. You must have them installed prior to installing gensim.


Hands-on intro to Language Processing (NLP)

#artificialintelligence

This article discusses three techniques that practitioners could use to effectively start working with natural language processing (NLP). This will also give good visibility to people interested in having a sense of what NLP is about -- if you are an expert, please feel free to connect, comment, or suggest. At erreVol, we leverage similar tools to extract useful insights from transcripts of earnings reports of public corporations -- the interested reader can go test the platform. Note, we will present lines of codes for the reader interested in replicating or using what is presented below. Otherwise, please feel free to skip those technical lines as the reading should result seamless.


Getting started with Gensim for basic NLP tasks – Analytics India Magazine

#artificialintelligence

Gensim is an open-source python package for natural language processing with a special focus on topic modelling.

  Country: Asia > India (0.40)
  Industry: Media > News (0.73)

Using NLP to improve your Resume - KDnuggets

#artificialintelligence

Now you can read an overall summary of the job role and your existing Resume! Did you miss anything about the job role that is being highlighted in summary? Small nuanced details can help you sell yourself. Does your summarized document make sense and bring out your essential qualities? Perhaps a concise summary alone is not sufficient.


Learn NLP the Stanford Way -- Lesson 2

#artificialintelligence

In the previous post, we introduced NLP. To find out word meanings with the Python programming language, we used the NLTK package and worked our way into word embeddings using the gensim package and Word2vec. Since we only touched the Word2Vec technique from a 10,000-feet overview, we are now going to dive deeper into the training method to create a Word2vec model. The Word2vec (Mikolov et al. 2013)[1][2] is not a singular technique or algorithm. It's actually a family of neural network architectures and optimization techniques that can produce good results learning embeddings for large datasets.


Learn NLP the Stanford way -- Lesson 1

#artificialintelligence

The AI area of Natural Language Processing, or NLP, throughout its gigantic language models -- yes, GPT-3, I'm watching you -- presents what it's perceived as a revolution in machines' capabilities to perform the most distinct language tasks. Due to that, the perception of the public as a whole is split: some perceive that these new language models are going to pave the way to a Skynet type of technology, while others dismiss them as hype-fueled technologies that will live in dusty shelves, or HDD drives, in little to no time. Motivated by this, I'm creating this series of stories that will approach NLP from scratch in a friendly way. To join me, you'll need to have little experience with Python and Jupyter Notebooks, and for the most part, I won't even ask you to have anything installed on your machine. This series will differ dramatically from the Stanford course in terms of the depth that we'll approach statistics and calculus.


Time-based Sequence Model for Personalization and Recommendation Systems

Ishkhanov, Tigran, Naumov, Maxim, Chen, Xianjie, Zhu, Yan, Zhong, Yuan, Azzolini, Alisson Gusatti, Sun, Chonglin, Jiang, Frank, Malevich, Andrey, Xiong, Liang

arXiv.org Machine Learning

Recommendation systems play an important role in many e-commerce applications as well as search and ranking services [6, 15, 21, 26, 30, 31, 41, 48]. There are two main strategies to perform recommendations: content and collaborative filtering. In content filtering the user creates a profile based on its interest, while human experts create a profile for the product. An algorithm matches the two profiles and recommends the closest matches to the user. For example, this approach is taken by the Pandora Music Genome Project [29]. In collaborative filtering, the recommendations are based only on user past behavior from which the future behavior is derived. The advantage of this approach is that it requires no external information and is not domain specific. The challenge is that in the beginning very few user-item interactions are available. For instance, this cold start problem is addressed by Netflix by asking the user for a few favorite movies when creating their profile for the first time [27].


Python Libraries for Natural Language Processing

#artificialintelligence

Natural Language Processing is considered one of the many critical aspects of making intelligent systems. By training your solution with data gathered from the real-world, you can make it faster and more relevant to users, generating crucial insight about your customer base. In this article, we will be taking a look at how Python offers some of the most useful and powerful libraries for leveraging the power of Natural Language Processing into your project and where exactly do they fit in. Often recognized as a professional-grade Python library for advanced Natural Language Processing, spaCy excels at working with incredibly large-scale information extraction tasks. Built using Python and Cython, spaCy combines the best of both languages, the convenience from Python and the speed from Cython to deliver one of the best-in-class NLP experiences. Stanford CoreNLP is a suite of tools built for implementing a Natural Language Processing into your project.