AITopics | readme file

Collaborating Authors

readme file

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

LLM-based Content Classification Approach for GitHub Repositories by the README Files

Mehmood, Malik Uzair, Hussain, Shahid, Wang, Wen Li, Malik, Muhammad Usama

arXiv.org Artificial IntelligenceJul-30-2025

GitHub is the world's most popular platform for storing, sharing, and managing code. Every GitHub repository has a README file associated with it. The README files should contain project-related information as per the recommendations of GitHub to support the usage and improvement of repositories. However, GitHub repository owners sometimes neglected these recommendations. This prevents a GitHub repository from reaching its full potential. This research posits that the comprehensiveness of a GitHub repository's README file significantly influences its adoption and utilization, with a lack of detail potentially hindering its full potential for widespread engagement and impact within the research community. Large Language Models (LLMs) have shown great performance in many text-based tasks including text classification, text generation, text summarization and text translation. In this study, an approach is developed to fine-tune LLMs for automatically classifying different sections of GitHub README files. Three encoder-only LLMs are utilized, including BERT, DistilBERT and RoBERTa. These pre-trained models are then fine-tuned based on a gold-standard dataset consisting of 4226 README file sections. This approach outperforms current state-of-the-art methods and has achieved an overall F1 score of 0.98. Moreover, we have also investigated the use of Parameter-Efficient Fine-Tuning (PEFT) techniques like Low-Rank Adaptation (LoRA) and shown an economical alternative to full fine-tuning without compromising much performance. The results demonstrate the potential of using LLMs in designing an automatic classifier for categorizing the content of GitHub README files. Consequently, this study contributes to the development of automated tools for GitHub repositories to improve their identifications and potential usages.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.21899

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Uncovering Intention through LLM-Driven Code Snippet Description Generation

Nugroho, Yusuf Sulistyo, Salam, Farah Danisha, Reid, Brittany, Kula, Raula Gaikovina, Shimari, Kazumasa, Matsumoto, Kenichi

arXiv.org Artificial IntelligenceJun-19-2025

Documenting code snippets is essential to pinpoint key areas where both developers and users should pay attention. Examples include usage examples and other Application Programming Interfaces (APIs), which are especially important for third-party libraries. With the rise of Large Language Models (LLMs), the key goal is to investigate the kinds of description developers commonly use and evaluate how well an LLM, in this case Llama, can support description generation. We use NPM Code Snippets, consisting of 185,412 packages with 1,024,579 code snippets. From there, we use 400 code snippets (and their descriptions) as samples. First, our manual classification found that the majority of original descriptions (55.5%) highlight example-based usage. This finding emphasizes the importance of clear documentation, as some descriptions lacked sufficient detail to convey intent. Second, the LLM correctly identified the majority of original descriptions as "Example" (79.75%), which is identical to our manual finding, showing a propensity for generalization. Third, compared to the originals, the produced description had an average similarity score of 0.7173, suggesting relevance but room for improvement. Scores below 0.9 indicate some irrelevance. Our results show that depending on the task of the code snippet, the intention of the document may differ from being instructions for usage, installations, or descriptive learning examples for any user of a library.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.15453

Country:

Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.05)
Asia > Indonesia (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories

Xiao, Yijia, Wang, Runhui, Kong, Luyang, Golac, Davor, Wang, Wei

arXiv.org Artificial IntelligenceFeb-11-2025

The increasing complexity of computer science research projects demands more effective tools for deploying code repositories. Large Language Models (LLMs), such as Anthropic Claude and Meta Llama, have demonstrated significant advancements across various fields of computer science research, including the automation of diverse software engineering tasks. To evaluate the effectiveness of LLMs in handling complex code development tasks of research projects, particularly for NLP/CV/AI/ML/DM topics, we introduce CSR-Bench, a benchmark for Computer Science Research projects. This benchmark assesses LLMs from various aspects including accuracy, efficiency, and deployment script quality, aiming to explore their potential in conducting computer science research autonomously. We also introduce a novel framework, CSR-Agents, that utilizes multiple LLM agents to automate the deployment of GitHub code repositories of computer science research projects. Specifically, by checking instructions from markdown files and interpreting repository structures, the model generates and iteratively improves bash commands that set up the experimental environments and deploy the code to conduct research tasks. Preliminary results from CSR-Bench indicate that LLM agents can significantly enhance the workflow of repository deployment, thereby boosting developer productivity and improving the management of developmental workflows.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.06111

Country: North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre:

Research Report (0.82)
Workflow (0.55)

Industry:

Information Technology (0.68)
Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Free and Customizable Code Documentation with LLMs: A Fine-Tuning Approach

Chakrabarty, Sayak, Pal, Souradip

arXiv.org Artificial IntelligenceDec-1-2024

Automated documentation of programming source code is a challenging task with significant practical and scientific implications for the developer community. We present a large language model (LLM)-based application that developers can use as a support tool to generate basic documentation for any publicly available repository. Over the last decade, several papers have been written on generating documentation for source code using neural network architectures. With the recent advancements in LLM technology, some open-source applications have been developed to address this problem. However, these applications typically rely on the OpenAI APIs, which incur substantial financial costs, particularly for large repositories. Moreover, none of these open-source applications offer a fine-tuned model or features to enable users to fine-tune. Additionally, finding suitable data for fine-tuning is often challenging. Our application addresses these issues which is available at https://pypi.org/project/readme-ready/.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.00726

Country:

North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
North America > United States > Illinois > Cook County > Evanston (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.37)

Add feedback

Read between the lines -- Functionality Extraction From READMEs

Kumar, Prince, Tamilselvam, Srikanth, Garg, Dinesh

arXiv.org Artificial IntelligenceMar-15-2024

While text summarization is a well-known NLP task, in this paper, we introduce a novel and useful variant of it called functionality extraction from Git README files. Though this task is a text2text generation at an abstract level, it involves its own peculiarities and challenges making existing text2text generation systems not very useful. The motivation behind this task stems from a recent surge in research and development activities around the use of large language models for code-related tasks, such as code refactoring, code summarization, etc. We also release a human-annotated dataset called FuncRead, and develop a battery of models for the task. Our exhaustive experimentation shows that small size fine-tuned models beat any baseline models that can be designed using popular black-box or white-box large language models (LLMs) such as ChatGPT and Bard. Our best fine-tuned 7 Billion CodeLlama model exhibit 70% and 20% gain on the F1 score against ChatGPT and Bard respectively.

functionality, readme file, starcoderbase-1b 0, (15 more...)

arXiv.org Artificial Intelligence

2403.10205

Country: Asia (0.04)

Genre: Research Report (0.50)

Industry: Energy (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.75)

Add feedback

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

Liu, Yuliang, Tang, Xiangru, Cai, Zefan, Lu, Junjie, Zhang, Yichi, Shao, Yanjun, Deng, Zexuan, Hu, Helan, Yang, Zengxian, An, Kaikai, Huang, Ruijun, Si, Shuzheng, Chen, Sheng, Zhao, Haozhe, Li, Zhengliang, Chen, Liang, Zong, Yiming, Wang, Yan, Liu, Tianyu, Jiang, Zhiwei, Chang, Baobao, Qin, Yujia, Zhou, Wangchunshu, Zhao, Yilun, Cohan, Arman, Gerstein, Mark

arXiv.org Artificial IntelligenceNov-16-2023

Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10044 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements. Code, data, and models are available at \url{https://ml-bench.github.io/}.

arxiv preprint arxiv, language model leverage open-source library, repository, (9 more...)

arXiv.org Artificial Intelligence

2311.09835

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An End-to-End System for Reproducibility Assessment of Source Code Repositories via Their Readmes

Akdeniz, Eyüp Kaan, Tekir, Selma, Hinnawi, Malik Nizar Asad Al

arXiv.org Artificial IntelligenceOct-14-2023

Increased reproducibility of machine learning research has been a driving force for dramatic improvements in learning performances. The scientific community further fosters this effort by including reproducibility ratings in reviewer forms and considering them as a crucial factor for the overall evaluation of papers. Accompanying source code is not sufficient to make a work reproducible. The shared codes should meet the ML reproducibility checklist as well. This work aims to support reproducibility evaluations of papers with source codes. We propose an end-to-end system that operates on the Readme file of the source code repositories. The system checks the compliance of a given Readme to a template proposed by a widely used platform for sharing source codes of research. Our system generates scores based on a custom function to combine section scores. We also train a hierarchical transformer model to assign a class label to a given Readme. The experimental results show that the section similarity-based system performs better than the hierarchical transformer. Moreover, it has an advantage regarding explainability since one can directly relate the score to the sections of Readme files.

consecutive 0, readme file, reproducibility, (14 more...)

arXiv.org Artificial Intelligence

2310.09634

Country:

Asia > Middle East > Republic of Türkiye > İzmir Province > İzmir (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Automatic Analysis of Available Source Code of Top Artificial Intelligence Conference Papers

Lin, Jialiang, Wang, Yingmin, Yu, Yao, Zhou, Yu, Chen, Yidong, Shi, Xiaodong

arXiv.org Artificial IntelligenceSep-28-2022

Source code is essential for researchers to reproduce the methods and replicate the results of artificial intelligence (AI) papers. Some organizations and researchers manually collect AI papers with available source code to contribute to the AI community. However, manual collection is a labor-intensive and time-consuming task. To address this issue, we propose a method to automatically identify papers with available source code and extract their source code repository URLs. With this method, we find that 20.5% of regular papers of 10 top AI conferences published from 2010 to 2019 are identified as papers with available source code and that 8.1% of these source code repositories are no longer accessible. We also create the XMU NLP Lab README Dataset, the largest dataset of labeled README files for source code document research. Through this dataset, we have discovered that quite a few README files have no installation instructions or usage tutorials provided. Further, a large-scale comprehensive statistical analysis is made for a general picture of the source code of AI conference papers. The proposed solution can also go beyond AI conference papers to analyze other scientific papers from both journals and conferences to shed light on more domains.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1142/S0218194022500358

2209.14155

Country:

Asia > China > Fujian Province > Xiamen (0.04)
Asia > Taiwan (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

Automatically Categorising GitHub Repositories by Application Domain

Zanartu, Francisco, Treude, Christoph, Cartaxo, Bruno, Borges, Hudson Silva, Moura, Pedro, Wagner, Markus, Pinto, Gustavo

arXiv.org Artificial IntelligenceJul-30-2022

For example, there are limited means available to separate repositories containing engineered software projects from other repositories, such as personal projects or those that use GitHub for free cloud storage (Kalliamvakou et al., 2014; Munaiah et al., 2017). To make it easier for users to identify relevant repositories for their wide variety of use cases, GitHub has been adding features to its service, such as README files, topics tags, and showcases (where contributors describe, add keywords, and label their repository). However, these features are insufficient for many use cases. For example, while achieving generalizability of the results is the primary objective of many empirical papers, modern computing research is largely application domain independent (Capiluppi et al., 2020). Application domains are the sections of reality for which a software system is designed. Their importance relies on their serving as the starting point for actual state analysis and usually includes domain-specific language, meaning that developers in this domain think about their project in a specific way, with particular terms and concepts (Züllighoven, 2004). Application domains are not a feature currently implemented by GitHub to catalogue repositories. Previous work has found that repository quality indicators, such as object-oriented metrics, can be "extremely sensitive to application domains" (Capiluppi and Ajienka, 2019), and that the application domain is an important factor in predicting repository popularity (Borges et al., 2016). Furthermore, since documentation of GitHub repositories is often incomplete (Prana et al., 2019), information about the application domain of a repository can be crucial to gain a high-level understanding of its content and purpose.

machine learning, natural language, programming language, (17 more...)

arXiv.org Artificial Intelligence

2208.00269

Country:

South America > Brazil > Pernambuco (0.04)
Oceania > Australia > South Australia > Adelaide (0.04)
South America > Brazil > Pará (0.04)
(4 more...)

Genre: Research Report > New Finding (0.94)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

Terrasa

AAAI ConferencesFeb-8-2022, 11:12:32 GMT

Nowadays, most of the code hosting platforms for open-source projects consider the README file as the project cover. As it is the first piece of documentation seen by the project user or maintainer, such a document needs to be crafted with care. Documentation assist can be a useful tool to help documentation writers produce better documentation like README files. In this paper, we show how an abstract representation of a README file can help documentation assist tools provide better suggestions to writers. Our approach benefits from natural language processing tools and techniques to analyze the content of a README file. Using this model and the current cursor position within the document, our tool can suggest pieces of documentation, examples, and figures as well as structure improvements and update suggestions to the writer. Suggestions are presented as cards that can be selected to automatically enhance the document under writing.

documentation, readme file, terrasa, (1 more...)

AAAI Conferences

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback