AITopics

This practical experience report explores Neural Machine Translation (NMT) models' capability to generate offensive security code from natural language (NL) descriptions, highlighting the significance of contextual understanding and its impact on model performance. Our study employs a dataset comprising real shellcodes to evaluate the models across various scenarios, including missing information, necessary context, and unnecessary context. The experiments are designed to assess the models' resilience against incomplete descriptions, their proficiency in leveraging context for enhanced accuracy, and their ability to discern irrelevant information. The findings reveal that the introduction of contextual data significantly improves performance. However, the benefits of additional context diminish beyond a certain point, indicating an optimal level of contextual information for model training. Moreover, the models demonstrate an ability to filter out unnecessary context, maintaining high levels of accuracy in the generation of offensive security code. This study paves the way for future research on optimizing context use in AI-driven code generation, particularly for applications requiring a high degree of technical precision such as the generation of offensive code.

information, large language model, machine learning, (21 more...)

2408.02402

Country:

Europe > Denmark > Capital Region > Copenhagen (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(6 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Deshpande, Tejas, Kowtal, Nidhi, Joshi, Raviraj

Chain-of-Translation Prompting (CoTR): A Novel Prompting Technique for Low Resource Languages

This paper introduces Chain of Translation Prompting (CoTR), a novel strategy designed to enhance the performance of language models in low-resource languages. CoTR restructures prompts to first translate the input context from a low-resource language into a higher-resource language, such as English. The specified task like generation, classification, or any other NLP function is then performed on the translated text, with the option to translate the output back to the original language if needed. All these steps are specified in a single prompt. We demonstrate the effectiveness of this method through a case study on the low-resource Indic language Marathi. The CoTR strategy is applied to various tasks, including sentiment analysis, hate speech classification, subject classification and text generation, and its efficacy is showcased by comparing it with regular prompting methods. Our results underscore the potential of translation-based prompting strategies to significantly improve multilingual LLM performance in low-resource languages, offering valuable insights for future research and applications. We specifically see the highest accuracy improvements with the hate speech detection task. The technique also has the potential to enhance the quality of synthetic data generation for underrepresented languages using LLMs.

dataset, low-resource language, proceedings, (14 more...)

2409.04512

Country:

Asia > Singapore (0.04)
Asia > Indonesia > Bali (0.04)
North America > United States (0.04)
(4 more...)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)

Chizhov, Pavel, Arnett, Catherine, Korotkova, Elizaveta, Yamshchikov, Ivan P.

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce Picky BPE, a modified BPE algorithm that carries Figure 1: An example of a series of merges to produce a out vocabulary refinement during tokenizer token Kentucky. The pre-merge token frequencies are training. Our method improves vocabulary efficiency, demonstrated in corresponding circles. In the vanilla eliminates under-trained tokens, and BPE algorithm, entucky should also be stored in the does not compromise text compression. Our vocabulary, whereas it is redundant after the merge. In experiments show that our method does not this example, the IoS metric effectively captures the reduce the downstream performance, and in intermediate token, as IoS(entucky) T = 0.9.

computational linguistic, picky bpe, tokenizer, (13 more...)

2409.04599

Country:

North America > United States > Kentucky (0.24)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.04)
(22 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Mamasaidov, Mukhammadsaid, Shopulatov, Abror

Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak

This study presents several contributions for the Karakalpak language: a FLORES+ devtest dataset translated to Karakalpak, parallel corpora for Uzbek-Karakalpak, Russian-Karakalpak and English-Karakalpak of 100,000 pairs each and open-sourced fine-tuned neural models for translation across these languages. Our experiments compare different model variants and training approaches, demonstrating improvements over existing baselines. This work, conducted as part of the Open Language Data Initiative (OLDI) shared task, aims to advance machine translation capabilities for Karakalpak and contribute to expanding linguistic diversity in NLP technologies.

dataset, karakalpak, translation, (13 more...)

2409.04269

Country:

North America > United States > Florida (0.04)
Europe > Belgium (0.04)
Asia > Uzbekistan > Karakalpakstan (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Heo, DongNyeong, Rim, Daniela Noemi, Choi, Heeyoul

N-gram Prediction and Word Difference Representations for Language Modeling

arXiv.org Artificial IntelligenceSep-5-2024

Causal language modeling (CLM) serves as the foundational framework underpinning remarkable successes of recent large language models (LLMs). Despite its success, the training approach for next word prediction poses a potential risk of causing the model to overly focus on local dependencies within a sentence. While prior studies have been introduced to predict future N words simultaneously, they were primarily applied to tasks such as masked language modeling (MLM) and neural machine translation (NMT). In this study, we introduce a simple N-gram prediction framework for the CLM task. Moreover, we introduce word difference representation (WDR) as a surrogate and contextualized target representation during model training on the basis of N-gram prediction framework. To further enhance the quality of next word prediction, we propose an ensemble method that incorporates the future N words' prediction results. Empirical evaluations across multiple benchmark datasets encompassing CLM and NMT tasks demonstrate the significant advantages of our proposed methods over the conventional CLM.

large language model, machine learning, natural language, (20 more...)

2409.03295

Country:

Asia > South Korea (0.04)
Europe > Belgium (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceSep-5-2024

Painful intelligence: What AI can tell us about human suffering

Hyvärinen, Aapo

This book uses the modern theory of artificial intelligence (AI) to understand human suffering or mental pain. Both humans and sophisticated AI agents process information about the world in order to achieve goals and obtain rewards, which is why AI can be used as a model of the human brain and mind. This book intends to make the theory accessible to a relatively general audience, requiring only some relevant scientific background. The book starts with the assumption that suffering is mainly caused by frustration. Frustration means the failure of an agent (whether AI or human) to achieve a goal or a reward it wanted or expected. Frustration is inevitable because of the overwhelming complexity of the world, limited computational resources, and scarcity of good data. In particular, such limitations imply that an agent acting in the real world must cope with uncontrollability, unpredictability, and uncertainty, which all lead to frustration. Fundamental in such modelling is the idea of learning, or adaptation to the environment. While AI uses machine learning, humans and animals adapt by a combination of evolutionary mechanisms and ordinary learning. Even frustration is fundamentally an error signal that the system uses for learning. This book explores various aspects and limitations of learning algorithms and their implications regarding suffering. At the end of the book, the computational theory is used to derive various interventions or training methods that will reduce suffering in humans. The amount of frustration is expressed by a simple equation which indicates how it can be reduced. The ensuing interventions are very similar to those proposed by Buddhist and Stoic philosophy, and include mindfulness meditation. Therefore, this book can be interpreted as an exposition of a computational theory justifying why such philosophies and meditation reduce human suffering.

buddhist meditation manual visuddhimagga, pattern analysis and machine intelligence, unpredictability and uncontrollability, (15 more...)

2205.15409

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.13)
North America > United States > New York (0.04)
(14 more...)

Genre:

Workflow (1.00)
Summary/Review (1.00)
Research Report (1.00)
(4 more...)

Industry:

Transportation (1.00)
Media (1.00)
Leisure & Entertainment > Games (1.00)
(13 more...)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(16 more...)

arXiv.org Artificial IntelligenceSep-4-2024

A Comparative Study of Pre-training and Self-training

Wang, Yiheng, Lin, Jiayu, Lin, Zuoquan

Pre-training and self-training are two approaches to semi-supervised learning. The comparison between pre-training and self-training has been explored. However, the previous works led to confusing findings: self-training outperforms pre-training experienced on some tasks in computer vision, and contrarily, pre-training outperforms self-training experienced on some tasks in natural language processing, under certain conditions of incomparable settings. We propose, comparatively and exhaustively, an ensemble method to empirical study all feasible training paradigms combining pre-training, self-training, and fine-tuning within consistent foundational settings comparable to data augmentation. We conduct experiments on six datasets, four data augmentation, and imbalanced data for sentiment analysis and natural language inference tasks. Our findings confirm that the pre-training and fine-tuning paradigm yields the best overall performances. Moreover, self-training offers no additional benefits when combined with semi-supervised pre-training.

large language model, machine learning, paradigm, (16 more...)

2409.02751

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Dominican Republic (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(7 more...)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

arXiv.org Artificial IntelligenceSep-4-2024

Creating Domain-Specific Translation Memories for Machine Translation Fine-tuning: The TRENCARD Bilingual Cardiology Corpus

Dogru, Gokhan

This article investigates how translation memories (TM) can be created by translators or other language professionals in order to compile domain-specific parallel corpora , which can then be used in different scenarios, such as machine translation training and fine-tuning, TM leveraging, and/or large language model fine-tuning. The article introduces a semi-automatic TM preparation methodology leveraging primarily translation tools used by translators in favor of data quality and control by the translators. This semi-automatic methodology is then used to build a cardiology-based Turkish -> English corpus from bilingual abstracts of Turkish cardiology journals. The resulting corpus called TRENCARD Corpus has approximately 800,000 source words and 50,000 sentences. Using this methodology, translators can build their custom TMs in a reasonable time and use them in their bilingual data requiring tasks.

translation, translator, website, (15 more...)

2409.02667

Country:

Asia > Middle East > Republic of Türkiye (0.14)
Europe > Netherlands > North Holland > Amsterdam (0.04)
North America > United States > New York (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Kowtal, Nidhi, Deshpande, Tejas, Joshi, Raviraj

A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations

arXiv.org Artificial IntelligenceSep-4-2024

Machine translation in low-resource language pairs faces significant challenges due to the scarcity of parallel corpora and linguistic resources. This study focuses on the case of English-Marathi language pairs, where existing datasets are notably noisy, impeding the performance of machine translation models. To mitigate the impact of data quality issues, we propose a data filtering approach based on cross-lingual sentence representations. Our methodology leverages a multilingual SBERT model to filter out problematic translations in the training data. Specifically, we employ an IndicSBERT similarity model to assess the semantic equivalence between original and translated sentences, allowing us to retain linguistically correct translations while discarding instances with substantial deviations. The results demonstrate a significant improvement in translation quality over the baseline post-filtering with IndicSBERT. This illustrates how cross-lingual sentence representations can reduce errors in machine translation scenarios with limited resources. By integrating multilingual sentence BERT models into the translation pipeline, this research contributes to advancing machine translation techniques in low-resource environments. The proposed method not only addresses the challenges in English-Marathi language pairs but also provides a valuable framework for enhancing translation quality in other low-resource language translation tasks.

dataset, machine translation, translation, (13 more...)

doi: 10.1109/I2CT61223.2024.10544178

2409.02712

Country: Asia > India > Maharashtra > Pune (0.05)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Shakil, Hassan, Farooq, Ahmad, Kalita, Jugal

Abstractive Text Summarization: State of the Art, Challenges, and Improvements

arXiv.org Artificial IntelligenceSep-3-2024

Specifically focusing on the landscape of abstractive text summarization, as opposed to extractive techniques, this survey presents a comprehensive overview, delving into state-of-the-art techniques, prevailing challenges, and prospective research directions. We categorize the techniques into traditional sequence-to-sequence models, pre-trained large language models, reinforcement learning, hierarchical methods, and multi-modal summarization. Unlike prior works that did not examine complexities, scalability and comparisons of techniques in detail, this review takes a comprehensive approach encompassing state-of-the-art methods, challenges, solutions, comparisons, limitations and charts out future improvements - providing researchers an extensive overview to advance abstractive summarization research. We provide vital comparison tables across techniques categorized - offering insights into model complexity, scalability and appropriate applications. The paper highlights challenges such as inadequate meaning representation, factual consistency, controllable text summarization, cross-lingual summarization, and evaluation metrics, among others. Solutions leveraging knowledge incorporation and other innovative strategies are proposed to address these challenges. The paper concludes by highlighting emerging research areas like factual inconsistency, domain-specific, cross-lingual, multilingual, and long-document summarization, as well as handling noisy data. Our objective is to provide researchers and practitioners with a structured overview of the domain, enabling them to better understand the current landscape and identify potential areas for further research and improvement.

abstractive text summarization, summarization, text summarization, (13 more...)

doi: 10.1016/j.neucom.2024.128255

2409.02413

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > India > NCT > New Delhi (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(13 more...)

Genre:

Research Report > Promising Solution (1.00)
Research Report > New Finding (1.00)
Overview (1.00)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(5 more...)