AITopics

2307.03131

Country:

Europe > Spain > Balearic Islands > Mallorca (0.08)
Asia > Vietnam > Hanoi > Hanoi (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Nie, Ercong, Liang, Sheng, Schmid, Helmut, Schütze, Hinrich

Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages

arXiv.org Artificial IntelligenceJul-10-2023

Multilingual Pretrained Language Models (MPLMs) have shown their strong multilinguality in recent empirical cross-lingual transfer studies. In this paper, we propose the Prompts Augmented by Retrieval Crosslingually (PARC) pipeline to improve the zero-shot performance on low-resource languages (LRLs) by augmenting the context with semantically similar sentences retrieved from a high-resource language (HRL) as prompts. PARC improves the zero-shot performance on three downstream tasks (binary sentiment classification, topic categorization and natural language inference) with multilingual parallel test sets across 10 LRLs covering 6 language families in both unlabeled settings (+5.1%) and labeled settings (+16.3%). PARC-labeled also outperforms the finetuning baseline by 3.7%. We find a significant positive correlation between cross-lingual transfer performance on one side, and the similarity between the high- and low-resource languages as well as the amount of low-resource pretraining data on the other side. A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs.

computational linguistic, large language model, machine learning, (22 more...)

2212.09651

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > Dominican Republic (0.04)
(8 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Avila, Jonathan E., Ward, Nigel G.

Towards cross-language prosody transfer for dialog

arXiv.org Artificial IntelligenceJul-9-2023

Speech-to-speech translation systems today do not adequately support use for dialog purposes. In particular, nuances of speaker intent and stance can be lost due to improper prosody transfer. We present an exploration of what needs to be done to overcome this. First, we developed a data collection protocol in which bilingual speakers re-enact utterances from an earlier conversation in their other language, and used this to collect an English-Spanish corpus, so far comprising 1871 matched utterance pairs. Second, we developed a simple prosodic dissimilarity metric based on Euclidean distance over a broad set of prosodic features. We then used these to investigate cross-language prosodic differences, measure the likely utility of three simple baseline models, and identify phenomena which will require more powerful modeling. Our findings should inform future research on cross-language prosody and the design of speech-to-speech translation systems capable of effective prosody transfer.

artificial intelligence, natural language, utterance, (17 more...)

2307.04123

Country:

North America > United States > Texas > El Paso County > El Paso (0.04)
North America > Mexico (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceJul-9-2023

A Cross-Linguistic Pressure for Uniform Information Density in Word Order

Clark, Thomas Hikaru, Meister, Clara, Pimentel, Tiago, Hahn, Michael, Cotterell, Ryan, Futrell, Richard, Levy, Roger

While natural languages differ widely in both canonical word order and word order flexibility, their word orders still follow shared cross-linguistic statistical patterns, often attributed to functional pressures. In the effort to identify these pressures, prior work has compared real and counterfactual word orders. Yet one functional pressure has been overlooked in such investigations: the uniform information density (UID) hypothesis, which holds that information should be spread evenly throughout an utterance. Here, we ask whether a pressure for UID may have influenced word order patterns cross-linguistically. To this end, we use computational models to test whether real orders lead to greater information uniformity than counterfactual orders. In our empirical study of 10 typologically diverse languages, we find that: (i) among SVO languages, real word orders consistently have greater uniformity than reverse word orders, and (ii) only linguistically implausible counterfactual orders consistently exceed the uniformity of real orders. These findings are compatible with a pressure for information uniformity in the development and usage of natural languages.

machine learning, natural language, word order, (17 more...)

2306.03734

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
(18 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Gladkoff, Serge, Han, Lifeng, Nenadic, Goran

Student's t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce

arXiv.org Artificial IntelligenceJul-9-2023

In natural language processing (NLP) we always rely on human judgement as the golden quality evaluation method. However, there has been an ongoing debate on how to better evaluate inter-rater reliability (IRR) levels for certain evaluation tasks, such as translation quality evaluation (TQE), especially when the data samples (observations) are very scarce. In this work, we first introduce the study on how to estimate the confidence interval for the measurement value when only one data (evaluation) point is available. Then, this leads to our example with two human-generated observational scores, for which, we introduce ``Student's \textit{t}-Distribution'' method and explain how to use it to measure the IRR score using only these two data points, as well as the confidence intervals (CIs) of the quality evaluation. We give quantitative analysis on how the evaluation confidence can be greatly improved by introducing more observations, even if only one extra observation. We encourage researchers to report their IRR scores in all possible means, e.g. using Student's \textit{t}-Distribution method whenever possible; thus making the NLP evaluation more meaningful, transparent, and trustworthy. This \textit{t}-Distribution method can be also used outside of NLP fields to measure IRR level for trustworthy evaluation of experimental investigations, whenever the observational data is scarce. Keywords: Inter-Rater Reliability (IRR); Scarce Observations; Confidence Intervals (CIs); Natural Language Processing (NLP); Translation Quality Evaluation (TQE); Student's \textit{t}-Distribution

artificial intelligence, evaluation, natural language, (17 more...)

2303.04526

Country:

Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (0.46)

Industry:

Health & Medicine (1.00)
Education (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

OmniForce: On Human-Centered, Large Model Empowered and Cloud-Edge Collaborative AutoML System

Xue, Chao, Liu, Wei, Xie, Shuai, Wang, Zhenfang, Li, Jiaxing, Peng, Xuyang, Ding, Liang, Zhao, Shanshan, Cao, Qiong, Yang, Yibo, He, Fengxiang, Cai, Bohua, Bian, Rongcheng, Zhao, Yiyan, Zheng, Heliang, Liu, Xiangyang, Liu, Dongkai, Liu, Daqing, Shen, Li, Li, Chang, Zhang, Shijin, Zhang, Yukang, Chen, Guanpu, Chen, Shixiang, Zhan, Yibing, Zhang, Jing, Wang, Chaoyue, Tao, Dacheng

Automated machine learning (AutoML) seeks to build ML models with minimal human effort. While considerable research has been conducted in the area of AutoML in general, aiming to take humans out of the loop when building artificial intelligence (AI) applications, scant literature has focused on how AutoML works well in open-environment scenarios such as the process of training and updating large models, industrial supply chains or the industrial metaverse, where people often face open-loop problems during the search process: they must continuously collect data, update data and models, satisfy the requirements of the development and deployment environment, support massive devices, modify evaluation metrics, etc. Addressing the open-environment issue with pure data-driven approaches requires considerable data, computing resources, and effort from dedicated data engineers, making current AutoML systems and platforms inefficient and computationally intractable. Human-computer interaction is a practical and feasible way to tackle the problem of open-environment AI. In this paper, we introduce OmniForce, a human-centered AutoML (HAML) system that yields both human-assisted ML and ML-assisted human techniques, to put an AutoML system into practice and build adaptive AI in open-environment scenarios. Specifically, we present OmniForce in terms of ML version management; pipeline-driven development and deployment collaborations; a flexible search strategy framework; and widely provisioned and crowdsourced application algorithms, including large models. Furthermore, the (large) models constructed by OmniForce can be automatically turned into remote services in a few minutes; this process is dubbed model as a service (MaaS). Experimental results obtained in multiple search spaces and real-world use cases demonstrate the efficacy and efficiency of OmniForce.

artificial intelligence, machine learning, natural language, (21 more...)

2303.00501

Country:

Asia > Middle East > Israel (0.14)
North America > United States (0.14)
Europe (0.14)

Genre:

Workflow (0.68)
Research Report > Promising Solution (0.45)

Industry:

Transportation > Ground > Road (1.00)
Information Technology > Security & Privacy (1.00)
Automobiles & Trucks (0.92)
(3 more...)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
(4 more...)

Goel, Arnav, Hira, Medha, Anand, Avinash, Bangar, Siddhesh, Shah, Dr. Rajiv Ratn

Advancements in Scientific Controllable Text Generation Methods

The previous work on controllable text generation is organized using a new schema we provide in this study. Seven components make up the schema, and each one is crucial to the creation process. To accomplish controlled generation for scientific literature, we describe the various modulation strategies utilised to modulate each of the seven components. We also offer a theoretical study and qualitative examination of these methods. This insight makes possible new architectures based on combinations of these components. Future research will compare these methods empirically to learn more about their strengths and utility.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

2307.05538

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia (0.04)

Genre: Research Report > Promising Solution (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Wanjawa, Barack, Wanzare, Lilian, Indede, Florence, McOnyango, Owen, Ombui, Edward, Muchemi, Lawrence

Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks

Indigenous African languages are categorized as under-served in Natural Language Processing. They therefore experience poor digital inclusivity and information access. The processing challenge with such languages has been how to use machine learning and deep learning models without the requisite data. The Kencorpus project intends to bridge this gap by collecting and storing text and speech data that is good enough for data-driven solutions in applications such as machine translation, question answering and transcription in multilingual communities. The Kencorpus dataset is a text and speech corpus for three languages predominantly spoken in Kenya: Swahili, Dholuo and Luhya. Data collection was done by researchers from communities, schools, media, and publishers. The Kencorpus' dataset has a collection of 5,594 items - 4,442 texts (5.6M words) and 1,152 speech files (177hrs). Based on this data, Part of Speech tagging sets for Dholuo and Luhya (50,000 and 93,000 words respectively) were developed. We developed 7,537 Question-Answer pairs for Swahili and created a text translation set of 13,400 sentences from Dholuo and Luhya into Swahili. The datasets are useful for downstream machine learning tasks such as model training and translation. We also developed two proof of concept systems: for Kiswahili speech-to-text and machine learning system for Question Answering task, with results of 18.87% word error rate and 80% Exact Match (EM) respectively. These initial results give great promise to the usability of Kencorpus to the machine learning community. Kencorpus is one of few public domain corpora for these three low resource languages and forms a basis of learning and sharing experiences for similar works especially for low resource languages.

artificial intelligence, machine learning, natural language, (21 more...)

2208.12081

Country:

Africa > East Africa (0.14)
Africa > Kenya > Nairobi City County > Nairobi (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
(20 more...)

Genre: Research Report (1.00)

Industry:

Education (1.00)
Media > News (0.93)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)
Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(2 more...)

Zheng, Boyuan, Xia, Patrick, Yarmohammadi, Mahsa, Van Durme, Benjamin

Multilingual Coreference Resolution in Multiparty Dialogue

Existing multiparty dialogue datasets for entity coreference resolution are nascent, and many challenges are still unaddressed. We create a large-scale dataset, Multilingual Multiparty Coref (MMC), for this task based on TV transcripts. Due to the availability of gold-quality subtitles in multiple languages, we propose reusing the annotations to create silver coreference resolution data in other languages (Chinese and Farsi) via annotation projection. On the gold (English) data, off-the-shelf models perform relatively poorly on MMC, suggesting that MMC has broader coverage of multiparty coreference than prior datasets. On the silver data, we find success both using it for data augmentation and training from scratch, which effectively simulates the zero-shot cross-lingual setting.

computational linguistic, machine learning, natural language, (19 more...)

2208.01307

Country:

North America > Dominican Republic (0.05)
Europe > Slovenia (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(27 more...)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment (1.00)
Media (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.69)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

MIT Technology ReviewJul-7-2023, 11:30:00 GMT

AI-text detection tools are really easy to fool

Then each researcher wrote an additional text in Bosnian, Czech, German, Latvian, Slovak, Spanish, or Swedish. Those texts were passed through either the AI translation tool DeepL or Google Translate to translate them into English. The team then used ChatGPT to generate two additional texts each, which they slightly tweaked in an effort to hide that it'd been AI-generated. One set was edited manually by the researchers, who reordered sentences and exchanged words, while another was rewritten using an AI paraphrasing tool called Quillbot. In the end, they had 54 documents to test the detection tools on.

accuracy, additional text, ai-text detection tool

MIT Technology Review

Country: Oceania > Australia > South Australia (0.09)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.63)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.53)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.39)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.39)