Goto

Collaborating Authors

 Machine Translation


Self-Training Vision Language BERTs with a Unified Conditional Model

arXiv.org Artificial Intelligence

Abstract--Natural language BERTs are trained with language corpus in a self-supervised manner. An example of generated image descriptions. Given different condition flags, our proposed UCM model is able to generate diverse image descriptions, such as COCO caption, dense caption, and questions. It's clear that the generated contents have different styles. Large scale pretraining has become the dominating approach in various natural language processing tasks. The success of large scale pretraining is due to a large amount of language setting. Although these models can be finetuned to perform training data available everywhere and the self-training algorithm. In this paper, we Second, current common practice in vision language BERT propose a self-training approach that allows to pretrain VL-pretraining uses various image descriptions to train, such as BERTs using unlabeled image data. Those image Self-training is usually done by iterating the following three descriptions have significant differences, making it difficult for steps: 1) training with labeled data, 2) generating pseudo labels an unconditional model to learn to generate adequate pseudo for unlabeled data, 3) mixing the labeled data and unlabeled captions for unlabeled images. However, the has shown its effectiveness in various tasks [4], [5], how to self-training of vision language BERTs is nontrivial due to use it effectively in training vision language BERTs is not yet the following reasons. First, although auto-encoding models studied.


JCSE: Contrastive Learning of Japanese Sentence Embeddings and Its Applications

arXiv.org Artificial Intelligence

Contrastive learning is widely used for sentence representation learning. Despite this prevalence, most studies have focused exclusively on English and few concern domain adaptation for domain-specific downstream tasks, especially for low-resource languages like Japanese, which are characterized by insufficient target domain data and the lack of a proper training strategy. To overcome this, we propose a novel Japanese sentence representation framework, JCSE (derived from ``Contrastive learning of Sentence Embeddings for Japanese''), that creates training data by generating sentences and synthesizing them with sentences available in a target domain. Specifically, a pre-trained data generator is finetuned to a target domain using our collected corpus. It is then used to generate contradictory sentence pairs that are used in contrastive learning for adapting a Japanese language model to a specific task in the target domain. Another problem of Japanese sentence representation learning is the difficulty of evaluating existing embedding methods due to the lack of benchmark datasets. Thus, we establish a comprehensive Japanese Semantic Textual Similarity (STS) benchmark on which various embedding models are evaluated. Based on this benchmark result, multiple embedding methods are chosen and compared with JCSE on two domain-specific tasks, STS in a clinical domain and information retrieval in an educational domain. The results show that JCSE achieves significant performance improvement surpassing direct transfer and other training strategies. This empirically demonstrates JCSE's effectiveness and practicability for downstream tasks of a low-resource language.


Language Embeddings Sometimes Contain Typological Generalizations

arXiv.org Artificial Intelligence

To what extent can neural network models learn generalizations about language structure, and how do we find out what they have learned? We explore these questions by training neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1295 languages. The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features obtained through annotation projection. We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most of our models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations. Careful attention to details in the evaluation turns out to be essential to avoid false positives. Furthermore, to encourage continued work in this field, we release several resources covering most or all of the languages in our data: (i) multiple sets of language representations, (ii) multilingual word embeddings, (iii) projected and predicted syntactic and morphological features, (iv) software to provide linguistically sound evaluations of language representations.


Prompting Large Language Model for Machine Translation: A Case Study

arXiv.org Artificial Intelligence

Research on prompting has shown excellent performance with little or even no supervised training across many tasks. However, prompting for machine translation is still under-explored in the literature. We fill this gap by offering a systematic study on prompting strategies for translation, examining various factors for prompt template and demonstration example selection. We further explore the use of monolingual data and the feasibility of cross-lingual, cross-domain, and sentence-to-document transfer learning in prompting. Extensive experiments with GLM-130B (Zeng et al., 2022) as the testbed show that 1) the number and the quality of prompt examples matter, where using suboptimal examples degenerates translation; 2) several features of prompt examples, such as semantic similarity, show significant Spearman correlation with their prompting performance; yet, none of the correlations are strong enough; 3) using pseudo parallel prompt examples constructed from monolingual data via zero-shot prompting could improve translation; and 4) improved performance is achievable by transferring knowledge from prompt examples selected in other settings. We finally provide an analysis on the model outputs and discuss several problems that prompting still suffers from.


HanoiT: Enhancing Context-aware Translation via Selective Context

arXiv.org Artificial Intelligence

Context-aware neural machine translation aims to use the document-level context to improve translation quality. However, not all words in the context are helpful. The irrelevant or trivial words may bring some noise and distract the model from learning the relationship between the current sentence and the auxiliary context. To mitigate this problem, we propose a novel end-to-end encoder-decoder model with a layer-wise selection mechanism to sift and refine the long document context. To verify the effectiveness of our method, extensive experiments and extra quantitative analysis are conducted on four document-level machine translation benchmarks. The experimental results demonstrate that our model significantly outperforms previous models on all datasets via the soft selection mechanism.


Learning a Formality-Aware Japanese Sentence Representation

arXiv.org Artificial Intelligence

While the way intermediate representations are generated in encoder-decoder sequence-to-sequence models typically allow them to preserve the semantics of the input sentence, input features such as formality might be left out. On the other hand, downstream tasks such as translation would benefit from working with a sentence representation that preserves formality in addition to semantics, so as to generate sentences with the appropriate level of social formality -- the difference between speaking to a friend versus speaking with a supervisor. We propose a sequence-to-sequence method for learning a formality-aware representation for Japanese sentences, where sentence generation is conditioned on both the original representation of the input sentence, and a side constraint which guides the sentence representation towards preserving formality information. Additionally, we propose augmenting the sentence representation with a learned representation of formality which facilitates the extraction of formality in downstream tasks. We address the lack of formality-annotated parallel data by adapting previous works on procedural formality classification of Japanese sentences. Experimental results suggest that our techniques not only helps the decoder recover the formality of the input sentence, but also slightly improves the preservation of input sentence semantics.


An A.I. Translation Tool Can Help Save Dying Languages. But at What Cost?

Slate

Sanjib Chaudhary chanced upon StoryWeaver, a multilingual children's storytelling platform, while searching for books he could read to his 7-year-old daughter. Chaudhary's mother tongue is Kochila Tharu, a language with about 250,000 speakers in eastern Nepal. Languages with a relatively small number of speakers, like Kochila Tharu, do not have enough digitized material for linguistic communities to thrive--no Google Translate, no film or television subtitles, no online newspapers. In industry parlance, these languages are "underserved" and "underresourced." This is where StoryWeaver comes in.


XNLI 2.0: Improving XNLI dataset and performance on Cross Lingual Understanding (XLU)

arXiv.org Artificial Intelligence

Natural Language Processing systems are heavily dependent on the availability of annotated data to train practical models. Primarily, models are trained on English datasets. In recent times, significant advances have been made in multilingual understanding due to the steeply increasing necessity of working in different languages. One of the points that stands out is that since there are now so many pre-trained multilingual models, we can utilize them for cross-lingual understanding tasks. Using cross-lingual understanding and Natural Language Inference, it is possible to train models whose applications extend beyond the training language. We can leverage the power of machine translation to skip the tiresome part of translating datasets from one language to another. In this work, we focus on improving the original XNLI dataset by re-translating the MNLI dataset in all of the 14 different languages present in XNLI, including the test and dev sets of XNLI using Google Translate. We also perform experiments by training models in all 15 languages and analyzing their performance on the task of natural language inference. We then expand our boundary to investigate if we could improve performance in low-resource languages such as Swahili and Urdu by training models in languages other than English.


NLP Startup Funding in 2022. It's no secret that the commercial…

#artificialintelligence

It's no secret that the commercial application of NLP technologies has exploded in recent years. From chatbots and virtual assistants to machine translation and sentiment analysis, NLP technologies are now being used in a wide variety of applications across a range of industries. With the increasing demand for technologies that can process human language, investors have been eager to get a piece of the action. In this article, we look at NLP start-up funding over the past year, identifying the applications and domains that have received investment. A version of this article will appear in the Journal of Natural Language Engineering in early 2023.


Music Playlist Title Generation Using Artist Information

arXiv.org Artificial Intelligence

Automatically generating or captioning music playlist titles given a set of tracks is of significant interest in music streaming services as customized playlists are widely used in personalized music recommendation, and well-composed text titles attract users and help their music discovery. We present an encoder-decoder model that generates a playlist title from a sequence of music tracks. While previous work takes track IDs as tokenized input for playlist title generation, we use artist IDs corresponding to the tracks to mitigate the issue from the long-tail distribution of tracks included in the playlist dataset. Also, we introduce a chronological data split method to deal with newly-released tracks in real-world scenarios. Comparing the track IDs and artist IDs as input sequences, we show that the artist-based approach significantly enhances the performance in terms of word overlap, semantic relevance, and diversity.