Goto

Collaborating Authors

 Machine Translation


AI 50: America's Most Promising Artificial Intelligence Companies

#artificialintelligence

The Covid-19 pandemic was devastating for many industries, but it only accelerated the use of artificial intelligence across the U.S. economy. Amid the crisis, companies scrambled to create new services for remote workers and students, beef up online shopping and dining options, make customer call centers more efficient and speed development of important new drugs. Even as applications of machine learning and perception platforms become commonplace, a thick layer of hype and fuzzy jargon clings to AI-enabled software.That makes it tough to identify the most compelling companies in the space--especially those finding new ways to use AI that create value by making humans more efficient, not redundant. With this in mind, Forbes has partnered with venture firms Sequoia Capital and Meritech Capital to create our third annual AI 50, a list of private, promising North American companies that are using artificial intelligence in ways that are fundamental to their operations. To be considered, businesses must be privately-held and utilizing machine learning (where systems learn from data to improve on tasks), natural language processing (which enables programs to "understand" written or spoken language) or computer vision (which relates to how machines "see"). AI companies incubated at, largely funded through or acquired by large tech, manufacturing or industrial firms aren't eligible for consideration. Our list was compiled through a submission process open to any AI company in the U.S. and Canada. The application asked companies to provide details on their technology, business model, customers and financials like funding, valuation and revenue history (companies had the option to submit information confidentially, to encourage greater transparency). Forbes received several hundred entries, of which nearly 400 qualified for consideration. From there, our data partners applied an algorithm to identify 100 companies with the highest quantitative scores--and that also made diversity a priority. Next, a panel of expert AI judges evaluated the finalists to find the 50 most compelling companies (they were precluded from judging companies in which they have a vested interest). Among trends this year are what Sequoia Capital's Konstantine Buhler calls AI workbench companies--building of platforms tailored to different enterprises, including Dataiku, DataRobot Domino Data and Databricks.


Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation

arXiv.org Artificial Intelligence

We translate a closed text that is known in advance into a severely low resource language by leveraging massive source parallelism. In other words, given a text in 124 source languages, we translate it into a severely low resource language using only ~1,000 lines of low resource data without any external help. Firstly, we propose a systematic method to rank and choose source languages that are close to the low resource language. We call the linguistic definition of language family Family of Origin (FAMO), and we call the empirical definition of higher-ranked languages using our metrics Family of Choice (FAMC). Secondly, we build an Iteratively Pretrained Multilingual Order-preserving Lexiconized Transformer (IPML) to train on ~1,000 lines (~3.5%) of low resource data. To translate named entities correctly, we build a massive lexicon table for 2,939 Bible named entities in 124 source languages, and include many that occur once and covers more than 66 severely low resource languages. Moreover, we also build a novel method of combining translations from different source languages into one. Using English as a hypothetical low resource language, we get a +23.9 BLEU increase over a multilingual baseline, and a +10.3 BLEU increase over our asymmetric baseline in the Bible dataset. We get a 42.8 BLEU score for Portuguese-English translation on the medical EMEA dataset. We also have good results for a real severely low resource Mayan language, Eastern Pokomchi.


AI 50: America's Most Promising Artificial Intelligence Companies

#artificialintelligence

The Covid-19 pandemic was devastating for many industries, but it only accelerated the use of artificial intelligence across the U.S. economy. Amid the crisis, companies scrambled to create new services for remote workers and students, beef up online shopping and dining options, make customer call centers more efficient and speed development of important new drugs. Even as applications of machine learning and perception platforms become commonplace, a thick layer of hype and fuzzy jargon clings to AI-enabled software.That makes it tough to identify the most compelling companies in the space--especially those finding new ways to use AI that create value by making humans more efficient, not redundant. With this in mind, Forbes has partnered with venture firms Sequoia Capital and Meritech Capital to create our third annual AI 50, a list of private, promising North American companies that are using artificial intelligence in ways that are fundamental to their operations. To be considered, businesses must be privately-held and utilizing machine learning (where systems learn from data to improve on tasks), natural language processing (which enables programs to "understand" written or spoken language) or computer vision (which relates to how machines "see"). AI companies incubated at, largely funded through or acquired by large tech, manufacturing or industrial firms aren't eligible for consideration. Our list was compiled through a submission process open to any AI company in the U.S. and Canada. The application asked companies to provide details on their technology, business model, customers and financials like funding, valuation and revenue history (companies had the option to submit information confidentially, to encourage greater transparency). Forbes received several hundred entries, of which nearly 400 qualified for consideration. From there, our data partners applied an algorithm to identify 100 companies with the highest quantitative scores--and that also made diversity a priority. Next, a panel of expert AI judges evaluated the finalists to find the 50 most compelling companies (they were precluded from judging companies in which they have a vested interest). Among trends this year are what Sequoia Capital's Konstantine Buhler calls AI workbench companies--building of platforms tailored to different enterprises, including Dataiku, DataRobot Domino Data and Databricks.


AI 50: America's Most Promising Artificial Intelligence Companies

#artificialintelligence

The Covid-19 pandemic was devastating for many industries, but it only accelerated the use of artificial intelligence across the U.S. economy. Amid the crisis, companies scrambled to create new services for remote workers and students, beef up online shopping and dining options, make customer call centers more efficient and speed development of important new drugs. Even as applications of machine learning and perception platforms become commonplace, a thick layer of hype and fuzzy jargon clings to AI-enabled software.That makes it tough to identify the most compelling companies in the space--especially those finding new ways to use AI that create value by making humans more efficient, not redundant. With this in mind, Forbes has partnered with venture firms Sequoia Capital and Meritech Capital to create our third annual AI 50, a list of private, promising North American companies that are using artificial intelligence in ways that are fundamental to their operations. To be considered, businesses must be privately-held and utilizing machine learning (where systems learn from data to improve on tasks), natural language processing (which enables programs to "understand" written or spoken language) or computer vision (which relates to how machines "see"). AI companies incubated at, largely funded through or acquired by large tech, manufacturing or industrial firms aren't eligible for consideration. Our list was compiled through a submission process open to any AI company in the U.S. and Canada. The application asked companies to provide details on their technology, business model, customers and financials like funding, valuation and revenue history (companies had the option to submit information confidentially, to encourage greater transparency). Forbes received several hundred entries, of which nearly 400 qualified for consideration. From there, our data partners applied an algorithm to identify 100 companies with the highest quantitative scores--and that also made diversity a priority. Next, a panel of expert AI judges evaluated the finalists to find the 50 most compelling companies (they were precluded from judging companies in which they have a vested interest). Among trends this year are what Sequoia Capital's Konstantine Buhler calls AI workbench companies--building of platforms tailored to different enterprises, including Dataiku, DataRobot Domino Data and Databricks.


The NLP Week: How NLTM can make India a world leader in Speech-to-Speech Translation

#artificialintelligence

India is a melting pot of multiple cultures, religions, diaspora and languages. Although 22 languages are recognised officially, more than 100 languages and dialects are spoken across the country. In the past decade, India has witnessed stupendous growth digitally - in 2019, the number of smartphone users in rural areas surpassed that of urban India. There is a burgeoning market for digital products, going well beyond borders of urban pockets. However, less than 1% of content on the Internet is in English.


Dataset Inference: Ownership Resolution in Machine Learning

arXiv.org Machine Learning

With increasingly more data and computation involved in their training, machine learning models constitute valuable intellectual property. This has spurred interest in model stealing, which is made more practical by advances in learning with partial, little, or no supervision. Existing defenses focus on inserting unique watermarks in a model's decision surface, but this is insufficient: the watermarks are not sampled from the training distribution and thus are not always preserved during model stealing. In this paper, we make the key observation that knowledge contained in the stolen model's training set is what is common to all stolen copies. The adversary's goal, irrespective of the attack employed, is always to extract this knowledge or its by-products. This gives the original model's owner a strong advantage over the adversary: model owners have access to the original training data. We thus introduce $dataset$ $inference$, the process of identifying whether a suspected model copy has private knowledge from the original model's dataset, as a defense against model stealing. We develop an approach for dataset inference that combines statistical testing with the ability to estimate the distance of multiple data points to the decision boundary. Our experiments on CIFAR10, SVHN, CIFAR100 and ImageNet show that model owners can claim with confidence greater than 99% that their model (or dataset as a matter of fact) was stolen, despite only exposing 50 of the stolen model's training points. Dataset inference defends against state-of-the-art attacks even when the adversary is adaptive. Unlike prior work, it does not require retraining or overfitting the defended model.


Google translation AI botches legal terms

#artificialintelligence

Translation tools from Google and other companies could be contributing to significant misunderstanding of legal terms with conflicting meanings such as "enjoin," according to research due to be presented at an academic workshop. Google's translation software turns an English sentence about a court enjoining violence, or banning it, into one in the Indian language of Kannada that implies the court ordered violence, according to the new study. "Enjoin" can refer to either promoting or restraining an action. Mistranslations also arise with other contronyms, or words with contradictory meanings depending on context, including "all over," "eventual" and "garnish," the paper said. Google said machine translation is "is still just a complement to specialized professional translation" and that it is "continually researching improvements, from better handling ambiguous language, to mitigating bias, to making large quality gains for under-resourced languages."


Demystify Optimization Challenges in Multilingual Transformers

arXiv.org Artificial Intelligence

Multilingual Transformer improves parameter efficiency and crosslingual transfer. How to effectively train multilingual models has not been well studied. Using multilingual machine translation as a testbed, we study optimization challenges from loss landscape and parameter plasticity perspectives. We found that imbalanced training data poses task interference between high and low resource languages, characterized by nearly orthogonal gradients for major parameters and the optimization trajectory being mostly dominated by high resource. We show that local curvature of the loss surface affects the degree of interference, and existing heuristics of data subsampling implicitly reduces the sharpness, although still face a trade-off between high and low resource languages. We propose a principled multi-objective optimization algorithm, Curvature Aware Task Scaling (CATS), which improves both optimization and generalization especially for low resource. Experiments on TED, WMT and OPUS-100 benchmarks demonstrate that CATS advances the Pareto front of accuracy while being efficient to apply to massive multilingual settings at the scale of 100 languages.


Can Latent Alignments Improve Autoregressive Machine Translation?

arXiv.org Artificial Intelligence

Latent alignment objectives such as CTC and AXE significantly improve non-autoregressive machine translation models. Can they improve autoregressive models as well? We explore the possibility of training autoregressive machine translation models with latent alignment objectives, and observe that, in practice, this approach results in degenerate models. We provide a theoretical explanation for these empirical results, and prove that latent alignment objectives are incompatible with teacher forcing.


Back-Training excels Self-Training at Unsupervised Domain Adaptation of Question Generation and Passage Retrieval

arXiv.org Artificial Intelligence

In this paper, we propose a new domain adaptation method called $\textit{back-training}$, a superior alternative to self-training. While self-training results in synthetic training data of the form quality inputs aligned with noisy outputs, back-training results in noisy inputs aligned with quality outputs. Our experimental results on unsupervised domain adaptation of question generation and passage retrieval models from $\textit{Natural Questions}$ domain to the machine learning domain show that back-training outperforms self-training by a large margin: 9.3 BLEU-1 points on generation, and 7.9 accuracy points on top-1 retrieval. We release $\textit{MLQuestions}$, a domain-adaptation dataset for the machine learning domain containing 50K unaligned passages and 35K unaligned questions, and 3K aligned passage and question pairs. Our data and code are available at https://github.com/McGill-NLP/MLQuestions