Collaborating Authors


Northern accents are dying out and could DISAPPEAR BY 2066

Daily Mail - Science & tech

From the approachable Geordie dialect to the instantly recognisable Liverpool lilt, many of England's most distinctive accents are from the north. But a new study has warned that northern accents could all but disappear in just 45 years. Using physics modelling, researchers from the Universities of Portsmouth and Cambridge predicted how accents are likely to change across England by 2066. Their findings suggest that northern accents could be replaced with'posh' south eastern pronunciations. However, certain north-south differences are predicted to remain - we will continue to disagree about the pronunciation of bath', according to the researchers.

Scientists Are Using AI to Decode Whale Language


When you dive into the ocean, the physiology of your body changes. As you go deeper into the water, your heart rate slows. In an environment that is seemingly hostile to its survival, the body becomes remarkably efficient at keeping you alive. The mammalian dive reflex, more romantically termed the "Master Switch of Life" by its discoverer, the physiologist Per Scholander, helped shape how we view our relationship to the water. If our bodies were so at home in the ocean, scientists wondered, what did that say about our evolutionary history?

wav2vec Unsupervised: Speech recognition without supervision


Whether it's giving directions, answering questions, or carrying out requests, speech recognition makes life easier in countless ways. But today the technology is available for only a small fraction of the thousands of languages spoken around the globe. This is because high-quality systems need to be trained with large amounts of transcribed speech audio. Transcribed recordings of English-language novels, for example, will do little to help machines learn to understand a Basque speaker ordering food off a menu or a Tagalog speaker giving a business presentation. This is why we developed wav2vec Unsupervised (wav2vec-U), a way to build speech recognition systems that require no transcribed data at all.

AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News and Hate Speech Detection Dataset Artificial Intelligence

Along with the COVID-19 pandemic, an "infodemic" of false and misleading information has emerged and has complicated the COVID-19 response efforts. Social networking sites such as Facebook and Twitter have contributed largely to the spread of rumors, conspiracy theories, hate, xenophobia, racism, and prejudice. To combat the spread of fake news, researchers around the world have and are still making considerable efforts to build and share COVID-19 related research articles, models, and datasets. This paper releases "AraCOVID19-MFH" a manually annotated multi-label Arabic COVID-19 fake news and hate speech detection dataset. Our dataset contains 10,828 Arabic tweets annotated with 10 different labels. The labels have been designed to consider some aspects relevant to the fact-checking task, such as the tweet's check worthiness, positivity/negativity, and factuality. To confirm our annotated dataset's practical utility, we used it to train and evaluate several classification models and reported the obtained results. Though the dataset is mainly designed for fake news detection, it can also be used for hate speech detection, opinion/news classification, dialect identification, and many other tasks.

How is Artificial Intelligence Challenging the Translation Industry?


Language is perhaps the most defining factor of humankind. What makes humans different from other animals on the planet is our ability to speak out and communicate via framed words and sentences. The language of a population is one of the most defining factors across countries and nationalities, regions, and cultures. It can define the history, sociocultural situation, and even geographic diversity. From ancient times, there has been a trend for people to understand the language of one another. History traces back to Greeks and Romans traveling all across the world to discover, decipher and translate languages to find out the cultural, political, and social situations from one era to another.

A Linguistic Guide to Assassin's Creed: Valhalla


Invading my own country has been one of the most surreal experiences of playing Assassin's Creed: Valhalla, and the variety of languages included in the game makes it one of the most thought-provoking. Assassin's Creed is an award-winning historical action game series known for putting players in the middle of transformative events in history. Valhalla is set during the Viking invasions of Britain, during which the main character, Eivor, and their brother Sigurd embark on a quest to conquer a new land. They travel by boat from their native country Norway to a place that is home to new Viking settlers, eager to forge their own legacy of glory. This gave me an outsider's perspective of my own country, eavesdropping on everyday conversations in busy settlements and deciphering the origin of war cries on mountainsides.

Sorry, Cannot Understand the Language. Wait Chatbots can.


If you utter these words, probably a human would not understand but maybe chatbots will. We need to accept the fact that the majority of the world population do not speak English and only a small portion of people have English as their native language. Thus, it becomes important for the chatbot market and manufacturers to address the multi-lingual aspect. According to Research and Market report, the global chatbot market size is expected to grow to USD 10.5 billion by 2026 at a CAGR of 23.5% during 2020-2026. Language acts as a great barrier for the rural and non-English speaking populations to acquire assistance through chatbots.

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets Artificial Intelligence

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.