mother tongue
Mitigating Language Barriers in Education: Developing Multilingual Digital Learning Materials with Machine Translation
Poláková, Lucie, Popel, Martin, Kloudová, Věra, Novák, Michal, Anisimova, Mariia, Balhar, Jiří
The EdUKate project combines digital education, linguistics, translation studies, and machine translation to develop multilingual learning materials for Czech primary and secondary schools. Launched through collaboration between a major Czech academic institution and the country's largest educational publisher, the project is aimed at translating up to 9,000 multimodal interactive exercises from Czech into Ukrainian, English, and German for an educational web portal. It emphasizes the development and evaluation of a direct Czech-Ukrainian machine translation system tailored to the educational domain, with special attention to processing formatted content such as XML and PDF and handling technical and scientific terminology. We present findings from an initial survey of Czech teachers regarding the needs of non-Czech-speaking students and describe the system's evaluation and implementation on the web portal. All resulting applications are freely available to students, educators, and researchers.
Boli: A dataset for understanding stuttering experience and analyzing stuttered speech
Batra, Ashita, narang, Mannas, Sharma, Neeraj Kumar, Das, Pradip K
There is a growing need for diverse, high-quality stuttered speech data, particularly in the context of Indian languages. This paper introduces Project Boli, a multi-lingual stuttered speech dataset designed to advance scientific understanding and technology development for individuals who stutter, particularly in India. The dataset constitutes (a) anonymized metadata (gender, age, country, mother tongue) and responses to a questionnaire about how stuttering affects their daily lives, (b) captures both read speech (using the Rainbow Passage) and spontaneous speech (through image description tasks) for each participant and (c) includes detailed annotations of five stutter types: blocks, prolongations, interjections, sound repetitions and word repetitions. We present a comprehensive analysis of the dataset, including the data collection procedure, experience summarization of people who stutter, severity assessment of stuttering events and technical validation of the collected data. The dataset is released as an open access to further speech technology development.
Can We Reverse In-Context Knowledge Edits?
Youssef, Paul, Zhao, Zhixue, Schlötterer, Jörg, Seifert, Christin
In-context knowledge editing (IKE) enables efficient modification of large language model (LLM) outputs without parameter changes and at zero-cost. However, it can be misused to manipulate responses opaquely, e.g., insert misinformation or offensive content. Such malicious interventions could be incorporated into high-level wrapped APIs where the final input prompt is not shown to end-users. To address this issue, we investigate the detection and reversal of IKE-edits. First, we demonstrate that IKE-edits can be detected with high accuracy (F1 > 80\%) using only the top-10 output probabilities of the next token, even in a black-box setting, e.g. proprietary LLMs with limited output information. Further, we introduce the novel task of reversing IKE-edits using specially tuned reversal tokens. We explore using both continuous and discrete reversal tokens, achieving over 80\% accuracy in recovering original, unedited outputs across multiple LLMs. Our continuous reversal tokens prove particularly effective, with minimal impact on unedited prompts. Through analysis of output distributions, attention patterns, and token rankings, we provide insights into IKE's effects on LLMs and how reversal tokens mitigate them. This work represents a significant step towards enhancing LLM resilience against potential misuse of in-context editing, improving their transparency and trustworthiness.
CollabEdit: Towards Non-destructive Collaborative Knowledge Editing
Zheng, Jiamu, Zhang, Jinghuai, Du, Tianyu, Zhang, Xuhong, Yin, Jianwei, Lin, Tao
Collaborative learning of large language models (LLMs) has emerged as a new paradigm for utilizing private data from different parties to guarantee efficiency and privacy. Meanwhile, Knowledge Editing (KE) for LLMs has also garnered increased attention due to its ability to manipulate the behaviors of LLMs explicitly, yet leaves the collaborative KE case (in which knowledge edits of multiple parties are aggregated in a privacy-preserving and continual manner) unexamined. To this end, this manuscript dives into the first investigation of collaborative KE, in which we start by carefully identifying the unique three challenges therein, including knowledge overlap, knowledge conflict, and knowledge forgetting. We then propose a non-destructive collaborative KE framework, COLLABEDIT, which employs a novel model merging mechanism to mimic the global KE behavior while preventing the severe performance drop. Extensive experiments on two canonical datasets demonstrate the superiority of COLLABEDIT compared to other destructive baselines, and results shed light on addressing three collaborative KE challenges and future applications.
Shaping the Future of Endangered and Low-Resource Languages -- Our Role in the Age of LLMs: A Keynote at ECIR 2024
Isidore of Seville is credited with the adage that it is language that gives birth to a people, and not the other way around , underlining the profound role played by language in the formation of cultural and social identity. Today, of the more than 7100 languages listed, a significant number are endangered. Since the 1970s, linguists, information seekers and enthusiasts have helped develop digital resources and automatic tools to support a wide range of languages, including endangered ones. The advent of Large Language Model (LLM) technologies holds both promise and peril. They offer unprecedented possibilities for the translation and generation of content and resources, key elements in the preservation and revitalisation of languages. They also present threat of homogenisation, cultural oversimplification and the further marginalisation of already vulnerable languages. The talk this paper is based on has proposed an initiatory journey, exploring the potential paths and partnerships between technology and tradition, with a particular focus on the Occitan language. Occitan is a language from Southern France, parts of Spain and Italy that played a major cultural and economic role, particularly in the Middle Ages. It is now endangered according to UNESCO. The talk critically has examined how human expertise and artificial intelligence can work together to offer hope for preserving the linguistic diversity that forms the foundation of our global and especially our European heritage while addressing some of the ethical and practical challenges that accompany the use of these powerful technologies. This paper is based on the keynote I gave at the 46th European Conference on Information Retrieval (ECIR 2024). As an alternative to reading this paper, a video talk is available online. 1 Date: 26 March 2024.
Bad grammar is so maddening it activates the 'fight or flight' response within the human body, study finds
For many, bad grammar can be maddening. Now experts have discovered it really does cause a physical reaction – and even affects our heart rate. Instances of bad grammar can include mixing up tenses within a sentence, confusing the singular and plural, using a double negative or misusing a comma. Examples of the pet peeve include'We don't need no education', 'I ate porridge for breakfast and drink milk' or'Anna and Mike is going skiing'. Researchers from the University of Birmingham recruited 41 British English-speaking adults who listened to 40 English speech samples, half of which contained grammatical errors.
The Workers Behind AI Rarely See Its Rewards. This Indian Startup Wants to Fix That
In the shade of a coconut palm, Chandrika tilts her smartphone screen to avoid the sun's glare. It is early morning in Alahalli village in the southern Indian state of Karnataka, but the heat and humidity are rising fast. As Chandrika scrolls, she clicks on several audio clips in succession, demonstrating the simplicity of the app she recently started using. At each tap, the sound of her voice speaking her mother tongue emerges from the phone. Before she started using this app, 30-year-old Chandrika (who, like many South Indians, uses the first letter of her father's name, K., instead of a last name) had just 184 rupees ($2.25) in her bank account. But in return for around six hours of work spread over several days in late April, she received 2,570 rupees ($31.30). That's roughly the same amount she makes in a month of working as a teacher at a distant school, after the cost of the three buses it takes her to get there and back. Just by reading text aloud in her native language of Kannada, spoken by around 60 million people mostly in central and southern India, Chandrika has used this app to earn an hourly wage of about $5, nearly 20 times the Indian minimum. And in a few days, more money will arrive--a 50% bonus, awarded once the voice clips are validated as accurate. Chandrika's voice can fetch this sum because of the boom in artificial intelligence (AI). Right now, cutting edge AIs--for example, large language models like ChatGPT--work best in languages like English, where text and audio data is abundant online.
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Hoelscher-Obermaier, Jason, Persson, Julia, Kran, Esben, Konstas, Ioannis, Barez, Fazl
Recent model editing techniques promise to mitigate the problem of memorizing false or outdated associations during LLM training. However, we show that these techniques can introduce large unwanted side effects which are not detected by existing specificity benchmarks. We extend the existing CounterFact benchmark to include a dynamic component and dub our benchmark CounterFact+. Additionally, we extend the metrics used for measuring specificity by a principled KL divergence-based metric. We use this improved benchmark to evaluate recent model editing techniques and find that they suffer from low specificity. Our findings highlight the need for improved specificity benchmarks that identify and prevent unwanted side effects.
Ukrainians who grew up speaking Russian learn a new mother tongue
Oleksandr Zahalskyy spent most of his life speaking only Russian. Born in 1960 in what was then the Soviet Union, Zahalskyy hails from the largely Russian-speaking Ukrainian city of Kherson. Now, at 63 and living in the capital, Kyiv, Zahalskyy and his wife Natasha are in the midst of the difficult but voluntary transition – making the Ukrainian language their own. "At first, we thought we needed to know our national language, but with the start of this full-scale war, the feeling changed from'I have to' to'I want to'," Zahalskyy told Al Jazeera by phone. The invasion Russia launched on February 24 last year, which started the biggest war in Europe since 1945, is seen by many Ukrainians as an attempt to wipe them out – and their culture, language and way of life.
Making computer science research more accessible in India
Imagine that you are teaching a technical subject to children in a small village. They are eager to learn, but you face a problem: There are few resources to educate them in their mother tongue. This is a common experience in India, where the quality of textbooks written in many local languages pales in comparison to those written in English. To address educational inequality, the Indian government launched an initiative in 2020 that would improve the quality of these resources for hundreds of millions of people, but its implementation remains a massive undertaking. Siddhartha Jayanti, an MIT PhD student in electrical engineering and computer science (EECS) who is an affiliate of MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and Google Research, encountered this problem first-hand when teaching students in India about math, science, and English.