Goto

Collaborating Authors

 labelled training data


cantnlp@LT-EDI-2024: Automatic Detection of Anti-LGBTQ+ Hate Speech in Under-resourced Languages

Wong, Sidney G. -J., Durward, Matthew

arXiv.org Artificial Intelligence

This paper describes our homophobia/transphobia in social media comments detection system developed as part of the shared task at LT-EDI-2024. We took a transformer-based approach to develop our multiclass classification model for ten language conditions (English, Spanish, Gujarati, Hindi, Kannada, Malayalam, Marathi, Tamil, Tulu, and Telugu). We introduced synthetic and organic instances of script-switched language data during domain adaptation to mirror the linguistic realities of social media language as seen in the labelled training data. Our system ranked second for Gujarati and Telugu with varying levels of performance for other language conditions. The results suggest incorporating elements of paralinguistic behaviour such as script-switching may improve the performance of language detection systems especially in the cases of under-resourced languages conditions.


cantnlp@LT-EDI-2023: Homophobia/Transphobia Detection in Social Media Comments using Spatio-Temporally Retrained Language Models

Wong, Sidney G. -J., Durward, Matthew, Adams, Benjamin, Dunn, Jonathan

arXiv.org Artificial Intelligence

This paper describes our multiclass classification system developed as part of the LTEDI@RANLP-2023 shared task. We used a BERT-based language model to detect homophobic and transphobic content in social media comments across five language conditions: English, Spanish, Hindi, Malayalam, and Tamil. We retrained a transformer-based crosslanguage pretrained language model, XLMRoBERTa, with spatially and temporally relevant social media language data. We also retrained a subset of models with simulated script-mixed social media language data with varied performance. We developed the best performing seven-label classification system for Malayalam based on weighted macro averaged F1 score (ranked first out of six) with variable performance for other language and class-label conditions. We found the inclusion of this spatio-temporal data improved the classification performance for all language and task conditions when compared with the baseline. The results suggests that transformer-based language classification systems are sensitive to register-specific and language-specific retraining.


What You Need to Know About Machine Learning in 2023

#artificialintelligence

Machine learning is the process of enabling computers to tackle different kinds of tasks that have been carried out by people until now. Machine learning algorithms are built in such a way that it helps automate self-driving cars, translate speech and execute many other tasks. Machine learning technology is driving an explosion in the field of artificial intelligence. Let us see what exactly is machine learning. Machine learning is a type of artificial intelligence that allows software applications to become accurate at predicting outcomes without being explicitly programmed.


Supervised vs Unsupervised Learning Explained - Seldon

#artificialintelligence

Machine learning is already an important part of how modern organisation and services function. Whether in social media platforms, healthcare, or finance, machine learning models are deployed in a variety of settings. But the steps needed to train and deploy a model will differ depending on the task at hand and the data that's available. Supervised and unsupervised learning are examples of two different types of machine learning model approach. They differ in the way the models are trained and the condition of the training data that's required.


Cost-effective speech-to-text with weakly- and semi-supervised training

AIHub

Voice assistants equipped with speech-to-text technology have seen a major boost in performance and usage, thanks to the new powerful machine learning methods based on deep neural networks. These methods follow a supervised learning approach, requiring large amounts of paired speech-text data to train the best performing speech-to-text transcription models. After collecting large amounts of relevant and diverse spoken utterances, the complex and intensive task of annotating and labelling of the collected speech data awaits. To get a feel for a typical scenario, let's look at some estimates. On average a typical user query, for example "Do you have the Christmas edition with Santa?", would last for about 3 seconds.


Twin Neural Network Regression is a Semi-Supervised Regression Algorithm

Wetzel, Sebastian J., Melko, Roger G., Tamblyn, Isaac

arXiv.org Artificial Intelligence

Twin neural network regression (TNNR) is a semi-supervised regression algorithm, it can be trained on unlabelled data points as long as other, labelled anchor data points, are present. TNNR is trained to predict differences between the target values of two different data points rather than the targets themselves. By ensembling predicted differences between the targets of an unseen data point and all training data points, it is possible to obtain a very accurate prediction for the original regression problem. Since any loop of predicted differences should sum to zero, loops can be supplied to the training data, even if the data points themselves within loops are unlabelled. Semi-supervised training improves TNNR performance, which is already state of the art, significantly.


Machine Learning

#artificialintelligence

Machine learning algorithms all aim to learn and improve their accuracy as they process more datasets. One way that we can classify the tasks that machine learning algorithms solve is by how much feedback they present to the system. In some scenarios, the computer is provided a significant amount of labelled training data is provided, which is called supervised learning. In other cases, no labelled data is provided and this is known as unsupervised learning. Lastly, in semi-supervised learning, some labelled training data is provided, but most of the training data is unlabelled.


Facebook Is Giving Away This Speech Recognition Model For Free

#artificialintelligence

Researchers at Facebook AI recently introduced and open-sourced a new framework for self-supervised learning of representations from raw audio data known as wav2vec 2.0. The company claims that this framework can enable automatic speech recognition models with just 10 minutes of transcribed speech data. Neural network models have gained much traction over the last few years due to its applications across various sectors. The models work with the help of vast quantities of labelled training data. However, most of the time, it is challenging to gather labelled data than unlabelled data.


Automatic Speech Transcription And Speaker Recognition Simultaneously Using Apple AI

#artificialintelligence

Last year, Apple witnessed several controversies regarding its speech recognition technology. To provide quality control in the company's voice assistant Siri, Apple asked its contractors to regularly hear the confidential voice recordings in the name of the "Siri Grading Program". However, to this matter, the company later apologised and published a statement where it announced the changes in the Siri grading program. This year, the tech giant has been gearing up a number of researchers regarding speech recognition technology to upgrade its voice assistant. Recently, the researchers at Apple developed an AI model which can perform automatic speech transcription and speaker recognition simultaneously.


Machine Learning – Introduction to Supervised Learning Vinod Sharma's Blog

#artificialintelligence

Supervised learning – A blessing we have in this machines era. It helps to depict inputs to outputs. It uses labelled training data to deduce a function which has a set of training examples. The majority of practical machine learning uses supervised learning as on date. AILabPage defines Machine Learning as "A focal point where business, data and experience meets emerging technology and decides to work together".