Goto

Collaborating Authors

 South America


Extended Parallel Corpus for Amharic-English Machine Translation

arXiv.org Artificial Intelligence

This paper describes the acquisition, preprocessing, segmentation, and alignment of an Amharic-English parallel corpus. It will be useful for machine translation of an under-resourced language, Amharic. The corpus is larger than previously compiled corpora; it is released for research purposes. We trained neural machine translation and phrase-based statistical machine translation models using the corpus. In the automatic evaluation, neural machine translation models outperform phrase-based statistical machine translation models.


HumAID: Human-Annotated Disaster Incidents Data from Twitter with Deep Learning Benchmarks

arXiv.org Artificial Intelligence

Social networks are widely used for information consumption and dissemination, especially during time-critical events such as natural disasters. Despite its significantly large volume, social media content is often too noisy for direct use in any application. Therefore, it is important to filter, categorize, and concisely summarize the available content to facilitate effective consumption and decision-making. To address such issues automatic classification systems have been developed using supervised modeling approaches, thanks to the earlier efforts on creating labeled datasets. However, existing datasets are limited in different aspects (e.g., size, contains duplicates) and less suitable to support more advanced and data-hungry deep learning models. In this paper, we present a new large-scale dataset with ~77K human-labeled tweets, sampled from a pool of ~24 million tweets across 19 disaster events that happened between 2016 and 2019. Moreover, we propose a data collection and sampling pipeline, which is important for social media data sampling for human annotation. We report multiclass classification results using classic and deep learning (fastText and transformer) based models to set the ground for future studies. The dataset and associated resources are publicly available. https://crisisnlp.qcri.org/humaid_dataset.html


What the Hell Are You Supposed to Do With Your Vaccine Card?

Slate

The joy, anxiety, and anticipation of getting a COVID vaccine in America culminates, quite anticlimactically, with a piece of white cardstock. Some have already lost their vaccine cards or never got them to begin with. Others have their names misspelled and crossed out on it. Many are having trouble reconciling how something so simple--and easily forged--can carry such import and weight. The White House has recently clarified that there will be no federal vaccine passport.


Council Post: AI's Role In Analyzing Shifting Sentiments Around Companies

#artificialintelligence

Despite only being early in the year, significant events have already taken place in 2021. Mass vaccinations for Covid-19 have begun around the world, and new strains of the disease have surfaced in the United Kingdom, South Africa and Brazil. For companies, this news has had a direct impact on their ability to conduct business while further placing their pandemic response under the public microscope. How companies are being talked and written about is changing as the pandemic unfolds, and these nuances could reveal more than simply how effective an organization's marketing department is. What if shifts in sentiment could help traders make more informed financial decisions?


Three keys to working effectively on Artificial Intelligence projects

#artificialintelligence

On many occasions, the greatest impediments to creating Artificial Intelligence solutions do not lie in the capacity of highly qualified teams, but in establishing an effective way of working between the different professional profiles involved in the life cycle of analytical models. This is one of the main tasks we are currently tackling at BBVA AI Factory. It is a task guided by three concepts: simplify, accelerate and reuse. My first direct contact with the AI Factory was in April 2020, in the middle of lockdown. I found myself with a team of data scientists who were extremely competent in creating AI models, but who needed to continue to push for common working guidelines in order to deal with the complexity – both organisational and technical – that exists in the Engineering domain.


Detection of marine litter using deep learning

AIHub

Researchers at the University of Barcelona have developed an open access, deep learning-based web app that will enable the detection and quantification of floating plastics in the sea with a reliability of over 80%. Floating sea macro-litter is a threat to the conservation of marine ecosystems worldwide. According to UNESCO, plastic debris causes the deaths of more than a million seabirds every year, as well as more than 100,000 marine mammals. Eroded fragments, known as micro-plastics, are now prevalent across the food chain. The largest density of floating litter is found in the great ocean gyres (systems of circular currents) with litter being caught and spun in these vast cycles.


Adaptive Clustering of Robust Semantic Representations for Adversarial Image Purification

arXiv.org Artificial Intelligence

Deep Learning models are highly susceptible to adversarial manipulations that can lead to catastrophic consequences. One of the most effective methods to defend against such disturbances is adversarial training but at the cost of generalization of unseen attacks and transferability across models. In this paper, we propose a robust defense against adversarial attacks, which is model agnostic and generalizable to unseen adversaries. Initially, with a baseline model, we extract the latent representations for each class and adaptively cluster the latent representations that share a semantic similarity. We obtain the distributions for the clustered latent representations and from their originating images, we learn semantic reconstruction dictionaries (SRD). We adversarially train a new model constraining the latent space representation to minimize the distance between the adversarial latent representation and the true cluster distribution. To purify the image, we decompose the input into low and high-frequency components. The high-frequency component is reconstructed based on the most adequate SRD from the clean dataset. In order to evaluate the most adequate SRD, we rely on the distance between robust latent representations and semantic cluster distributions. The output is a purified image with no perturbation. Image purification on CIFAR-10 and ImageNet-10 using our proposed method improved the accuracy by more than 10% compared to state-of-the-art results.


Deep learning for prediction of complex geology ahead of drilling

arXiv.org Machine Learning

During a geosteering operation the well path is intentionally adjusted in response to the new data acquired while drilling. To achieve consistent high-quality decisions, especially when drilling in complex environments, decision support systems can help cope with high volumes of data and interpretation complexities. They can assimilate the real-time measurements into a probabilistic earth model and use the updated model for decision recommendations. Recently, machine learning (ML) techniques have enabled a wide range of methods that redistribute computational cost from on-line to off-line calculations. In this paper, we introduce two ML techniques into the geosteering decision support framework. Firstly, a complex earth model representation is generated using a Generative Adversarial Network (GAN). Secondly, a commercial extra-deep electromagnetic simulator is represented using a Forward Deep Neural Network (FDNN). The numerical experiments demonstrate that the combination of the GAN and the FDNN in an ensemble randomized maximum likelihood data assimilation scheme provides real-time estimates of complex geological uncertainty. This yields reduction of geological uncertainty ahead of the drill-bit from the measurements gathered behind and around the well bore.


Heuristics2Annotate: Efficient Annotation of Large-Scale Marathon Dataset For Bounding Box Regression

arXiv.org Artificial Intelligence

Annotating a large-scale in-the-wild person re-identification dataset especially of marathon runners is a challenging task. The variations in the scenarios such as camera viewpoints, resolution, occlusion, and illumination make the problem non-trivial. Manually annotating bounding boxes in such large-scale datasets is cost-inefficient. Additionally, due to crowdedness and occlusion in the videos, aligning the identity of runners across multiple disjoint cameras is a challenge. We collected a novel large-scale in-the-wild video dataset of marathon runners. The dataset consists of hours of recording of thousands of runners captured using 42 hand-held smartphone cameras and covering real-world scenarios. Due to the presence of crowdedness and occlusion in the videos, the annotation of runners becomes a challenging task. We propose a new scheme for tackling the challenges in the annotation of such large dataset. Our technique reduces the overall cost of annotation in terms of time as well as budget. We demonstrate performing fps analysis to reduce the effort and time of annotation. We investigate several annotation methods for efficiently generating tight bounding boxes. Our results prove that interpolating bounding boxes between keyframes is the most efficient method of bounding box generation amongst several other methods and is 3x times faster than the naive baseline method. We introduce a novel way of aligning the identity of runners in disjoint cameras. Our inter-camera alignment tool integrated with the state-of-the-art person re-id system proves to be sufficient and effective in the alignment of the runners across multiple cameras with non-overlapping views. Our proposed framework of annotation reduces the annotation cost of the dataset by a factor of 16x, also effectively aligning 93.64% of the runners in the cross-camera setting.


A Heuristic-driven Uncertainty based Ensemble Framework for Fake News Detection in Tweets and News Articles

arXiv.org Artificial Intelligence

The significance of social media has increased manifold in the past few decades as it helps people from even the most remote corners of the world to stay connected. With the advent of technology, digital media has become more relevant and widely used than ever before and along with this, there has been a resurgence in the circulation of fake news and tweets that demand immediate attention. In this paper, we describe a novel Fake News Detection system that automatically identifies whether a news item is "real" or "fake", as an extension of our work in the CONSTRAINT COVID-19 Fake News Detection in English challenge. We have used an ensemble model consisting of pre-trained models followed by a statistical feature fusion network , along with a novel heuristic algorithm by incorporating various attributes present in news items or tweets like source, username handles, URL domains and authors as statistical feature. Our proposed framework have also quantified reliable predictive uncertainty along with proper class output confidence level for the classification task. We have evaluated our results on the COVID-19 Fake News dataset and FakeNewsNet dataset to show the effectiveness of the proposed algorithm on detecting fake news in short news content as well as in news articles. We obtained a best F1-score of 0.9892 on the COVID-19 dataset, and an F1-score of 0.9073 on the FakeNewsNet dataset.