Collaborating Authors

Inductive Learning

Postdoc in Artificial Intelligence for Structural Bioinformatics


This position will investigate the application of a range of supervised learning techniques, representations and data augmentation strategies to the discovery of bioactive molecules in ultra-large libraries for selected therapeutic targets. The project will exploit large volumes of protein structure data, including recently available Alphafold2 structures. Selection criteria - Essential • A PhD awarded in an area relevant to the project. The CRCM is also affiliated to the private cancer hospital Institut Paoli Calmettes, the CNRS and Aix-Marseille University. The successful candidate will have a 3-year contract with a gross monthly salary of up to €2,900 gross monthly (depending on experience).

Google AI Introduces FLAN, A Language Model with Instruction Fine-Tuning


Google AI recently introduced their new Natural Language Processing (NLP) model, known as Fine-tuned LAnguage Net (FLAN), which explores a simple technique called instruction fine-tuning, or instruction tuning for short. In general, fine-tuning requires a large number of training examples, along with stored model weights for each downstream task which is not always practical, particularly for large models. FLAN's instruction fine-tuning technique involves fine-tuning a model not to solve a specific task, but to also make it more amenable to solving NLP tasks in particular. FLAN is fine-tuned on a large set of varied instructions that use a simple and intuitive description of the task, such as "Classify this movie review as positive or negative," or "Translate this sentence to Danish." Creating a dataset of instructions from scratch to fine-tune the model would take a considerable amount of resources.

What is Hybrid Natural Language Understanding?


We find it in everything from emails to videos to business documents and beyond. However, as pervasive as language data is to the enterprise, organizations struggle to maximize its value. Not only is there an incredible amount of language data available to and contained within organizations, but an exponentially increasing volume of it, as well. There is no ignoring the importance of language to the enterprise ecosystem. Organizations are listening, as 42% have already adopted natural language processing (NLP) systems while 26% plan to within the next year, according to IBM's Global AI Adoption Index 2021.

How to do Deep Learning on Graphs with Graph Convolutional Networks


In the previous post, I gave a high-level introduction to GCNs and showed how a nodes representation is updated based on its neighbors representation. In this post, we first gain a deeper understanding of the aggregation performed during the rather simple graph convolutions discussed in the previous post. Then we move on to a recently published graph convolutional propagation rule and I show how to implement and use it for semi-supervised learning on a community prediction task in Zachary's Karate Club, a small social network. As shown below, the GCN is able to learn latent feature representations for each node that separates the two communities into two reasonably cohesive and separated clusters despite using only one training example for each community.

Semi-supervised learning made simple


Semi-supervised learning is a machine learning technique of deriving useful information from both labelled and unlabelled data. Before doing this tutorial, you should have basic familiarity with supervised learning on images with PyTorch. We will omit reinforcement learning here and concentrate on the first two types. In supervised learning, our data consists of labelled objects. A machine learning model is tasked with learning how to assign labels (or values) to objects.

Machine learning: Types (part-3)


Inference refers to reaching an outcome or decision. There are different paradigms for inference that may be used as a framework for understanding how some machine learning algorithms work or how some learning problems may be approached. Some examples of approaches to learning are inductive, deductive, and transductive learning and inference. Inductive learning involves using evidence to determine the outcome. Inductive reasoning refers to using specific cases to determine general outcomes, ex- specific to general.



Boosting refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, one of the most popular is AdaBoost (Adaptive Boosting). The way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This is the technique used by AdaBoost.

Data, Assemble: Leveraging Multiple Datasets with Heterogeneous and Partial Labels Artificial Intelligence

The success of deep learning relies heavily on large datasets with extensive labels, but we often only have access to several small, heterogeneous datasets associated with partial labels, particularly in the field of medical imaging. When learning from multiple datasets, existing challenges include incomparable, heterogeneous, or even conflicting labeling protocols across datasets. In this paper, we propose a new initiative--"data, assemble"--which aims to unleash the full potential of partially labeled data and enormous unlabeled data from an assembly of datasets. To accommodate the supervised learning paradigm to partial labels, we introduce a dynamic adapter that encodes multiple visual tasks and aggregates image features in a question-and-answer manner. Furthermore, we employ pseudo-labeling and consistency constraints to harness images with missing labels and to mitigate the domain gap across datasets. From proof-of-concept studies on three natural imaging datasets and rigorous evaluations on two large-scale thorax X-ray benchmarks, we discover that learning from "negative examples" facilitates both classification and segmentation of classes of interest. This sheds new light on the computer-aided diagnosis of rare diseases and emerging pandemics, wherein "positive examples" are hard to collect, yet "negative examples" are relatively easier to assemble. As a result, besides exceeding the prior art in the NIH ChestXray benchmark, our model is particularly strong in identifying diseases of minority classes, yielding over 3-point improvement on average. Remarkably, when using existing partial labels, our model performance is on-par (p>0.05) with that using a fully curated dataset with exhaustive labels, eliminating the need for additional 40% annotation costs.

Improved pathogenicity prediction for rare human missense variants


The success of personalized genomic medicine depends on our ability to assess the pathogenicity of rare human variants, including the important class of missense variation. There are many challenges in training accurate computational systems, e.g., in finding the balance between quantity, quality, and bias in the variant sets used as training examples and avoiding predictive features that can accentuate the effects of bias. Here, we describe VARITY, which judiciously exploits a larger reservoir of training examples with uncertain accuracy and representativity. To limit circularity and bias, VARITY excludes features informed by variant annotation and protein identity. To provide a rationale for each prediction, we quantified the contribution of features and feature combinations to the pathogenicity inference of each variant.

Paint4Poem: A Dataset for Artistic Visualization of Classical Chinese Poems Artificial Intelligence

In this work we propose a new task: artistic visualization of classical Chinese poems, where the goal is to generatepaintings of a certain artistic style for classical Chinese poems. For this purpose, we construct a new dataset called Paint4Poem. Thefirst part of Paint4Poem consists of 301 high-quality poem-painting pairs collected manually from an influential modern Chinese artistFeng Zikai. As its small scale poses challenges for effectively training poem-to-painting generation models, we introduce the secondpart of Paint4Poem, which consists of 3,648 caption-painting pairs collected manually from Feng Zikai's paintings and 89,204 poem-painting pairs collected automatically from the web. We expect the former to help learning the artist painting style as it containshis most paintings, and the latter to help learning the semantic relevance between poems and paintings. Further, we analyze Paint4Poem regarding poem diversity, painting style, and the semantic relevance between poems and paintings. We create abenchmark for Paint4Poem: we train two representative text-to-image generation models: AttnGAN and MirrorGAN, and evaluate theirperformance regarding painting pictorial quality, painting stylistic relevance, and semantic relevance between poems and paintings.The results indicate that the models are able to generate paintings that have good pictorial quality and mimic Feng Zikai's style, but thereflection of poem semantics is limited. The dataset also poses many interesting research directions on this task, including transferlearning, few-shot learning, text-to-image generation for low-resource data etc. The dataset is publicly available.(