Goto

Collaborating Authors

 one-hot representation


Reviews: FRAGE: Frequency-Agnostic Word Representation

Neural Information Processing Systems

The core idea is adding an adversarial loss which is recently widely used in many NLP papers as mentioned in this paper submission. The authors define two frequency-based domains: major words and rare words. The proposed approach is easy to use and seems helpful in improving accuracy of several NLP tasks.


A Bayesian Flow Network Framework for Chemistry Tasks

Tao, Nianze, Abe, Minori

arXiv.org Artificial Intelligence

In this work, we introduce ChemBFN, a language model that handles chemistry tasks based on Bayesian flow networks working on discrete data. A new accuracy schedule is proposed to improve the sampling quality by significantly reducing the reconstruction loss. We show evidence that our method is appropriate for generating molecules with satisfied diversity even when a smaller number of sampling steps is used. A classifier-free guidance method is adapted for conditional generation. It is also worthwhile to point out that after generative training, our model can be fine-tuned on regression and classification tasks with the state-of-the-art performance, which opens the gate of building all-in-one models in a single module style. Our model has been open sourced at https://github.com/Augus1999/bayesian-flow-network-for-chemistry.


transformers.html

#artificialintelligence

Finally the discomfort of not knowing what makes them tick grew too great for me. Transformers were introduced in this 2017 paper as a tool for sequence transduction--converting one sequence of symbols to another. The most popular examples of this are translation, as in English to German. It has also been modified to perform sequence completion--given a starting prompt, carry on in the same vein and style. They have quickly become an indispensible tool for research and product development in natural language processing. Before we start, just a heads-up. We're going to be talking a lot about matrix multiplications and touching on backpropagation (the algorithm for training the model), but you don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation. This isn't a short journey, but I hope you'll be glad you came. In the beginning were the words. Our first step is to convert all the words to numbers so we can do math on them. Imagine that our goal is to create the computer that responds to our voice commands. It's our job to build the transformer that converts (or transduces) a sequence of sounds to a sequence of words. We start by choosing our vocabulary, the collection of symbols that we are going to be working with in each sequence. In our case, there will be two different sets of symbols, one for the input sequence to represent vocal sounds and one for the output sequence to represent words. For now, let's assume we're working with English. There are tens of thousands of words in the English language, and perhaps another few thousand to cover computer-specific terminology. That would give us a vocabulary size that is the better part of a hundred thousand. One way to convert words to numbers is to start counting at one and assign each word its own number. Then a sequence of words can be represented as a list of numbers. For example, consider a tiny language with a vocabulary size of three: files, find, and my. Each word could be swapped out for a number, perhaps files 1, find 2, and my 3. Then the sentence "Find my files", consisting of the word sequence [ find, my, files ] could be represented instead as the sequence of numbers [2, 3, 1]. This is a perfectly valid way to convert symbols to numbers, but it turns out that there's another format that's even easier for computers to work with, one-hot encoding. In one-hot encoding a symbol is represented by an array of mostly zeros, the same length of the vocabulary, with only a single element having a value of one. Another way to think about one-hot encoding is that each word still gets assigned its own number, but now that number is an index to an array. Here is our example above, in one-hot notation. So the sentence "Find my files" becomes a sequence of one-dimensional arrays, which, after you squeeze them together, starts to look like a two-dimensional array.


Knowledge transfer across cell lines using Hybrid Gaussian Process models with entity embedding vectors

Hutter, Clemens, von Stosch, Moritz, Bournazou, Mariano Nicolas Cruz, Butté, Alessandro

arXiv.org Machine Learning

To date, a large number of experiments are performed to develop a biochemical process. The generated data is used only once, to take decisions for development. Could we exploit data of already developed processes to make predictions for a novel process, we could significantly reduce the number of experiments needed. Processes for different products exhibit differences in behaviour, typically only a subset behave similar. Therefore, effective learning on multiple product spanning process data requires a sensible representation of the product identity. We propose to represent the product identity (a categorical feature) by embedding vectors that serve as input to a Gaussian Process regression model. We demonstrate how the embedding vectors can be learned from process data and show that they capture an interpretable notion of product similarity. The improvement in performance is compared to traditional one-hot encoding on a simulated cross product learning task. All in all, the proposed method could render possible significant reductions in wet-lab experiments.


Getting Started with TensorFlow 2 - KDnuggets

#artificialintelligence

But wait… What is Tensorflow? Tensorflow is a Deep Learning Framework by Google, which released its 2nd version in 2019. It is one of the world's most famous Deep Learning frameworks widely used by Industry Specialists and Researchers. Tensorflow v1 was difficult to use and understand as it was less Pythonic, but with v2 released with Keras now fully synchronized with Tensorflow.keras, it is easy to use, easy to learn, and simple to understand. Remember, this is not a post on Deep Learning so I expect you to be aware of Deep Learning terms and the basic ideas behind it.


Semantic Relatedness and Taxonomic Word Embeddings

Kacmajor, Magdalena, Kelleher, John D., Klubicka, Filip, Maldonado, Alfredo

arXiv.org Artificial Intelligence

This paper 1 connects a series of papers dealing with taxonomic word embeddings. It begins by noting that there are different types of semantic relatedness and that different lexical representations encode different forms of relatedness. A particularly important distinction within semantic relatedness is that of thematic versus taxonomic relatedness. Next, we present a number of experiments that analyse taxonomic embeddings that have been trained on a synthetic corpus that has been generated via a random walk over a taxonomy. These experiments demonstrate how the properties of the synthetic corpus, such as the percentage of rare words, are affected by the shape of the knowledge graph the corpus is generated from. Finally, we explore the interactions between the relative sizes of natural and synthetic corpora on the performance of embeddings when taxonomic and thematic embeddings are combined.


Drug cell line interaction prediction

Liu, Pengfei

arXiv.org Machine Learning

Understanding the phenotypic drug response on cancer cell lines plays a vital rule in anti-cancer drug discovery and re-purposing. The Genomics of Drug Sensitivity in Cancer (GDSC) database provides open data for researchers in phenotypic screening to test their models and methods. Previously, most research in these areas starts from the fingerprints or features of drugs, instead of their structures. In this paper, we introduce a model for phenotypic screening, which is called twin Convolutional Neural Network for drugs in SMILES format (tCNNS). tCNNS is comprised of CNN input channels for drugs in SMILES format and cancer cell lines respectively. Our model achieves $0.84$ for the coefficient of determinant($R^2$) and $0.92$ for Pearson correlation($R_p$), which are significantly better than previous works\cite{ammad2014integrative,haider2015copula,menden2013machine}. Besides these statistical metrics, tCNNS also provides some insights into phenotypic screening.


Word2Vec and FastText Word Embedding with Gensim – Towards Data Science

#artificialintelligence

A traditional way of representing words is one-hot vector, which is essentially a vector with only one target element being 1 and the others being 0. The length of the vector is equal to the size of the total unique vocabulary in the corpora. Conventionally, these unique words are encoded in alphabetical order. Namely, you should expect the one-hot vectors for words starting with "a" with target "1" of lower index, while those for words beginning with "z" with target "1" of higher index. Though this representation of words is simple and easy to implement, there are several issues. First, you cannot infer any relationship between two words given their one-hot representation.