Click to learn more about author Rosaria Silipo. Automatic machine translation has been a popular subject for machine learning algorithms. After all, if machines can detect topics and understand texts, translation should be just the next step. Machine translation can be seen as a variation of natural language generation. In a previous project, we worked on the automatic generation of fairy tales (see "Once upon a Time … by LSTM Network").
Text Generation is a task in Natural Language Processing (NLP) in which text is generated with some constraints such as initial characters or initial words. We come across this task in our day-to-day applications such as character/word/sentence predictions while typing texts in Gmail, Google Docs, Smartphone keyboard, and chatbot. Understanding of text generation forms the base to advanced NLP tasks such as Neural Machine Translation. This article discusses the text generation task to predict the next character given its previous characters. It employs a recurrent neural network with LSTM layers to achieve the task. The deep learning process will be carried out using TensorFlow's Keras, a high-level API.
Today, we will provide a walkthrough example of how you can apply character-based text generation using RNN and more particularly GRU models in TensorFlow. We will run it on colab and as training dataset we will take the "Alice's Adventures in Wonderland". In another post we explained how you can apply word-based text generation. Feel free to compare the two approaches. I wrote this article after finishing the course Natural Language Processing in Tensorflow by Coursera and deeplearning.ai.
We present a novel method of performing spelling correction on short input strings, such as search queries or individual words. At its core lies a procedure for generating artificial typos which closely follow the error patterns manifested by humans. This procedure is used to train the production spelling correction model based on a transformer architecture. This model is currently served in the HubSpot product search. We show that our approach to typo generation is superior to the widespread practice of adding noise, which ignores human patterns. We also demonstrate how our approach may be extended to resource-scarce settings and train spelling correction models for Arabic, Greek, Russian, and Setswana languages, without using any labeled data.
In the last few years, many different methods have been focusing on using deep recurrent neural networks for natural language generation. The most widely used sequence-to-sequence neural methods are word-based: as such, they need a pre-processing step called delexicalization (conversely, relexicalization) to deal with uncommon or unknown words. These forms of processing, however, give rise to models that depend on the vocabulary used and are not completely neural. In this work, we present an end-to-end sequence-to-sequence model with attention mechanism which reads and generates at a character level, no longer requiring delexicalization, tokenization, nor even lowercasing. Moreover, since characters constitute the common "building blocks" of every text, it also allows a more general approach to text generation, enabling the possibility to exploit transfer learning for training. These skills are obtained thanks to two major features: (i) the possibility to alternate between the standard generation mechanism and a copy one, which allows to directly copy input facts to produce outputs, and (ii) the use of an original training pipeline that further improves the quality of the generated texts. We also introduce a new dataset called E2E+, designed to highlight the copying capabilities of character-based models, that is a modified version of the well-known E2E dataset used in the E2E Challenge. We tested our model according to five broadly accepted metrics (including the widely used bleu), showing that it yields competitive performance with respect to both character-based and word-based approaches.