Goto

Collaborating Authors

Machine Learning


iiot bigdata_2022-05-27_03-56-20.xlsx

#artificialintelligence

The graph represents a network of 1,040 Twitter users whose tweets in the requested range contained "iiot bigdata", or who were replied to or mentioned in those tweets. The network was obtained from the NodeXL Graph Server on Friday, 27 May 2022 at 11:01 UTC. The requested start date was Friday, 27 May 2022 at 00:01 UTC and the maximum number of tweets (going backward in time) was 7,500. The tweets in the network were tweeted over the 2-day, 18-hour, 9-minute period from Tuesday, 24 May 2022 at 05:50 UTC to Thursday, 26 May 2022 at 23:59 UTC. Additional tweets that were mentioned in this data set were also collected from prior time periods.


A Deep Dive into NLP Tokenization and Encoding with Word and Sentence Embeddings

#artificialintelligence

Introduction An estimated 80% of all organizational data is unstructured text. Despite this, even among data savvy organizations, most of it goes completely ignored. Why? Numeric data is easy. Just toss it in a standard scaler, throw it at a model, and see what happens. You may not get good results, but you'll get something, and you can always iterate from there. Text data is hard. Before it can be entered into any mathematical model, it must first be encoded into numbers. How do we turn words into numbers? From simple 0's and 1's, to multidimensional embeddings, NLP has made incredible progress in the past decade. But using these cutting-edge strategies means rethinking our data. It means expanding our minds into complex higher-dimensional spaces that often defy our intuition. I still struggle with the basic 4 dimensions of our physical world. When I first heard about 768-dimension embeddings, I feared my brain would escape from my ear. If you can relate, if you want to truly master the tricky subject of NLP encoding, this article is for you. This is a deep dive: over 8,000 words long. Don't be afraid to bookmark this article and read it in pieces. There is a lot to cover. We will start with basic One-Hot encoding, move on to word2vec word and sentence embeddings, build our own custom embeddings using R, and finally, work with the cutting-edge BERT model and its contextual embeddings. Like Frodo on the way to Mordor, we have a long and challenging journey before us. Let's get started. Our Experiment The Lord of the Rings and Star Wars are two iconic franchises that need no introduction. Every day thousands of passionate fans discuss these fictional worlds and characters online. For example: msg_id: 27054 source: /r/lotr Not quite. Radagast alerts the Eagles to scout around and report to him and to Gandalf. Gwaihir is near Orthanc later and sees Gandalf and so goes down to him. Radagast is not aware of Saruman's Treachery at any point. The above comment doesn't specifically mention Lord of the Rings, but we humans can easily infer which franchise the author is talking about. Another example: msg_id: 17480 Source: /r/StarWars “For over a thousand generations the Jedi knights were the guardians of peace and justice in the old republic. Before the dark times. Before the empire.”- Ben Kenobi to Luke Skywalker. Again, no specific mention of Star Wars, but we humans know what's up. We can easily classify these comments. And in this experiment, we are going to train a machine to do the same thing. Can different encoding methods help the machine gain better insights into the text? Do some methods result in better predictions than others? How do these different encoding methods compare, all else being equal? Meet the Model Our classifier is based on the StepByStep class by Daniel Voigt Godoy. It is an extremely simple model: [dm_code_snippet]#define our model model = nn.Sequential( nn.Linear(X, 1) ) loss_fn = nn.BCEWithLogitsLoss() optimizer = optim.Adam(model.parameters(), lr=0.01) # train it our_model = StepByStep(model, loss_fn, optimizer) our_model.set_loaders(train_loader, val_loader) our_model.train(100) [/dm_code_snippet] For each individual comment, we will send X inputs (features) into a dense linear layer with a single output. The exact number of inputs will differ depending on encoding method, but they will all follow this same architecture: Each input (Xn) is connected to our output by its own weight (Wn). These weights, plus a special bias weight, are the trainable parameters of the model. We begin with random values. With each iteration these parameters (weights) are modified slightly. This is what is happening behind-the-scenes when we say a model is training and learning. Mathematically speaking, we simply adjust each weight by the partial derivative of our loss function with respect to the given weight, multiplied by our learning rate (η). [latexpage] \[w_n\prime=w_n - \eta\frac{\partial \text{loss function}}{\partial w_n}\] Our model uses the Adam optimizer which adjusts our learning rate (η) as it trains. We will use the initial learning rate 0.01. In general practice, that learning rate is really big; we would normally use something closer to 0.0003. That just leaves our loss function, the heart of any neural network. Since this is a binary classification task, we will use log loss. In general terms, this loss function measures the probability of belonging to the positive class . In our specific case we use BCEWithLogitsLoss which returns logits that must be passed through a sigmoid function to squish them back into these probabilities. [latexpage] \[S(x) = \frac{1}{1+e^{-x}}\] The sigmoid function squishes logits back into a 0%-100% range For any comments with a probability > 50%, we will classify them as positive. Here we treat Lord of the Rings as our positive class, but that is an arbitrary choice. The math works the same either way. Same model, same data, different encoding methods. Here we go. Our Data On Reddit we find the /r/StarWars and /r/lotr subreddits. We wrote a simple reddit scraper using praw (google colab) to pull comments from each. We pulled about 50,000 comments total, from the hottest content around March 1, 2022. A significant number of these comments were extremely short. They did not contain any useful information. Neither a human nor machine could tell you which subreddit they came from. We randomly selected 10,000 of these 50,000 total comments, each containing at least 13 words (not counting stop words e.g. the, is, a, etc.). Exactly half of these comments came from /r/StarWars, and the other half from /r/lotr. This allows us to avoid the notorious imbalanced classes issue. If we used a coin flip to guess, or always picked the same franchise, we would expect to be right 50% of the time. We used this R script to label and organize the data and break it into 5 equally sized 50/50 groups so we could put this data on github. When training and testing models, we always want to remain mindful of data leakage. We cannot allow any information from outside the training dataset to leak into the model. To that end, we will use groups 1, 2 and 3 as our training data (60%), use group 4 as our validation data (20%), and leave group 5 as our testing data (20%). This also helps keep everything as apples-to-apples as possible. Each method will be encoded, trained, validated and tested using the exact same samples. At this point our data looks like this: We keep the original raw_text exactly as written, plus a clean_text version, converted to lowercase, with all punctuation and other special characters removed. Now how do we turn this data into input for our model? One-Hot Encoding We will begin with One-Hot encoded inputs. This will allow us to gain some insight into what our model is doing, and define some metrics for our model's performance. This will also allow us to clearly illustrate the difference between tokenization and encoding. Tokenizing with Unigrams Tokenization is the process of turning a full text string into smaller chunks we call tokens. There are many different ways we might tokenize our text. We could use bigrams (luke skywalker) or trigrams (gandalf the grey), or tokenize parts of a word, or even individual characters. Our options seem infinite. Here we will tokenize into unigrams, using the tidytext package in R. To avoid data leakage we will use only our training data here. [dm_code_snippet] ############################################ # load dependencies library(tidyverse) library(tidytext) ############################################################ # load our TRAINING data from github (groups 1, 2, 3) comments % rbind(read.csv(https://raw.githubusercontent.com/DataJenius/NLPEncodingExperiment/main/data/comments/selected/selected_reddit_comments_group2.csv)) %>% rbind(read.csv(https://raw.githubusercontent.com/DataJenius/NLPEncodingExperiment/main/data/comments/selected/selected_reddit_comments_group3.csv)) [/dm_code_snippet] We will tokenize on clean_text and also filter out any tokens that are stop words (common words such as the and is and a). [dm_code_snippet] ############################################################ # define our stopwords and make unigrams from clean_text words_to_ignore % filter(!(token %in% words_to_ignore)) [/dm_code_snippet] Even after filtering, we are left with a very long dataframe, with one row for each occurrence of each token: Which of these tokens do we care about? How many documents (comments) does each token appear in? [dm_code_snippet]################################################################ # find most important tokens according to document frequency tokens_by_df <- unigrams %>% group_by(token) %>% summarise(total_df=length(unique(msg_id))) %>% arrange(desc(total_df)) [/dm_code_snippet] Here is a list of the top 20 tokens by document frequency (df): Some of our most frequent tokens (e.g. story, tolkien, episode) seem to offer insight into these two groups. Let's visualize this: We see that some tokens (e.g. “trilogy”, “characters”, “story”) are used roughly the same by both franchises. Some tokens (e.g. “frodo”, “books”, “tolkien”) are used more often in /r/lotr comments, and others (e.g. “mandalorian”, “luke”, “episode”) are used more often in /r/StarWars comments. But how do we get this information to our classification model? Binary Encoding One-Hot encoding is simple. If a comment includes a given token, we will mark that with a 1. If a comment does not include a given token, we will mark that with a 0. The only tricky part is deciding how many tokens to include. How much dimensionality do we want? We begin with our top 300 tokens according to the document frequency of the training data. We then use that vocabulary to encode all 10,000 comments. At this point our data looks like this: We have added a column for each of our top 300 tokens (starting with absolutely, action, actors and ending with youre). For each comment, if it includes a given token we see a 1, otherwise a 0. We can see this One-Hot encoding results in a sparse matrix; most of what it contains are 0's. You can find the R script used to one-hot encode the data, and the data itself on github. We started with a raw text string, tokenized it into clean tokens, and encoded these tokens into actual numbers (0 or 1) that can be passed into a neural network as input. Limitations of One-Hot Encoding Consider the following made up comments: example: 1 I love Luke Skywalker, but I hate Gandalf the Grey. example: 2 I love Gandalf the Grey, but I hate Luke Skywalker. As humans we can see these two examples have opposite meanings. Sadly, our One-Hot encodings do not show any difference: The order of tokens is ignored. Furthermore, only tokens that appear in our pre-defined vocabulary will appear here. Stop words like I, the, but, and less common tokens like Grey, simply disappear. We find 38 of these 10,000 comments are One-Hot encoded with all 0's; they don't contain any of the top 300 tokens. Another issue: we only match exact tokens. For the sake of this encoding, love and loved are two entirely different things, even though, as humans, we understand they are just different forms of the same word. We also ignore how often these tokens are used. How will our model do given these limitations? One-Hot (300) To train and use the model we turn back to python (google colab). We will pass the model 300 inputs, 0's or 1's, for each of our top 300 tokens. This results in 301 total trainable parameters (including the bias parameter which is independent of our inputs). We train for 100 epochs using our training and validation data: We see the model struggles to learn. As we proceed through 100 epochs the model overfits on the training data; the gap between training loss and validation loss becomes wide, and stays wide. How well does the model do given One-Hot (300) encoding on data it has never seen before (our testing data, group 5)? In terms of accuracy we got 91.3%. This means the model correctly classified 1,826 out of 2,000 comments. Not bad. In terms of precision we got 91.2%. This means that of all the comments our model thought were from Lord of the Rings, 91.2% of them were actually from Lord of the Rings. It also means 8.8% were really from Star Wars. In terms of recall we got 91.4%. This means that of the 1,000 Lord of the Rings comments in our training data (exactly half there are), our model found 914 of them. You can find all 2,000 individual predictions on github. One-Hot (768) Can we make the model perform better by including more tokens? What if we use the top 768 tokens instead? The data looks pretty much the same, except now we have another 468 columns of 0's and 1's. You can find the full data here on github. And you can train and test the model with python (google colab). The only change to the model is more inputs. Now we have 769 total trainable parameters. We train for the same 100 epochs: We see the model struggles to learn even more than before. The gap between training loss and validation loss gets wider as we train the model longer and longer. We find accuracy improves (+0.35%) to 91.55%. We also find precision improves (+1.1%) to 92.3%, but recall decreases (0.7%) to 90.7%. You can find all 2,000 individual predictions on github if you'd like to take a closer look. Obviously we can keep throwing more and more tokens at our model, but the law of diminishing returns is against us. Is there a better way to encode the information this text contains? word2vec Our approach to NLP encoding took a radical leap forward with the 2013 paper Efficient Estimation of Word Representations in Vector Space. Instead of representing each token with a simple 0 or 1, word2vec proposed representing each token as a vector instead. [latexpage] \[\vec{king} = \begin{bmatrix}0.125976562\\0.0297851562\\0.00860595703\\...\\0.251953125\end{bmatrix}\] We can use cosine similarity to compare how similar any two equal-length vectors are to each other: [latexpage] \[\text{CosSim}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{||\vec{a}|| \thinspace||\vec{b}||} = \frac{\sum_i^na_ib_i}{\sqrt{\sum_i^na_i^2}\sqrt{\sum_i^nb_i^2}}\] This allows us find the tokens that are most similar to our given token (king in this example): This also allows us to do the following: [latexpage] \[\vec{king}+\vec{woman}-\vec{man} = \text{???}\] Behind-the-scenes this is just vector addition and subtraction. We create a new vector: [latexpage] \[\begin{bmatrix}0.125976562\\0.0297851562\\0.00860595703\\...\\0.251953125\end{bmatrix} + \begin{bmatrix}0.243164062\\-0.0771484375\\-0.103027344\\...\\-0.0294189453\end{bmatrix} - \begin{bmatrix}0.32617188\\0.13085938\\0.03466797\\...\\0.02099609\end{bmatrix} = \begin{bmatrix}0.04296875\\-0.178222656\\-0.129089355\\...\\0.201538086\end{bmatrix}\] We can then use cosine similarity to find which token in our existing vocabulary is most similar to this new hybrid vector: [latexpage] \[\vec{king}+\vec{woman}-\vec{man} = \vec{queen}\] These word2vec vectors are clearly more informative and flexible than our old One-Hot encoding. Can we use them to improve the performance of our model? word2vec Tokenization The original word2vec was trained on roughly 100 billion words from a collection of articles curated by Google News. It contains a vocabulary of 3 million unique tokens. We will tokenize on the same examples from above to see how it compares: example: 1 I love Luke Skywalker, but I hate Gandalf the Grey. example: 2 I love Gandalf the Grey, but I hate Luke Skywalker. We find that word2vec's vocabulary includes unigrams like our One-Hot encoding, but also and parts of words: example: 1 tokenized: word2vec i love luke skywalker [UNK] but i hate ganda ##l ##f the gre ##y [UNK] example: 2 tokenized: word2vec i love ganda ##l ##f the gre ##y [UNK] but i hate luke skywalker [UNK] We see that word2vec's vocabulary does not include punctuation; both the comma and period have been replaced with [UNK], which represents unknown tokens. We also see word2vec's vocabulary does not include the name Gandalf and instead tokenizes it using two word fragments (ganda + ##l + ##f). These word fragment tokens allow word2vec to cover a much greater variety in language. word2vec Word Embeddings For each token in word2vec's vocabulary, we have a 300-dimension word embedding, a vector with a series of 300 numbers. [latexpage] \[\vec{skywalker} = \begin{bmatrix}0.105957031\\0.04296875\\0.0810546875\\...\\0.171875\end{bmatrix}\] For special tokens (e.g. [UNK]) we simply use a vector of 300 zeros: [latexpage] \[\vec{[UNK]} = \begin{bmatrix}0\\0\\0\\...\\0\end{bmatrix}\] Once we have tokenized our text, we simply grab the corresponding pre-trained word2vec word embedding for each of them. This includes our word fragment tokens like ##l and ##f. word2vec Sentence Embeddings We have 10,000 comments, each with a variable number of tokens, each of which is associated to a 300-dimension word embedding. How do we turn this mess into useable input for our model? The first thing we will do is deal with the fact that our comments have variable numbers of tokens. We will set a max token length of 500, and truncate any comments that are longer than that. For comments shorter than 500 tokens, we will pad them with vectors of 300 zeros. For example: example: 2 I love Gandalf the Grey, but I hate Luke Skywalker. This is tokenized by word2vec like so: example: 2 tokenized: word2vec i love ganda ##l ##f the gre ##y [UNK] but i hate luke skywalker [UNK] And encoded like so: We now have exactly 500 columns left-to-right, one for each token in this comment, plus as many [PAD] tokens as needed. We also have exactly 300 rows, one for each of our 300 dimensions. There are many different ways we might combine these token-based word embeddings into full sentence (comment) embeddings. One simple method is to take the average of each dimension, the row-wise mean of the above dataframe. If we take those averages and transpose them (rotate 90°) we get something that resembles our One-Hot encoding from above. At this point our data for all comments looks like this: Now we have one row for each comment, plus an additional 300 columns (our features X). However, instead of a simple 0 or 1 for a select vocabulary of tokens, we now have the average of each dimension of our 300-dimension word embeddings. What was once a sparse matrix full of 0's is now dense with information. Because these word2vec embeddings are pre-trained, we can encode all of our data at the same time without any fear of data leakage. You can find the full data here on github. And you can extract these embeddings yourself with python (google colab). Limitations of word2vec In theory, word2vec offers some advantages over One-Hot. For example, One-Hot ignores how often tokens are used. With word2vec, repeated tokens serve as a sort of weighted average in our sentence embeddings. We also noted that, in terms of One-Hot encoding, the tokens love and loved are completely different things. Not true of these word2vec embeddings: We see that the most similar tokens to loved not only include love but also loves, loving, adore and other synonyms and antonyms. This also shows how embeddings like these are able to deal with common slang, typos, and other eccentricities of natural language. However, as with our One-Hot encoding, the order of tokens is still ignored. The I love/hate Luke Skywalker, but I love/hate Gandalf the Grey. examples still result in identical encodings. Furthermore, any tokens that do not appear in word2vec's 3,000,000 token vocabulary will be encoded as zeros and lost. And the biggest limitation to these word2vec embeddings? They were trained almost 10 years ago, on data not very similar to our Reddit comments. Consider the tokens most similar to luke: It appears luke is associated to other common first names, nothing specific to our Jedi hero. And the tokens most similar to skywalker? I'm not sure what these tokens have in common. Is this new word2vec encoding truly better than the old One-Hot 0's and 1's? word2vec (300) To train and use the model we turn back to python (google colab). We will pass the model 300 inputs, the means of each of our 300 dimensions of word2vec embeddings for each comment. This results in 301 total trainable parameters (including the bias). We train for 100 epochs using our training and validation data: We see the model is now learning. The gap between training loss and validation loss is small, and both continue to decrease as the model trains longer and longer. However, compared to our One-Hot inputs with 300 dimensions, after 100 epochs, we find accuracy decreases (-2.1%) to 89.2%. We also find precision drops (-0.97%) to 90.25%, and recall decreases (-2.8%) to 87.9%. You can find all 2,000 individual predictions on github if you’d like to take a closer look. We continued training for 1,000 epochs: Even after 1,000 epochs we found accuracy failed to beat our One-Hot inputs (-0.15%) with 91.15%. We find an improvement in precision (+0.47%) to 91.69%, but recall decreases (-0.2%) to 90.5%. You can find all 2,000 individual predictions on github. Does this mean all these fancy vector encodings are a waste of time? Custom Embeddings The name word2vec usually refers to the pre-trained 300-dimensional word embeddings used above . However, word2vec can also refer the actual model used to train these embeddings in the first place. We aren't stuck with these pre-trained vectors. We can build our own using the word2vec model and our own specific text. Sure, we could just reuse the word2vec model, but what fun is that? In order to truly dig in and understand these embeddings, we are going to make our own, from scratch, using R. The Theory The theory behind word2vec is deceptively simple: You shall know a word by the company it keeps.John Rupert Firth At the heart of any neural network we find a loss function, and word2vec is no exception: [latexpage] \[J(\theta) = -\frac{1}{T} \sum_{t=1}^T \sum_{j=(-m)}^m log\bigg(p(w_{t+j}|w_t;\theta)\bigg) \text{ where } j \neq 0\] Although this formula appears intimidating at first, it can be reduced to some very simple ideas. (T) is the total number of tokens in our corpus. We loop through each of these tokens (t) one-by-one. At each token we look at a window of size (m) to see which tokens are our neighbors. To use an example, imagine the text: I think machine learning is super cool. Let us assume m=2 and t=5 for the sake of this example. This means we have reached the 5th token (is): [latexpage] \[\text{ machine } \text{ learning }\frac{\text{ is }}{} \text{ super } \text{ cool }\] Our m window of 2 means we look at the two words in front of is, and the two words that follow. Note that m=2 is an arbitrary choice, determined by the researcher. One heuristic is that smaller window sizes (2-15) lead to embeddings where high similarity scores between two embeddings indicates that the words are interchangeable. [...] Larger window sizes (15-50, or even more) lead to embeddings where similarity is more indicative of relatedness of the words.Jay Alammar, The Illustrated Word2vec Recall this term of our loss function: [latexpage] \[\sum_{j=(-m)}^m log\bigg(p(w_{t+j}|w_t;\theta)\bigg) \text{ where } j \neq 0\] Here this simply means: [latexpage] \[log\bigg(p(\text{machine}|\text{is};\theta)\bigg) + log\bigg(p(\text{learning}|\text{is};\theta)\bigg) + log\bigg(p(\text{super}|\text{is};\theta)\bigg) + log\bigg(p(\text{cool}|\text{is};\theta)\bigg)\] It all comes down to basic probability. How likely is a context token given a specific target token? Let us simplify the above math. Let A equal the number of comments in which a target token appears together with a context token. Let B equal the number of comments in which the target token appears overall. Our probability is now a simple A/B. Let us illustrate with an example: [latexpage] \[p(\text{luke}|\text{skywalker}) = \frac{A}{B} = \frac{50}{103} \approx 48.5\%\] The token skywalker appears in 103 of our training comments. The tokens skywalker and luke appear together in 50 of our training comments. Therefore the probability of seeing the context token luke given the target token skywalker is roughly 50%. Embeddings from Scratch We can create custom embeddings with less than 200 lines of R code. Actually, we could do it more efficiently than that, but this code is intentionally verbose for the sake of a teaching example. Because we are considering the entire comment regardless of length, our m window is now variable- the length of each individual comment. This results in the largest possible window, the largest possible m, and yields embeddings where similarity is more indicative of the relatedness of the words. To avoid data leakage, we will build these custom embeddings based on our training data only: [dm_code_snippet] ############################################################ # load our TRAINING data from github (groups 1, 2, 3) comments % rbind(read.csv(https://raw.githubusercontent.com/DataJenius/NLPEncodingExperiment/main/data/comments/selected/selected_reddit_comments_group2.csv)) %>% rbind(read.csv(https://raw.githubusercontent.com/DataJenius/NLPEncodingExperiment/main/data/comments/selected/selected_reddit_comments_group3.csv)) [/dm_code_snippet] Again we tokenize into unigrams. B is simply the document frequency from before. We limit our vocabulary to tokens that have appeared in at least 2 comments: [dm_code_snippet] ############################################################ # define vocabulary that appears in at least 2 documents # we have 10,161 unique tokens (our vocabulary) vocab % group_by(token) %>% summarise(B_total_occurrences=length(unique(msg_id))) %>% arrange(desc(B_total_occurrences)) %>% filter(B_total_occurrences >= 2) %>% rename(target_token=token) [/dm_code_snippet] We can get A from the pairwise_count function: the number of times each context token appears with a given target token. [dm_code_snippet]############################################################ # define all pairs of our 10,161 vocab words vocab.pair <- unigrams %>% # only consider tokens in vocab filter(token %in% vocab$target_token) %>% # find A the occurrences of both tokens pairwise_count(token, msg_id, diag=FALSE) %>% # include diagonal (always 100%) rename(target_token=item1, context_token=item2, A_pair_occurrences=n) [/dm_code_snippet] We join the two together and calculate prob as A/B: Next we cast this probability into a matrix where each row represents a target token and each column represents a context token. We have 10,161 rows and columns, one for each token in our vocabulary: Once again we have a sparse matrix. If we take the row for a single target token and transpose it, what we are left with is a de-facto 10,161-dimension word embedding. Each dimension represents the probability of each of the 10,161 potential context tokens: There are many different methods of dimensionality reduction we might use, including PCA, t-SNE and SVD, just to name a few. Here we will use IRLBA in order to reduce from 10,161 dimensions down to 300 dimensions, the same number our word2vec embeddings have: Our sparse matrix has become dense with information, but the trade-off is interpretability. We had 10,161 rows, each corresponding to a specific context token probability. Once reduced to 300, our dimensions no longer refer to a specific token. This makes it very hard to define what a given dimension is capturing in terms a human can understand. We can, however, understand the essence of this new custom embedding: we have preserved the most important information contained in the above probability matrix. Custom Sentence Embeddings How do we use these new custom embeddings to encode our full comments? Our method will be similar to our word2vec sentence embeddings above, with one notable difference. Instead of padding and truncating our comments to a standard number of tokens, we will simply consider all the tokens in each comment that can be found in our vocabulary. Any tokens not in our vocabulary will be ignored. Any comment without any known tokens will be encoded with 300 zeros instead. Our custom vocabulary still includes 10,161 different unigrams. Each of these tokens is associated to one of the 300-dimension custom word embeddings we made above. For example: example: 1 I love Luke Skywalker, but I hate Gandalf the Grey. This is tokenized based on our custom vocabulary like so: example: 1 tokenized: custom love luke skywalker hate gandalf grey And encoded like so: To reduce this into a sentence embedding, once again we will take the average per dimension (the row-wise mean) and transpose it. At this point our data for all comments looks like this: You can find the full encoded data here on github. And you can encode this data for yourself using this R script. Limitations of Custom The order of tokens is still being ignored. The “I love/hate Luke Skywalker, but I love/hate Gandalf the Grey.” examples will still result in identical encodings. Furthermore, any tokens that do not appear in our custom 10,161 unigram vocabulary will be ignored. However, these custom embeddings seem to capture the essence of our comments far better than the pre-trained word2vec embeddings. Consider the most similar tokens for luke and gandalf: These similar tokens make a lot more sense now. Do we finally have an encoding that can beat the old One-Hot 0’s and 1’s? Custom (300) To train and use the model we turn back to python (google colab). We will pass the model 300 inputs, the means of each of our 300 dimensions of custom embeddings for each comment. This results in 301 total trainable parameters (including the bias). We train for 100 epochs using our training and validation data: We see the model is learning. Both training loss and validation loss continue to decrease as the model trains longer and longer. We find considerable improvement compared to our One-Hot inputs with 300 dimensions. After 100 epochs, accuracy increases (+3.3%) to 94.6%, precision increases (+3.2%) to 94.42%, and recall increases (+3.4%) to 94.8%. You can find all 2,000 individual predictions on github. We continued training for 1,000 epochs: This training/validation loss curve isn't as pretty now. After around 400 epochs it struggles to learn further, and starts to unlearn slightly as we approach 1,000 epochs. Despite this, we see another jump in performance. Accuracy increases another +1.15% to 95.75%, while precision increases another +1.56% to 95.98%, and recall increases another +0.7% to 95.5%. Again you can find all 2,000 individual predictions on github. It looks like our custom embeddings are pretty powerful. Can we do better? Custom (768) What happens if we increase our custom embeddings to 768 dimensions? The data looks pretty much the same, except now we have another 468 columns of dense custom embedding inputs. You can find the full data here on github. And you can train and test the model with python (google colab). The only change to the model is more inputs. Now we have 769 total trainable parameters. We train for the same 100 epochs: Once again we see the model is learning; training and validation loss decrease as the model trains longer and longer. Given these additional dimensions, we find more improvement. Compared to our custom embeddings with 300 dimensions, after 100 epochs each, accuracy increases (+0.75%) to 95.35%, precision increases (+0.98%) to 95.4%, and recall increases (+0.5%) to 94.8%. You can find all 2,000 individual predictions on github. We continued training for 1,000 epochs: Again the model starts to “unlearn” around the 400th epoch. Performance remains largely unchanged. We are unable to beat 300 dimensions with 768 dimensions after 1,000 epochs. We could keep adding dimensions and training epochs, but once again we are facing the law of diminishing returns. If we want to improve performance further, we will need a different strategy. BERT BERT (Bidirectional Encoder Representations from Transformers) combined the unsupervised pre-training of a transformer (OpenAI GPT) with embeddings that consider the bidirectional context of tokens (eLMo). When released in 2019, BERT was capable of producing state-of-the-art results for 11 different NLP tasks. Here we are concerned with only one of those tasks: classification. BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding BERT Masking One of the things BERT does is use masks in order to predict a token given the context of surrounding tokens. For example: example: 1 model: BERT masked: love I [MASK] Luke Skywalker, but I hate Gandalf the Grey. How does this differ from word2vec? Until now, all of our encoding schemes have ignored the order of our tokens. A word2vec embedding trained with a window of m=9 would include love, hate, Luke Skywalker and Gandalf the Grey, but have no additional context beyond that. By comparison, BERT is bidirectional. In other words, not only does BERT consider the order of tokens, it considers them both left-to-right and right-to-left. example: 1 model: BERT advantage: bidirectional I love Luke Skywalker, but I hate Gandalf the Grey. example: 2 model: BERT advantage: bidirectional I love Gandalf the Grey, but I hate Luke Skywalker. At long last, the “I love/hate Luke Skywalker, but I love/hate Gandalf the Grey.” examples will result in different encodings, based on the specific order of the tokens. BERT Tokenization Let's take a closer look at the BERT tokenizer: example: 2 I love Gandalf the Grey, but I hate Luke Skywalker. This is tokenized by BERT like so: example: 2 tokenized: BERT [CLS] i love gan ##dal ##f the grey , but i hate luke sky ##walker . [SEP] The first thing we notice about BERT tokenization is the special [CLS] token added to the start of our comment. This is going to be critical in the next step. We also find [SEP] tokens at the end of our comments. Unlike word2vec, BERT tokenization includes punctuation. Our comma and period remain. We find that text tokenized by BERT, like word2vec, includes parts of words. However, unlike word2vec, BERT uses the WordPiece algorithm in order to define these subword tokens. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it loses by merging two symbols to ensure it’s worth it.HuggingFace Documentation BERT has a vocabulary of only 30,522 unique tokens, compared to word2vec's vocabulary of 3 million. Despite this, we find that the BERT tokenizer with WordPiece is more versatile. It can handle pretty much whatever you throw at it: example: 3 Captain Smurgleblorp snorged his 😜 on Blursday. example: 3 tokenized: BERT [CLS] captain sm ##urg ##le ##bl ##or ##p s ##nor ##ged his [UNK] on blur ##sd ##ay . [SEP] As before, when it encounters something unknown (in this case our 😜 emoji) you will see the special [UNK] token in its place. These tokens look good. But how do we get our embeddings? BERT Sentence Embeddings For our word2vec and custom sentence embeddings above, our process was roughly the same: Tokenize our text and extract all known tokensEncode each known token with the corresponding word embeddingUse the dimension-wise average as the sentence embedding BERT, like word2vec above, uses truncation and padding to turn comments (each with a variable number of tokens) into a standard size tensor. However, this is where the similarities end. Illustration by Yang Gao BERT uses 12 transformer layers and 12 self-attention heads in order to focus on different sections of each text (reading both left-to-right and right-to-left). The resulting embeddings are based (in part) on the order of tokens, and the exact position of the token in question. This is how BERT is able to train and learn 768-dimension contextual embeddings for each token. This includes the special [CLS] token mentioned above. We will use this special [CLS] embedding, rather than a dimensional average, for our downstream task (predicting which franchise a comment belongs to). As we see below, this is exactly what the BertForSequenceClassification model does: [dm_code_snippet]BertForSequenceClassification( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( ** 12 layers of self-attention goodness ** ) (pooler): BertPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) ) (dropout): Dropout(p=0.1, inplace=False) (classifier): Linear(in_features=768, out_features=2, bias=True) ) [/dm_code_snippet] We used this python script (google colab) to extract the output of the BertPooler above. Note that we used the pre-trained (pt) bert-based-uncased model to get these. Like the our word2vec input above, this BERTpt input is not based on any of our own comment data. These embeddings are based on what BERT saw back in 2019 when it was originally trained. [dm_code_snippet]bert_model = TFBertModel.from_pretrained( 'bert-base-uncased', trainable=False, num_labels=2) [/dm_code_snippet] At this point our data is almost identical to custom (768), just with different values for each dimension. It looks like this: You can find the full encoded data here on github. BERTpt (768) To train and use the model we turn back to python (google colab). We will pass the model 768 inputs, the pre-trained BERT [CLS] embeddings for each comment. This results in 769 total trainable parameters (including the bias). Because the values of these BERTpt embeddings are significantly larger that the others, we reduced our initial learning rate (η) to 0.0003 to compensate. Then we train for 100 epochs using our training and validation data: We see that our same simple classification model is able to learn from these fancy BERT embeddings too. Both training loss and validation loss decrease as the model trains. How well do these new BERTpt embeddings do as input? In a word: disappointing. The BERTpt input fails to outperform the one-hot or word2vec inputs. Custom still reigns supreme. You can find all 2,000 individual predictions on github. Does this mean using BERT is a waste of our time? Not so fast. Fine-tuning BERT As mentioned above, BERT uses 12 transformers and 12 self-attention heads in order to generate these embeddings. However, the above pre-trained encoding didn't take full advantage of that. The excellent Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) by Jay Alammar can help provide readers with some additional context. For our purposes the key point is this: we can fine-tune BERT. BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding Unlike prior attempts at transfer learning, BERT allows us to fine-tune using all 109,483,778 of its trainable parameters. This results in the best possible representations of our specific text. Once again we remain mindful of data leakage. We restrict this tuning to our training and validation data only, and keep our test data unseen. We used this python script (google colab) to fine-tune BERT for a single epoch. We then extracted these fine-tuned (BERTft) embeddings for all 10,000 comments for use in our own model. While extracting these embeddings you will encounter the biggest limitation of BERT. Compared to our simple model, which trains and predicts in under five minutes, BERT can feel painfully slow. It took roughly 4 hours to complete a single epoch of fine-tuning. Why so long? Because we are fine-tuning 109,483,778 trainable parameters. That's a lot of math. Our end result is almost identical to custom (768) and BERTpt (768). All that changes is the specific values for each dimension. It looks like this: You can find the full encoded data here on github. BERTft (768) To train and use the model we turn back to python (google colab). We will pass the model 768 inputs, the fine-tuned BERT [CLS] embeddings for each comment. This results in 769 total trainable parameters (including the bias). We reduce our initial learning rate (η) to 0.00001 and train for 100 epochs using our training and validation data: Both training loss and validation loss decrease as the model trains, but learning seems to level off after around 18 epochs. Now the million dollar question: how do fine-tuned BERT embeddings compare to our best custom embeddings? We find accuracy improves (+0.65%) to 96.4%, precision improves (+0.61%) to 96.59%, and recall increases (+0.7%) to 96.2%. You can find all 2,000 individual predictions on github if you’d like to take a closer look. Eat your heart out, custom. It looks like BERT is the new sheriff in town. Comparing our Methods Here is a quick recap of the 10 different methods we have tried: Beyond these statistics, we can gain some more intuition towards these different input methods by looking at some specific examples. Easy Examples Recall that our model predictions include the probability of belonging to the positive class (Lord of the Rings). This means we can look beyond accuracy, precision and recall, and see how confident the model was for each answer. Consider this comment (from Star Wars): msg_id: 17480 Source: /r/StarWars “For over a thousand generations the Jedi knights were the guardians of peace and justice in the old republic. Before the dark times. Before the empire.”- Ben Kenobi to Luke Skywalker. This is how each method responded: Our decision boundary is 50% (the dotted line). Everything over that line is classified as Lord of the Rings. Because this example comes from Star Wars, here we want all of these probabilities to be as low as possible. The closer a method is to 0%, the more confident it is that this is a Star Wars comment. We can see that although our word2vec methods got this one right, they were the least right of the 10 methods tested. Compare to One-Hot and BERT which are much closer to 0%. Consider this comment (from Lord of the Rings): msg_id: 27417 Source: /r/lotr Start with The Hobbit: 1. Watching The Hobbit first doesn't really spoil anything in The Lord of the Rings, whereas watching The Lord of the Rings first does spoil major plot elements in The Hobbit. 2. The Hobbit is shorter and you can start with the extended editions: those extra 12 minutes in the first film aren't going to make or break anyone's viewing experience. This is how each method responded: Because this comment is from our positive class (Lord of the Rings) here we want to see all of these probabilities as high as possible. The closer a method is to 100%, the more confident it is that this is a “Lord of the Rings” comment. These comments are both rich with franchise-specific terms like Jedi, Ben Kenobi, Luke Skywalker, Hobbit and even The Lord of the Rings. We can understand why all methods got these right- these were some easy ones. Not As Easy Examples Consider this comment (from Lord of the Rings): msg_id: 30774 Source: /r/lotr I’m halfway through the Two Towers right now (third reading). I hope you make it!Props for trying at 11. I tried it when I was in grade 2. I don’t remember if I made it through, but I once found copies of the Ent’s songs in and old journal from that time haha This is how each method responded: We humans recognize Two Towers as the title of one of the Lord of the Rings books. Why did our One-Hot encodings struggle here? Because the critical token towers does not appear in the vocabularies of One-Hot (300) or One-Hot (768). And the pre-trained word2vec and BERT embeddings don't recognize the significance of Two Towers here. Compare to our custom (300) embeddings: Not only does our custom vocabulary include towers, it associates that token to gimli, aragorn and legolas. Presumably our fine-tuned BERT embeddings were able to find a similar pattern, and come to a similar conclusion. Our next comment (from Star Wars) dares to discuss the important issues: Does Jabba the Hutt reproduce asexually? msg_id: 18052 Source: /r/StarWars Yes it literally does. Asexual reproduction. That user isn’t talking about the Hutt’s sexual orientation, they’re talking about how they reproduce. But if they reproduced asexually, they would not have males and females This is how each method responded: Similar to above, the critical token hutts does not appear in the vocabularies of One-Hot (300) or One-Hot (768). Likewise, the pre-trained word2vec and BERT embeddings don't recognize this as a character from Star Wars. Even our custom (300) and custom (768) embeddings disagree here. Let's take a closer look at the most similar tokens to hutts according to both custom (300) and custom (768): Both embeddings associate back to jabba, but only custom (768) contains enough dimensionality to also associate with asexual and orientation. Note that both custom embeddings (300 and 768) include the same 10,161 token vocabulary. It takes custom (300) 1,000 epochs to get it right. With the added dimensionality, custom (768) figures it out after only 100. Where BERT Prevails Consider this comment (from “Star Wars”): msg_id: 733 Source: /r/StarWars Their subscriber base surely expanded due to Mando and its great word of mouth. If BoBF were as good as Mando and built similar goodwill, it would expand their subscriber base even more. BoBF's bad rep might not have shrunk it much, but they lost out on all the new subscribers they could have gained. Sure, there are diminishing returns after each additional good show, but they'd still be getting more money than they would by producing crap, and it's not like slightly better writing would have added much in production costs. This is how each method responded: This comment uses slang (Mando, BoBF) to refer to the shows The Mandalorian and The Book of Boba Fett. The token mando appears in all of our One-Hot and custom vocabularies. bobf appears in our custom and One-Hot (768) vocabularies. Despite this, only our fine-tuned BERT embeddings resulted in a confident right answer. Consider this comment (from “Lord of the Rings”): msg_id: 42884 Source: /r/lotr Thinking about it now, I wonder if it was because they had some crumb of power because suddenly the world was giving them attention.And it just went to their heads. They needed to flex that they should be considered as an absolute authority.Makes me think of an old boss. Couldn’t handle a woman who said no. He grew up being rejected by them, but now that he was a manager, he just lost his shit whenever his control was threatened.I think it was the same thing. Finally they were an expert at something, and they just had to show everyone in the most awkward way possible.Most people just say “The book was better”, like that’s some sort of secret surprise to anyone.That’s a given. With everything. Since ever. But they just need to let everyone know they read the book. This is how each method responded: Here is an interesting case where only our most simple encoding (One-Hot with top 300 tokens) and our most complicated encoding (fine-tuned BERT) got the right answer. And they are both pretty confident about it. What is going on here? Out of our 298 training examples that included the token book, 70.5% of them came from Lord of the Rings. Our One-Hot (300) encoding focuses on book because (a) it only looks for 300 tokens in the first place, and (b) this is the most relevant token in this example. Our fine-tuned BERT encoding also focuses on the token book, despite having more than 100 million additional trainable parameters. This is the power of self-attention, and a good example of how BERT learns which parts of a text are the most relevant. Given the length of this comment, our other encoding methods get overwhelmed with too much information. Once we look for 768 One-Hot encoded tokens, or take the dimension-wise averages of all the tokens this comment contains, our encodings lose their focus on book, and confuse the model. Sentence Similarity When looking at our word embeddings above, we used cosine similarity to rank how close tokens are to each other. [latexpage] \[\text{CosSim}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{||\vec{a}|| \thinspace||\vec{b}||} = \frac{\sum_i^na_ib_i}{\sqrt{\sum_i^na_i^2}\sqrt{\sum_i^nb_i^2}}\] Can we use this same function to see how similar our sentence embeddings are to each other? Yes we can. Custom Similarity Consider the following comment (from “Lord of the Rings”): msg_id: 28126 Embedding: custom (768) Source: /r/lotr If I am correct, Sam notes Gandalf's true power when he effortlessly lifts I can't remember what. But the frail old man is acf even as Gandalf the Grey. If we compare the cosine similarities of our custom (768) sentence embeddings, we find this is the most similar comment: msg_id: 28180 Embedding: custom (768) Source: /r/lotr Is it actually lore that their power is restrained? As in externally. I thought that they chose not use their power, though a portion of gandalf's true power does slip through a couple of times. I think Sam notes his strength when he lifts frodo up to Glorfindel's horse.Edit: it was in Moria after Frodo got stabbed by the troll. Gandalf wasn't present when Glorfindel picked him up after weathertop. cosine similarity: 0.6641756 For the sake of comparison, here is another comment which also contains the name Gandalf, but has a much lower similarity score: msg_id: 24423 Embedding: custom (768) Source: /r/lotr gandalf better not appear in this, that makes no sense, but I like the lotr universe becuase its set themed to the somewhat forgotten ancient culture of the British Isles. cosine similarity: 0.05298240 This comment mentions Gandalf, but not his true power. We can see how this is less similar to the original comment. BERT Contextual Embeddings Consider the following examples: He is over 7 feet tall. Did you measure in feet or meters? In these examples, feet refers to a measurement of 12 inches. After playing basketball my feet hurt. Her feet are in need of a pedicure. In these examples, feet refers to the things attached to the bottom of your legs. This is a common phenomenon in natural language: the same words can have different meanings, depending on their context. Let us compare to this example: My feet are too big for these shoes. When we convert these into pre-trained BERT embeddings and compare cosine similarity, we see something amazing: The BERT embeddings for feet (with toes) are more similar than the embeddings for feet (12 inches). The 12 transformers and self-attention heads of BERT are powerful enough to produce more informative embeddings. Consider the following: He failed his driver test because he can't parallel park. Where do you park this big truck? In these examples, park refers to the act of putting a vehicle to rest. Turing Park is full of trees and flowers. We should walk over to the park today. In these examples, park refers to a place you might want to go. Let us compare to this example: I took the kids to the park so they could play. Again we convert these into pre-trained BERT embeddings and compare cosine similarity, and again we see something amazing: The BERT embeddings for park (noun) are more similar than the embeddings for park (verb). Once again BERT demonstrates the power of context in understanding natural language. Conclusion Back in the ancient times, before 2013, we usually encoded basic unigram tokens using simple 1's and 0's in a process called One-Hot encoding. word2vec improved things by expanding these 1's and 0's into full vectors (aka word embeddings). BERT improved things further by using transformers and self-attention heads to create full contextual sentence embeddings. Beyond use in this classification example, these embeddings are powerful. They allow us to do math with words: [latexpage] \[\vec{king}+\vec{woman}-\vec{man} = \vec{queen}\] These embeddings also allow us to find the meaning of specific tokens, and compare the similarity of full sentences (comments). We can use BERT embeddings for many other tasks too, such as extracting a questions/answers model, or summarizing key points of text. This article has only scratched the surface in regards to BERT's full capabilities. When training models, it is often tempting to toss more features at it, train for more epochs, and hope for the best. This article demonstrates the diminishing returns such an approach will yield. Instead we must think deeply about the information our text contains, and how we can represent that information numerically. As the above experiment proved, how we choose to encode our data can make all the difference. Want more content like this? Please subscribe and share below: [DISPLAY_ULTIMATE_SOCIAL_ICONS]


Artificial intelligence learns 'song' of coral reefs

#artificialintelligence

England: According to new research, artificial intelligence (AI) can track the health of coral reefs by learning the "song of the reef." The research has been published in the journal, "Ecological Indicators". Coral reefs have a complex soundscape – and even experts have to conduct painstaking analyses to measure reef health based on sound recordings. In the study, University of Exeter scientists trained a computer algorithm using multiple recordings of healthy and degraded reefs, allowing the machine to learn the difference. The computer then analysed a host of new recordings, and successfully identified reef health 92 per cent of the time.


Implementing Particle Swarm Optimization in Tensorflow

#artificialintelligence

Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. It's free, we don't spam, and we never share your email address.


Toward Reduction in False-Positive Thyroid Nodule Biopsies with a Deep Learning–based Risk Stratification System Using US Cine-Clip Images

#artificialintelligence

The Cine-CNNTrans achieved an average AUC of 0.88 0.10 for classifying benign versus malignant thyroid nodules. The Cine-CNNTrans showed higher AUC than the Static-2DCNN (P .03). For aggregating framewise outputs into nodulewise scores, the Cine-CNNTrans tended toward higher AUC compared with the Cine-CNNAvePool (P .17). Our system tended toward higher AUC than the Cine-Radiomics and the ACR TI-RADS level, though the difference did not achieve statistical significance (P .16


Machine Learning at the Edge

#artificialintelligence

I'm really excited to talk about advances in federated learning at the edge with you. When I think about the edge, I often think about small embedded devices, IoT, other types of things that might have a small computer in them, and I might not even realize that. I recently learned that these little scooters that are all over my city in Berlin, Germany, and maybe even yours as well, that they are collecting quite a lot of data and sending it. When I think about the data they might be collecting, and when I put on my data science and machine learning hat, and I think about the problems that they might want to solve, they might want to know about maintenance. They might want to know about road and weather conditions. They might want to know about driver performance. Really, the ultimate question they're trying to answer is this last one, which is, is this going to result in some problem for the scooter, or for the human, or for the other things around the scooter and the human? These are the types of questions we ask when we think about data and machine learning. When we think about it on the edge, or with embedded small systems, this often becomes a problem because traditional machine learning needs quite a lot of extra information to answer these questions. Let's take a look at a traditional machine learning system and investigate how it might go about collecting this data and answering this question. First, all the data would have to be aggregated and collected into a data lake. It might need to be standardized, or munged, or cleaned, or something done with it beforehand. Then, eventually, that data is pulled usually by a data science team or by scripts written by data engineering, or data scientists on the team.


GitHub - jason718/awesome-self-supervised-learning: A curated list of awesome self-supervised methods

#artificialintelligence

Self-Supervised Learning has become an exciting direction in AI community. Predicting What You Already Know Helps: Provable Self-Supervised Learning. For self-supervised learning, Rationality implies generalization, provably. Can Pretext-Based Self-Supervised Learning Be Boosted by Downstream Data? FAIR Self-Supervision Benchmark [pdf] [repo]: various benchmark (and legacy) tasks for evaluating quality of visual representations learned by various self-supervision approaches.


Stock Forecast Based On a Predictive Algorithm

#artificialintelligence

This Stock Pickers forecast is designed for investors and analysts who need predictions of the best utilities stocks to buy for the whole Industry. Package Name: Utilities Stocks Recommended Positions: Long Forecast Length: 7 Days (5/18/22 – 5/25/22) I Know First Average: 3.35% For this 7 Days forecast the algorithm had successfully predicted 9 out of 10 movements. The highest trade return came from CDZI, at 15.43%. Further notable returns came from PNW and NRG at 4.36% and 3.65%, respectively. The package had an overall average return of 3.35%, providing investors with a premium of 6.04% over the S&P 500's return of -2.69% during the same period.


Machine learning helps distinguishing diseases - Innovation Origins

#artificialintelligence

Nowadays doctors define and diagnose most diseases on the basis of symptoms. However, that does not necessarily mean that the illnesses of patients with similar symptoms will have identical causes or demonstrate the same molecular changes. In biomedicine, one often speaks of the molecular mechanisms of a disease. This refers to changes in the regulation of genes, proteins or metabolic pathways at the onset of illness. The goal of stratified medicine is to classify patients into various subtypes at the molecular level in order to provide more targeted treatments, wrties the Technical University of Munich in a press release.


Artificial Intelligence in China

#artificialintelligence

In the fifth of a series of blogs from our global offices, we provide a overview of key trends in artificial intelligence in China. What is China's strategy for Artificial Intelligence? In March 2021, the Chinese government released the Outline of the 14th Five-Year Plan of the National Economic and Social Development of the People's Republic of China and Vision 2035. This includes more than 50 references to "[artificial] intelligence", reflecting China aims to develop of a new generation of information technology powered by artificial intelligence. Specifically, China intends to drive industry through science and technology projects to develop cutting-edge fundamental theories and algorithms, create specialized chips and build open-source algorithm platforms such as deep learning frameworks.