Several factors have led to the increase in interest in this field, which is heavily influenced by techniques from speech processing. One major factor is the recent availability of large online text collections. Another is a disillusionment with traditional AIbased approaches to parsing and natural language processing (NLP). Charniak is recognized as a distinguished contributor to what he calls traditional AI NLP, which is why it is all the more significant that in the Preface, when speaking of his recent transition to the statistical approach, he writes … few, if any, consider the traditional study of language from an artificial-intelligence point of view a "hot" area of research. A great deal of work is still done on specific NLP problems, from grammatical issues to stylistic considerations, but for me at least it is increasingly hard to believe that it will shed light on broader problems, since it has steadfastly refused to do so in the past.
I saved the data using numpy's np.save() function: Each row has the length 2992, which corresponds to the length of the longest recording (136) multiplied number of variables (22). I used numpy's linear interpolator to fill in the gaps randomly: This code generates a scaffold of NaN values, randomly chooses a list of indices from the scaffold which corresponds to the length of the sign being interpolated, fills the real values into the scaffold at those random indices, and then interpolates the missing values. The pipeline specifies three steps: scaling using StandardScaler(), principal components analysis (PCA) using the PCA() function and finally a linear support vector classifier (LinearSVC()). Once the PCA algorithm has reduced the dimensionality of the data to just those features which encapsulate the most variance, the data is used to train a linear support vector classification (SVC) model.