In this notebook I train convolutional and recurrent neural networks to classify sequences. The sequences used are the result of vectorizing messages obtained from the 20NewsGroup dataset, which contains 20000 messages taken from 20 newsgroups.
The messages have several headers at the top. One of them is the category field which states the label of the observation. The label may also be present in other fields in the headers, namely Followup-to or Subject.
> things which are eternal. Jesus is a subset of God. Therefore
>:> Jesus belongs to the set of things which are eter
When reading the files, any header with a mention of the corresponding label is thus removed. Furthermore, the Path and Xref headers and the headers that contain e-mail addresses are also removed.
The texts are split to form a training data set and a validation set.
I use train_test_split to stratify the split and have a proportionate distribution of categories among both sets, and leave the last observation out of the training dataset to make a prediction for the sake of representation at the end.
I have written a vectorizer to transform the texts to sequences. The method is simple, first an object is instantiated, set in this case with a max length per sequence of 200 tokens and a maximum of 20000 tokens in the vocabulary.
Then the fit method of the object takes the training array as an argument to form the vocabulary which will be used every time an array is passed to the transform method. Transforming an array is tokenizing a string and vectorizing it mapping from the vocabulary.
The resulting arrays of sequences alongside with their corresponding target lists are then passed to TwentyNewsDataset to instantiate two torch.utils.data.Dataset objects, which will be used in the training of the neural nets.
A lower overfitting to the training set, and thus higher validation accuracy, may be found increasing the maximum sequence length. For performance reasons I have set a relatively low length.
Modelling
In this section four models are trained, two convolutional networks and two recurrent networks, using this functions.
All of the models implemented share a similar workflow, i.e., a multidimensional representation of the sequences, a process of learning high-level features in the data and a last phase of classifying the estimations.
For this reason I have written a simple class that I will instantiate in every model and which will gather the different modules to be used.
Embedding
The following class will be used to apply a multidimensional transformation to the sequences in the forward pass. The backward pass will only alter the weights of the embeddings if these are not pre trained.
As I use pretrained weights in one of the convolutional neural networks later, I download a GloVe index (822 MB) with all the weights and create a tensor with the shape (20002, 100) or (N_TOKENS, GLOVE_EMBEDDING_DIM). In other words, a matrix with all the words in the vocabulary plus two special characters, and the corresponding 100 dimensions for each token.
If the GloVe index does not contain a token from the vocabulary, every dimension of that token will equal zero.
More than 90% of the tokens have been assigned pretrained weights, some of the tokens that have not been found in the GloVe index are shown above.
Classifier
I have defined two classifiers, one with one layer fully connected and another one with two. Both include a dropout argument to randomly zero out a percentage of the elements of the input tensor to the classifier.
CNN
Two simple models with convolutional networks are trained:
the first version is a network with three one-dimension convolution layers with max pooling, which uses as input embeddings with pretrained weights of 100 dimensions, and which has a two-layer classifier in the output.
the second version is a network with four 1-d convolution layers with max pooling, an input embedding layer of 128-d, and an output of a one-layer classifier.
I have set L2 regularization, and 50% and 40% dropout in every respective classifier to try to address an issue of overfitting the training set. This has a better result in the second convolutional network.
The hyperparameters are listed at the top of each respective notebook cell.
Lastly, I have trained two recurrent neural networks:
One with a bidirectional one-layer LSTM and a self attention mechanism.
And another one with a bidirectional one-layer GRU and a self attention mechanism.
Both have an embedding layer of 128 dimensions and a one-layer classifier. However I have only set dropout in the LSTM.
We achieve a better validation accuracy with the recurrent networks, although there also appears to be a remarkable overfitting as we can see on the graphs below during the learning process.
Increasing the regularization does not remedy this issue. Therefore, a solution could be to increase the length of the sequences.
Predicting
We can see below the workflow used to make a single prediction:
'Subject: Re: After 2000 years, can we say that Christian Morality is\nDate: 24 Apr 1993 14:03:44 -0700\nOrganization: EIT\nLines: 126\nNNTP-Posting-Host: squick.eitech.com\n\n>#>Ordinarily, it is also a *value* judgement, though it needn\'t be (one \n>#>could "do science" without believing it was worth a damn in any context, \n>#>though that hardly seems sensible).\n>#No, you\'re just overloading the word "value" again. It is an\n>#estimation of probability of correctness, not an estimation of "worth."\n>#Sh'
Price regression on the Ames housing prices dataset
Ready to #buidl?
Are you interested in Web3 or the synergies between blockchain technology, artificial intelligence and zero knowledge?. Then, do not hesitate to contact me by e-mail or on my LinkedIn profile. You can also find me on GitHub.