Skip to article frontmatterSkip to article content

Long Short-Term Memory

In our previous MLP experiment, we saw that a neural network could extract more from averaged Word2Vec embeddings than a simpler Logistic Regression. However, averaging throws away crucial information: the order of words in a sentence.

This time, we’re diving deeper into neural networks with Long Short-Term Memory (LSTM) units. They are specifically designed to process sequences, remembering context and understanding how word order contributes to meaning. Let’s see if harnessing this sequential power can push our accuracy even further.

Data Preparapation

import kagglehub
df = kagglehub.dataset_load(
    kagglehub.KaggleDatasetAdapter.PANDAS,
    'jp797498e/twitter-entity-sentiment-analysis',
    'twitter_training.csv',
    pandas_kwargs={'encoding': 'ISO-8859-1'},
)

df = df[df.columns[[2, 3]]]
df.columns = ['sentiment', 'text']

df['text'] = df['text'].astype(str)
df['sentiment'] = df['sentiment'].astype(str)
df = df.dropna()

df = df.loc[df['sentiment'] != 'Irrelevant']
display(df)
Loading...
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
x_train = train['text']
y_train = train['sentiment']
x_test = test['text']
y_test = test['sentiment']

Tokenization

Before our neural network can understand text, we need to convert sentences into a numerical format it can process. In our MLP approach, we tokenized text and then immediately averaged word vectors. LSTM pipeline is a a bit more complicated.

The first step is tokenization, where we break down each sentence into individual words or sub-word units called “tokens.” Then, we’ll build a vocabulary of all unique tokens in our training data and assign a unique integer ID to each one, transforming our text into sequences of numbers.

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['text'])
display(tokenizer.texts_to_sequences(["Hello World"]))
[[737, 162]]

Data Padding

The next step is called padding. LSTMs (like most neural networks) require input sequences to be of a uniform length. Since our sentences naturally vary, we need to pad our integer sequences to a fixed length.

This involves either truncating longer sequences or adding special tokens (usually zeros) to shorter sequences until they all reach a predetermined maximum length. We can use some reasonable number here, and adjust it later if needed.

maxlen = 96

from tensorflow.keras.utils import pad_sequences
x_train = pad_sequences(tokenizer.texts_to_sequences(x_train), maxlen=maxlen)
x_test = pad_sequences(tokenizer.texts_to_sequences(x_test), maxlen=maxlen)

display(x_train)
array([[ 0, 0, 0, ..., 357, 155, 163], [ 0, 0, 0, ..., 25, 1136, 10], [ 0, 0, 0, ..., 827, 3072, 1119], ..., [ 0, 0, 0, ..., 469, 7, 5669], [ 0, 0, 0, ..., 36, 16, 14165], [ 0, 0, 0, ..., 3, 121, 10]], dtype=int32)

Embedding Layer

Next, we need to transform these padded sequences into meaningful representations that capture their semantic relationships. That’s where the embedding layer comes in. To build it, we may use our existing Word2Vec model.

import kagglehub
path = kagglehub.dataset_download(
    'leadbest/googlenewsvectorsnegative300', 
    path='GoogleNews-vectors-negative300.bin.gz'
)

from gensim.models import KeyedVectors
wv = KeyedVectors.load_word2vec_format(path, binary=True)

Think of it as a sophisticated, trainable lookup table - for each integer ID representing a token in our sequence, the embedding layer looks up its corresponding dense vector. This helps our neural network to capture not some random indexes, but the semantic meaning of words.

import numpy as np
embedding_matrix_shape = (len(tokenizer.word_index) + 1, wv.vector_size)
embedding_matrix = np.zeros(shape=embedding_matrix_shape)
for word, index in tokenizer.word_index.items():
    if word in wv:
        embedding_matrix[index] = wv.get_vector(word)
    else:
        embedding_matrix[index] = np.zeros(wv.vector_size)

Label Encoding

from sklearn.preprocessing import LabelBinarizer 
import pandas as pd

encoder = LabelBinarizer()
encoder.fit(df['sentiment'])

y_train_encoded = pd.DataFrame(encoder.fit_transform(y_train))
y_test_encoded = pd.DataFrame(encoder.transform(y_test))

Building and Training the Model

Here comes the most interesting part. With our text now represented as sequences of dense semantic vectors (thanks to the embedding layer), we can introduce the star of this experiment - the Long Short-Term Memory (LSTM) layer.

Unlike a simple Dense layer that processes all its inputs at once, an LSTM processes each vector in our sequence one at a time. Internally, each LSTM unit contains a sophisticated set of “gates” – an input gate, a forget gate, and an output gate – along with a memory cell.

These gates learn to control the flow of information, deciding what to remember from previous steps, what to discard, and what new information from the current word’s vector is important enough to update its memory, allowing it to capture context and dependencies across the entire sequence.

from tensorflow.keras import layers, Sequential
num_classes = len(encoder.classes_)
model = Sequential([
    layers.Embedding(
        weights=[embedding_matrix],
        input_dim=embedding_matrix_shape[0], 
        output_dim=embedding_matrix_shape[1],
        trainable=True, # allow optimizer to tweak this layer too
        mask_zero=True, # ignore padding zeroes
    ),
    layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2),
    layers.Dropout(0.2),
    layers.Dense(num_classes, activation='softmax'),
])

We can compile and train our model now, but before we do so - let’s implement a regularisation technique called early stopping. Essentially, that’s a special function that watches our training process and stops it when the monitored metric stops improving. That helps to reduce overfitting and saves us some computation cycles when the training process gets stuck with the same accuracy for too long.

from tensorflow.keras.callbacks import EarlyStopping
earlystop = EarlyStopping(monitor='val_accuracy', patience=5, restore_best_weights=True)

Also, we could try tweaking the optimizer learning rate - the default one might be too high when fine-tuning pre-trained embeddings.

from tensorflow.keras.optimizers import Adam
optimizer = Adam(learning_rate=0.0001)

Everything is ready - let’s start the training process.

from tensorflow import device
with device('/CPU:0'):
    model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit(x_train, y_train_encoded, epochs=35, batch_size=128, validation_split=0.1, callbacks=[earlystop]) 
Output
Epoch 1/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 98s 280ms/step - accuracy: 0.4975 - loss: 1.0252 - val_accuracy: 0.6669 - val_loss: 0.8161
Epoch 2/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 96s 275ms/step - accuracy: 0.6696 - loss: 0.8008 - val_accuracy: 0.7111 - val_loss: 0.7120
Epoch 3/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 95s 272ms/step - accuracy: 0.7129 - loss: 0.7021 - val_accuracy: 0.7466 - val_loss: 0.6414
Epoch 4/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 97s 281ms/step - accuracy: 0.7468 - loss: 0.6350 - val_accuracy: 0.7747 - val_loss: 0.5757
Epoch 5/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 97s 279ms/step - accuracy: 0.7780 - loss: 0.5642 - val_accuracy: 0.7934 - val_loss: 0.5168
Epoch 6/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 102s 294ms/step - accuracy: 0.8068 - loss: 0.4949 - val_accuracy: 0.8144 - val_loss: 0.4728
Epoch 7/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 100s 289ms/step - accuracy: 0.8251 - loss: 0.4517 - val_accuracy: 0.8306 - val_loss: 0.4352
Epoch 8/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 100s 288ms/step - accuracy: 0.8432 - loss: 0.4051 - val_accuracy: 0.8424 - val_loss: 0.4055
Epoch 9/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 103s 296ms/step - accuracy: 0.8585 - loss: 0.3652 - val_accuracy: 0.8515 - val_loss: 0.3836
Epoch 10/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 95s 275ms/step - accuracy: 0.8708 - loss: 0.3330 - val_accuracy: 0.8614 - val_loss: 0.3566
Epoch 11/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 99s 285ms/step - accuracy: 0.8769 - loss: 0.3160 - val_accuracy: 0.8695 - val_loss: 0.3366
Epoch 12/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 97s 279ms/step - accuracy: 0.8879 - loss: 0.2840 - val_accuracy: 0.8736 - val_loss: 0.3243
Epoch 13/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 96s 277ms/step - accuracy: 0.8954 - loss: 0.2648 - val_accuracy: 0.8801 - val_loss: 0.3098
Epoch 14/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 99s 285ms/step - accuracy: 0.9040 - loss: 0.2467 - val_accuracy: 0.8819 - val_loss: 0.2973
Epoch 15/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 96s 277ms/step - accuracy: 0.9057 - loss: 0.2360 - val_accuracy: 0.8872 - val_loss: 0.2875
Epoch 16/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 97s 278ms/step - accuracy: 0.9116 - loss: 0.2204 - val_accuracy: 0.8882 - val_loss: 0.2793
Epoch 17/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 96s 277ms/step - accuracy: 0.9164 - loss: 0.2075 - val_accuracy: 0.8863 - val_loss: 0.2786
Epoch 18/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 95s 275ms/step - accuracy: 0.9193 - loss: 0.1988 - val_accuracy: 0.8928 - val_loss: 0.2685
Epoch 19/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 99s 285ms/step - accuracy: 0.9260 - loss: 0.1858 - val_accuracy: 0.8951 - val_loss: 0.2658
Epoch 20/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 99s 285ms/step - accuracy: 0.9264 - loss: 0.1805 - val_accuracy: 0.8997 - val_loss: 0.2587
Epoch 21/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 99s 286ms/step - accuracy: 0.9298 - loss: 0.1721 - val_accuracy: 0.8967 - val_loss: 0.2626
Epoch 22/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 98s 283ms/step - accuracy: 0.9327 - loss: 0.1649 - val_accuracy: 0.8997 - val_loss: 0.2547
Epoch 23/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 97s 281ms/step - accuracy: 0.9375 - loss: 0.1566 - val_accuracy: 0.9034 - val_loss: 0.2527
Epoch 24/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 101s 291ms/step - accuracy: 0.9369 - loss: 0.1530 - val_accuracy: 0.9001 - val_loss: 0.2558
Epoch 25/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 100s 287ms/step - accuracy: 0.9400 - loss: 0.1461 - val_accuracy: 0.9030 - val_loss: 0.2515
Epoch 26/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 102s 293ms/step - accuracy: 0.9453 - loss: 0.1354 - val_accuracy: 0.9034 - val_loss: 0.2505
Epoch 27/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 98s 283ms/step - accuracy: 0.9426 - loss: 0.1358 - val_accuracy: 0.9052 - val_loss: 0.2506
Epoch 28/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 98s 282ms/step - accuracy: 0.9433 - loss: 0.1349 - val_accuracy: 0.9094 - val_loss: 0.2502
Epoch 29/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 93s 267ms/step - accuracy: 0.9463 - loss: 0.1289 - val_accuracy: 0.9090 - val_loss: 0.2476
Epoch 30/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 94s 270ms/step - accuracy: 0.9466 - loss: 0.1253 - val_accuracy: 0.9074 - val_loss: 0.2496
Epoch 31/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 96s 277ms/step - accuracy: 0.9491 - loss: 0.1197 - val_accuracy: 0.9066 - val_loss: 0.2509
Epoch 32/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 98s 283ms/step - accuracy: 0.9494 - loss: 0.1176 - val_accuracy: 0.9094 - val_loss: 0.2500
Epoch 33/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 98s 283ms/step - accuracy: 0.9511 - loss: 0.1133 - val_accuracy: 0.9098 - val_loss: 0.2493
Epoch 34/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 99s 286ms/step - accuracy: 0.9519 - loss: 0.1120 - val_accuracy: 0.9094 - val_loss: 0.2564
Epoch 35/35
347/347 ━━━━━━━━━━━━━━━━━━━━ 102s 293ms/step - accuracy: 0.9506 - loss: 0.1130 - val_accuracy: 0.9119 - val_loss: 0.2541

Result

from sklearn.metrics import classification_report
with device('/CPU:0'):
    y_pred_probs = model.predict(x_test, verbose=False)
    y_pred_labels = np.argmax(y_pred_probs, axis=1)
    y_true_labels = np.argmax(y_test_encoded.to_numpy(), axis=1)
    print(classification_report(y_true_labels, y_pred_labels, target_names=encoder.classes_))
              precision    recall  f1-score   support

    Negative       0.92      0.93      0.92      4605
     Neutral       0.89      0.91      0.90      3594
    Positive       0.92      0.89      0.91      4140

    accuracy                           0.91     12339
   macro avg       0.91      0.91      0.91     12339
weighted avg       0.91      0.91      0.91     12339

Conclusion

We were able to achieve a final accuracy of 91% - this was accomplished by combining LSTM units with pre-trained Word2Vec embeddings, enabling the model to leverage sequential information.

This proves the value of sequence-aware models over simple embedding averaging for this dataset, performing competitively with the heavily optimized CountVectorizer approach. Further improvements would likely require exploring more advanced architectures like transformers.

But for now... This is more than enough for this class of tasks.