Skip to article frontmatterSkip to article content

Multilayer Perceptron

In our previous attempt, we tried using Word2Vec to improve our sentiment classification, but instead of a higher score, we got a much, much worse result.

That happened because our existing architecture (logistic regression) was unfit for a new vectorization (seemingly much better) approach. But what if we change the architecture itself?

That gives us a nice opportunity to try a different paradigm called deep learning - a branch of machine learning and artificial intelligence that uses artificial neural networks to process data and learn patterns. These networks, inspired by the structure of the human brain, are built with multiple layers of interconnected nodes (neurons) that allow them to identify complex relationships and make predictions.

Data Preparation

import kagglehub
df = kagglehub.dataset_load(
    kagglehub.KaggleDatasetAdapter.PANDAS,
    'jp797498e/twitter-entity-sentiment-analysis',
    'twitter_training.csv',
    pandas_kwargs={'encoding': 'ISO-8859-1'},
)

df = df[df.columns[[2, 3]]]
df.columns = ['sentiment', 'text']

df['text'] = df['text'].astype(str)
df['sentiment'] = df['sentiment'].astype(str)
df = df.dropna()

df = df.loc[df['sentiment'] != 'Irrelevant']
display(df)
Loading...
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
x_train = train['text']
y_train = train['sentiment']
x_test = test['text']
y_test = test['sentiment']

Semantic Vectorization

Our vectorization routine also remains exactly the same. We are doing this so we can see how the change of approach affects the final result with the same data.

import kagglehub
path = kagglehub.dataset_download(
    'leadbest/googlenewsvectorsnegative300', 
    path='GoogleNews-vectors-negative300.bin.gz'
)

from gensim.models import KeyedVectors
wv = KeyedVectors.load_word2vec_format(path, binary=True)

import numpy as np
from gensim.utils import simple_preprocess
def vectorize(text):
    tokens = simple_preprocess(text.lower(), deacc=True)
    token_vectors = [wv.get_vector(x) for x in tokens if x in wv]
    if token_vectors:
        return np.mean(token_vectors, axis=0)
    else:
        return np.zeros(wv.vector_size)

x_train = np.array(train['text'].map(vectorize).tolist())
x_test = np.array(test['text'].map(vectorize).tolist())

Label Encoding

Before we proceed further, we need to transform our output training as well! That happens because neural networks do not work with text directly - instead, we need to encode our labels, turning them into some kind of mathematical representation.

from sklearn.preprocessing import LabelBinarizer 
import pandas as pd

encoder = LabelBinarizer()
encoder.fit(df['sentiment'])

y_train_encoded = pd.DataFrame(encoder.fit_transform(y_train))
y_test_encoded = pd.DataFrame(encoder.transform(y_test))

This method is called “one-hot encoding”, where each category is assigned a unique binary column. We have three categories - so our encodings will have a dimension of three each.

display(y_train_encoded)
Loading...

Building and Training the Model

Now, let’s design our model structure. This time, we will use a thing called multilayer perceptron. As its name states, it is a neural network that consists of multiple layers of neurons - allowing one to learn complex, non-linear relationships in data (unlike linear regression classifier).

For this task, we will use three types of layers - input (transforms our source data and passes it next), dense (simple layer of interconnected neurons), and dropout (special layer that helps against overfitting by randomly disabling part of the previous layer during the training process).

from tensorflow.keras import layers, Sequential
num_classes = len(encoder.classes_)
model = Sequential([
    layers.Input(shape=(wv.vector_size,)),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.1),
    layers.Dense(num_classes, activation='softmax')
])

Originally, there were only three layers - input, hidden, and output. But stacking two more hidden layers helped to uplift the original 82% accuracy to 85%. That happened due to increased model capacity, and (potential) hierarchical feature learning.

We can compile and train our model now. Sticking to the CPU may be a good idea for now - using GPU for such a small model may actually slow down the training process due to a huge data transfer overhead.

from tensorflow import device
with device('/CPU:0'):
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit(x_train, y_train_encoded, epochs=50, batch_size=32, validation_split=0.1) 
Output
Epoch 1/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.5869 - loss: 0.9017 - val_accuracy: 0.6696 - val_loss: 0.7869
Epoch 2/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.6685 - loss: 0.7735 - val_accuracy: 0.6872 - val_loss: 0.7375
Epoch 3/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.6939 - loss: 0.7171 - val_accuracy: 0.7052 - val_loss: 0.7021
Epoch 4/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.7148 - loss: 0.6740 - val_accuracy: 0.7016 - val_loss: 0.7010
Epoch 5/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.7309 - loss: 0.6325 - val_accuracy: 0.7322 - val_loss: 0.6555
Epoch 6/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.7548 - loss: 0.5895 - val_accuracy: 0.7411 - val_loss: 0.6228
Epoch 7/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.7678 - loss: 0.5611 - val_accuracy: 0.7486 - val_loss: 0.5982
Epoch 8/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.7821 - loss: 0.5269 - val_accuracy: 0.7601 - val_loss: 0.5915
Epoch 9/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.7953 - loss: 0.4972 - val_accuracy: 0.7737 - val_loss: 0.5538
Epoch 10/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8037 - loss: 0.4788 - val_accuracy: 0.7887 - val_loss: 0.5278
Epoch 11/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8150 - loss: 0.4510 - val_accuracy: 0.7903 - val_loss: 0.5115
Epoch 12/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8231 - loss: 0.4320 - val_accuracy: 0.7996 - val_loss: 0.4945
Epoch 13/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8309 - loss: 0.4142 - val_accuracy: 0.7855 - val_loss: 0.5317
Epoch 14/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8358 - loss: 0.4042 - val_accuracy: 0.8025 - val_loss: 0.4998
Epoch 15/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8400 - loss: 0.3931 - val_accuracy: 0.7936 - val_loss: 0.5041
Epoch 16/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8457 - loss: 0.3788 - val_accuracy: 0.8027 - val_loss: 0.4869
Epoch 17/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8539 - loss: 0.3640 - val_accuracy: 0.8081 - val_loss: 0.4843
Epoch 18/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8553 - loss: 0.3572 - val_accuracy: 0.8183 - val_loss: 0.4635
Epoch 19/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8581 - loss: 0.3484 - val_accuracy: 0.8193 - val_loss: 0.4679
Epoch 20/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8600 - loss: 0.3459 - val_accuracy: 0.8219 - val_loss: 0.4627
Epoch 21/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8650 - loss: 0.3300 - val_accuracy: 0.8144 - val_loss: 0.4668
Epoch 22/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8681 - loss: 0.3256 - val_accuracy: 0.8177 - val_loss: 0.4691
Epoch 23/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8676 - loss: 0.3193 - val_accuracy: 0.8288 - val_loss: 0.4486
Epoch 24/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8727 - loss: 0.3138 - val_accuracy: 0.8290 - val_loss: 0.4497
Epoch 25/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8758 - loss: 0.3067 - val_accuracy: 0.8258 - val_loss: 0.4596
Epoch 26/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8791 - loss: 0.2995 - val_accuracy: 0.8268 - val_loss: 0.4502
Epoch 27/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8753 - loss: 0.3049 - val_accuracy: 0.8316 - val_loss: 0.4420
Epoch 28/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8820 - loss: 0.2944 - val_accuracy: 0.8363 - val_loss: 0.4429
Epoch 29/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8884 - loss: 0.2815 - val_accuracy: 0.8369 - val_loss: 0.4258
Epoch 30/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8830 - loss: 0.2848 - val_accuracy: 0.8284 - val_loss: 0.4361
Epoch 31/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8872 - loss: 0.2814 - val_accuracy: 0.8298 - val_loss: 0.4392
Epoch 32/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8884 - loss: 0.2810 - val_accuracy: 0.8355 - val_loss: 0.4335
Epoch 33/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8928 - loss: 0.2675 - val_accuracy: 0.8379 - val_loss: 0.4373
Epoch 34/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8929 - loss: 0.2620 - val_accuracy: 0.8351 - val_loss: 0.4385
Epoch 35/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8930 - loss: 0.2675 - val_accuracy: 0.8400 - val_loss: 0.4436
Epoch 36/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8914 - loss: 0.2655 - val_accuracy: 0.8422 - val_loss: 0.4392
Epoch 37/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8992 - loss: 0.2545 - val_accuracy: 0.8450 - val_loss: 0.4247
Epoch 38/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9012 - loss: 0.2507 - val_accuracy: 0.8397 - val_loss: 0.4308
Epoch 39/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8994 - loss: 0.2528 - val_accuracy: 0.8355 - val_loss: 0.4373
Epoch 40/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8966 - loss: 0.2538 - val_accuracy: 0.8385 - val_loss: 0.4302
Epoch 41/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.8987 - loss: 0.2495 - val_accuracy: 0.8351 - val_loss: 0.4460
Epoch 42/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9024 - loss: 0.2456 - val_accuracy: 0.8434 - val_loss: 0.4288
Epoch 43/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9011 - loss: 0.2450 - val_accuracy: 0.8406 - val_loss: 0.4337
Epoch 44/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9022 - loss: 0.2450 - val_accuracy: 0.8503 - val_loss: 0.4281
Epoch 45/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9063 - loss: 0.2323 - val_accuracy: 0.8357 - val_loss: 0.4530
Epoch 46/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9034 - loss: 0.2400 - val_accuracy: 0.8495 - val_loss: 0.4268
Epoch 47/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9055 - loss: 0.2352 - val_accuracy: 0.8523 - val_loss: 0.4225
Epoch 48/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9064 - loss: 0.2317 - val_accuracy: 0.8436 - val_loss: 0.4283
Epoch 49/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9064 - loss: 0.2362 - val_accuracy: 0.8452 - val_loss: 0.4243
Epoch 50/50
1388/1388 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9077 - loss: 0.2266 - val_accuracy: 0.8470 - val_loss: 0.4272

Result

from sklearn.metrics import classification_report
with device('/CPU:0'):
    y_pred_probs = model.predict(x_test, verbose=False)
    y_pred_labels = np.argmax(y_pred_probs, axis=1)
    y_true_labels = np.argmax(y_test_encoded.to_numpy(), axis=1)
    print(classification_report(y_true_labels, y_pred_labels, target_names=encoder.classes_))
              precision    recall  f1-score   support

    Negative       0.86      0.86      0.86      4471
     Neutral       0.85      0.80      0.82      3666
    Positive       0.83      0.87      0.85      4202

    accuracy                           0.85     12339
   macro avg       0.85      0.84      0.84     12339
weighted avg       0.85      0.85      0.85     12339

Conclusion

This experiment yielded an 85% accuracy, confirming that a deeper, non-linear model can extract more signal from these embeddings than a linear classifier. The addition of hidden layers and dropout underscores the “more capacity, better results” principle, at least to a point.

While a significant improvement over the 65% Logistic Regression baseline with these features, it still trails the 91% n-gram model. This suggests that to fully leverage Word2Vec semantic richness, architectures capable of processing sequences are the necessary next step.