Sentiment analysis is the process of analyzing digital text to determine the emotional tone of the message - positive, negative, or neutral. Essentially, it helps us understand the writer’s attitude toward a particular topic or product, and it could be helpful for a lot of applications - like processing customer reviews or social media comments.
This example demonstrates how to implement a simple sentiment classifier using logistic regression. It’s surprising how well it performs for this class of tasks for a relatively simple model.
Data Preparation¶
Let’s start with Standford’s IMGDB sentiment dataset. Its official split is 50/50 - so we are going to have 25,000 samples for training and 25,0000 samples for validation.
from datasets import load_dataset
import numpy as np
train, test = load_dataset('stanfordnlp/imdb', split=['train', 'test'])
class_names = train.features['label'].names
x_train = np.array(train['text'])
y_train = np.array(train['label'])
x_test = np.array(test['text'])
y_test = np.array(test['label'])
Let’s take a quick peek at our data before going further.
display(train.to_pandas())
Building and Training the Model¶
Now, that we’re done with the data, it’s time to build the classification pipeline. The first step would be vectorization - the process of turning strings to manipulate them mathematically. The simpliest version of it is called сount vectorizer.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer_example = vectorizer.fit_transform(['Hello World! Hello!', 'Bye World']).toarray()
display(vectorizer.get_feature_names_out(), vectorizer_example)
array(['bye', 'hello', 'world'], dtype=object)
array([[0, 2, 1],
[1, 0, 1]])
Its idea is pretty simple. First, it scans all the text to build a vocabulary of all the unique words it finds. Then, for each sentence, it creates a numerical list (vector) where each number describes how many times a specific word from that dictionary appears in that sentence.
But in our specific case, we might take a look at another approach called TF-IDF. It measures a word’s importance to a document by multiplying its frequency in that document (term frequency) by a penalty for how common the word is across all documents (inverse document frequency).
This approach helps to essentially eliminate common terms from the classification process, emphasizing words that are uniquely relevant to the text.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
Now, let’s define a classifier (no fancy configuration here yet) and stick everything into an elegant pipeline. That is going to be our final model architecture.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
pipeline = Pipeline([
('vectorizer', vectorizer),
('classifier', classifier),
])
We could start training our model right away, but...
It would be not ideal in terms of its hyperparameters - those values that define how our pipeline works. These are settings we choose before training, like regularization strength or how our text vectorizer processes words. They significantly control how the model learns and how well it ultimately performs.
Manually trying every possible combination of these hyperparameters would be incredibly tedious. Instead, we can use automated hyperparameter tuning techniques. One such technique is randomized search - it randomly samples different combinations of hyperparameters from a pre-defined grid.
We may tune the following parameters:
- Classifier
C
: Regularization strength of the LogisticRegression classifier. Smaller values make it stronger (less prone to overfitting), and bigger - weaker (able to capture more nuances in noisy data). - Vectorizer
ngram_range
: This is crucial for capturing context! Instead of just looking at individual words (unigrams), n-grams allow us to consider sequences of words as single features. Using n-grams beyond unigrams often significantly improves performance in text tasks by providing more contextual information to the model, but it also increases the vocabulary size. - Vectorizer
max_df
: Maximum document frequency - ignore terms that appear in more than ‘max_df’ documents. Smaller values exclude more common terms (good for noise reduction), but too small may result in losing important common signals (underfitting). - Vectorizer
min_df
: Minimum document frequency - ignore terms that appear in fewer than ‘min_df’ documents. Smaller values may lead to huge noisy vocabularies, and bigger ones may result in losing specific signals.
param_grid = {
'classifier__C': [0.1, 1, 10],
'vectorizer__ngram_range': [(1, 1), (1, 2)],
'vectorizer__max_df': [0.85, 0.90, 0.95, 1.0],
'vectorizer__min_df': [1, 2, 3, 5],
}
Everything is ready - let’s train our model now.
%%capture --no-stdout
from joblib import parallel_backend
from sklearn.model_selection import RandomizedSearchCV
cv = RandomizedSearchCV(pipeline, param_grid, random_state=0, n_jobs=-1, verbose=3)
cv.fit(x_train, y_train)
Output
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 4/5] END classifier__C=0.1, vectorizer__max_df=1.0, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.844 total time= 10.9s
[CV 1/5] END classifier__C=0.1, vectorizer__max_df=1.0, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.831 total time= 11.0s
[CV 3/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.848 total time= 11.1s
[CV 5/5] END classifier__C=0.1, vectorizer__max_df=1.0, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.840 total time= 11.1s
[CV 2/5] END classifier__C=0.1, vectorizer__max_df=1.0, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.844 total time= 11.1s
[CV 3/5] END classifier__C=0.1, vectorizer__max_df=1.0, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.832 total time= 11.2s
[CV 1/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.848 total time= 11.1s
[CV 2/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.842 total time= 11.6s
[CV 1/5] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.832 total time= 8.5s
[CV 2/5] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.843 total time= 8.6s
[CV 5/5] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.841 total time= 8.4s
[CV 3/5] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.830 total time= 8.5s
[CV 4/5] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.846 total time= 8.6s
[CV 5/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.859 total time= 9.1s
[CV 4/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.865 total time= 9.7s
[CV 2/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.840 total time= 7.5s
[CV 1/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.851 total time= 7.8s
[CV 3/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.844 total time= 7.3s
[CV 1/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 2);, score=0.872 total time= 16.5s
[CV 5/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.857 total time= 7.7s
[CV 4/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.863 total time= 7.9s
[CV 4/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 2);, score=0.887 total time= 16.8s
[CV 3/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 2);, score=0.869 total time= 17.1s
[CV 5/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 2);, score=0.879 total time= 17.3s
[CV 2/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 2);, score=0.881 total time= 17.5s
[CV 1/5] END classifier__C=0.1, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 1);, score=0.830 total time= 7.9s
[CV 2/5] END classifier__C=0.1, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 1);, score=0.844 total time= 7.8s
[CV 3/5] END classifier__C=0.1, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 1);, score=0.828 total time= 8.2s
[CV 5/5] END classifier__C=0.1, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 1);, score=0.839 total time= 8.5s
[CV 4/5] END classifier__C=0.1, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 1);, score=0.845 total time= 8.7s
[CV 1/5] END classifier__C=10, vectorizer__max_df=0.9, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.850 total time= 8.6s
[CV 1/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 2);, score=0.873 total time= 28.5s
[CV 3/5] END classifier__C=10, vectorizer__max_df=0.9, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.846 total time= 8.0s
[CV 2/5] END classifier__C=10, vectorizer__max_df=0.9, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.846 total time= 8.2s
[CV 4/5] END classifier__C=10, vectorizer__max_df=0.9, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.864 total time= 8.1s
[CV 2/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 2);, score=0.877 total time= 34.7s
[CV 5/5] END classifier__C=10, vectorizer__max_df=0.9, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.858 total time= 9.1s
[CV 5/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 2);, score=0.883 total time= 30.9s
[CV 2/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.863 total time= 8.0s
[CV 1/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.865 total time= 8.2s
[CV 3/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.859 total time= 7.6s
[CV 4/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.874 total time= 7.7s
[CV 3/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 2);, score=0.874 total time= 36.0s
[CV 5/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.872 total time= 7.4s
[CV 4/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 2);, score=0.887 total time= 38.7s
[CV 1/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.849 total time= 6.8s
[CV 2/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.841 total time= 5.3s
[CV 3/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.846 total time= 5.4s
[CV 4/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.864 total time= 5.5s
[CV 5/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.856 total time= 4.9s
Result¶
from sklearn.metrics import classification_report
prediction = cv.best_estimator_.predict(x_test)
print(classification_report(y_test, prediction, target_names=class_names))
precision recall f1-score support
neg 0.90 0.90 0.90 12500
pos 0.90 0.90 0.90 12500
accuracy 0.90 25000
macro avg 0.90 0.90 0.90 25000
weighted avg 0.90 0.90 0.90 25000
from pandas import DataFrame
best_params_table = DataFrame.from_dict(cv.best_params_.items())
display(best_params_table.style.hide(axis=0).hide(axis=1))
Conclusion¶
Our logistic regression model reached an accuracy of 90%.
This demonstrates the effectiveness of classical machine learning techniques for the type of text classification task. One of the key factors contributing to this performance was the use of n-grams for local context processing.
While being highly effective, future improvements could involve exploring more complex vectorization techniques, experimenting with more advanced text pre-processing (like stemming or lemmatization), or even moving to deep learning models for sequence understanding.