Sentiment analysis, also known as opinion mining, is the process of computationally identifying and categorizing opinions expressed in a piece of text. Essentially, it helps us understand if the writer’s attitude towards a particular topic, product, or entity is positive, negative, or neutral.
This example demonstrates how to implement a simple sentiment classifier using logistic regression. It is surprising how well it performs for this class of tasks as a relatively simple model.
Data Preparation¶
Let’s start from imporing some sentiment training dataset from Kaggle first.
import kagglehub
df = kagglehub.dataset_load(
kagglehub.KaggleDatasetAdapter.PANDAS,
'jp797498e/twitter-entity-sentiment-analysis',
'twitter_training.csv',
pandas_kwargs={'encoding': 'ISO-8859-1'},
)
display(df)
Now, we can prepare a simple cleaning function. It’s pretty simple - it just converts text into lowercase and removes common words (stopwords) that usually have no semantic value in terms of sentiment analysis.
from gensim.parsing.preprocessing import remove_stopwords
def clean(text):
return remove_stopwords(text.lower())
Let’s clean our dataset and take a look at it again.
# Pin column names
df = df[df.columns[[2, 3]]]
df.columns = ['sentiment', 'text']
# Pin column data types
df['text'] = df['text'].astype(str)
df['sentiment'] = df['sentiment'].astype(str)
df = df.dropna()
# Clean data
df = df.loc[df['sentiment'] != 'Irrelevant']
df['text'] = df['text'].map(clean)
# Check data
display(df)
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
x_train = train['text']
y_train = train['sentiment']
x_test = test['text']
y_test = test['sentiment']
Building and Training the Model¶
Alright, we are done with our data, now let’s build the classification pipeline.
We need to vectorize our data first - essentially, to turn strings into a bunch of numbers to manipulate them mathematically. The easiest way to do so is to use an instrument called count vectorizer.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
First, it scans all the text to build a dictionary (vocabulary) of all unique words it finds. Then, for each sentence, it creates a numerical list (vector) where each number represents how many times a specific word from that dictionary appears in that sentence.
example_vector = vectorizer.fit_transform(['Hello World! Hello!', 'Bye World']).toarray()
display(vectorizer.get_feature_names_out(), example_vector)
array(['bye', 'hello', 'world'], dtype=object)
array([[0, 2, 1],
[1, 0, 1]])
Sounds easy, right? Next, we will define a thing called scaler.
It helps to rescale numerical features (like word counts) so they have a similar range, typically with a mean of 0 and a standard deviation of 1. This helps Logistic Regression learn better because it prevents features with naturally larger values from unfairly dominating the learning process, ensuring all features contribute more equally.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)
Finally, let’s define a classifier (no fancy configuration here yet) and stuck everything into an elegant pipeline. That would be our final model architecture.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver='lbfgs', max_iter=1500)
pipeline = Pipeline([
('vectorizer', vectorizer),
('scaler', scaler),
('classifier', classifier),
])
We could start training our model right away, but...
It would be not ideal in terms of its hyperparameters - those values that define how our pipeline works. These are settings we choose before training, like regularization strength (C
) or how our text vectorizer processes words (min_df
, max_df
). They significantly control how the model learns and how well it ultimately performs.
Manually trying every possible combination of these hyperparameters would be incredibly tedious. Instead, we can use automated hyperparameter tuning techniques. For example, GridSearchCV
(or RandomizedSearchCV
) systematically tries out different combinations of hyperparameters from a grid we define.
Crucially, to evaluate how good each combination is without peeking at our final test set, it uses cross-validation. This means for each hyperparameter set, it splits the training data into several “folds”, trains the model on some folds, and tests it on a remaining fold, repeating this process so every fold gets to be a test set. By averaging the performance across these folds, we get a more reliable score for that hyperparameter combination, helping us pick the best ones for our final model.
param_grid = {
'classifier__C': [0.1, 1, 10],
'vectorizer__ngram_range': [(1, 1), (1, 2)],
'vectorizer__max_df': [0.85, 0.90, 0.95, 1.0],
'vectorizer__min_df': [1, 2, 3, 5],
}
Let’s dissect those hyperparameters:
- Classifier
C
: Regularization strength of the LogisticRegression classifier. Smaller values make it stronger (less prone to overfitting), bigger - weaker (able to capture more nuances in data). - Vectorizer
ngram_range
: This is crucial for capturing context! Instead of just looking at individual words (unigrams), n-grams allow us to consider sequences of words as single features. Using n-grams beyond unigrams often significantly improves performance in text tasks by providing more contextual information to the model, but also increases the vocabulary size. - Vectorizer
max_df
: Maximum document frequency - ignore terms that appear in more than ‘max_df’ % of documents. Smaller values exclude more common terms (good for noise reduction), but too small may result in losing important common signals (underfitting). - Vectorizer
min_df
: Minimum document frequency - ignore terms that appear in fewer than ‘min_df’ documents. Smaller values may lead to huge noisy vocabularies, bigger may result in losing specific signals.
Here comes the moment of truth - we could finally start actually training our model, with cross-validation running over the top.
from joblib import parallel_backend
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipeline, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1, verbose=2)
with parallel_backend('multiprocessing'):
grid.fit(x_train, y_train)
Output
Result¶
from sklearn.metrics import classification_report
prediction = grid.best_estimator_.predict(x_test)
print(classification_report(y_test, prediction))
print(f'Best parameters: {grid.best_params_}')
precision recall f1-score support
Negative 0.94 0.91 0.92 4505
Neutral 0.92 0.89 0.90 3582
Positive 0.89 0.93 0.91 4252
accuracy 0.91 12339
macro avg 0.91 0.91 0.91 12339
weighted avg 0.91 0.91 0.91 12339
Best parameters: {'classifier__C': 0.1, 'vectorizer__max_df': 0.85, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 2)}
Conclusion¶
Our final deep learning model reached an accuracy of 91%. This was achieved by systematically preprocessing the data, constructing a robust Scikit pipeline, and performing a complex hyperparameter optimization via GridSearchCV
.
Key factors contributing to this performance included the use of n-grams (1, 2) and careful tuning of regularization and vectorizer frequency cutoffs. This demonstrates the effectiveness of classical machine learning techniques for the type of text classification task.