Building a Twitter Sentiment Classifier with NLTK and Scikit-Learn | by Tushar | May, 2025


Photo by Benjamin Balázs on Unsplash

Ever wondered how machines understand whether a tweet is positive or negative?

In this blog, I’ll walk you step-by-step through building a Sentiment Classifier using tweets from real people, the classic NLTK toolkit, and a Logistic Regression model. We’ll go from loading the data all the way to evaluating our model.

No prior ML expertise needed.

We’ll train a model to classify tweets as positive or negative. Think of it as a simple version of what powers Twitter/X sentiment analysis dashboards or brand monitoring tools.

Here’s what you’ll learn:

  • How to load and process real tweet data
  • What “tokenization”, “stemming”, and “stopword removal” mean (and why they matter)
  • How to extract meaningful features from text
  • How to train and evaluate a Logistic Regression model
  • A full, working sentiment classifier by the end

NLTK comes with a ready-to-use Twitter dataset containing 5,000 positive and 5,000 negative tweets. First, we import our libraries and load this dataset:

from nltk.corpus import twitter_samples
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Before feeding text to a model, we need to clean it. Tweets are messy — emojis, URLs, hashtags, retweets — all of that needs processing.

Here’s our plan:

  • Remove links, stock tickers
  • Strip out handles and hashtags
  • Lowercase the text
  • Remove stopwords like “is”, “and”, etc.
  • Stem words (e.g., “running” → “run”)

Here’s our process_tweet function:

def process_tweet(tweet):
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')

tweet = re.sub(r'\$\w*', '', tweet)
tweet = re.sub(r'^RT[\s]+', '', tweet)
tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
tweet = re.sub(r'#', '', tweet)

tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)

tweets_clean = [stemmer.stem(word) for word in tweet_tokens
if word not in stopwords_english and word not in string.punctuation]

return tweets_clean

This function turns a noisy tweet like

👉#FollowFriday @tushar_elric for being top engaged members in my community this week :)

into 👉 ['followfriday','top','engag','member','commun','week',':)']

Notice how :) is retained in the output—it signals sentiment and adds emotional context.

We convert tweets into a DataFrame with a label (1 for positive, 0 for negative), combine both classes, shuffle the rows, and split into train/test:

raw_tweets = pd.concat([df_pos, df_neg], ignore_index=True).sample(frac=1, random_state=42)

Then apply our process_tweet function:

raw_tweets['processed_tokens'] = raw_tweets['tweet'].apply(process_tweet)

# Split it

train_df, test_df = train_test_split(
raw_tweets[['processed_tokens', 'label']],
test_size=0.2,
stratify=raw_tweets['label'],
random_state=42
)

This part is crucial. We’re creating a frequency dictionary:
A map of words to how often they appear in positive and negative tweets.

def build_freqs(tweets, ys):
freqs = {}
for y, tweet in zip(np.squeeze(ys).tolist(), tweets):
tokens = process_tweet(' '.join(tweet))
for word in tokens:
if word not in freqs:
freqs[word] = [0, 0]
freqs[word][1-int(y)] += 1
return freqs

Wonder why it’s 1 - int(y) instead of just int(y)? 🤔

Now that we have word frequencies, we turn each tweet into a 3-element vector:

def extract_features(tweet, freqs):
x = np.zeros((1, 3))
x[0, 0] = 1 # bias term
for word in process_tweet(tweet):
if word in freqs:
pos, neg = freqs[word]
x[0, 1] += pos
x[0, 2] += neg
return x

Now we stack up all vectors and train:

X_train = np.vstack([extract_features(' '.join(tweet), freqs) for tweet in train_x])
X_test = np.vstack([extract_features(' '.join(tweet), freqs) for tweet in test_x])

model = LogisticRegression()
model.fit(X_train, train_y.ravel())

We use standard metrics:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(test_y, y_pred))
print("Precision:", precision_score(test_y, y_pred))
print("Recall:", recall_score(test_y, y_pred))
print("F1 Score:", f1_score(test_y, y_pred))

'''
Accuracy: 0.82
Precision: 0.84
Recall: 0.80
F1 Score: 0.82
'''
# Not bad at all for a simple, interpretable model!

What we built is a solid sentiment classifier using basic NLP and classical ML. It’s not deep learning, but it gets the job done — and teaches you a lot along the way.

You now understand:

  • How to clean and tokenize tweets
  • How to build word frequency dictionaries
  • How to extract features for classification
  • How to train and evaluate a model

Want to take this further? Here’s what you could try next:

  • Use TF-IDF or Word2Vec instead of word counts
  • Train a Naive Bayes classifier for comparison
  • Build a web interface where users input tweets

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here