Example
Identifying spam Emails
(Classifying Emails inbox/spam)
Let us work through one complete example of logistic regression for spam detection.
In this example, an email is either accepted to the inbox or sent to spam.
In a real spam filter, the equation used to calculate
z is not chosen by hand.
It is learned by training the computer using data.
During training, the model is shown thousands of emails that are already labelled
as Spam or Not spam.
For each email, the computer extracts features (numbers), predicts a label,
compares with the correct label, and measures the error.
When the prediction is wrong, the model adjusts the constants in its formula.
This adjustment is done using optimisation techniques.
The most common one is gradient descent,
where the model repeatedly reduces its errors step by step.
Other optimisation methods are also used in practice,
but the idea is always the same: improve the model by learning from errors.
After training on many spam and normal emails, the model settles on values that work well.
That is how an equation like the one below is obtained.
z = −4 + 1.2x
Here, the numbers −4 and 1.2 come from training.
They reflect patterns learned from real email data, not human guesswork.
A List Of Spam Words
free, win, winner, prize, offer, click, buy, cheap, discount, deal,
urgent, limited, bonus, cash, money, gift, reward, now, claim, jackpot
For each email, we count how many of these words appear.
Let x be that count.
The model does not understand the “meaning” of the message.
It only uses this number.
p =
1
1 + e−z
Decision rule (strict filter): choose a threshold t (example: t = 0.9)
If p ≥ t → Spam (class 1) → sent to Spam
If p < t → Not spam (class 0) → delivered to Inbox
10 emails analysed
(x → z → p → inbox/spam)
| Email (short) |
x |
z |
p |
Result |
| Meeting at 3 pm. Agenda attached. |
0 |
−4.0 |
≈ 0.02 |
Inbox |
| Invoice sent. Please review and reply. |
0 |
−4.0 |
≈ 0.02 |
Inbox |
| Limited offer. Click now for discount. |
4 |
0.8 |
≈ 0.69 |
Inbox |
| Urgent: claim your reward now. |
3 |
−0.4 |
≈ 0.40 |
Inbox |
| Buy cheap deal today. Limited bonus. |
5 |
2.0 |
≈ 0.88 |
Inbox |
| Win cash prize now. Click to claim. |
6 |
3.2 |
≈ 0.96 |
Spam |
| Congratulations winner! Claim your gift. |
4 |
0.8 |
≈ 0.69 |
Inbox |
| Jackpot prize! Win money now. |
5 |
2.0 |
≈ 0.88 |
Inbox |
| Free bonus cash. Limited time. Click now. |
6 |
3.2 |
≈ 0.96 |
Spam |
| Claim jackpot reward. Win prize now. |
6 |
3.2 |
≈ 0.96 |
Spam |
Key idea:
The steps are always the same:
count words → get x → compute z → convert to p → compare with threshold.
With a strict threshold like 0.9, only emails with very high probability are sent to spam.