Which Customers Are Similar?
Understanding Customer Segmentation Through Real Data
Not all customers behave the same. Some buy frequently and spend a lot. Some buy occasionally. Some bought once and disappeared. K-Means clustering on RFM features finds the natural groups — so every action you take is targeted, not guesswork.
200 real customers · K-Means clustering · Runnable Python notebook
Section 01 — The Business Problem
Not all customers behave the same
Suppose an ecommerce store has 100,000 customers. Some buy frequently and spend a lot. Some buy once a year. Some bought once and disappeared after a promotional discount. Sending the same email campaign to all of them means wasted budget, missed revenue, and a poor experience for everyone.
Three customers, three very different stories
Last bought 5 days ago. 35 orders. $8,500 spent.
Last bought 130 days ago. 8 orders. $3,800 spent.
Last bought 75 days ago. 1 order. $150 spent.
The same discount email sent to all three is irrelevant to at least two of them.
Same campaign for everyone
- ▸A discount email goes to every customer — including your most loyal ones who would have bought anyway
- ▸Win-back messages land in the inbox of someone who bought last week
- ▸High-value customers receive the same message as first-time buyers
- ▸Budget spent on customers already planning to buy
With segmentation
- ▸Champions get early product access and loyalty rewards
- ▸At-Risk customers receive personalised win-back campaigns
- ▸Occasional buyers get first-repeat-purchase incentives
- ▸Every message is relevant — higher open rates, better ROI
The question
- ▸Which of your 100,000 customers are actually similar to each other?
- ▸How do you find the natural groups without manually labelling them?
- ▸The answer is already in your transaction data — you just need to look.
Section 02 — The Dataset
Three features. 200 customers.
Customer segmentation starts with RFM — three features computed directly from transaction history. Every customer gets a score on each dimension. Together, those three numbers capture most of what matters about customer behaviour.
Recency
Lower = more engaged
How many days since their last purchase?
Frequency
Higher = more loyal
How many orders have they placed?
Monetary
Higher = more valuable
How much have they spent in total?
Sample — 6 of 200 customers
| Customer | Recency | Frequency | Monetary | Segment |
|---|---|---|---|---|
| C001 | 5 days | 35 orders | $8,500 | Champion |
| C038 | 25 days | 15 orders | $2,000 | Loyal |
| C087 | 130 days | 8 orders | $3,800 | At-Risk |
| C133 | 75 days | 1 order | $150 | Occasional |
| C107 | 170 days | 12 orders | $4,800 | At-Risk |
| C172 | 65 days | 2 orders | $180 | Occasional |
These 6 customers look very different. How do we find others like each of them across 200 customers? That is what K-Means does.
Section 03 — Explore the Segments
Loading explorer…
Section 04 — What Surprised Us
Patterns the data revealed
These findings come from running the actual analysis on our curated 200-customer dataset. Every stat here is computed from the data — nothing invented.
At-Risk customers spend 3.7× more per order
Average order value: At-Risk $471 vs Loyal $129. Customers who have gone quiet were not casual buyers — they were high-value ones. The silence is not indifference; it is an opportunity. A win-back campaign for this segment is not a cost — it is a recovery play.
$471 avg order value
15% of customers generate 45% of revenue
Champions — the smallest segment by count (30 of 200) — account for $218K of $485K total revenue. The Pareto principle holds in real customer data. Protect this segment above all others. Losing one Champion is not like losing one Occasional buyer.
45% from 15% of customers
$155K in recoverable At-Risk revenue
At-Risk customers collectively represent $155K in historical spend — and they have proven they will spend. The question is whether anyone noticed they left. This segment has the highest average order value of any group. The recovery opportunity is larger than it looks.
$155K at-risk revenue
Section 05 — How It Works
How K-Means finds similar customers
K-Means is an iterative algorithm. It does not know anything about “Champions” or “At-Risk” customers — it only sees numbers. It finds groups of customers that are close to each other in RFM space. We name those groups afterwards.
Normalise the features
Monetary values range from $150–$8,500. Recency from 3–200 days. Before clustering, we scale each feature to the same range (0–1) using Min-Max scaling so no single dimension dominates the distance calculation.
Pick K starting points (centroids)
The algorithm places K random points in 3D RFM space — one for each cluster it is looking for. With K=4, we are asking: find 4 natural groups. The starting positions affect convergence speed but not (usually) the final result.
Assign every customer to the nearest centroid
Each customer is assigned to whichever centroid is closest, measured by Euclidean distance in 3D RFM space. Every customer belongs to exactly one cluster at each step.
Move centroids to the centre of their group
Each centroid moves to the average position of all customers assigned to it. A centroid with many high-frequency, low-recency customers moves toward that region of space.
Repeat until stable
Steps 3 and 4 repeat. The clusters stop changing when no customer switches groups between iterations. On this 200-customer dataset, convergence happens within 10–15 iterations.
Choosing K=4
K=4 was chosen because it produces four distinct, interpretable segments. In practice, the elbow method plots inertia (sum of squared distances to centroids) against different values of K. The “elbow” — where adding another cluster stops reducing inertia significantly — guides K selection. The silhouette score provides a second check: how similar each customer is to its own cluster versus others.
Section 06 — The 4 Segments
The 4 segments discovered
Named after what the data shows, not what we hoped for. Each segment has a distinct RFM profile that points directly at the right business action.
Champions
High spend, frequent, recent
30
customers
$218,300 total · 15% of customers
These customers drive a disproportionate share of revenue.
Loyal Buyers
Regular buyers, steady spend
50
customers
$94,900 total · 25% of customers
Reliable revenue base. At risk of moving to at-risk if engagement drops.
At-Risk
Previously high-value, now inactive
40
customers
$154,600 total · 20% of customers
Highest average order value of any segment — these customers know how to spend.
Occasional
Low frequency, low spend
80
customers
$17,550 total · 40% of customers
Large segment by count, small by revenue. High acquisition cost if they churn.
Section 07 — Production Architecture
How a company would build this
Running K-Means in a notebook is one thing. Serving segment labels in real time to a marketing platform serving millions of customers is another. Here is a practical architecture.
Customer Events
Event stream / DatabasePurchases, logins, clicks — every customer interaction is an event with customer_id, timestamp, and order details.
Data Warehouse
BigQuery / Redshift / SnowflakeEvents land in a warehouse. Orders are aggregated per customer to compute raw totals: last purchase date, order count, total spend.
Feature Pipeline
dbt / Airflow / SparkA daily job computes RFM features per customer: recency in days, frequency count, monetary total. Output: one row per customer_id.
Segmentation Job
Python · scikit-learn · weekly batchK-Means runs on the normalised RFM features with K=4. Each customer is assigned a cluster label. Centroids are stored for future assignment of new customers.
Customer Profile Store
Redis / Postgres / DynamoDBThe segment label is written back to a fast lookup store indexed by customer_id. Every downstream system reads from here — no re-clustering at request time.
Downstream Systems
Marketing / Product / CRMMarketing platform, product recommendation engine, CRM, support tools — all consume the segment label to tailor their behaviour per customer.
Refresh note: RFM features should update daily. Segment re-assignment can run weekly — segments are stable unless a major cohort shifts. Champions dropping to Loyal is an early warning signal worth monitoring separately.
Practical considerations
RFM refresh frequency
Update daily. A customer who just made a large purchase should move segments quickly.
Segment re-assignment
Run K-Means weekly. Segments are stable — Champions rarely flip overnight. But a full monthly batch misses emerging At-Risk signals.
Champions dropping to Loyal
An early warning signal. Recency is increasing while frequency is flat. Trigger a proactive retention campaign before they reach At-Risk.
Cold-start for new customers
New customers have no history. Default to Occasional. After 2–3 orders, recompute and reassign. Avoid using day-1 data for Champions criteria.
Section 08 — AI Opportunities
Segments make AI smarter
A segment label is a context signal. When an AI system knows a customer is At-Risk rather than a Champion, its responses — recommendations, copy, timing — become dramatically more relevant. The AI does not replace segmentation; it consumes it.
AI Marketing Assistant
Questions it can answer
- ›
“Which segment should receive this campaign?”
- ›
“Which At-Risk customers are most likely to respond to a discount?”
- ›
“Draft personalised subject lines for each segment”
- ›
“Which Champions haven't received a loyalty offer in 90 days?”
Personalised Onboarding (SaaS)
Questions it can answer
- ›
“Which onboarding path matches this customer's usage profile?”
- ›
“Which features should we highlight for this segment?”
- ›
“Is this customer's behaviour tracking toward Champions or Occasional?”
- ›
“What intervention has historically moved Occasional users to Loyal?”
Churn Early Warning
Questions it can answer
- ›
“Which Loyal customers are trending toward At-Risk?”
- ›
“What changed in the last 30 days for customers now classified At-Risk?”
- ›
“What intervention historically reduces churn for this segment?”
- ›
“Show me Champions whose recency has increased by more than 10 days this month”
The AI does not replace segmentation — it consumes the segment labels. Segment → context → better AI output. Without the segment signal, the same AI question returns a generic answer. With it, the response is specific, actionable, and relevant.
Section 09 — Apply to Your Business
What data do you have?
RFM segmentation is not limited to ecommerce. Anywhere you have engagement records — features used, leads touched, sessions completed — the same technique applies. The features change; the pattern does not.
Ecommerce
Dataset: Order history
“Which customers are actually similar to each other?”
customer_id, order_date, order_totalDays since last order, order count, total spendTargeted campaigns, personalised recommendationsSaaS
Dataset: Feature usage logs
“Which customers use the product similarly?”
user_id, feature_name, last_used, session_countLast login, feature breadth, usage depthPersonalised onboarding, feature discovery, upgrade triggersCRM / Sales
Dataset: Lead activity history
“Which leads behave similarly in the funnel?”
lead_id, last_activity, touchpoint_count, deal_valueDays since last touch, engagement count, deal sizePrioritised outreach, differentiated follow-up cadenceEducation
Dataset: Learning activity logs
“Which students learn similarly?”
student_id, last_session, session_count, progress_pctDays since last session, session frequency, completion depthAdaptive curriculum, early intervention for at-risk learnersThe question to ask yourself
What are the “customers” in your system — users, leads, students, accounts? Can you express their behaviour as recency, frequency, and value? If yes, K-Means will find the natural groups. The segments are already there. You just have not looked.
Section 10 — From Segments to Decisions
A segment without an action is just a label
The value of segmentation is not the segment name. It is the decision that follows. Each segment profile points at a specific action — and the same business data supports all four simultaneously.
Segment
Champions
Profile
9 days recency · 32 orders · $7,277 spend
Recommended Action
Early product access, loyalty programme, VIP support tier
Segment
Loyal Buyers
Profile
29 days recency · 15 orders · $1,898 spend
Recommended Action
Volume discounts, subscription offers, cross-sell recommendations
Segment
At-Risk
Profile
140 days silent · $471 avg order value
Recommended Action
Win-back email sequence, personalised incentive, exit survey
Segment
Occasional
Profile
2 orders · $219 total spend
Recommended Action
First repeat-purchase incentive, product discovery campaign
Note on sequencing: Start with Champions (protect your top revenue), then At-Risk (highest recovery potential). Occasional buyers are the largest segment by count but the smallest by revenue — they are a long-term activation project, not an immediate priority.
References & Further Reading
Go deeper
Where this playbook ends, these resources begin.
Run the full analysis yourself
Part A reproduces the exact 200-customer results on this page. Part B runs the same K-Means analysis on the UCI Online Retail Dataset — 541,909 real transactions. Every finding can be reproduced and challenged.
Deep Reading
scikit-learn KMeans Documentation
scikit-learn contributors · 2024
The implementation used in the notebook. Covers the algorithm, n_init for stability, random_state for reproducibility, and the inertia / silhouette score metrics for K selection.
scikit-learn.orgMining of Massive Datasets — Chapter 7: Clustering
Leskovec, Rajaraman, Ullman (Stanford) · 2020
Free textbook chapter covering K-Means, BFR algorithm, and clustering at scale. The definitive reference for taking this beyond a notebook and into distributed systems.
Find on Google ScholarApply this
Want us to build this for you?
We apply these techniques in production systems for insurance companies, lead gen businesses, and growing startups.