Interactive PlaybookMachine Learning · Segmentation · RFM

Which Customers Are Similar?

Understanding Customer Segmentation Through Real Data

Not all customers behave the same. Some buy frequently and spend a lot. Some buy occasionally. Some bought once and disappeared. K-Means clustering on RFM features finds the natural groups — so every action you take is targeted, not guesswork.

200 customers

4 segments discovered

3 RFM features

Runnable Python notebook

200 real customers · K-Means clustering · Runnable Python notebook

✓Why treating all customers the same wastes budget

✓What RFM features reveal about customer behaviour

✓How K-Means finds natural customer groups

✓What each segment means and what action to take

✓How to build this into a production system

✓Where AI consumes segment labels for better output

Section 01 — The Business Problem

Not all customers behave the same

Suppose an ecommerce store has 100,000 customers. Some buy frequently and spend a lot. Some buy once a year. Some bought once and disappeared after a promotional discount. Sending the same email campaign to all of them means wasted budget, missed revenue, and a poor experience for everyone.

Three customers, three very different stories

C001VIP

Last bought 5 days ago. 35 orders. $8,500 spent.

C087Gone quiet

Last bought 130 days ago. 8 orders. $3,800 spent.

C133First-timer

Last bought 75 days ago. 1 order. $150 spent.

The same discount email sent to all three is irrelevant to at least two of them.

Same campaign for everyone

▸A discount email goes to every customer — including your most loyal ones who would have bought anyway
▸Win-back messages land in the inbox of someone who bought last week
▸High-value customers receive the same message as first-time buyers
▸Budget spent on customers already planning to buy

With segmentation

▸Champions get early product access and loyalty rewards
▸At-Risk customers receive personalised win-back campaigns
▸Occasional buyers get first-repeat-purchase incentives
▸Every message is relevant — higher open rates, better ROI

The question

▸Which of your 100,000 customers are actually similar to each other?
▸How do you find the natural groups without manually labelling them?
▸The answer is already in your transaction data — you just need to look.

Section 02 — The Dataset

Three features. 200 customers.

Customer segmentation starts with RFM — three features computed directly from transaction history. Every customer gets a score on each dimension. Together, those three numbers capture most of what matters about customer behaviour.

Recency

Lower = more engaged

How many days since their last purchase?

Frequency

Higher = more loyal

How many orders have they placed?

Monetary

Higher = more valuable

How much have they spent in total?

Sample — 6 of 200 customers

Customer	Recency	Frequency	Monetary	Segment
C001	5 days	35 orders	$8,500	Champion
C038	25 days	15 orders	$2,000	Loyal
C087	130 days	8 orders	$3,800	At-Risk
C133	75 days	1 order	$150	Occasional
C107	170 days	12 orders	$4,800	At-Risk
C172	65 days	2 orders	$180	Occasional

These 6 customers look very different. How do we find others like each of them across 200 customers? That is what K-Means does.

Section 03 — Explore the Segments

Loading explorer…

Section 04 — What Surprised Us

Patterns the data revealed

These findings come from running the actual analysis on our curated 200-customer dataset. Every stat here is computed from the data — nothing invented.

Counterintuitive

At-Risk customers spend 3.7× more per order

Average order value: At-Risk $471 vs Loyal $129. Customers who have gone quiet were not casual buyers — they were high-value ones. The silence is not indifference; it is an opportunity. A win-back campaign for this segment is not a cost — it is a recovery play.

$471 avg order value

Pareto principle

15% of customers generate 45% of revenue

Champions — the smallest segment by count (30 of 200) — account for $218K of $485K total revenue. The Pareto principle holds in real customer data. Protect this segment above all others. Losing one Champion is not like losing one Occasional buyer.

45% from 15% of customers

Recovery opportunity

$155K in recoverable At-Risk revenue

At-Risk customers collectively represent $155K in historical spend — and they have proven they will spend. The question is whether anyone noticed they left. This segment has the highest average order value of any group. The recovery opportunity is larger than it looks.

$155K at-risk revenue

Section 05 — How It Works

How K-Means finds similar customers

K-Means is an iterative algorithm. It does not know anything about “Champions” or “At-Risk” customers — it only sees numbers. It finds groups of customers that are close to each other in RFM space. We name those groups afterwards.

Normalise the features

Monetary values range from $150–$8,500. Recency from 3–200 days. Before clustering, we scale each feature to the same range (0–1) using Min-Max scaling so no single dimension dominates the distance calculation.

Pick K starting points (centroids)

The algorithm places K random points in 3D RFM space — one for each cluster it is looking for. With K=4, we are asking: find 4 natural groups. The starting positions affect convergence speed but not (usually) the final result.

Assign every customer to the nearest centroid

Each customer is assigned to whichever centroid is closest, measured by Euclidean distance in 3D RFM space. Every customer belongs to exactly one cluster at each step.

Move centroids to the centre of their group

Each centroid moves to the average position of all customers assigned to it. A centroid with many high-frequency, low-recency customers moves toward that region of space.

Repeat until stable

Steps 3 and 4 repeat. The clusters stop changing when no customer switches groups between iterations. On this 200-customer dataset, convergence happens within 10–15 iterations.

Choosing K=4

K=4 was chosen because it produces four distinct, interpretable segments. In practice, the elbow method plots inertia (sum of squared distances to centroids) against different values of K. The “elbow” — where adding another cluster stops reducing inertia significantly — guides K selection. The silhouette score provides a second check: how similar each customer is to its own cluster versus others.

Section 06 — The 4 Segments

The 4 segments discovered

Named after what the data shows, not what we hoped for. Each segment has a distinct RFM profile that points directly at the right business action.

Champions

High spend, frequent, recent

customers

R~9 days

F~32 orders

M~$7.3K

AOV ~$231

Revenue share45.0%

$218,300 total · 15% of customers

These customers drive a disproportionate share of revenue.

Action: Early access to new products, loyalty rewards, VIP support.

Loyal Buyers

Regular buyers, steady spend

customers

R~29 days

F~15 orders

M~$1.9K

AOV ~$129

Revenue share19.6%

$94,900 total · 25% of customers

Reliable revenue base. At risk of moving to at-risk if engagement drops.

Action: Personalised recommendations, volume discounts, subscription offers.

At-Risk

Previously high-value, now inactive

customers

R~140 days

F~8 orders

M~$3.9K

AOV ~$471

Revenue share31.9%

$154,600 total · 20% of customers

Highest average order value of any segment — these customers know how to spend.

Action: Win-back campaigns, personalised re-engagement, survey to understand churn reason.

Occasional

Low frequency, low spend

customers

R~75 days

F~2 orders

M~$219

AOV ~$130

Revenue share3.6%

$17,550 total · 40% of customers

Large segment by count, small by revenue. High acquisition cost if they churn.

Action: Activation campaigns, first purchase incentives, product discovery nudges.

Section 07 — Production Architecture

How a company would build this

Running K-Means in a notebook is one thing. Serving segment labels in real time to a marketing platform serving millions of customers is another. Here is a practical architecture.

Customer Events

Event stream / Database

Purchases, logins, clicks — every customer interaction is an event with customer_id, timestamp, and order details.

Data Warehouse

BigQuery / Redshift / Snowflake

Events land in a warehouse. Orders are aggregated per customer to compute raw totals: last purchase date, order count, total spend.

Feature Pipeline

dbt / Airflow / Spark

A daily job computes RFM features per customer: recency in days, frequency count, monetary total. Output: one row per customer_id.

Segmentation Job

Python · scikit-learn · weekly batch

K-Means runs on the normalised RFM features with K=4. Each customer is assigned a cluster label. Centroids are stored for future assignment of new customers.

Customer Profile Store

Redis / Postgres / DynamoDB

The segment label is written back to a fast lookup store indexed by customer_id. Every downstream system reads from here — no re-clustering at request time.

Downstream Systems

Marketing / Product / CRM

Marketing platform, product recommendation engine, CRM, support tools — all consume the segment label to tailor their behaviour per customer.

Refresh note: RFM features should update daily. Segment re-assignment can run weekly — segments are stable unless a major cohort shifts. Champions dropping to Loyal is an early warning signal worth monitoring separately.

Practical considerations

RFM refresh frequency

Update daily. A customer who just made a large purchase should move segments quickly.

Segment re-assignment

Run K-Means weekly. Segments are stable — Champions rarely flip overnight. But a full monthly batch misses emerging At-Risk signals.

Champions dropping to Loyal

An early warning signal. Recency is increasing while frequency is flat. Trigger a proactive retention campaign before they reach At-Risk.

Cold-start for new customers

New customers have no history. Default to Occasional. After 2–3 orders, recompute and reassign. Avoid using day-1 data for Champions criteria.

Section 08 — AI Opportunities

Segments make AI smarter

A segment label is a context signal. When an AI system knows a customer is At-Risk rather than a Champion, its responses — recommendations, copy, timing — become dramatically more relevant. The AI does not replace segmentation; it consumes it.

AI Marketing Assistant

Questions it can answer

›
“Which segment should receive this campaign?”
›
“Which At-Risk customers are most likely to respond to a discount?”
›
“Draft personalised subject lines for each segment”
›
“Which Champions haven't received a loyalty offer in 90 days?”

Personalised Onboarding (SaaS)

Questions it can answer

›
“Which onboarding path matches this customer's usage profile?”
›
“Which features should we highlight for this segment?”
›
“Is this customer's behaviour tracking toward Champions or Occasional?”
›
“What intervention has historically moved Occasional users to Loyal?”

Churn Early Warning

Questions it can answer

›
“Which Loyal customers are trending toward At-Risk?”
›
“What changed in the last 30 days for customers now classified At-Risk?”
›
“What intervention historically reduces churn for this segment?”
›
“Show me Champions whose recency has increased by more than 10 days this month”

The AI does not replace segmentation — it consumes the segment labels. Segment → context → better AI output. Without the segment signal, the same AI question returns a generic answer. With it, the response is specific, actionable, and relevant.

Section 09 — Apply to Your Business

What data do you have?

RFM segmentation is not limited to ecommerce. Anywhere you have engagement records — features used, leads touched, sessions completed — the same technique applies. The features change; the pattern does not.

🛒

Ecommerce

Dataset: Order history

“Which customers are actually similar to each other?”

Datasetcustomer_id, order_date, order_total

RFMDays since last order, order count, total spend

DecisionTargeted campaigns, personalised recommendations

Every ecommerce store with more than a few months of data has these patterns. Champions exist in your data — you just have not found them yet.

⚙️

SaaS

Dataset: Feature usage logs

“Which customers use the product similarly?”

Datasetuser_id, feature_name, last_used, session_count

RFM equivLast login, feature breadth, usage depth

DecisionPersonalised onboarding, feature discovery, upgrade triggers

Power users and at-risk churners have distinct usage fingerprints. Segmenting by usage RFM surfaces both early — while action is still possible.

📊

CRM / Sales

Dataset: Lead activity history

“Which leads behave similarly in the funnel?”

Datasetlead_id, last_activity, touchpoint_count, deal_value

RFM equivDays since last touch, engagement count, deal size

DecisionPrioritised outreach, differentiated follow-up cadence

High-value cold leads and active small leads need different attention. Segmenting by engagement RFM tells reps which ones to prioritise today.

🎓

Education

Dataset: Learning activity logs

“Which students learn similarly?”

Datasetstudent_id, last_session, session_count, progress_pct

RFM equivDays since last session, session frequency, completion depth

DecisionAdaptive curriculum, early intervention for at-risk learners

Students at-risk of dropout have a recognisable RFM pattern before they stop logging in entirely. Early identification means early intervention.

The question to ask yourself

What are the “customers” in your system — users, leads, students, accounts? Can you express their behaviour as recency, frequency, and value? If yes, K-Means will find the natural groups. The segments are already there. You just have not looked.

Section 10 — From Segments to Decisions

A segment without an action is just a label

The value of segmentation is not the segment name. It is the decision that follows. Each segment profile points at a specific action — and the same business data supports all four simultaneously.

Segment

Champions

Profile

9 days recency · 32 orders · $7,277 spend

Recommended Action

Early product access, loyalty programme, VIP support tier

Segment

Loyal Buyers

Profile

29 days recency · 15 orders · $1,898 spend

Recommended Action

Volume discounts, subscription offers, cross-sell recommendations

Segment

At-Risk

Profile

140 days silent · $471 avg order value

Recommended Action

Win-back email sequence, personalised incentive, exit survey

Segment

Occasional

Profile

2 orders · $219 total spend

Recommended Action

First repeat-purchase incentive, product discovery campaign

Note on sequencing: Start with Champions (protect your top revenue), then At-Risk (highest recovery potential). Occasional buyers are the largest segment by count but the smallest by revenue — they are a long-term activation project, not an immediate priority.

References & Further Reading

Go deeper

Where this playbook ends, these resources begin.

Run the full analysis yourself

Part A reproduces the exact 200-customer results on this page. Part B runs the same K-Means analysis on the UCI Online Retail Dataset — 541,909 real transactions. Every finding can be reproduced and challenged.

Open in Colab

Dataset

UCI Online Retail Dataset

Dr Daqing Chen, London South Bank University · 2015

Real transaction data from a UK-based online retailer (2010–2011). 541,909 rows, 25,900 transactions. Used in Part B of the companion notebook to verify that the same segmentation patterns emerge on real retail data.

archive.ics.uci.edu

Original Research

Recency, Frequency, Monetary Value Analysis

Hughes, A.M. · 1994

The paper that formalised RFM scoring for direct mail marketing. Introduced the idea that recency, frequency, and monetary value predict customer response better than demographic data — an insight that holds in modern ecommerce.

Find on Google Scholar

Deep Reading

scikit-learn KMeans Documentation

scikit-learn contributors · 2024

The implementation used in the notebook. Covers the algorithm, n_init for stability, random_state for reproducibility, and the inertia / silhouette score metrics for K selection.

scikit-learn.org

Mining of Massive Datasets — Chapter 7: Clustering

Leskovec, Rajaraman, Ullman (Stanford) · 2020

Free textbook chapter covering K-Means, BFR algorithm, and clustering at scale. The definitive reference for taking this beyond a notebook and into distributed systems.

Find on Google Scholar

Apply this

Want us to build this for you?

We apply these techniques in production systems for insurance companies, lead gen businesses, and growing startups.

AI Automation

LLM-powered workflows, AI agents, and automation that replaces manual work at scale.

Backend & APIs

Python and PHP backends that scale with your business. RESTful, documented, tested.

CRM Development

Custom CRM systems and HubSpot/Salesforce integrations built around how your team sells.