Dataset is occured by UK based online- retail company's sales records between 2010 and 2011.(Source) E-commerce data is not normally found in publicly available sources, as it contains customer-specific data.Thanks to "UCI Machine Learning Repository" this dataset was produced in order to be used for public usage.
Dataset includes 8 different columns and 541.909 rows in only one table. Description of the whole columns are explained as below:
As the dataset comes from an e-commerce website, the main goal of the project is understand how does the customers' behaviours change? with this main goal, a RFM analysis could be beneficial for defining each customer's trend and then seperate them clusters which are defined according to their RFM based customer segmentation.
To get successful results from project there are six steps which was assumed as shown below
a. Data Cleaning
b. Data Analysis
c. Data Visualization
d. Data Modeling
e. Cluster Labelling
f. Conclusion
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
import calendar
data = 'data.csv'
df = pd.read_csv(data, encoding = 'unicode_escape')
df.head()
df.shape
#Checking if there are any missing values
df.isna().sum()
df = df.dropna()
df.isna().sum()
df.info()
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df["Amount"] = df['Quantity'] * df['UnitPrice']
df
df.info()
df.describe()
plt.figure(figsize=(8, 6))
sns.distplot(df["Quantity"])
plt.show()
There are some cancelled invoice in the dataset. Because of cancelled invoice datas are not in the scope of our analysis, we decided to remove these transections from our analysis via the function as showed below.
def remove_cancelled_transactions(df):
trans_neg = df.Quantity < 0
return df.loc[~(trans_neg | trans_neg.shift(-1))]
groups = [df.StockCode, df.Quantity.abs()]
df_revized = df.groupby(groups, as_index=False, group_keys=False) \
.apply(remove_cancelled_transactions)
plt.figure(figsize=(8, 6))
sns.distplot(df_revized["Quantity"])
plt.show()
df_revized.Quantity.sort_values()
The dataset is belong of the sales data between first date: 2010-12-01 08:26:00 and last date: 2011-06-23 15:26:00
print("first date: ", df_revized["InvoiceDate"].min(),"\n", "last date: ", df_revized["InvoiceDate"].max())
df_revized.head()
df_revized["year"], df_revized["month"] = df_revized["InvoiceDate"].dt.year, df_revized["InvoiceDate"].dt.month
sales = df_revized.groupby(["year","month"])["Amount"].sum() \
.reset_index().rename({"Amount":"TotalSales"},axis=1)
sales
new_rows = pd.DataFrame({"year":[2010,2010],
"month": [4,11],
"TotalSales": [38123.21,184980.04]},
index = [98,99]) # arbitrary indexes
# insert the row in the sales table
sales = pd.concat([new_rows, sales]) \
.sort_values(by=["year","month"]).reset_index(drop=True)
sales["month"] = sales["month"].apply(lambda x: calendar.month_abbr[x])
# combine month and year
sales["month"] = sales["month"].astype(str) + " " + sales["year"].astype(str)
# drop the redundant year column
sales = sales.drop("year", axis = 1)
sales = sales[0:23]
trace = go.Scatter(
x = sales["month"],
y = sales["TotalSales"],
mode = "lines+markers",
name = "TotalSales",
line = dict(width = 4),
marker = dict(
size = 10,
color = "rgba(120, 26, 120, 0.8)"
),
hovertemplate = " %{x}<br>£%{y:,.0f} <extra></extra>",
)
line_data = [trace]
layout = dict(
title = "Total Sales by Month",
titlefont = dict(size = 20),
margin=dict(l=10, r=50, b=10, t=70, pad=0),
xaxis= dict(title= "Month",ticklen = 5,zeroline = False),
yaxis= dict(title= "Total Sales", tickformat = ",.0f", tickprefix="£")
)
fig = dict(data = line_data, layout = layout)
iplot(fig)
The sales trend during the period covered by the data set of all products on the e-commerce site is as in the chart above. The trend, which was generally on the rise at the end of 2010, seems to decrease slightly at the beginning of 2011, but continues to increase gradually towards the middle of the year with slight fluctuations throughout the year.
There are 3651 uniq products in this e-commerce website which are sold.
df_revized["StockCode"].nunique()
The most populer products are shown according to their order number as below:
df_revized["StockCode"].value_counts().head()
items = df_revized['Description'].value_counts().head()
print(items)
fig,ax = plt.subplots(1,1,figsize=(15,5))
color = plt.cm.copper(np.linspace(0, 1, 40)) #importance: color
df_revized['Description'].value_counts().head(40).plot.bar(color = color)
plt.title('Frequency of most bought items Description wise', fontsize=25)
plt.ylabel("Counts")
plt.show()
# orders by country
fig,ax = plt.subplots(1,1,figsize=(15,5))
stats_country = df_revized.groupby(['Country']).Quantity.agg([np.sum])
stats_country.sort_values(by='sum',ascending=False).plot(kind='bar',ax=ax).set_title('Count of orders by country')
plt.yscale("log")
plt.grid()
stats_country.sort_values(by='sum',ascending=False).head()
The distribution of the countries according to the products order number is as shown above. The first three countries are U.K., Netherlands and Germany.
plt.figure()
df_revized['Country'].value_counts()[:5].plot(kind = 'pie', autopct='%1.1f%%')
plt.title('Top 5 countries based on transactions', weight='bold')
plt.show()
When the product numbers of the orders received on a country basis are added together, a distribution is formed as in the image above. The first 3 places are England, Netherlands and Ireland.
fig,ax = plt.subplots(1,1,figsize=(15,5))
temp = df_revized.copy()
temp['Amount'] = temp['Quantity']*temp['UnitPrice']
stats_country = temp.groupby(['Country']).Amount.agg([np.sum, np.mean])
stats_country.sort_values(by='sum',ascending=False).plot(kind='bar',ax=ax).set_title('Amount of orders by country')
plt.yscale("log")
plt.grid()
stats_country.sort_values(by='mean',ascending=False).head()
When the total and average turnovers of the orders received on a country basis are examined, the graph above is formed. Despite the fact that the highest average turnover on a country basis is made in the Netherlands, the total turnover is again mostly made in England.
stats_cust = df_revized.groupby(['CustomerID'])
for key,value in stats_cust:
if len(value.Country.unique()) > 1:
print(key, value.Country.unique())
Looking at the country customer number matches, a total of 8 different people were identified who ordered from more than one country.
df_revized.info()
df_revized.shape
df.InvoiceDate.sort_values(ascending=False).head(10)
print("Min:{}; Max:{}".format(min(df.InvoiceDate), max(df.InvoiceDate)))
import datetime
start_date = max(df.InvoiceDate)
start_date
end_date = min(df.InvoiceDate)
end_date
Snapshot date is defined as the next day of the date when the last sales was happened.
snapshot_date = start_date + datetime.timedelta(days=1)
snapshot_date
df_revized
The uniq number of hte InvoiceNo was calculated.
df_revized["InvoiceNo"].nunique()
df_revized = df_revized[df_revized["CustomerID"] != 13256]
In order to build rfm columns each column was calculated according to the formulas as below chunk.
df_rfm = df_revized.groupby("CustomerID").agg({"InvoiceDate": lambda x: (snapshot_date - x.max()).days,
"InvoiceNo": "count",
"Amount": "sum"})
df_rfm
df_rfm.rename(columns={"InvoiceNo": "Frequency",
"InvoiceDate": "Recency",
"Amount": "Monetary"}, inplace=True)
df_rfm.head()
df_rfm.describe()
heatmap_all = sns.heatmap(df_rfm.corr(), vmin=-1, vmax=1, annot=True)
Recency, Frequence, Monetary features were tried to normalise using four different transformation methods: square root, reciprocal, log, and boxcox one by one. All process failed, but the nearest values to normal distribution are obtained thanks to boxcox method.
sns.distplot(df_rfm["Recency"])
plt.show()
sns.distplot(df_rfm["Frequency"])
plt.show()
sns.distplot(df_rfm["Monetary"])
plt.show()
from scipy import stats
shapiro_Recency= stats.shapiro(df_rfm.Recency)
shapiro_Monetary= stats.shapiro(df_rfm.Monetary)
shapiro_Frequency= stats.shapiro(df_rfm.Frequency)
print("Shapiro Score of Recency: {}".format(shapiro_Recency))
print("Shapiro Score of Monetary: {}".format(shapiro_Monetary))
print("Shapiro Score of Frequency: {}".format(shapiro_Frequency))
df_rfm_log = np.log(df_rfm)
plt.subplot(3,1,1) ; sns.distplot(df_rfm_log["Recency"])
plt.subplot(3,1,2) ; sns.distplot(df_rfm_log["Monetary"])
plt.subplot(3,1,3) ; sns.distplot(df_rfm_log["Frequency"])
from scipy import stats
shapiro_Recency_log= stats.shapiro(df_rfm_log.Recency)
shapiro_Monetary_log= stats.shapiro(df_rfm_log.Monetary)
shapiro_Frequency_log= stats.shapiro(df_rfm_log.Frequency)
print("Shapiro Score of Recency: {}".format(shapiro_Recency_log))
print("Shapiro Score of Monetary: {}".format(shapiro_Monetary_log))
print("Shapiro Score of Frequency: {}".format(shapiro_Frequency_log))
df_rfm_sqrt = (df_rfm)**(0.5)
plt.subplot(3,1,1) ; sns.distplot(df_rfm_sqrt["Recency"])
plt.subplot(3,1,2) ; sns.distplot(df_rfm_sqrt["Monetary"])
plt.subplot(3,1,3) ; sns.distplot(df_rfm_sqrt["Frequency"])
from scipy import stats
print(stats.shapiro(df_rfm_sqrt["Recency"]))
print(stats.shapiro(df_rfm_sqrt["Monetary"]))
print(stats.shapiro(df_rfm_sqrt["Frequency"]))
df_rfm_recip = 1/df_rfm
plt.subplot(3,1,1) ; sns.distplot(df_rfm_recip["Recency"])
plt.subplot(3,1,2) ; sns.distplot(df_rfm_recip["Monetary"])
plt.subplot(3,1,3) ; sns.distplot(df_rfm_recip["Frequency"])
from scipy import stats
print(stats.shapiro(df_rfm_recip["Recency"]))
print(stats.shapiro(df_rfm_recip["Monetary"]))
print(stats.shapiro(df_rfm_recip["Frequency"]))
df_rfm_boxcox = df_rfm.copy()
from scipy.stats import boxcox
for i in df_rfm_boxcox:
df_rfm_boxcox[i], lmbda = boxcox(df_rfm_boxcox[i], lmbda=None)
plt.subplot(3,1,1) ; sns.distplot(df_rfm_boxcox["Recency"])
plt.subplot(3,1,2) ; sns.distplot(df_rfm_boxcox["Monetary"])
plt.subplot(3,1,3) ; sns.distplot(df_rfm_boxcox["Frequency"])
from scipy import stats
print(stats.shapiro(df_rfm_boxcox["Recency"]))
print(stats.shapiro(df_rfm_boxcox["Monetary"]))
print(stats.shapiro(df_rfm_boxcox["Frequency"]))
The shapiro test was used to define if the RFM columns distributed as normlize or not. Because three of them were not distributed as normalize, four different technique were tried on them to make them normalize distributed. Finally in the fourth technique of "Boxcox", the distribution seems like normalized distributed.
Min-max scaling was used to scale the RFM columns from 0 to 1.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df_rfm_boxcox)
df_rfm_normalized = scaler.transform(df_rfm_boxcox)
df_rfm_normalized = pd.DataFrame(data=df_rfm_normalized, index=df_rfm.index, columns=df_rfm.columns)
df_rfm_normalized.head()
print(df_rfm_normalized.describe().round(2))
Firstly, the elbow method is applied with 10 options which are from 1 to 10. 3 seems like a good option, but the silhouette score graph can be clearer to determine the number of clusters.
from sklearn.cluster import KMeans
sse = {}
for k in range(1,11):
kmeans = KMeans(n_clusters=k, n_init=10, random_state=1)
kmeans.fit(df_rfm_normalized)
sse[k] = kmeans.inertia_
plt.title("The Elbow Method")
plt.xlabel("k")
plt.ylabel("SSE")
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()
def optimal_kmeans(dataset, start=2, end=11):
'''
Calculate the optimal number of kmeans
INPUT:
dataset : dataframe. Dataset for k-means to fit
start : int. Starting range of kmeans to test
end : int. Ending range of kmeans to test
OUTPUT:
Values and line plot of Silhouette Score.
'''
# Create empty lists to store values for plotting graphs
n_clu = []
km_ss = []
# Create a for loop to find optimal n_clusters
for n_clusters in range(start, end):
# Create cluster labels
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=1)
labels = kmeans.fit_predict(dataset)
# Calcualte model performance
silhouette_avg = round(silhouette_score(dataset, labels, random_state=1), 3)
# Append score to lists
km_ss.append(silhouette_avg)
n_clu.append(n_clusters)
print("No. Clusters: {}, Silhouette Score: {}, Change from Previous Cluster: {}".format(
n_clusters,
silhouette_avg,
(km_ss[n_clusters - start] - km_ss[n_clusters - start - 1]).round(3)))
# Plot graph at the end of loop
if n_clusters == end - 1:
plt.figure(figsize=(6.47,3))
plt.title('Silhouette Score')
sns.pointplot(x=n_clu, y=km_ss)
plt.tight_layout()
plt.show()
A silhouette score graph was plotted, and it includes the cluster numbers from 2 to 15. The graph slightly increases on 6 and 8.
from sklearn.metrics import silhouette_score
optimal_kmeans(df_rfm_normalized, start=2, end=16)
def kmeans_pca(df, n_clusters):
kmeans = KMeans(n_clusters=n_clusters, n_init=20, random_state=1)
df_rfm_k8_stat = kmeans.fit(df)
labels = kmeans.labels_
centers= kmeans.cluster_centers_
clusters= kmeans.predict(df)
from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=1)
pca.fit(df)
df_rfm_pca = pca.transform(df)
df_rfm_pca = pd.DataFrame(df_rfm_pca, columns=["PCA1", "PCA2"])
df_rfm_pca["Clusters"] = clusters
pca.fit(centers)
centers_pca = pca.transform(centers)
plt.figure(figsize=(14,10))
sns.scatterplot(df_rfm_pca["PCA1"], df_rfm_pca["PCA2"], hue=clusters, s=25, palette="brg")
sns.scatterplot(centers_pca[:, 0], centers_pca[:, 1], marker="o", s=200, alpha=0.8);
plt.xlabel("PCA1")
plt.ylabel("PCA2")
plt.show()
kmeans_pca(df_rfm_normalized, 6)
kmeans_pca(df_rfm_normalized, 8)
PCA method was used in order to show the dataset in 2-D graph as shown above.
kmeans = KMeans(n_clusters=8, n_init=20, random_state=1)
df_rfm_k8_stat = kmeans.fit(df_rfm_normalized)
labels = kmeans.labels_
centers= kmeans.cluster_centers_
clusters= kmeans.predict(df_rfm_normalized)
centers.shape
df_rfm_normalized.shape
from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=1)
pca.fit(df_rfm_normalized)
df_rfm_pca = pca.transform(df_rfm_normalized)
df_rfm_pca.shape
df_rfm_pca = pd.DataFrame(df_rfm_pca, columns=["PCA1", "PCA2"])
df_rfm_pca["Clusters"] = clusters
df_rfm_pca
PCA method and pair plots were applied for all important cluster numbers.(3, 6, 8)
important_clusters = [3, 6, 8]
for i in important_clusters:
kmeans = KMeans(n_clusters=i, n_init=20, random_state=1)
df_rfm_k = kmeans.fit(df_rfm_normalized)
centers= kmeans.cluster_centers_
clusters= kmeans.predict(df_rfm_normalized)
from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=1)
pca.fit(df_rfm_normalized)
df_rfm_pca = pca.transform(df_rfm_normalized)
df_rfm_pca = pd.DataFrame(df_rfm_pca, columns=["PCA1", "PCA2"])
df_rfm_pca["Clusters"] = clusters
pca.fit(centers)
centers_pca = pca.transform(centers)
plt.figure(figsize=(10,70))
plt.subplot(10,1,i)
plt.scatter(df_rfm_pca["PCA1"], df_rfm_pca["PCA2"], c=clusters, s=25, cmap='brg')
plt.scatter(centers_pca[:, 0], centers_pca[:, 1], c='black', marker="o", s=200, alpha=0.8)
plt.title('Number of Clusters: {}'.format(i))
After comparing the graphs between the clusters of 3,6 and 8, it was understand that there is more accurate and explainable plots in the number of 8 clusters graph. Therefore, in this project 8 is defined as the number of clusters for the dataset.
With the aim of defining exact labels for whole clusters, another k-means analysis was worked as shown below chunk to get the better understanding of the RFM clusters. These graphs shows our pair plots for
df_rfm_normalized_multiple = df_rfm_normalized.copy()
# k means
important_clusters = [3, 6, 8]
for i in important_clusters:
kmeans = KMeans(n_clusters=i, random_state=1)
df_rfm_normalized_multiple['cluster'] = kmeans.fit_predict(df_rfm_normalized[['Recency', 'Monetary']])
clusters= kmeans.predict(df_rfm_normalized[['Recency', 'Monetary']])
# get centroids
centroids = kmeans.cluster_centers_
cen_x = [j[0] for j in centroids]
cen_y = [j[1] for j in centroids]
## add to df_rfm_normalized_multiple
df_rfm_normalized_multiple['cen_x'] = df_rfm_normalized_multiple.cluster.map({0:cen_x[0], 1:cen_x[1], 2:cen_x[2]})
df_rfm_normalized_multiple['cen_y'] = df_rfm_normalized_multiple.cluster.map({0:cen_y[0], 1:cen_y[1], 2:cen_y[2]})
#####PLOT#####
plt.figure(figsize=(10,70))
plt.subplot(10,1,i)
plt.scatter(df_rfm_normalized_multiple.Recency, df_rfm_normalized_multiple.Monetary, c=clusters, cmap="brg", alpha = 0.6, s=10)
plt.scatter(cen_x, cen_y, marker='^', c="black", s=100, alpha=0.5)
# title and labels
plt.title('\nRecency - Monetary\nNumber of Clusters: {}'.format(i))
plt.xlabel('Recency')
plt.ylabel('Monetary')
# k means
important_clusters = [3, 6, 8]
for i in important_clusters:
kmeans = KMeans(n_clusters=i, random_state=1)
df_rfm_normalized_multiple['cluster'] = kmeans.fit_predict(df_rfm_normalized[['Frequency', 'Monetary']])
clusters= kmeans.predict(df_rfm_normalized[['Frequency', 'Monetary']])
# get centroids
centroids = kmeans.cluster_centers_
cen_x = [j[0] for j in centroids]
cen_y = [j[1] for j in centroids]
## add to df_rfm_normalized_multiple
df_rfm_normalized_multiple['cen_x'] = df_rfm_normalized_multiple.cluster.map({0:cen_x[0], 1:cen_x[1], 2:cen_x[2]})
df_rfm_normalized_multiple['cen_y'] = df_rfm_normalized_multiple.cluster.map({0:cen_y[0], 1:cen_y[1], 2:cen_y[2]})
#####PLOT#####
plt.figure(figsize=(10,70))
plt.subplot(10,1,i)
plt.scatter(df_rfm_normalized_multiple.Frequency, df_rfm_normalized_multiple.Monetary, c=clusters, cmap="brg", alpha = 0.6, s=10)
plt.scatter(cen_x, cen_y, marker='^', c="black", s=100, alpha=0.5)
# title and labels
plt.title('\nFrequency - Monetary\nNumber of Clusters: {}'.format(i))
plt.xlabel('Frequency')
plt.ylabel('Monetary')
# k means
important_clusters = [3, 6, 8]
for i in important_clusters:
kmeans = KMeans(n_clusters=i, random_state=1)
df_rfm_normalized_multiple['cluster'] = kmeans.fit_predict(df_rfm_normalized[['Frequency', 'Recency']])
clusters= kmeans.predict(df_rfm_normalized[['Frequency', 'Recency']])
# get centroids
centroids = kmeans.cluster_centers_
cen_x = [j[0] for j in centroids]
cen_y = [j[1] for j in centroids]
## add to df_rfm_normalized_multiple
df_rfm_normalized_multiple['cen_x'] = df_rfm_normalized_multiple.cluster.map({0:cen_x[0], 1:cen_x[1], 2:cen_x[2]})
df_rfm_normalized_multiple['cen_y'] = df_rfm_normalized_multiple.cluster.map({0:cen_y[0], 1:cen_y[1], 2:cen_y[2]})
#####PLOT#####
plt.figure(figsize=(10,70))
plt.subplot(10,1,i)
plt.scatter(df_rfm_normalized_multiple.Frequency, df_rfm_normalized_multiple.Recency, c=clusters, cmap="brg", alpha = 0.6, s=10)
plt.scatter(cen_x, cen_y, marker='^', c="black", s=100, alpha=0.5)
# title and labels
plt.title('\nFrequency - Recency\nNumber of Clusters: {}'.format(i))
plt.xlabel('Frequency')
plt.ylabel('Recency')
df_rfm_k8_stat = df_rfm.assign(Cluster = labels)
df_rfm_k8_stat.groupby(["Cluster"]).agg({
"Recency": ["mean", "std"],
"Monetary": ["mean", "std"],
"Frequency": ["mean", "std", "count"],
}).round(0)
df_rfm_normalized_k8 = df_rfm_normalized.copy()
df_rfm_normalized_k8["Cluster"] = df_rfm_k8_stat["Cluster"]
df_rfm_melt = pd.melt(df_rfm_normalized_k8.reset_index(),
id_vars= ["CustomerID", "Cluster"],
value_vars= ["Recency", "Monetary", "Frequency"],
var_name= "Attribute",
value_name= "Value")
plt.title("Snake Plot of Standardized Variables")
sns.lineplot(x="Attribute", y="Value", hue="Cluster", data=df_rfm_melt)
important_clusters = [3, 6, 8]
for i in important_clusters:
kmeans = KMeans(n_clusters=i, n_init=20, random_state=1)
df_rfm_k = kmeans.fit(df_rfm_normalized)
labels = kmeans.labels_
centers= kmeans.cluster_centers_
clusters= kmeans.predict(df_rfm_normalized)
from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=1)
pca.fit(df_rfm_normalized)
df_rfm_pca = pca.transform(df_rfm_normalized)
df_rfm_pca = pd.DataFrame(df_rfm_pca, columns=["PCA1", "PCA2"])
df_rfm_pca["Clusters"] = clusters
pca.fit(centers)
centers_pca = pca.transform(centers)
plt.figure(figsize=(14,10))
sns.scatterplot(df_rfm_pca["PCA1"], df_rfm_pca["PCA2"], hue=clusters, s=25, palette="brg")
sns.scatterplot(centers_pca[:, 0], centers_pca[:, 1], marker="o", s=200, alpha=0.8);
plt.xlabel("PCA1")
plt.ylabel("PCA2")
plt.show()
df_rfm_k = df_rfm.assign(Cluster = labels)
df_rfm_summary = df_rfm_k.groupby(["Cluster"]).agg({
"Recency": ["mean", "std"],
"Monetary": ["mean", "std", "max",],
"Frequency": ["mean", "std", "count"],
}).round(0)
print('\t\t\tNumber of Cluster {}\n'.format(i))
print(df_rfm_summary)
print("\n")
After a plenty of trials, cluster labels are determined based on the visualizations(2-D, Snake and Pair Plots) above and statistical values.
As a result of all the studies and analyzes, you can find our action suggestions for the following groups.