Back to Projects

UK-ecommerce Retention Analysis

2024 E-commerce Analytics SQL, Cohort Analysis

1. Overview

1-1. Problem

1-2. Analytical Method: Cohort / Retention

Definition of retained customer

We defined retained customer as customers who are registered to the service and have ordered from the service more than once.

We have limited our customer data to be within UK.

Defining Cohort

From the service launch in December 2010 to November 2011, we plan to analyze retention based on purchase orders by grouping 'first-time purchasers' into the same cohorts at monthly intervals.

SQL query for customer's purchase cycle (Reference page)

View SQL Query
SELECT MEDIAN(customer_avg_buying_cycle) AS avg_buying_cycle
FROM (
    SELECT customer_id
        , CASE WHEN order_count - 1 == 0 THEN 0
                ELSE order_date_diff / (order_count - 1)
        END AS customer_avg_buying_cycle
    FROM (
        SELECT customer_id
            , EXTRACT('epoch' FROM DATE_TRUNC('day', last_order_date - first_order_date))/3600/24 AS order_date_diff
            , order_count
        FROM (
            SELECT customer_id
                , MIN(order_date) AS first_order_date
                , MAX(order_date) AS last_order_date
                , COUNT(DISTINCT order_id) AS order_count
            FROM retail_cleaned
            GROUP BY 1
        ) AS step1
    ) AS step2
) AS step3

Retention Chart


Global market share of the business Visualization of the sales proportions for each country
View Retention Chart SQL Query
WITH retail_cleaned AS (
    SELECT *
         , MIN(order_month) OVER(PARTITION BY customer_id) AS first_order_month
    FROM retail
    WHERE country = 'United Kingdom'
    AND order_date NOT LIKE '2011-12%'
    AND category != '삭제'
    AND customer_id IS NOT NULL
    AND unit_price >= 0.01
)
SELECT first_order_month
      ,month0
      ,ROUND(IF(month1 = 0, 0, month1 / month0 * 100) ,2) AS month1_pct
      ,ROUND(IF(month2 = 0, 0, month2 / month0 * 100) ,2) AS month2_pct
      ,ROUND(IF(month3 = 0, 0, month3 / month0 * 100) ,2) AS month3_pct
      ,ROUND(IF(month4 = 0, 0, month4 / month0 * 100) ,2) AS month4_pct
      ,ROUND(IF(month5 = 0, 0, month5 / month0 * 100) ,2) AS month5_pct
      ,ROUND(IF(month6 = 0, 0, month6 / month0 * 100) ,2) AS month6_pct
      ,ROUND(IF(month7 = 0, 0, month7 / month0 * 100) ,2) AS month7_pct
      ,ROUND(IF(month8 = 0, 0, month8 / month0 * 100) ,2) AS month8_pct
      ,ROUND(IF(month9 = 0, 0, month9 / month0 * 100) ,2) AS month9_pct
      ,ROUND(IF(month10 = 0, 0, month10 / month0 * 100) ,2) AS month10_pct
      ,ROUND(IF(month11 = 0, 0, month11 / month0 * 100) ,2) AS month11_pct
FROM (
        SELECT LEFT(first_order_month,10) AS first_order_month
                ,COUNT(DISTINCT customer_id) AS month0
                ,COUNT(DISTINCT CASE WHEN DATE_ADD(first_order_month, INTERVAL 1 MONTH) = order_month THEN customer_id END) AS month1
                ,COUNT(DISTINCT CASE WHEN DATE_ADD(first_order_month, INTERVAL 2 MONTH) = order_month THEN customer_id END) AS month2
                ,COUNT(DISTINCT CASE WHEN DATE_ADD(first_order_month, INTERVAL 3 MONTH) = order_month THEN customer_id END) AS month3
                ,COUNT(DISTINCT CASE WHEN DATE_ADD(first_order_month, INTERVAL 4 MONTH) = order_month THEN customer_id END) AS month4
                ,COUNT(DISTINCT CASE WHEN DATE_ADD(first_order_month, INTERVAL 5 MONTH) = order_month THEN customer_id END) AS month5
                ,COUNT(DISTINCT CASE WHEN DATE_ADD(first_order_month, INTERVAL 6 MONTH) = order_month THEN customer_id END) AS month6
                ,COUNT(DISTINCT CASE WHEN DATE_ADD(first_order_month, INTERVAL 7 MONTH) = order_month THEN customer_id END) AS month7
                ,COUNT(DISTINCT CASE WHEN DATE_ADD(first_order_month, INTERVAL 8 MONTH) = order_month THEN customer_id END) AS month8
                ,COUNT(DISTINCT CASE WHEN DATE_ADD(first_order_month, INTERVAL 9 MONTH) = order_month THEN customer_id END) AS month9
                ,COUNT(DISTINCT CASE WHEN DATE_ADD(first_order_month, INTERVAL 10 MONTH) = order_month THEN customer_id END) AS month10
                ,COUNT(DISTINCT CASE WHEN DATE_ADD(first_order_month, INTERVAL 11 MONTH) = order_month THEN customer_id END) AS month11
        FROM retail_cleaned
        GROUP BY first_order_month
        ORDER BY first_order_month 
     )retention_pct

1-3. Hypotheses


  • Reason for not choosing sales amount:
    • It would be better to analyze based on the order volume considering the fact that there are instances of bulk purchases from a single location
  • Order volume vs Order quantity:
    • We thought securing many different customers purchasing the item was more significant compared to securing one customer who purchases a lot of items

  • We thought customers who bought the items in CHRISTMAS category in their first purchase are likely to be temporarily acquire customers who only used our service to buy seasonal products.

  • As the customer proportion for CANDLE category products in the 2010-12 cohort was as high as the customer proportion for CHRISTMAS category products, we tried to verify whether this has an impact on customer retention.

2. Results

2-1. The Top 5 Subcategories Based on Order Volume

Order volume → Based on the number of customers who placed orders

<Group A> (2010-12 Cohort) <Group B> (Cohort excluding 2010-12)
subcategoryOrder volumeCustomers (proportion) subcategoryOrder volumeCustomers (proportion)
CANDLE2960627 (77.03%) CANDLE39061909 (62.22%)
PARTY DECOR2238534 (65.6%) CHRISTMAS31241761 (57.4%)
DRINKWARE2099541 (66.46%) PARTY DECOR28541537 (50.1%)
CHRISTMAS2039613 (75.31%) BAKEWARE26551439 (46.9%)
LUNCHWARE1954392 (48.16%) DRINKWARE25681441 (46.97%)

When comparing the top 5 subcategories of the two cohorts based on the order quantity, there wasn't a significant difference in the ranking list. However, there was a meaningful difference when we compared the customer proportions.

  • For CANDLE category products, the customer proportion in Group A is approximately 15% higher than in Group B
  • For PARTY DECOR category products, the customer proportion in Group A is approximately 15% higher than in Group B
  • For CHRISTMAS category products, the customer proportion in Group A is approximately 18% higher than in Group B
  • For DRINKWARE category products, the customer proportion in Group A is approximately 20% higher than in Group B

→ We observed that the CANDLE, PARTY DECOR, CHRISTMAS, and DRINKWARE category products, which were purchased by a large number of orders and a high proportion of customers in Group A, had a different impact on retention compared to Group B.

2-2. Christmas Related Item in First Purchase

Christmas related item retention
cohortmonth0month1_pctmonth2_pctmonth3_pctmonth4_pctmonth5_pctmonth6_pctmonth7_pctmonth8_pctmonth9_pctmonth10_pctmonth11_pct
christmas134623.418.516.8615.0115.915.3813.5212.3312.6311.3714.86
non_christmas253620.6222.1222.4423.9421.7319.7617.5117.2714.7910.738.04

The cohort that purchased CHRISTMAS items in their first purchase has lower retention compared to the cohort that did not.

  • The retention rate for the CHRISTMAS cohort is at max 23.4%, with an average of 15.43%.
  • The retention rate for the NON_CHRISTMAS cohort is at max 23.94%, with an average of 18.09%.

2-3. Previously Purchased Candle Related Items

Candle related item retention
cohortmonth0month1_pctmonth2_pctmonth3_pctmonth4_pctmonth5_pctmonth6_pctmonth7_pctmonth8_pctmonth9_pctmonth10_pctmonth11_pct
candle253626.0625.0825.4325.8725.2422.2420.4319.4417.5514.4713.64
non_candle134613.1512.9311.2211.379.2910.78.028.257.434.314.31

The retention of the cohort that has previously purchased CANDLE products is noticeably higher than the cohort that has not.

  • The retention rate for CANDLE cohort has an average of 21.4%
  • The retention rate for NON_CANDLE cohort has an average of 9.18%

3. Action Plan


4. About the Dataset

Column Information

Original column nameNew column nameDescription
InvoiceNoorder_idInvoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
StockCodeproduct_idProduct (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
DescriptiondescriptionProduct (item) name. Nominal.
QuantityquantityThe quantities of each product (item) per transaction. Numeric.
InvoiceDateorder_dateInvoice Date and time. Numeric, the day and time when each transaction was generated.
UnitPriceunit_priceUnit price. Numeric, Product price per unit in sterling.
CustomerIDcustomer_idCustomer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
CountrycountryCountry name. Nominal, the name of the country where each customer resides.
Addedorder_monthOrder date expressed in monthly intervals.
Addedfirst_order_monthFirst order date expressed in monthly intervals.
Addedtotal_amountTotal amount of the item including the quantity and unit price.

5. EDA HISTORY

5-1. Data Cleaning → Write a csv file with the cleaned data

  1. Dropped cancelled order values
  2. Dropped rows where quantity & unit_price < 0
  3. Dropped abnormal description values

R code

View R Code
library(dplyr)
library(data.table)
library(lubridate)
library(stringr)
library(clipr)
library(usethis)
library(readxl)
library(writexl)
library(dlookr)

retail <- read_xlsx("Online Retail.xlsx")
retail

retail <- retail %>%
   rename(order_id = "InvoiceNo", product_id = "StockCode",
          description = "Description", subcategory = "Subcategory",
          category = "Category", order_date = "InvoiceDate",
          unit_price = "UnitPrice", customer_id = "CustomerID",
          country = "Country", quantity = "Quantity")

retail

# drop abnormal description values
desc_drop <- c("add stock to allocate online orders",
               ...
               "20713")

retail_df <- retail %>%
   filter(description != desc_drop)

# drop cancelled order values
retail_df <- retail_df %>%
   filter(!str_detect(order_id, '^C'))

# drop rows where quantity & unit_price < 0
retail_df <- retail_df %>%
   filter(unit_price > 0,
         quantity > 0)

# chec
retail_df <- retail_df %>%
   mutate(order_month = format(as.Date(order_date), "%Y-%m")) %>%
   mutate(total_amount = quantity * unit_price)retail_df %>%
   filter(str_detect(order_id, '^C'))

# Save to csv file
write.csv(retail_df, "retail.csv", row.names = F)

5-2. Categorizing Products

Faced a challenge where it is difficult to identify unique differentiation among customers using existing information.

Categorized a total of 4,211 unique product names (description)

BeforeAfter
4,211 uncategorized productsCategorized to 22 categories, and 255 subcategories

5-3. Final EDA

Completed the final filtering conditions to extract only the necessary information for our analysis.

  1. customer_id : drop null values → filter customers who did not register to our service
  2. country : United Kingdom → limiting dataset within the UK
  3. Filter all purchse data in 2011-12 → Only 9 days worth of data exists in the dataset
  4. category : drop abnormal values
  5. unit_price : filter abnormal values that is below the lowest possible unit price
  6. Add first_order_month column
SELECT *
     , MIN(order_month) OVER(PARTITION BY customer_id) AS first_order_month
FROM retail
WHERE country = 'United Kingdom'
AND order_date NOT LIKE '2011-12%'
AND category != '삭제'
AND customer_id IS NOT NULL
AND unit_price > 0.01