Analysis of Sales Data

I was generated data of Ads Data, Leads Data, and Sales Data as below

this is the Ads Data result of the generated

campaign_id	source	date	impressions	clicks	spend	leads	cpl	conversions
Google_Camp1	Google	2024-01-01	8270	485	955.64	485	1.97	310
Google_Camp2	Google	2024-01-01	6501	185	496.54	185	2.68	118
Google_Camp3	Google	2024-01-01	3902	301	272.83	301	0.91	192
TikTok_Camp1	TikTok	2024-01-01	3339	310	984.18	310	3.17	198
TikTok_Camp2	TikTok	2024-01-01	6729	265	507.96	265	1.92	169
TikTok_Camp2	TikTok	2024-12-31	2043	168	188.46	168	1.12	107
TikTok_Camp3	TikTok	2024-12-31	2821	63	364.66	63	5.79	40
Facebook_Camp1	Facebook	2024-12-31	1477	122	569.51	122	4.67	78
Facebook_Camp2	Facebook	2024-12-31	3198	200	248.76	200	1.24	128
Facebook_Camp3	Facebook	2024-12-31	5679	291	440.48	291	1.51	186

this is the Leads Data result of the generated

lead_date	customer_id	sku_interest	campaign_id	source
2024-01-01	CUST0000001	SKU008	Google_Camp1	Google
2024-01-01	CUST0000002	SKU005	Google_Camp1	Google
2024-01-01	CUST0000003	SKU007	Google_Camp1	Google
2024-01-01	CUST0000004	SKU010	Google_Camp1	Google
2024-01-01	CUST0000005	SKU003	Google_Camp1	Google
2024-12-31	CUST0985854	SKU001	Facebook_Camp3	Facebook
2024-12-31	CUST0985855	SKU010	Facebook_Camp3	Facebook
2024-12-31	CUST0985856	SKU004	Facebook_Camp3	Facebook
2024-12-31	CUST0985857	SKU005	Facebook_Camp3	Facebook
2024-12-31	CUST0985858	SKU006	Facebook_Camp3	Facebook

and this is the Sales Data

transaction_id	purchase_date	customer_id	sku	price	qty
TXN0000001	2024-01-01	CUST0000016	SKU009	466.90	1
TXN0000001	2024-01-01	CUST0000016	SKU006	71.64	1
TXN0000002	2024-01-18	CUST0000016	SKU001	211.47	2
TXN0000002	2024-01-18	CUST0000016	SKU007	226.60	2
TXN0000003	2024-03-03	CUST0000016	SKU002	222.75	1
TXN1132878	2024-12-31	CUST0985722	SKU005	244.67	1
TXN1132879	2024-12-31	CUST0985722	SKU008	198.79	1
TXN1132879	2024-12-31	CUST0985722	SKU010	312.23	1
TXN1132880	2024-12-31	CUST0985722	SKU003	305.59	1
TXN1132881	2024-12-31	CUST0985650	SKU003	460.04	1

From 3 tables generated, i want to make few analysis based on them, the analysis that i want to do are :

Descriptive Analysis
Churn Analysis
RFM
ARIMA
Behaviour Funnel

1. Descriptive Analysis

Total Impressions: 18235179
Total Clicks: 985858
Total Spend: 1796318.45
Total Unique Leads: 985858
Total unique customers purchase: 629389
Total Unique Purchases (transactions): 1132881
Total Revenue: 506630165.74
Overall Conversion Rate (purchases/leads): 114.91%
Average CPL: 1.82

The descriptive analysis indicates the presence of repeat purchases. This is reflected in the overall conversion rate of 114.91%, meaning there are approximately 15% more transactions than leads. In other words, for every 100 leads acquired, there are an average of 115 transactions, suggesting that some leads made multiple purchases.

2. Churn Analysis

To perform Churn Analysis, I need to set a benchmark to determine when a customer is considered churned. I decided to define churn as having a last purchase more than 90 days ago.

reference_date = df_purchase['purchase_date'].max()

last_purchase = df_purchase.groupby('customer_id')['purchase_date'].max().reset_index()

churn_threshold = reference_date - timedelta(days=90)
last_purchase['churn'] = last_purchase['purchase_date'] < churn_threshold

The result obtained from the code is attached below.

Out of 629,389 unique purchasing customers, 63.4% have churned, which means the number of churned customers is approximately 2 times greater than active customers.

To prevent further churn, especially within the 90-day window, we recommend regular retention initiatives, such as special events during payday or exclusive promotions (e.g., Black Market sales) at least once a month to recapture their attention.

Before launching campaigns, conducting behavioral analysis or customer surveys on churned users is essential to better understand their preferences, ensuring that the campaign is relevant, personalized, and impactful.

3. RFM

First of all, i need to make a measure of Recency, Frequency, and Monetary values of each customer. i choose maximum date of the data purchase became the comparation for the Recency value.

reference_date = df_purchase['purchase_date'].max()

rfm = df_purchase.groupby('customer_id').agg({
    'purchase_date': lambda x: (reference_date - x.max()).days,
    'transaction_id': 'nunique',
    'price': 'sum'
}).rename(columns={'purchase_date': 'Recency', 'transaction_id': 'Frequency', 'price': 'Monetary'})

After got the Recency, Frequency, and Monetary values. I divided the Recency, Frequency, and Monetary values into 4 categories.

rfm['R_score'] = pd.qcut(rfm['Recency'], 4, labels=[4,3,2,1]).astype(int)

try:
    rfm['F_score'] = pd.qcut(rfm['Frequency'].rank(method='dense'), 4, labels=[1,2,3,4]).astype(int)
except ValueError:
    rfm['F_score'] = pd.cut(rfm['Frequency'], bins=4, labels=[1,2,3,4]).astype(int)

try:
    rfm['M_score'] = pd.qcut(rfm['Monetary'].rank(method='dense'), 4, labels=[1,2,3,4]).astype(int)
except ValueError:
    rfm['M_score'] = pd.cut(rfm['Monetary'], bins=4, labels=[1,2,3,4]).astype(int)

rfm['RFM_Score'] = rfm['R_score'].astype(str) + rfm['F_score'].astype(str) + rfm['M_score'].astype(str)

The reason I applied pd.qcut directly on the Recency score is because Recency is a continuous variable with a wide range of values, making it suitable for quantile-based binning without additional processing.

On the other hand, Frequency and Monetary scores are discrete variables. Frequency often has a relatively small set of possible values, and Monetary values can be highly skewed and unevenly distributed.

Because of this, to effectively split Frequency and Monetary into 4 meaningful categories, I first applied ranking (rank()) to transform their discrete or skewed distributions into a more continuous-like scale. Then, I applied pd.qcut on these ranked values to get balanced quartile bins.

Then i need to make segmentation based on Recency, Frequency, and Monetary scores.

def segment_rfm(row):
    if row['RFM_Score'] == '444':
        return 'Champions'
    elif row['R_score'] >= 3 and row['F_score'] >= 3:
        return 'Loyal Customers'
    elif row['R_score'] >= 3 and row['M_score'] >= 3:
        return 'Big Spenders'
    elif row['R_score'] == 4:
        return 'Recent Customers'
    elif row['F_score'] >= 3:
        return 'Frequent Buyers'
    else:
        return 'At Risk'

rfm['Segment'] = rfm.apply(segment_rfm, axis=1)

And the result of the segmentation attached below

54% of our customers fall into the At Risk segment, meaning more than half of our customer base is at risk of churning or has significantly reduced engagement. This is a critical warning sign that immediate retention efforts are necessary to prevent revenue loss.

Recommendations:

Personalized engagement: Reach out with surveys or request feedback to make these customers feel valued and heard.

Targeted promotions: Offer incentives like discounts, vouchers, or free products to encourage reactivation.

Loyalty programs: Develop reward systems or communities to strengthen customer attachment and long-term loyalty.

For customers outside the At Risk segment:

Upselling and cross-selling: Promote new products with attractive offers such as buy-one-get-one or free samples.

Regular communication: Send relevant and engaging content tailored to their profile to maintain ongoing interest.

4. ARIMA

The first step before processing data using the ARIMA method is to check the ADF test statistic and its p-value.

This is important to determine whether the data is stationary or not. In my case, the data I am using consists of monthly clicks.

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller

df_ads['date'] = pd.to_datetime(df_ads['date'])
daily_clicks = df_ads.groupby('date')['clicks'].sum().asfreq('D') 

daily_clicks = daily_clicks.fillna(0)

result = adfuller(daily_clicks)
print('ADF Statistic:', result[0])
print('p-value:', result[1])

The results of the ADF test statistic and p-value are as follows:

ADF Statistic: -12.054420491695716
p-value: 2.553072882245295e-22

Since the p-value is less than 0.05, we can conclude that the data is already stationary. Therefore, no data transformation or differencing is needed.

Next, because no differencing or transformation is performed, the next step is to analyze the ACF and PACF plots to determine the appropriate orders of the Moving Average (MA) and Auto-Regressive (AR) components.

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

plot_acf(daily_clicks, ax=axes[0])
axes[0].set_title('ACF')

plot_pacf(daily_clicks, ax=axes[1])
axes[1].set_title('PACF')

plt.show()

The code produced the following ACF and PACF plots:

From the plots, it can be seen that both the ACF and PACF show a sharp decline after lag 0. Therefore, the possible ARIMA models could be ARIMA(0,0,0), ARIMA(1,0,0), ARIMA(0,0,1), or ARIMA(1,0,1).

Since several ARIMA models have been tested, the next step is to perform ARIMA analysis to determine the most suitable model.

This involves evaluating the models based on statistical metrics such as AIC, BIC, and residual diagnostics, in order to select the model that best fits the data.

ARIMA Model Comparison

Metric	ARIMA(0,0,0)	ARIMA(1,0,0)	ARIMA(0,0,1)	ARIMA(1,0,1)
Dependent Variable	clicks	clicks	clicks	clicks
Observations	366	366	366	366
Log Likelihood	-2894.813	-2894.794	-2894.797	-2894.340
AIC	5793.627	5795.589	5795.595	5796.680
BIC	5801.432	5807.297	5807.303	5812.290
HQIC	5796.729	5800.241	5800.247	5802.883

Coefficients

Variable	ARIMA(0,0,0)	ARIMA(1,0,0)	ARIMA(0,0,1)	ARIMA(1,0,1)
const	82,150	82,150	82,400	82,340
ar.L1	-	-0.1076	-	0.3605
ma.L1	-	-	-0.9976	-1.0000
sigma²	14,220,000	14,680,000	9,616,000	9,245,000

Coefficient Significance (P>|z|)

Variable	ARIMA(0,0,0)	ARIMA(1,0,0)	ARIMA(0,0,1)	ARIMA(1,0,1)
const	0.000	0.000	0.000	0.000
ar.L1	-	0.852	-	0.193
ma.L1	-	-	0.876	0.241
sigma²	0.000	0.000	0.000	0.000

Residual Diagnostics

Test	ARIMA(0,0,0)	ARIMA(1,0,0)	ARIMA(0,0,1)	ARIMA(1,0,1)
Ljung-Box (Q)	0.04	0.00	0.00	0.32
Prob(Q) (p-value)	0.85	0.99	0.99	0.57
Jarque-Bera (JB)	13.22	13.22	13.22	13.78
Prob(JB) (p-value)	0.00	0.00	0.00	0.00
Heteroskedasticity (H)	0.92	0.92	0.92	0.93
Prob(H)	0.63	0.64	0.64	0.69
Skewness	0.46	0.46	0.46	0.47
Kurtosis	3.11	3.11	3.11	3.11

From the four models, ARIMA(0,0,0) is the best model based on having the lowest AIC and BIC values among all candidates. Although its log likelihood is slightly lower than other models, the difference is minimal and does not outweigh its advantage in AIC and BIC.

The drawback of this model is its relatively large sigma² value, which indicates greater variance in the residuals, potentially leading to wider confidence intervals in forecasts. However, given the primary objective of minimizing AIC and BIC for model selection, ARIMA(0,0,0) is still considered the superior choice among the models evaluated.

Based on the ARIMA modeling results, no significant trend component was detected in the dataset. This is likely due to the relatively short data period, consisting of only 12 monthly observations.

Forecast decided using ARIMA (0,0,0)

model = ARIMA(daily_clicks, order=(0, 0, 0)) 
model_fit = model.fit()

forecast = model_fit.forecast(steps=9)

The result of forecast by using Model ARIMA (0,0,0)

Date	Predicted Mean
2025-01-01	2693.601056
2025-01-02	2693.601056
2025-01-03	2693.601056
2025-01-04	2693.601056
2025-01-05	2693.601056
2025-01-06	2693.601056
2025-01-07	2693.601056
2025-01-08	2693.601056
2025-01-09	2693.601056

forecast_index = pd.date_range(start=daily_clicks.index[-1] + pd.Timedelta(days=1), periods=9, freq='D')

plt.figure(figsize=(12, 5))
plt.plot(daily_clicks.index, daily_clicks, label='Actual')
plt.plot(forecast_index, forecast, label='Forecast', marker='o', linestyle='--', color='orange')
plt.title('Daily Clicks Forecast')
plt.xlabel('Date')
plt.ylabel('Clicks')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

The result of plot with actual data and predicted by using Model ARIMA (0,0,0)

For more accurate trend detection in time series analysis, it is generally recommended to use datasets with at least 24 or more observations to capture potential long-term trends or seasonal patterns.

Alternatively, employing other models such as SARIMA or Prophet would be more suitable for identifying trend and seasonality components, especially in datasets expected to have recurring patterns or long-term directional movements.

5. Behaviour Funnel

I have data for behaviour funnel, using impressions, clicks, leads, and transaction

Impressions: 18235179
Clicks: 985858 (CTR: 5.41%)
Leads: 985858 (Lead Conversion: 100.00%)
Purchases: 1132881 (Purchase Conversion: 114.91%)

and the funnel will show as this

The significant gap between impressions and clicks indicates an opportunity for improvement. To increase engagement, we should create more interactive ads that truly resonate with the audience—ads that reflect their interests and follow current trends.

For example, if our product targets Gen Z, the ad content must align with trends popular among this group. Additionally, fine-tuning ad settings such as age, gender, and other demographics will help us reach the right customers more effectively.

By tailoring both the creative content and the targeting settings, we can increase click-through rates and ultimately generate more leads.