Building Custom Attribution Models with SQL and Python
Learn how to build custom attribution models using SQL and Python to understand which marketing channels actually drive conversions, not just which ones get the last click.

TL;DR
Quick Summary
Building Custom Attribution Models with SQL and Python
Quick Answer
Imagine you're a restaurant owner who asks every customer, "How did you hear about us?" The last person always says, "I walked by and saw your sign." So you think, "Great! My sign is doing all the work."
But what you don't see is that they read a review two weeks ago, saw a friend's Instagram post last week, and got a recommendation three days ago. The sign just happened to be the last thing before they walked in.
That's exactly what happens with most marketing attribution tools. They give all the credit to the last click, even though the real story is much more interesting.
Let me show you how to build your own custom attribution model using SQL and Python. You'll see the complete picture of how customers actually find you, not just the last step.
Why Build Your Own Attribution Model
Standard tools like Google Analytics give you preset options. That's fine for basic reporting, but here's the problem: you can't see under the hood.
When GA4 says "this channel drove 40% of conversions," you're trusting their formula. You can't adjust it for your business. You can't test different assumptions. You're stuck with their rules.
Building a custom attribution model means you control the logic. You decide how to distribute credit across touchpoints. You can test different approaches and see which one matches reality for your business.
Here's what you gain:
- Full transparency: You know exactly how credit gets assigned
- Flexibility: Adjust the model as your marketing changes
- Better budget decisions: See which channels actually assist conversions, not just close them
- Real user journeys: Track individual paths, not just aggregated stats
The Basic Building Blocks
Every custom attribution model needs three things:
- Raw journey data: A record of every touchpoint for every user
- A stitching process: Logic to group touchpoints into complete journeys
- An attribution method: Rules for dividing credit among touchpoints
Let's build each piece step by step.
Step 1: Getting Your Journey Data Ready
First, you need data that shows user journeys. This usually comes from:
- Google Analytics BigQuery exports (raw event data)
- Your marketing database (UTM-tagged sessions)
- Web analytics platforms like Snowplow (event streams)
Your data should look something like this:
| user_id | timestamp | channel | converted |
|---|---|---|---|
| user_123 | 2024-01-05 | Organic Search | No |
| user_123 | 2024-01-08 | No | |
| user_123 | 2024-01-12 | Direct | Yes |
Here's a simple SQL query to structure your data:
SELECT
user_id,
session_timestamp,
CASE
WHEN utm_source = 'google' AND utm_medium = 'organic' THEN 'Organic Search'
WHEN utm_medium = 'email' THEN 'Email'
WHEN utm_source IS NULL THEN 'Direct'
ELSE CONCAT(utm_source, ' - ', utm_medium)
END AS channel,
CASE WHEN transaction_id IS NOT NULL THEN 1 ELSE 0 END AS converted
FROM sessions
WHERE user_id IS NOT NULL
ORDER BY user_id, session_timestamp
This gives you a clean list of every touchpoint for every user.
Step 2: Building Complete User Journeys
Now you need to group these touchpoints into meaningful journeys. A journey typically starts at the first touchpoint and ends at a conversion (or a cutoff point if they don't convert).
Here's SQL code to create journey paths:
WITH user_paths AS (
SELECT
user_id,
STRING_AGG(channel, ' > ' ORDER BY session_timestamp) AS journey_path,
MAX(converted) AS had_conversion,
COUNT(*) AS touchpoint_count
FROM cleaned_sessions
WHERE session_timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY user_id
)
SELECT
journey_path,
COUNT(*) AS journey_count,
SUM(had_conversion) AS conversions
FROM user_paths
WHERE had_conversion = 1
GROUP BY journey_path
ORDER BY conversions DESC
This creates paths like: Organic Search > Email > Direct
You can now see the actual sequences people take before converting.
Step 3: Implementing Attribution Methods
Now comes the interesting part: deciding how to distribute credit. Let me show you four common approaches, from simple to sophisticated.
Position-Based Attribution
This method gives 40% credit to the first touchpoint, 40% to the last, and splits the remaining 20% among middle touches.
Why? Because the first touch introduced your brand, the last touch closed the deal, and the middle touches kept the conversation going.
Here's Python code to calculate it:
import pandas as pd
def position_based_attribution(journey_path):
"""
Distributes credit: 40% first, 40% last, 20% middle
"""
channels = journey_path.split(' > ')
num_channels = len(channels)
if num_channels == 1:
return {channels[0]: 1.0}
elif num_channels == 2:
return {channels[0]: 0.4, channels[1]: 0.4}
else:
middle_credit = 0.20 / (num_channels - 2)
attribution = {}
for i, channel in enumerate(channels):
if i == 0:
attribution[channel] = 0.40
elif i == num_channels - 1:
attribution[channel] = 0.40
else:
attribution[channel] = middle_credit
return attribution
# Apply to your data
df = pd.read_sql_query(journey_sql, connection)
attribution_results = []
for _, row in df.iterrows():
credits = position_based_attribution(row['journey_path'])
for channel, credit in credits.items():
attribution_results.append({
'channel': channel,
'credit': credit * row['conversions']
})
results_df = pd.DataFrame(attribution_results)
final_attribution = results_df.groupby('channel')['credit'].sum().reset_index()
print(final_attribution)
Time-Decay Attribution
This gives more credit to recent touchpoints. The logic: touches closer to conversion probably had more influence.
import numpy as np
def time_decay_attribution(journey_path, days_between_touches, half_life=7):
"""
More recent touches get exponentially more credit
half_life: days for credit to decay by 50%
"""
channels = journey_path.split(' > ')
num_channels = len(channels)
if num_channels == 1:
return {channels[0]: 1.0}
# Calculate decay weights
weights = []
for i in range(num_channels):
days_before_conversion = sum(days_between_touches[i:])
weight = 2 ** (-days_before_conversion / half_life)
weights.append(weight)
# Normalize to sum to 1
total_weight = sum(weights)
normalized_weights = [w / total_weight for w in weights]
# Assign to channels
attribution = {}
for channel, weight in zip(channels, normalized_weights):
attribution[channel] = attribution.get(channel, 0) + weight
return attribution
Linear Attribution
The simplest approach: every touchpoint gets equal credit. If there are 5 touches, each gets 20%.
def linear_attribution(journey_path):
"""
Equal credit to all touchpoints
"""
channels = journey_path.split(' > ')
credit_per_channel = 1.0 / len(channels)
attribution = {}
for channel in channels:
attribution[channel] = attribution.get(channel, 0) + credit_per_channel
return attribution
Data-Driven Attribution (Markov Chains)
This is the most sophisticated approach. It calculates each channel's actual contribution by simulating what happens if you remove that channel entirely.
Think of it like this: if removing "Email" from all journeys causes conversions to drop 25%, then Email deserves 25% of the credit.
Here's how to implement it:
from collections import defaultdict
import itertools
def build_markov_graph(journeys_df):
"""
Build transition probabilities between channels
"""
transitions = defaultdict(lambda: defaultdict(int))
for _, row in journeys_df.iterrows():
path = ['Start'] + row['journey_path'].split(' > ') + ['Conversion']
# Count transitions
for i in range(len(path) - 1):
from_state = path[i]
to_state = path[i + 1]
transitions[from_state][to_state] += row['conversions']
# Convert to probabilities
graph = {}
for from_state, to_states in transitions.items():
total = sum(to_states.values())
graph[from_state] = {k: v/total for k, v in to_states.items()}
return graph
def calculate_removal_effect(graph, channel_to_remove):
"""
Calculate conversion probability without a specific channel
"""
# Create modified graph without the channel
modified_graph = {
k: {tk: tv for tk, tv in v.items() if tk != channel_to_remove}
for k, v in graph.items() if k != channel_to_remove
}
# Renormalize probabilities
for from_state in modified_graph:
total = sum(modified_graph[from_state].values())
if total > 0:
modified_graph[from_state] = {
k: v/total for k, v in modified_graph[from_state].items()
}
# Calculate conversion probability (simplified simulation)
# In production, you'd run a proper Markov chain simulation
return simulate_conversions(modified_graph)
def markov_attribution(journeys_df):
"""
Calculate attribution using removal effect
"""
graph = build_markov_graph(journeys_df)
channels = list(set([ch for path in journeys_df['journey_path']
for ch in path.split(' > ')]))
baseline_conversions = simulate_conversions(graph)
attribution = {}
for channel in channels:
conversions_without = calculate_removal_effect(graph, channel)
removal_effect = baseline_conversions - conversions_without
attribution[channel] = removal_effect / baseline_conversions
# Normalize to sum to 1
total = sum(attribution.values())
return {k: v/total for k, v in attribution.items()}
Step 4: Handling Real-World Complexity
Real journey data gets messy. Here are common issues and how to handle them:
Multiple Conversions Per User
If a user converts twice, you need to decide: do you split their journey at the first conversion or treat it as one long path?
def split_journeys_at_conversion(df):
"""
Create separate journeys for each conversion
"""
result = []
for user_id, group in df.groupby('user_id'):
current_journey = []
for _, row in group.sort_values('timestamp').iterrows():
current_journey.append(row['channel'])
if row['converted'] == 1:
result.append({
'user_id': user_id,
'journey_path': ' > '.join(current_journey),
'conversions': 1
})
current_journey = [] # Start fresh journey
return pd.DataFrame(result)
Lookback Windows
You don't want to include a touchpoint from 6 months ago. Set a reasonable window (typically 30 days for e-commerce, 90 days for B2B):
WHERE session_timestamp >= DATE_SUB(conversion_date, INTERVAL 30 DAY)
AND session_timestamp <= conversion_date
Duplicate Channels
If someone visits via "Organic Search" three times in a row, do you count it once or three times? Usually once for each continuous session:
def deduplicate_consecutive_channels(journey_path):
"""
Organic > Organic > Email becomes Organic > Email
"""
channels = journey_path.split(' > ')
deduplicated = [channels[0]]
for channel in channels[1:]:
if channel != deduplicated[-1]:
deduplicated.append(channel)
return ' > '.join(deduplicated)
Comparing Different Attribution Models
The best way to understand attribution is to run multiple models side by side:
def compare_attribution_models(journeys_df):
"""
Run all models and compare results
"""
models = {
'Linear': linear_attribution,
'Position-Based': position_based_attribution,
'Time-Decay': time_decay_attribution
}
results = []
for model_name, model_func in models.items():
for _, row in journeys_df.iterrows():
credits = model_func(row['journey_path'])
for channel, credit in credits.items():
results.append({
'Model': model_name,
'Channel': channel,
'Credit': credit * row['conversions']
})
comparison_df = pd.DataFrame(results)
pivot = comparison_df.pivot_table(
index='Channel',
columns='Model',
values='Credit',
aggfunc='sum'
)
return pivot
print(compare_attribution_models(journeys_df))
This creates a table showing how each channel performs under different assumptions. You'll often find dramatically different results.
For example:
- Direct traffic gets 50% credit under last-click but only 15% under position-based
- Organic Search gets 10% credit under last-click but 35% under position-based
This tells you that Organic Search is actually introducing most customers, even though Direct is closing them.
Making This Work at Scale
If you have millions of sessions, the Python code above will be slow. Here are optimization strategies:
Use SQL for Heavy Lifting
Push as much processing as possible into your database:
CREATE TEMP FUNCTION PositionBasedCredit(channel_position INT64, total_channels INT64)
RETURNS FLOAT64 AS (
CASE
WHEN total_channels = 1 THEN 1.0
WHEN channel_position = 1 THEN 0.4
WHEN channel_position = total_channels THEN 0.4
ELSE 0.2 / (total_channels - 2)
END
);
WITH attributed_journeys AS (
SELECT
channel,
SUM(PositionBasedCredit(channel_position, total_channels) * conversions) AS attributed_conversions
FROM journey_data
GROUP BY channel
)
SELECT * FROM attributed_journeys
ORDER BY attributed_conversions DESC;
Use Parallel Processing
For large datasets in Python, use libraries like Dask:
import dask.dataframe as dd
# Read large CSV with Dask
ddf = dd.read_csv('large_journey_data.csv')
# Apply attribution in parallel
def apply_attribution(partition):
results = []
for _, row in partition.iterrows():
credits = position_based_attribution(row['journey_path'])
for channel, credit in credits.items():
results.append({
'channel': channel,
'credit': credit * row['conversions']
})
return pd.DataFrame(results)
attributed = ddf.map_partitions(apply_attribution).compute()
Validating Your Attribution Model
How do you know if your custom attribution model is accurate? Test it:
Holdout Validation
Split your data into training (80%) and test (20%) sets. Build your model on training data, then check if it predicts test data accurately:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(journeys_df, test_size=0.2)
# Build attribution on training data
train_attribution = compare_attribution_models(train_df)
# Compare predicted vs actual on test data
# (Implementation depends on your specific goals)
Budget Reallocation Test
The real test: does changing your budget based on the model improve results?
Start small. Pick one channel the model says is undervalued. Increase its budget by 20% for one month. Did conversions increase?
Common Pitfalls to Avoid
After building dozens of these models, I've seen the same mistakes repeatedly:
Over-complicating the model: Start simple. Linear or position-based attribution is often good enough. Only add complexity if you need it.
Ignoring offline channels: If you run TV ads or send direct mail, you need to track those too. Use promo codes or unique landing pages to connect offline to online.
Trusting the model blindly: Attribution models show correlations, not always causation. A channel might look valuable just because high-intent users happen to visit it last.
Not updating regularly: User behavior changes. Rebuild your model quarterly to capture new patterns.
What This Means for Your Marketing
Here's the practical takeaway: custom attribution models help you answer questions like:
- "Should we increase our email budget or our paid search budget?"
- "Is our content marketing actually driving sales, or just engagement?"
- "Which channels work together? Which ones overlap?"
You can't answer these with last-click attribution. You need to see the complete journey.
The models I've shown you give you that visibility. You don't need a data science PhD. You just need basic SQL and Python skills, plus the willingness to look deeper than default reports.
Getting Started Today
If you want to build your first custom attribution model, here's what I recommend:
- Export 90 days of session data from your analytics platform
- Start with linear attribution—it's simple and often reveals surprising insights
- Compare it to your current reporting—look for big differences
- Test one budget change based on what you learn
- Measure the results after 30 days
You'll quickly see which channels are truly driving value versus which ones just get credit by default.
Need help setting this up for your business? We build custom attribution models for companies that need to understand their marketing ROI at a deeper level. We can work with your existing data sources and create models that match your specific customer journey.
The best attribution model is the one that helps you make better decisions. Sometimes that's a simple position-based model. Sometimes it's a sophisticated data-driven approach. But it's always one you understand and can explain to your team.
Because when you know the real story of how customers find you, you can invest your budget where it actually matters.
Frequently Asked Questions
Get answers to common questions about this topic
Have more questions? We're here to help you succeed with your MarTech strategy. Get in touch
Related Articles
Need Help Implementing?
Get expert guidance on your MarTech strategy and implementation.
Get Free Audit