Building Custom Attribution Models with SQL and Python

Published: February 21, 2026

Updated: February 22, 2026

✓ Recently Updated

Quick Answer

Custom attribution assigns conversion credit across every touchpoint so you can see which channels truly drive conversions — start with a 30–90 day lookback and compare simple models (linear, position-based) with a data-driven Markov approach. In practice you’ll often see 10–40% shifts in channel credit versus last-click; validate with a small 30-day budget test before scaling changes.

Imagine you're a restaurant owner who asks every customer, "How did you hear about us?" The last person always says, "I walked by and saw your sign." So you think, "Great! My sign is doing all the work."

But what you don't see is that they read a review two weeks ago, saw a friend's Instagram post last week, and got a recommendation three days ago. The sign just happened to be the last thing before they walked in.

That's exactly what happens with most marketing attribution tools. They give all the credit to the last click, even though the real story is much more interesting.

Let me show you how to build your own custom attribution model using SQL and Python. You'll see the complete picture of how customers actually find you, not just the last step.

Why Build Your Own Attribution Model

Standard tools like Google Analytics give you preset options. That's fine for basic reporting, but here's the problem: you can't see under the hood.

When GA4 says "this channel drove 40% of conversions," you're trusting their formula. You can't adjust it for your business. You can't test different assumptions. You're stuck with their rules.

Building a custom attribution model means you control the logic. You decide how to distribute credit across touchpoints. You can test different approaches and see which one matches reality for your business.

Here's what you gain:

Full transparency: You know exactly how credit gets assigned
Flexibility: Adjust the model as your marketing changes
Better budget decisions: See which channels actually assist conversions, not just close them
Real user journeys: Track individual paths, not just aggregated stats

The Basic Building Blocks

Every custom attribution model needs three things:

Raw journey data: A record of every touchpoint for every user
A stitching process: Logic to group touchpoints into complete journeys
An attribution method: Rules for dividing credit among touchpoints

Let's build each piece step by step.

Step 1: Getting Your Journey Data Ready

First, you need data that shows user journeys. This usually comes from:

Google Analytics BigQuery exports (raw event data)
Your marketing database (UTM-tagged sessions)
Web analytics platforms like Snowplow (event streams)

Your data should look something like this:

user_id	timestamp	channel	converted
user_123	2024-01-05	Organic Search	No
user_123	2024-01-08	Email	No
user_123	2024-01-12	Direct	Yes

Here's a simple SQL query to structure your data:

SELECT
  user_id,
  session_timestamp,
  CASE
    WHEN utm_source = 'google' AND utm_medium = 'organic' THEN 'Organic Search'
    WHEN utm_medium = 'email' THEN 'Email'
    WHEN utm_source IS NULL THEN 'Direct'
    ELSE CONCAT(utm_source, ' - ', utm_medium)
  END AS channel,
  CASE WHEN transaction_id IS NOT NULL THEN 1 ELSE 0 END AS converted
FROM sessions
WHERE user_id IS NOT NULL
ORDER BY user_id, session_timestamp

This gives you a clean list of every touchpoint for every user.

Step 2: Building Complete User Journeys

Now you need to group these touchpoints into meaningful journeys. A journey typically starts at the first touchpoint and ends at a conversion (or a cutoff point if they don't convert).

Here's SQL code to create journey paths:

WITH user_paths AS (
  SELECT
    user_id,
    STRING_AGG(channel, ' > ' ORDER BY session_timestamp) AS journey_path,
    MAX(converted) AS had_conversion,
    COUNT(*) AS touchpoint_count
  FROM cleaned_sessions
  WHERE session_timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
  GROUP BY user_id
)
SELECT
  journey_path,
  COUNT(*) AS journey_count,
  SUM(had_conversion) AS conversions
FROM user_paths
WHERE had_conversion = 1
GROUP BY journey_path
ORDER BY conversions DESC

This creates paths like: Organic Search > Email > Direct

You can now see the actual sequences people take before converting.

Step 3: Implementing Attribution Methods

Now comes the interesting part: deciding how to distribute credit. Let me show you four common approaches, from simple to sophisticated.

Position-Based Attribution

This method gives 40% credit to the first touchpoint, 40% to the last, and splits the remaining 20% among middle touches.

Why? Because the first touch introduced your brand, the last touch closed the deal, and the middle touches kept the conversation going.

Here's Python code to calculate it:

import pandas as pd

def position_based_attribution(journey_path):
    """
    Distributes credit: 40% first, 40% last, 20% middle
    """
    channels = journey_path.split(' > ')
    num_channels = len(channels)

    if num_channels == 1:
        return {channels[0]: 1.0}
    elif num_channels == 2:
        return {channels[0]: 0.4, channels[1]: 0.4}
    else:
        middle_credit = 0.20 / (num_channels - 2)
        attribution = {}

        for i, channel in enumerate(channels):
            if i == 0:
                attribution[channel] = 0.40
            elif i == num_channels - 1:
                attribution[channel] = 0.40
            else:
                attribution[channel] = middle_credit

        return attribution

# Apply to your data
df = pd.read_sql_query(journey_sql, connection)

attribution_results = []
for _, row in df.iterrows():
    credits = position_based_attribution(row['journey_path'])
    for channel, credit in credits.items():
        attribution_results.append({
            'channel': channel,
            'credit': credit * row['conversions']
        })

results_df = pd.DataFrame(attribution_results)
final_attribution = results_df.groupby('channel')['credit'].sum().reset_index()
print(final_attribution)

Time-Decay Attribution

This gives more credit to recent touchpoints. The logic: touches closer to conversion probably had more influence.

import numpy as np

def time_decay_attribution(journey_path, days_between_touches, half_life=7):
    """
    More recent touches get exponentially more credit
    half_life: days for credit to decay by 50%
    """
    channels = journey_path.split(' > ')
    num_channels = len(channels)

    if num_channels == 1:
        return {channels[0]: 1.0}

    # Calculate decay weights
    weights = []
    for i in range(num_channels):
        days_before_conversion = sum(days_between_touches[i:])
        weight = 2 ** (-days_before_conversion / half_life)
        weights.append(weight)

    # Normalize to sum to 1
    total_weight = sum(weights)
    normalized_weights = [w / total_weight for w in weights]

    # Assign to channels
    attribution = {}
    for channel, weight in zip(channels, normalized_weights):
        attribution[channel] = attribution.get(channel, 0) + weight

    return attribution

Linear Attribution

The simplest approach: every touchpoint gets equal credit. If there are 5 touches, each gets 20%.

def linear_attribution(journey_path):
    """
    Equal credit to all touchpoints
    """
    channels = journey_path.split(' > ')
    credit_per_channel = 1.0 / len(channels)

    attribution = {}
    for channel in channels:
        attribution[channel] = attribution.get(channel, 0) + credit_per_channel

    return attribution

Data-Driven Attribution (Markov Chains)

This is the most sophisticated approach. It calculates each channel's actual contribution by simulating what happens if you remove that channel entirely.

Think of it like this: if removing "Email" from all journeys causes conversions to drop 25%, then Email deserves 25% of the credit.

Here's how to implement it:

from collections import defaultdict
import itertools

def build_markov_graph(journeys_df):
    """
    Build transition probabilities between channels
    """
    transitions = defaultdict(lambda: defaultdict(int))

    for _, row in journeys_df.iterrows():
        path = ['Start'] + row['journey_path'].split(' > ') + ['Conversion']

        # Count transitions
        for i in range(len(path) - 1):
            from_state = path[i]
            to_state = path[i + 1]
            transitions[from_state][to_state] += row['conversions']

    # Convert to probabilities
    graph = {}
    for from_state, to_states in transitions.items():
        total = sum(to_states.values())
        graph[from_state] = {k: v/total for k, v in to_states.items()}

    return graph

def calculate_removal_effect(graph, channel_to_remove):
    """
    Calculate conversion probability without a specific channel
    """
    # Create modified graph without the channel
    modified_graph = {
        k: {tk: tv for tk, tv in v.items() if tk != channel_to_remove}
        for k, v in graph.items() if k != channel_to_remove
    }

    # Renormalize probabilities
    for from_state in modified_graph:
        total = sum(modified_graph[from_state].values())
        if total > 0:
            modified_graph[from_state] = {
                k: v/total for k, v in modified_graph[from_state].items()
            }

    # Calculate conversion probability (simplified simulation)
    # In production, you'd run a proper Markov chain simulation
    return simulate_conversions(modified_graph)

def markov_attribution(journeys_df):
    """
    Calculate attribution using removal effect
    """
    graph = build_markov_graph(journeys_df)
    channels = list(set([ch for path in journeys_df['journey_path']
                         for ch in path.split(' > ')]))

    baseline_conversions = simulate_conversions(graph)

    attribution = {}
    for channel in channels:
        conversions_without = calculate_removal_effect(graph, channel)
        removal_effect = baseline_conversions - conversions_without
        attribution[channel] = removal_effect / baseline_conversions

    # Normalize to sum to 1
    total = sum(attribution.values())
    return {k: v/total for k, v in attribution.items()}

Step 4: Handling Real-World Complexity

Real journey data gets messy. Here are common issues and how to handle them:

Multiple Conversions Per User

If a user converts twice, you need to decide: do you split their journey at the first conversion or treat it as one long path?

def split_journeys_at_conversion(df):
    """
    Create separate journeys for each conversion
    """
    result = []

    for user_id, group in df.groupby('user_id'):
        current_journey = []

        for _, row in group.sort_values('timestamp').iterrows():
            current_journey.append(row['channel'])

            if row['converted'] == 1:
                result.append({
                    'user_id': user_id,
                    'journey_path': ' > '.join(current_journey),
                    'conversions': 1
                })
                current_journey = []  # Start fresh journey

    return pd.DataFrame(result)

Lookback Windows

You don't want to include a touchpoint from 6 months ago. Set a reasonable window (typically 30 days for e-commerce, 90 days for B2B):

WHERE session_timestamp >= DATE_SUB(conversion_date, INTERVAL 30 DAY)
  AND session_timestamp <= conversion_date

Duplicate Channels

If someone visits via "Organic Search" three times in a row, do you count it once or three times? Usually once for each continuous session:

def deduplicate_consecutive_channels(journey_path):
    """
    Organic > Organic > Email becomes Organic > Email
    """
    channels = journey_path.split(' > ')
    deduplicated = [channels[0]]

    for channel in channels[1:]:
        if channel != deduplicated[-1]:
            deduplicated.append(channel)

    return ' > '.join(deduplicated)

Comparing Different Attribution Models

The best way to understand attribution is to run multiple models side by side:

def compare_attribution_models(journeys_df):
    """
    Run all models and compare results
    """
    models = {
        'Linear': linear_attribution,
        'Position-Based': position_based_attribution,
        'Time-Decay': time_decay_attribution
    }

    results = []

    for model_name, model_func in models.items():
        for _, row in journeys_df.iterrows():
            credits = model_func(row['journey_path'])

            for channel, credit in credits.items():
                results.append({
                    'Model': model_name,
                    'Channel': channel,
                    'Credit': credit * row['conversions']
                })

    comparison_df = pd.DataFrame(results)
    pivot = comparison_df.pivot_table(
        index='Channel',
        columns='Model',
        values='Credit',
        aggfunc='sum'
    )

    return pivot

print(compare_attribution_models(journeys_df))

This creates a table showing how each channel performs under different assumptions. You'll often find dramatically different results.

For example:

Direct traffic gets 50% credit under last-click but only 15% under position-based
Organic Search gets 10% credit under last-click but 35% under position-based

This tells you that Organic Search is actually introducing most customers, even though Direct is closing them.

Making This Work at Scale

If you have millions of sessions, the Python code above will be slow. Here are optimization strategies:

Use SQL for Heavy Lifting

Push as much processing as possible into your database:

CREATE TEMP FUNCTION PositionBasedCredit(channel_position INT64, total_channels INT64)
RETURNS FLOAT64 AS (
  CASE
    WHEN total_channels = 1 THEN 1.0
    WHEN channel_position = 1 THEN 0.4
    WHEN channel_position = total_channels THEN 0.4
    ELSE 0.2 / (total_channels - 2)
  END
);

WITH attributed_journeys AS (
  SELECT
    channel,
    SUM(PositionBasedCredit(channel_position, total_channels) * conversions) AS attributed_conversions
  FROM journey_data
  GROUP BY channel
)
SELECT * FROM attributed_journeys
ORDER BY attributed_conversions DESC;

Use Parallel Processing

For large datasets in Python, use libraries like Dask:

import dask.dataframe as dd

# Read large CSV with Dask
ddf = dd.read_csv('large_journey_data.csv')

# Apply attribution in parallel
def apply_attribution(partition):
    results = []
    for _, row in partition.iterrows():
        credits = position_based_attribution(row['journey_path'])
        for channel, credit in credits.items():
            results.append({
                'channel': channel,
                'credit': credit * row['conversions']
            })
    return pd.DataFrame(results)

attributed = ddf.map_partitions(apply_attribution).compute()

Validating Your Attribution Model

How do you know if your custom attribution model is accurate? Test it:

Holdout Validation

Split your data into training (80%) and test (20%) sets. Build your model on training data, then check if it predicts test data accurately:

from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(journeys_df, test_size=0.2)

# Build attribution on training data
train_attribution = compare_attribution_models(train_df)

# Compare predicted vs actual on test data
# (Implementation depends on your specific goals)

Budget Reallocation Test

The real test: does changing your budget based on the model improve results?

Start small. Pick one channel the model says is undervalued. Increase its budget by 20% for one month. Did conversions increase?

Common Pitfalls to Avoid

After building dozens of these models, I've seen the same mistakes repeatedly:

Over-complicating the model: Start simple. Linear or position-based attribution is often good enough. Only add complexity if you need it.

Ignoring offline channels: If you run TV ads or send direct mail, you need to track those too. Use promo codes or unique landing pages to connect offline to online.

Trusting the model blindly: Attribution models show correlations, not always causation. A channel might look valuable just because high-intent users happen to visit it last.

Not updating regularly: User behavior changes. Rebuild your model quarterly to capture new patterns.

What This Means for Your Marketing

Here's the practical takeaway: custom attribution models help you answer questions like:

"Should we increase our email budget or our paid search budget?"
"Is our content marketing actually driving sales, or just engagement?"
"Which channels work together? Which ones overlap?"

You can't answer these with last-click attribution. You need to see the complete journey.

The models I've shown you give you that visibility. You don't need a data science PhD. You just need basic SQL and Python skills, plus the willingness to look deeper than default reports.

Getting Started Today

If you want to build your first custom attribution model, here's what I recommend:

Export 90 days of session data from your analytics platform
Start with linear attribution—it's simple and often reveals surprising insights
Compare it to your current reporting—look for big differences
Test one budget change based on what you learn
Measure the results after 30 days

You'll quickly see which channels are truly driving value versus which ones just get credit by default.

Need help setting this up for your business? We build custom attribution models for companies that need to understand their marketing ROI at a deeper level. We can work with your existing data sources and create models that match your specific customer journey.

The best attribution model is the one that helps you make better decisions. Sometimes that's a simple position-based model. Sometimes it's a sophisticated data-driven approach. But it's always one you understand and can explain to your team.

Because when you know the real story of how customers find you, you can invest your budget where it actually matters.

House of MarTech

TL;DR

Building Custom Attribution Models with SQL and Python

Quick Answer

Why Build Your Own Attribution Model

The Basic Building Blocks

Step 1: Getting Your Journey Data Ready

Step 2: Building Complete User Journeys

Step 3: Implementing Attribution Methods

Position-Based Attribution

Time-Decay Attribution

Linear Attribution

Data-Driven Attribution (Markov Chains)

Step 4: Handling Real-World Complexity

Multiple Conversions Per User

Lookback Windows

Duplicate Channels

Comparing Different Attribution Models

Making This Work at Scale

Use SQL for Heavy Lifting

Use Parallel Processing

Validating Your Attribution Model

Holdout Validation

Budget Reallocation Test

Common Pitfalls to Avoid

What This Means for Your Marketing

Getting Started Today

Frequently Asked Questions

What exactly is a custom attribution model and why should I build one?

Which attribution model should I start with?

What data fields and lookback window do I need?

How should I handle users with multiple conversions or repeated channel visits?

How do I validate the model and prove ROI to leadership?

How do I scale attribution processing for millions of sessions?

What common pitfalls should I avoid?

Related Topics

Related Articles

Attribution Reporting Dashboards: Looker vs Tableau Templates

Systematic Event Tracking: Why Your Data Keeps Breaking (And How to Fix It)

Tracking Plan vs Data Plan: What's the Difference and Which One Do You Need?

Need Help Implementing?