Random Forest and Ensemble Methods for YouTube Brand Lift Forecasting

Google’s YouTube Brand Lift Study (or BLS) is a tool for quantifying the impact of a YouTube video campaign’s effectiveness in generating brand awareness, brand awareness, ad recall, consideration, favorability, purchase intent, or brand interest. In this article, I’ll walk through the business use cases and challenges for the BLS study. While I won’t be sharing any results from our team’s BLS forecasts, I will deep-dive into why the ensemble model nature of Random Forests are an ideal fit for the special business requirements of working with BLS data.

The article is intended for technical analysts and data scientists, particularly those also within the digital marketing and advertising industry. A thank you to Operam’s paid search team for providing the business domain knowledge behind this article and answering the countless questions I’ve asked along the way.

How the Brand Lift Study Works

Although clients approach their marketing methodology differently, when planning their own campaigns, Operam’s Paid Search and Social team will typically start with a media brief for the customer segments the product might appeal to — a theatrical marketing campaign for the film The Disaster Artist, for instance, may highlight comedy fans, cinephiles, and niche comedy genre fans as targetable audiences.

Overview of the Brand Lift Study Workflow

A YouTube Brand Lift Study is used for multiple purposes:

1. Determine which customer segments respond best to creatives:

  • Are Rap and Hip Hop fans responding more favorably to a particular creative (a 2009 study claimed that 30% of an online ad’s performance is attributed to media placement and 70% to the specific creative)?
  • Given a limited media budget of X, which audiences will yield the best ROI(as measured by greatest increase in % lift)?

Retargeting lifted users is not directly supported by Google Ads, but obtaining a concrete, measurable signal for how well an audience responds to your creatives allows a paid search team to design a custom strategy for reaching these segments.

2. Determine which metrics are the strongest indicators of effective brand lift campaigns:

  • Our campaigns have racked up lots of views, but does this translate into actual lift in brand awareness, or even brand consideration?
  • Is headroom lift, relative lift, or effective lift the metric we should be showing to our clients? Or is it a custom calculated metric (such as cost per headroom lift, cost per lifted user, or cost per earned view)?

The study begins by creating two mutually exclusive audiences: a randomized control group who are part of the targeted audience but have not seen the ad yet, and a target exposed group that is served the YouTube video ad. A campaign’s effectiveness can typically be quantified using several metrics:

  • Absolute Lift: % positive in exposed group — % positive in control group. This is the percentage points difference between the exposed and control group positive responses (for recall, interest, consideration, etc.). If the % positive responses increases from 10% to 30%, the absolute lift would be 30% – 10% = 20%.
  • Relative Lift: % positive in exposed group/ % positive in control group – 1. This measures the percentage difference between the exposed and control group.
  • Headroom Lift: absolute lift / (1 – % positive in control group).

Brand Lift fills an interesting niche within a digital marketer’s toolkit specifically because it utilizes 1st party data. Instead of attempting to infer whether or not a campaign actually made an impact through complicated and indirect attribution models, we can directly ask the consumer.

The closest technical parallel would be the difference between implicit ratings (infer a user’s predicted interest in a specific product based upon the # of clicks, visits, comments, etc.) versus explicit ratings within collaborative filter recommendation models.

Technical Challenge #1: Limited Datasets, High Dimensionality, and Multicollinearity

From an analytics and data science perspective, the first consideration when working with YouTube Brand Lift Study data is the lack of data. Survey data volume is significantly smaller in scope than traditional paid or organic social media metrics — we are often working with hundreds, not hundreds of millions of rows of metrics. This places certain constraints and guidelines around the feature engineering, data processing, and model selection process.

One of the central tenets of statistics and machine learning is the concept of the Bias-Variance tradeoff, which states, in part, that as a machine learning model’s complexity increases, its bias (the average difference between its prediction and the true value) tends to decrease while the variance of its predictions will increase. This means for many models, we represent its overall error as

Decomposition of overall error into three components: 1) bias, 2) variance, and 3) irreducible error.

For the rest of this article, we utilize a Kaggle sample sales conversion dataset of Facebook ad campaigns contributed by an anonymous organization to illustrate some of the more technical concepts. Let’s use this dataset to see how the the breakdown of bias/variance affects the quality of our insights by exploring its relationship with model complexity. We’ll perform some initial preprocessing to arrive at our features (X) and target array (y):

Initial script to load in the sales conversion data, drop columns, create dummy variables, and split into feature set (X) and target (y). I dropped Total Conversions since it is not to be available at the time of prediction.

Perhaps the most salient point to note here is the large number of dimensions: 47, in fact, driven in large part by the variety of dummified (one-hot encoded) targeting interest IDs.

We’ll use an Ordinary Least Squares regression model for our experiments with model complexity, since it is one of the more simple, derivable, and recognizable statistical models. If the inner product of your feature space (the Gramian matrix — X transposed dot X) is invertible, a quick and convenient way of finding the parameters β (the array of coefficients/weights for each feature) is to define β in linear algebra matrix formulation:

We can use a few NumPy linear algebra functions to solve for β:

β = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y).flatten()

Now that we have a simple OLS model up and running, let’s start tinkering with its complexity via regularization. Keep in mind that model complexity can manifest itself in a variety of ways, including:

  • The number of hidden layers / hidden nodes within a deep learning neural network architecture
  • For decision tree algorithms, tree nodes that reach extremely large maximum depth
  • The number of features in the data

Without proper handling, models built off of YouTube Brand Lift data are especially susceptible to overfitting due to several factors:

  1. An overabundance of features to consider relative to the number of surveys in the dataset. An individual YouTube Brand Lift study itself will yield a large number of dimensions: spend, creative length (for theatrical marketing, often this is the film’s trailer duration), clickthrough rates, 75% video completion rates, impressions, views, earned views, earned playlist additions, etc. to mention just a few.
  2. Highly correlated features: 25% video completion rates are clearly correlated with 50% video completion rates, as are the earned metrics (earned views, earned subscribers, etc.) with each other. Absolute metrics (like the total number of impressions or views) will almost certainly be correlated with the spend.
  3. Additional dimensionality from incorporating cross-channel data. Since our platform integrates data from a variety of different connectors and data sources, a data scientist working on the platform is afforded great power along with great responsibility:

Working with integrated data sources means there is a unique opportunity to incorporate paid social metrics (i.e., link clicks, conversions) with organic data (number of Facebook comments, number of retweets, panel brand awareness, number of viewable impressions, etc.) all at the same time. However, it also means there is an increased burden to avoid overfitting and surfacing insights that are false positives.

An extremely high-level representation of data flow from individual data connectors to an integrated cross-channel view provided by Panoramic, Operam’s data management platform. Our platform joins together disparate data sources that have traditionally been analyzed in isolation (ie. a model built only off of Facebook paid campaign metrics), which can result in a significant expansion of the available feature space for model learning.

To gain some intuition behind how these additional dimensions will affect model behavior, we’ll tune our simple OLS model using L2 ridge regression (Tikhonov regularization) to simulate “constraining” or “muting” features.When we compute our loss function, we can incorporate a righthand term that penalizes the model for a large magnitude for β:

If many/all of the 47 coefficients in our model are extremely large in value, our model is likely overly complex (and over-fitted), and will be penalized accordingly. Conversely, if many/most of our 47 coefficients are at, or close to 0, this implies our model complexity is low. We are using only a small subset of features to make our prediction. In essence, λ can be thought of as the inverse of model complexity:

Solving for β, with the inclusion of an L2 λ regularization parameter. When λ is 0, the model reverts to a vanilla OLS model.

Let’s now explore what happens if we increase λ from 0 to 50 or in other words, make our model less and less complex:

If we plot bias and variance, we notice immediately their inverse relationship relative to the model’s complexity:

Plot of fitted OLS regression model prediction bias and variance using a variable λ regularization hyperparameter to tune model complexity. Visualization code available here.

Think about the least complex model we could make for our Brand Lift studies — we could simply predict the average headroom lift, or relative lift. For a classification problem, we would predict the mode (the most frequent class target). Regardless of what happens to the features — we drop certain features, we scale other features, etc. — our predictions will not change. However, our actual predictions are likely to be quite biased — simply predicting the mode/mean typically won’t result in a very accurate model.

In this way, we’ll have an extremely low variance, high bias model. Conversely, highly complex models will tend to have low bias but higher variance — they will be extremely sensitive to different noisy perturbations of the data, and adjust parameters to “refit” (ie. overfitting) to this noise signal.

Technical Challenge #2: False Positive Signals from Looking at Statistical Analysis in Isolation

An overfitted Brand Lift model can lead to strange and contradictory insights (for example, the seemingly paradoxical insight that decreasing earned views and spend will result in an increase in forecasted relative lift). However, even when we use simple statistical models for our marketing analysis, we run the risk of surfacing false positives.

A common technical metric of interest for marketers is the correlation(usually referencing the Pearson correlation) between two metrics. Everyone, from the high-level CMO to the media buyer actually managing the campaigns, is interested in answering the question

Which metrics correlate the most with higher box office/increased social engagement/decreased churn rate/etc.?

However, interpreting insights purely off of a single correlation statistic can be deceiving. For instance, looking directly at the correlations for Clicks, we might conclude that being a 45 to 49-year-old user appears to correlate the most with the number of clicks out of all the available dimensions. However, Clicks are almost perfectly correlated with Spend (0.99), and it certainly looks as if the organization simply spent more money targeting that particular age group.

An exemplary correlation matrix heatmap highlighting each dimension’s linear relationship (correlation) with Clicks — our target of interest. If we interpret only these numbers in a vacuum, we may again encounter the same issue of overfitting — this time by ignoring the interactions between different metrics within the dataset. Code to generate visualization.

Any analysis that interprets only correlations when examining feature importance ( attempting to answer how significantly a digital KPI translates into the desired outcome) is limited by the fact that it will consider each dimension in a vacuum, without accounting for the inter-relationships between dimensions (often synonymous with multi-collinearity).

Technical Deep-Dive into an Initial Solution: Bootstrap Aggregating (Bagging)

In our previous example, we had 1147 data points (n = 1147). Let’s say we were trying to find the median number of impressions. We’ll perform the following experiment:

  1. Randomly draw one data point from our dataset, record its value, and then put it back (replace it) into the dataset.
  2. Repeat n (1147) times.
  3. Find the median of these 1147 impression values (since we replaced the sample data points, some will likely be duplicates). Let’s say we call finding the median a generic aggregation function g(x).
  4. For each bootstrap sample, calculate your overall prediction. We’ll call this make_pred(). In our case, we’ll simply average our sample medians for our final prediction: np.mean(sample_medians).
  5. Repeat Steps 1–3 B times, where B is the number of times you want to perform your bootstrap sampling.

On average, about 63% (1/e) of the original dataset will be included in a bootstrap sample. If we plot errors and variances, we’ll notice that both converge close to as B grows. This is a partial side-effect of both the Law of Large Numbers — with a large enough number of bootstrapped dataset means, the mean of these dataset means will approach the true mean — the random error component is “smoothed” out through the aggregation (averaging).

This is the principle behind bagging — instead of finding the mean or median of bootstrap samples, fit a model based upon the bootstrap samples, and then aggregate these models’ predictions to get a final prediction. In essence, we are replacing our g(x) function from mean() to model.predict(). However, the net effect is similar — we can reduce the variance of our final predictions by aggregating our fitted models’ individual predictions.

Let’s try to visualize how bagging would affect how a model fits our social media dataset. We will compare the decision boundaries from a single fitted tree instantiated from sklearn.tree.DecisionTreeRegressor with a bagged model of 200 individual learners by predicting the number of conversions we can expect from a specific ad. First we need to build a bootstrap_fit() function for our bagged model:

From here, we can create a bootstrap_contour_predict function that will generate a contour map of decision boundaries for our forecast:

When we visualize these decision boundaries, we see that the bagged model captures this relationship a bit more organically, smoothing out some of the straight lines indicative of overfit pockets present in a single decision tree.

Contour map for a regression model to forecast the number of conversions. The data was initially decomposed into two principal components using PCA and then scaled. Notice that the bagged model’s decision boundaries are significantly more “smoothed” out, indicating it is far more likely to generalize to unseen samples (ie., not overfit the data). I’ve adapted a helper function called plot_decision_boundaries originally authored by Anand Chitipothu to include a bootstrap aggregation option (triggered by setting bootstrap=True).

A similar pattern is found when mapping the boundaries of a classification model. We can fit the same data on impressions and clicks to predict whether or not the ad was targeted to male or female audiences:

Decision boundaries for a single decision tree (left) vs. bootstrap aggregated predictions (right). Yellow spaces are predicted male-targeted ads; purple spaces are predicted female-targeted ads. Again, notice the smoothed decision boundaries for the bagged model.

De-Correlating Bootstrap Models

We’ve seen how bagging will reduce ensemble variance, but it is not enough to simply aggregate models — we need to ensure the bootstrapped models are as independent of each other as possible. We’ll use the following definitions to begin unpacking this intuition:

The ensemble prediction variance var(x) (which we plotted earlier), can be formally decomposed into two parts, dependent upon the value of ρ(x):

Adapted from Gilles Louppe’s PhD dissertation (page 66 — section 4.2.3, “Bias-variance decomposition of an ensemble”).

The more we can reduce the correlation ρ(x) between different bootstrapped predictions, the more linearly the ensemble prediction variance will decrease as a function of B (which we have full control of, and can push → ∞).

Random Forest

Random Forest algorithms combine the two machine learning techniques discussed: stacking de-correlated models (ensembling) and bootstrap aggregating (bagging). The sklearn.ensemble.RandomForestRegressor API and class constructor parameters reveal how it differs from a traditional DecisionTreeRegressor.

The API for scikit-learn’s RandomForestRegressor.

Note that the max_depth default is None: the bootstrapped trees are allowed to grow to an arbitrarily deep depth — we are intentionally overfitting each model, with the belief that by averaging out their predictions, we’ll get a overall robust prediction. Arguably the most important (and only) hyper-parameter to tune is max_features: while a traditional decision tree picks the best feature to split on for each node, Random Forests constrain its trees to picking between a subset of K features at each node when finding the best fit. This K is a hyper-parameter that should be tuned between 1 and M, the number of dimensions in the original training set:

Tuning K from 1 to M (the original number of features) will provide you with different flavors of Random Forest models.

Notice that the default value for max_features in RandomForestRegressor is ‘auto’, which is actually n_features! This means that unless you specify a different value, by default the Random Forest implemented for regression problems is a bagged ensemble tree model. Since we’ve repeatedly emphasized the need to de-correlate bootstrap model predictions, this might come as a bit of a surprise. However, research has empirically suggested that indeed, setting K = M, will result in the lowest overall error for regression problems.

The impact of K, or max_features (x-axis), on Random Forest model error (y-axis) for 12 classification and 12 regression problems, as reported in P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3–42, 2006. Notice that for classification problems, the optimal K is typically close to 1, but for regression problems (with the exception of Abalone), setting K equal to the original # of features yields the lowest error.

YouTube Brand Lift data and likely many other digital platform datasets, typically conform to these patterns: the increase bias significantly outstrips any gains from reduced variance when we select a K less than M:

  • Even if we set K (max_features) equal to 1, the bootstrapped models’ variance will not be reduced even close to linearly (remember, in our optimal case, var(x) should be reduced by B). Our original high-dimensional dataset includes many highly correlated variables. For instance, even if one bootstrap model selects the ad’s 25% view completion rate at a particular node to split on, another model may select 50% view completion rate, overall views, or another similarly correlated metric. Thus, the trees themselves will never be truly grown independent.
  • As a result, bias increases at a significantly faster rate than the pace at which variance decreases when max_features approaches 1. There are lots of “irrelevant” features when predicting headroom or relative lift: a particular ad campaign may generate high volume in terms of absolute metrics, but be otherwise completely forgettable. When K is set close to 1, the tree is unable to utilize its criterion algorithms — Gini coefficient, entropy, MSE, etc. — to weed out these irrelevant features, resulting in illogical splits and increased variance at each tree node’s “cut points”.
  • Explainability must be preserved: Although it’s tempting to preprocess the data into PCA eigenvectors (which are orthogonal to each other in vector space and thus not correlated), decomposing the original features to reduce multi-collinearity is unfortunately not an option. As much as a forecast for brand lift is important, perhaps even more importantly, media buyers and marketing executives almost always want to know which metrics are the most important levers to pull.

Beyond Forecasts: Feature Importances, Built-In Holdout Evaluation, And Ad Proximities

One of the primary use cases for a Brand Lift predictive model was to reduce the lead time between study initialization and feedback, in the form of lift metrics. Sometimes, media buyers would have to wait for days, or even a week, to figure out what their projected lift would be from a YouTube campaign. However, in March 2019, Google unveiled Brand Lift 2.0, which offers more visible metrics, such as Lifted Users (unique reach x absolute lift)and Cost Per Lifted User (spend / lifted users). Moreover, it allowed users to view real-time lift results within the Google Ads user interface, provided that their campaigns met a certain threshold of spend. This provides users with the ability to (theoretically) run campaigns continuously and optimize as results rolled in.

With this new version of YouTube BLS providing more real-time insight capabilities than before, the big questions are evolving from “What is my projected lift?” to “What explains why this campaign succeeded in lifting awareness and consideration?”

Luckily, given how important explainability is, Random Forests provide feature importances (accessible via random_forest.feature_importances_) that are arguably easier to reason about than most parametric models, with the exception of OLS coefficients. I find feature importances to be conveniently client-facing because:

  • They sum to 1, so they can be presented as percentages that are easily interpretable by clients.
  • The importance values do not have wild fluctuations or negative values symptomatic of overfit linear models (ie., one coefficient is an extremely large negative value and another coefficient is an extremely large positive value).
  • The feature importance tends to effectively capture hierarchical relationships amongst features. Indeed, even though we might find in an isolated correlation analysis that earned subscribers has the highest linear correlation with headroom lift, a Random Forest’s feature importance will generally identify if those earned subscribers are influenced by another, ultimately more significant, variable (such as earned views).

Moreover, clients often want to know how a particular YouTube video ad campaign’s performances are relative to other data points, either to build a peer group for benchmarking and further competitive analysis, or to intuitively see if their ad is performing in the same “league” as prior successful campaigns. We can use Random Forests to construct a proximity matrix,where we count the number of times two data points end up in the same terminal leaf node. This count is then divided by B, the number of bootstrapped estimator trees to get a proximity score between 0 and 1. We can then use this to identify similar performing ads:

Code to generate proximity matrix from fitted Random Forest models.

For instance, most similar pair of ads targeted females with the same interest IDs, accumulated nearly identical number of impressions, and resulted in no final approved conversions:

If we want to give clients a general sense of what level their ads are performing, we can use this same proximity score to provide a peer group of similar performance ads relative to the target (in this case, Approved Conversions):

An example of using Random Forest’s proximity matrix to identify similar performing ads relative to ultimate Approved Conversions performance. The target ad is highlighted in blue, and the most similar ads returned are listed underneath.

After we employ multi-dimensional scaling or dimensionality reduction technique to visualize this proximity matrix, we can see that the proximity scores are correctly grouping high-performing ads together (those with X or more approved conversions):

Proximity matrix mined from Random Forest visualized using Spectral Embedding dimensionality reduction. (Left) Orange data points are ads yielding 2 or more approved conversions. (Right) 5 or more conversions. Visualization code is available here.

Surfacing similar-performing entities (in this case, YouTube ads, creatives, audiences, etc.) often mirrors how media buyers and marketers evaluate their campaigns. It’s often human nature to identify a known quantity — some previous campaign or project, and use its performance as the frame of reference for another campaign:

  • A more intelligent “similarity” score: often we are concerned not so much with pure similarity in terms of explicit feature values, but similarity in terms of behavior relative to some defined goal. For instance, we don’t care so much about finding campaigns targeted to the same audiences, with similar budgets, etc. We care far more about surfacing campaigns — and their different combinations of targeting, objectives, optimization goals, bid strategies, etc. — that result in similar behavior (usually a large value for headroom lift, or PSR, or some KPI of interest). A Random Forest’s proximity score can often be an improvement over a simple distance-metric computed similarity score.
  • A more targeted missing values imputation strategy: traditional imputation strategies often involve dropping the record entirely, imputing the mean/median, or using a clustering strategy, involving some form of unsupervised learning model such as K-Nearest Neighbors to find imputed value from an aggregation (usually an average) of its most similar neighbors. Since Random Forest is a supervised learning model with a target, we can use proximity scores to iteratively impute missing values, field by field, in each iteration switching the target to the next field to impute.
  • Peer group analysis: Finding the campaigns with the highest proximity scores allows us to provide a distribution of outcomes for clients as a frame of reference during their evaluation of a campaign. Is this campaign performing in its expected peer group relative to some target metric of interest, or are all its most similar performers an entirely different targeting interest, audience demographic, or brand?

Final Thoughts

This article mixed different business and technical uses cases for Random Forests and the YouTube Brand Lift results that Operam’s data science team works with as part of the Panoramic data management platform. As BLS itself has evolved from v1.0 to v2.0, the dominant use cases for a Random Forest predictive model have also adapted: since Google Adwords is more and more able to provide the real-time insight into lift results that our models once served, we now can focus closer on using the Random Forest model as a starting point for finding the most important features of high-performing campaigns and surfacing similar-performing “comparable” campaigns in an intelligent and unique way.