Disclaimer: The views, thoughts, and opinions expressed in the text belong solely to the author, and not necessarily to the author's employer, organization, committee or other group or individual.

Highlights

In this project, Adam Rosenberg (adam17@stanford.edu) and I have been working on Google's Political Advertisement data set, which is publicly available on Google BigQuery. Our main purpose is to discover what features relate to the influence of a political campaign. Additionally we would like to see how ads are targeted at various groups of people based on demographics, geographic location, and chronological information about ad campaigns.

By carrying out steps including analysing data structures, querying, cleaning, visualising statistical results and at last - predicting political campaign influence using machine learning methods, we have reached the following conclusions:

In terms of advertisers, we have learned that in this political ad dataset, the advertiser launching the campaign doesn't have much to do with how many people their ad campaign touches. More specifically, the number of ads, total money spent, and average ad cost of an advertiser do not let us know much at all about how influential their campaign will be.

In terms of political campaigns, we've learned that the duration, money spent, and number of ads in a campaign lead to better results. The timing of a campaign can also influence whether or not people see it. Additionally, we have learned that careful targeting of an ad campaign will lead to better results.

In terms of the characteristics of ads in a political campaign, we have learned that,
• Playing it safe doesn't pay: Ads not targeted at a specific age group made fewer impressions. We saw this with gender and age.
• The youth isn't listening: We found ads targeted at young voters to have low amounts of impressions.
• People hate reading: Non text forms of advertisements were much more effective at reaching voters.
• Advertiser doesn't tell us much: this may have to do with the dataset only containing advertiser information from a single election cycle.

Stay tuned, if you are interested in how we come up with the conclusions.

2. Exploration of data sets

In this part, we use SQL to gather information from the dataset. Given spend_usd and some other variables that we are going to engineer, we will try to answer what makes an influential political campaign. For our purpose, we define influential as a single campaign with a large total number of impressions.
‍
To investigate what contributes most to the impressions of a campaign, we are going to create plots for the following variables to see how they related to the impressions. These variables have been grouped into different categories for an easier understanding. Some of them could be directly extracted from the tables, others need to be engineered.

In summary, we have found that,
• Targeting the states where the most money was spent was hugely related to impressions. This was likely due to the races in these states being especially competitive.
• Age targeting made for campaigns with more impressions (the non targeted by age campaigns did worst), but targeting young people did not seem to work.
• Advertiser data was not useful, poorly correlated to impressions, so we decided not to use these features.
• BLACK WEEKs were a good predictor, but did anomalously high with 7 black weeks.
• Also campaigns using video and image ads did better than text ads.Lastly the close a campaign was to the election the better it did.

If you are interested in knowing how we engineer the features and which features we feed into prediction models, click here.

(1) Advertiser matters.
‍
Each campaign contains different ads launched by different advertisers, while each advertiser has its own total number of creatives that have been launched and a certain total amount of money spent on ads. Thus, we extracted three features, which are used to evaluate whether an advertiser is good or not:
• The total amount of creatives by each advertiser
• The total amount of money spent by each advertiser
• Money spent per creative by each advertiser
‍
Figure 1 shows the visualisation results between these three features and the total number of impressions per campaign.

Figure 1: Advertiser characteristics vs the total number of impressions per campaign

(2) ADs matter.
‍
Each campaign has its own number of ads featured with different amount of money spent, resulting in different time scopes. Thus, we have engineered three features to reflect how ADs influence the campaign:
• The number of ads that a campaign contains
• The total amount of days that all ads cover (sum up days of ads, ignore duplicates of dates)
• The total amount of money spent on all ads that a campaign contains
‍
Figure 2 shows the visualisation results between these three features and the total number of impressions per campaign.

Figure 2: ADs characteristics vs the total number of impressions per campaign

(3) ADs matter a lot.
‍
We are going to discover whether the types of ads that a campaign contains contribute to its impression. In general, an ad could have a combination of three types, which are Text, Image and Video.
‍
We label Text as 1, Image as 2 and Video as 4 such that Text+Image is 3, Text+Video is 5, Image+Video is 6 and Text+Image+Video is 7, such that each number represents a unique combination of types of ads for each campaign.
‍
Figure 3 represents how the total number of impressions per campaign vary with types of ads that a campaign contains.

Figure 3: Types of ads vs the total number of impressions per campaign

(4) Time matters.
‍
Each campaign, without considering how long its ads last, has its own time features that we can investigate into. We have directly extracted the following features to see how chronological data could contribute to the impression of a campaign, as shown in Figure 4. These features are:
• The lasting time of the campaign (consider duplicates of dates)
• The month that the campaign starts
• The month that the campaign ends

Figure 4: Chronological features vs the total number of impressions per campaign

(5) Time matters a lot.
‍
It looks like Time matters a lot! To discover more about the relaionship between time features and the impressions of a campaign, we have defined a new concept called Black Week.
‍
‍Define BLACK WEEK: Too much money has been spent on that week, suggesting a fierce competition, like on American Black Friday. By too much, we mean larger than the average weekly spend_usd during that election cycle.
‍
Thus, when given start_date and end_date, we can find how many BLACK WEEKs each campaign contains. Figure 5 shows how the number of black weeks influence the campaign.

Figure 5: BLACK WEEKs vs the total number of impressions per campaign

(6) Age groups matter.
‍
As google advertising and marketing allows users to fine tune demographic targets, we found age targets to be a bit all over the place. But with a little creative feature engineering, we can garner insights from this data without the risk of making a model that overfits on targets that only have a campaign or two.
‍
In general, the age_targeting data are pretty miscellaneous. Therefore, we have grouped them into 5 brackets based on their similarity. These 5 groups are:
• Everyone: Targeted at all voting ages. Age_targets grouped in here cover all or nearly all targetable ages, and campaigns not targeted at any age.
• No Babies: These are the campaigns targeted at all voters except the youngest bracket or two.
• Young People: These are the campaigns targeted at the youngest voter age brackets.
• No Boomers Here: These are the campaigns that target young and middle aged voters, but cutoff at 54, excluding baby boomers and those older.
• The Wise Ones: These campaings are targeted at the older voting demographics.

Figure 6 shows the the total number of impressions vary with these 5 age groups.

Figure 6: Age groups vs the total number of impressions per campaign

(7) Gender groups matter.
‍
While most of the campaigns in the dataset are not targeted at a particular gender, we've found two interesting trends from gender targeting. Untargeted campaigns tend to have more impressions overall, while gender targeted campaigns tend to create more impressions per ad. Thus gender targeting gives us a feature that relates to the scope of a campaign as shown in Figure 7.

Figure 7: Gender groups vs the total number of impressions per campaign

(8) Location, Location, Location!
‍
Location matters! Where the campaign is launched?
• We can tell which state/district has the most fierce competition by calculating how much money has been spent in each state per day per advertiser | or simply the total amount (this could be directly used)?
• Also which state promises the largest return on investment by calculating impressions/spend_usd
• If there's geo data: Scale of campaign as feature (Statewide, District, Particular zipcodes)

This is the trickiest part! We have spent 50% of our time on figuring this out because we expected a map visualisation would be cool, and most importantly, we do believe location matters. Despite our efforts in data cleaning and feature engineering (for example, mapping zip to district, thanks to https://github.com/OpenSourceActivismTech/us-zipcodes-congress/blob/master/zccd.csv), we found the results very disappointing. Thus, in order not to disappoint you, let's skip this part!

3. Predictions based on explorations

After we have engineered several features in the previous part, we establish a logistic regression model using BigQuery to make predictions about impressions of a political campaign given these features. We randomly separate the dataset into 3 sets using a hash function, which are training set (80%), validation set (10%) and testing set (10%). Our prediction results on test set show that the model has a 40.0% precision score, 38.9% recall score, 37.4% accuracy and 36.3% f1 score.

4. Conclusions

What have we learned?
‍
We learned that in this political ad dataset, the advertiser launching the campaign doesn't have much to do with how many people their ad campaign touches. The number of ads, total money spent, and average ad cost of an advertiser do not let us know much at all about how influential their campaign will be.

We've learned that the duration, money spent, and number of ads in a campaign lead to better results. The timing of a campaign can also influence whether or not people see it.

Additionally, we have learned that careful targeting of an ad campaign will lead to better results.We have also learned that while more difficult to work with, some denormalized databases can be used to draw insight from.Broader targeting, less return (everyone in age, gender and young people hard to reach)

What conclusions we can make or unable to make about the data set?

• Playing it safe doesn't pay: Ads not targeted at a specific group made less impressions. We saw this with gender and age.
• The youth isn't listening: We found ads targeted at young voters to have low amounts of impressions.
• People hate reading: Non text forms of advertisements were much more effective at reaching voters.
• Advertiser doesn't tell us much: this may have to do with the dataset only containing advertiser information from a single election cycle.

What is obvious, and what did we not expect to see?

It seemed obvious that targeted campaigns would do better, and that more ads can get you more views. I thought we were going to see the advertisers running a campaign have more influence, I figured the DNCCC and similar such groups that run a lot of ads, spending a ton of money having an edge over others in the amount of impressions created.
‍
With more time we would have better gone over geo features. The irregular form of the data made it hard, and while we got close, came up short in the end.

Zhenhua Zhang
张振华

What Makes an Influential Campaign?

Disclaimer: The views, thoughts, and opinions expressed in the text belong solely to the author, and not necessarily to the author's employer, organization, committee or other group or individual.

Highlights

1. Analysis of data structures

2. Exploration of data sets

3. Predictions based on explorations

4. Conclusions

LinkedIn

|

Google Scholar

|

Twitter

All Rights Reserved 2022