What Makes an Influential Campaign?

Disclaimer: The views, thoughts, and opinions expressed in the text belong solely to the author, and not necessarily to the author's employer, organization, committee or other group or individual.

Highlights

In this project, Adam Rosenberg (adam17@stanford.edu) and I have been working on Google's Political Advertisement data set, which is publicly available on Google BigQuery. Our main purpose is to discover what features relate to the influence of a political campaign. Additionally we would like to see how ads are targeted at various groups of people based on demographics, geographic location, and chronological information about ad campaigns.

By carrying out steps including analysing data structures, querying, cleaning, visualising statistical results and at last - predicting political campaign influence using machine learning methods, we have reached the following conclusions:

In terms of advertisers, we have learned that in this political ad dataset, the advertiser launching the campaign doesn't have much to do with how many people their ad campaign touches. More specifically, the number of ads, total money spent, and average ad cost of an advertiser do not let us know much at all about how influential their campaign will be.

In terms of political campaigns, we've learned that the duration, money spent, and number of ads in a campaign lead to better results. The timing of a campaign can also influence whether or not people see it. Additionally, we have learned that careful targeting of an ad campaign will lead to better results.

In terms of the characteristics of ads in a political campaign, we have learned that,
Playing it safe doesn't pay: Ads not targeted at a specific age group made fewer impressions. We saw this with gender and age.
The youth isn't listening: We found ads targeted at young voters to have low amounts of impressions.
People hate reading: Non text forms of advertisements were much more effective at reaching voters.
Advertiser doesn't tell us much: this may have to do with the dataset only containing advertiser information from a single election cycle.

Stay tuned, if you are interested in how we come up with the conclusions.

1. Analysis of data structures

We've been working on Google's Political Advertisement data set, created in response to transparency concerns in the 2016 election. The data set begins with the election cycle that has just come to a close in 2018.

When first looking at the data, we come up with the following questions:
• How the dataset is organised?
• How is the data quality, or is there redundant data?
• What are the relationships between tables?
• Are there keys in the Functional Dependency sense among the tables and what are they?
• Are the tables normalised in some way (Boyce-Codd Normal Form) or are they denormalised?
If you are curious about the answers to these questions, please click and read.

2. Exploration of data sets

In this part, we use SQL to gather information from the dataset. Given spend_usd and some other variables that we are going to engineer, we will try to answer what makes an influential political campaign. For our purpose, we define influential as a single campaign with a large total number of impressions.

To investigate what contributes most to the impressions of a campaign, we are going to create plots for the following variables to see how they related to the impressions. These variables have been grouped into different categories for an easier understanding. Some of them could be directly extracted from the tables, others need to be engineered.

In summary, we have found that,
• Targeting the states where the most money was spent was hugely related to impressions. This was likely due to the races in these states being especially competitive.
• Age targeting made for campaigns with more impressions (the non targeted by age campaigns did worst), but targeting young people did not seem to work.
• Advertiser data was not useful, poorly correlated to impressions, so we decided not to use these features.
• BLACK WEEKs were a good predictor, but did anomalously high with 7 black weeks.
• Also campaigns using video and image ads did better than text ads.Lastly the close a campaign was to the election the better it did.
If you are interested in knowing how we engineer the features and which features we feed into prediction models, click here.

3. Predictions based on explorations

After we have engineered several features in the previous part, we establish a logistic regression model using BigQuery to make predictions about impressions of a political campaign given these features. We randomly separate the dataset into 3 sets using a hash function, which are training set (80%), validation set (10%) and testing set (10%). Our prediction results on test set show that the model has a 40.0% precision score, 38.9% recall score, 37.4% accuracy and 36.3% f1 score.

4. Conclusions

What have we learned?

We learned that in this political ad dataset, the advertiser launching the campaign doesn't have much to do with how many people their ad campaign touches. The number of ads, total money spent, and average ad cost of an advertiser do not let us know much at all about how influential their campaign will be.

We've learned that the duration, money spent, and number of ads in a campaign lead to better results. The timing of a campaign can also influence whether or not people see it.

Additionally, we have learned that careful targeting of an ad campaign will lead to better results.We have also learned that while more difficult to work with, some denormalized databases can be used to draw insight from.Broader targeting, less return (everyone in age, gender and young people hard to reach)

What conclusions we can make or unable to make about the data set?

• Playing it safe doesn't pay: Ads not targeted at a specific group made less impressions. We saw this with gender and age.
• The youth isn't listening: We found ads targeted at young voters to have low amounts of impressions.
• People hate reading: Non text forms of advertisements were much more effective at reaching voters.
• Advertiser doesn't tell us much: this may have to do with the dataset only containing advertiser information from a single election cycle.

What is obvious, and what did we not expect to see?

It seemed obvious that targeted campaigns would do better, and that more ads can get you more views. I thought we were going to see the advertisers running a campaign have more influence, I figured the DNCCC and similar such groups that run a lot of ads, spending a ton of money having an edge over others in the amount of impressions created.

With more time we would have better gone over geo features. The irregular form of the data made it hard, and while we got close, came up short in the end.

LinkedIn

|

Google Scholar

|

Twitter