(1) Advertiser matters.
Each campaign contains different ads launched by different advertisers, while each advertiser has its own total number of creatives that have been launched and a certain total amount of money spent on ads. Thus, we extracted three features, which are used to evaluate whether an advertiser is good or not:
• The total amount of creatives by each advertiser
• The total amount of money spent by each advertiser
• Money spent per creative by each advertiser
Figure 1 shows the visualisation results between these three features and the total number of impressions per campaign.
data:image/s3,"s3://crabby-images/2eec7/2eec7d76882d47b4c440f1c951f89ec42da903d5" alt=""
Figure 1: Advertiser characteristics vs the total number of impressions per campaign
(2) ADs matter.
Each campaign has its own number of ads featured with different amount of money spent, resulting in different time scopes. Thus, we have engineered three features to reflect how ADs influence the campaign:
• The number of ads that a campaign contains
• The total amount of days that all ads cover (sum up days of ads, ignore duplicates of dates)
• The total amount of money spent on all ads that a campaign contains
Figure 2 shows the visualisation results between these three features and the total number of impressions per campaign.
data:image/s3,"s3://crabby-images/f7382/f7382d7cc7c53249e8f96769dd75e08378e1ab94" alt=""
Figure 2: ADs characteristics vs the total number of impressions per campaign
(3) ADs matter a lot.
We are going to discover whether the types of ads that a campaign contains contribute to its impression. In general, an ad could have a combination of three types, which are Text, Image and Video.
We label Text as 1, Image as 2 and Video as 4 such that Text+Image is 3, Text+Video is 5, Image+Video is 6 and Text+Image+Video is 7, such that each number represents a unique combination of types of ads for each campaign.
Figure 3 represents how the total number of impressions per campaign vary with types of ads that a campaign contains.
data:image/s3,"s3://crabby-images/7160a/7160ae068362ef3a86b07ae9d0c7bc125b58c537" alt=""
Figure 3: Types of ads vs the total number of impressions per campaign
(4) Time matters.
Each campaign, without considering how long its ads last, has its own time features that we can investigate into. We have directly extracted the following features to see how chronological data could contribute to the impression of a campaign, as shown in Figure 4. These features are:
• The lasting time of the campaign (consider duplicates of dates)
• The month that the campaign starts
• The month that the campaign ends
data:image/s3,"s3://crabby-images/9f21f/9f21f0fe73b7fc5cfea356a1f3d8210bdd954f95" alt=""
Figure 4: Chronological features vs the total number of impressions per campaign
(5) Time matters a lot.
It looks like Time matters a lot! To discover more about the relaionship between time features and the impressions of a campaign, we have defined a new concept called Black Week.
Define BLACK WEEK: Too much money has been spent on that week, suggesting a fierce competition, like on American Black Friday. By too much, we mean larger than the average weekly spend_usd during that election cycle.
Thus, when given start_date and end_date, we can find how many BLACK WEEKs each campaign contains. Figure 5 shows how the number of black weeks influence the campaign.
data:image/s3,"s3://crabby-images/5f041/5f041c19c67559f211e6180be56f1d36034ac2a6" alt=""
Figure 5: BLACK WEEKs vs the total number of impressions per campaign
(6) Age groups matter.
As google advertising and marketing allows users to fine tune demographic targets, we found age targets to be a bit all over the place. But with a little creative feature engineering, we can garner insights from this data without the risk of making a model that overfits on targets that only have a campaign or two.
In general, the age_targeting data are pretty miscellaneous. Therefore, we have grouped them into 5 brackets based on their similarity. These 5 groups are:
• Everyone: Targeted at all voting ages. Age_targets grouped in here cover all or nearly all targetable ages, and campaigns not targeted at any age.
• No Babies: These are the campaigns targeted at all voters except the youngest bracket or two.
• Young People: These are the campaigns targeted at the youngest voter age brackets.
• No Boomers Here: These are the campaigns that target young and middle aged voters, but cutoff at 54, excluding baby boomers and those older.
• The Wise Ones: These campaings are targeted at the older voting demographics.
Figure 6 shows the the total number of impressions vary with these 5 age groups.
data:image/s3,"s3://crabby-images/15a4c/15a4c36cf23295c04b603000fbfc05d3acfed6b1" alt=""
Figure 6: Age groups vs the total number of impressions per campaign
(7) Gender groups matter.
While most of the campaigns in the dataset are not targeted at a particular gender, we've found two interesting trends from gender targeting. Untargeted campaigns tend to have more impressions overall, while gender targeted campaigns tend to create more impressions per ad. Thus gender targeting gives us a feature that relates to the scope of a campaign as shown in Figure 7.
data:image/s3,"s3://crabby-images/2b3f7/2b3f7748627d4c59538c1fa5ec89ff41e94a7f31" alt=""
Figure 7: Gender groups vs the total number of impressions per campaign
(8) Location, Location, Location!
Location matters! Where the campaign is launched?
• We can tell which state/district has the most fierce competition by calculating how much money has been spent in each state per day per advertiser | or simply the total amount (this could be directly used)?
• Also which state promises the largest return on investment by calculating impressions/spend_usd
• If there's geo data: Scale of campaign as feature (Statewide, District, Particular zipcodes)
This is the trickiest part! We have spent 50% of our time on figuring this out because we expected a map visualisation would be cool, and most importantly, we do believe location matters. Despite our efforts in data cleaning and feature engineering (for example, mapping zip to district, thanks to https://github.com/OpenSourceActivismTech/us-zipcodes-congress/blob/master/zccd.csv), we found the results very disappointing. Thus, in order not to disappoint you, let's skip this part!
3. Predictions based on explorations
After we have engineered several features in the previous part, we establish a logistic regression model using BigQuery to make predictions about impressions of a political campaign given these features. We randomly separate the dataset into 3 sets using a hash function, which are training set (80%), validation set (10%) and testing set (10%). Our prediction results on test set show that the model has a 40.0% precision score, 38.9% recall score, 37.4% accuracy and 36.3% f1 score.
4. Conclusions
What have we learned?
We learned that in this political ad dataset, the advertiser launching the campaign doesn't have much to do with how many people their ad campaign touches. The number of ads, total money spent, and average ad cost of an advertiser do not let us know much at all about how influential their campaign will be.
We've learned that the duration, money spent, and number of ads in a campaign lead to better results. The timing of a campaign can also influence whether or not people see it.
Additionally, we have learned that careful targeting of an ad campaign will lead to better results.We have also learned that while more difficult to work with, some denormalized databases can be used to draw insight from.Broader targeting, less return (everyone in age, gender and young people hard to reach)
What conclusions we can make or unable to make about the data set?
• Playing it safe doesn't pay: Ads not targeted at a specific group made less impressions. We saw this with gender and age.
• The youth isn't listening: We found ads targeted at young voters to have low amounts of impressions.
• People hate reading: Non text forms of advertisements were much more effective at reaching voters.
• Advertiser doesn't tell us much: this may have to do with the dataset only containing advertiser information from a single election cycle.
What is obvious, and what did we not expect to see?
It seemed obvious that targeted campaigns would do better, and that more ads can get you more views. I thought we were going to see the advertisers running a campaign have more influence, I figured the DNCCC and similar such groups that run a lot of ads, spending a ton of money having an edge over others in the amount of impressions created.
With more time we would have better gone over geo features. The irregular form of the data made it hard, and while we got close, came up short in the end.