5  TLYCS Portland trial: Brief report

Background, data input, brief report

5.1 The trial

In December 2021, TLYCS ran a YouTube advertising campaign in single city, involving ‘donation advice’. The top 10% household income households were targeted with (one of) three categories of videos. One of the ultimate goals was to get households to sign up for a ‘concierge’ personal donor advising service.

The details are presented in our gitbook HERE

Quick takeaways

There were very few signups for the concierge advising service. (About 16 in December 2021 , only 1 from Portland.)

We consider a ‘difference in difference’, to compare the year-on-year changes in visits to TLYCS during this period for Portland vs other comparison cities.

This comparison yields a ‘middle estimate cost’ of $37.7 per additional visitor to the site. This seems relatively expensive. We could look into this further to build a more careful model and consider statistical bounds, if such work is warranted.

5.2 Capturing data

I (David Reinstein) did a manual ‘create report and download’ in TLYCS’s Google Analytics, basically as described here.

The Google Analytics report and it’s parameters can be accessed here, if you have access.

Report:

  • 3-31 December 2021 vs prior year, same dates
  • All North American cities with 1 or more user
  • Counts: Users, sessions, certain types of conversions (see below)

I downloaded this as an Excel spreadsheet to the private eamt_actual_data repo:

eamt_actual_data/tlycs/tlycs_dec_2021_vs_2020_by_city_n_america.xlsx

5.3 Input and clean data

1

Reading TLYCS trial data
#tl21 <- read_excel(here::here("data_do_not_commit", "tlycs", "tlycs_dec_2021_vs_2020_by_city_n_america.xlsx"), sheet = "Dataset1") #this should have worked?

#tl21 <- readxl::read_excel("~/githubs/eamt_actual_data/tlycs/tlycs_dec_2021_vs_2020_by_city_n_america.xlsx",  sheet = "Dataset1")


tl21 <- readxl::read_excel(here::here("tlycs_portland_trial_2021", "tlycs_dec_2021_vs_2020_by_city_n_america.xlsx"),  sheet = "Dataset1") %>%
  dplyr::as_tibble()

We add companion data on US city sizes.2

input cities data
us_cities_pop <- read.csv(here::here("tlycs_portland_trial_2021", "uscities_pop.csv")) %>%
  dplyr::select(name, rank, usps, pop2022) %>%
  mutate(name = if_else(name=="New York City", "New York", name)) %>%
  as_tibble()

tl21 <- left_join(
  tl21,
  us_cities_pop,
 by = c("City" = "name") ) %>%
  mutate(
    pop_gt250k = pop2022>250000 | str_det(City, "Windsor|Oshawa|Halifax|Victoria|Kitchener|Hamilton|Quebec|Winnipeg|Ottawa|Edmonton|Calgary|Vancouver|Montreal|Toronto")
  ) %>%
  dplyr::select(City, `Date Range`, Users, pop2022, pop_gt250k, everything())

Below, we:

  • Choose the ‘largest Portland’ as this must be Portland, OR
  • Drop other city duplicates to enable lagging/leading more easily
  • Creates lags and ‘differences across years’ features.
Create ‘year’ feature
tl21 <- tl21 %>% mutate(
  year = case_when(
    str_detect(`Date Range`, "2021") == TRUE ~ 2021,
    str_detect(`Date Range`, "2020") == TRUE ~ 2020
  )
)
Table of ‘Portland’ cities
# Re-labels the 'correct' Portland (we think), as it's a common city name.^[And there are two other 'Portlands', each with a small number of sessions]

tl21 %>%
  filter(City=="Portland") %>%
  dplyr::select(City, year, Users, pop2022, pop_gt250k) %>%
  .kable() %>%
  .kable_styling
City year Users pop2022 pop_gt250k
Portland 2,021 1 666,453 TRUE
Portland 2,020 0 666,453 TRUE
Portland 2,021 8 666,453 TRUE
Portland 2,020 6 666,453 TRUE
Portland 2,021 306 666,453 TRUE
Portland 2,020 144 666,453 TRUE
relable portland
#Presumably the largest one is Portland, Oregon, and we'll re-label it as such.

tl21 <- tl21 %>%
  mutate(
    City = case_when(City=="Portland" & Users>20 ~ "Portland_OR",
              City=="Portland" ~ "Other_Portland",
              TRUE ~ City)
  )
drop_dup_cities
tl21 <- tl21 %>%
  rowwise() %>%
   mutate(key = paste(sort(c(City, year)), collapse="")) %>%
   distinct(key, .keep_all=T) %>%
   dplyr::select(-key) %>%
  ungroup
Create lag and difference features
tlag <- function(x, n = 1L, time) {
  index <- match(time - n, time, incomparables = NA)
  x[index]
}

tl21 <- tl21 %>%
  group_by(City) %>%
  mutate(
  lag_users = tlag(Users, 1, time = year),
  diff_users = Users-lag_users,
    prop_diff_users = diff_users/lag_users
    ) %>%
  ungroup()

5.4 Casual/simple ‘uptick’ analysis

Below, for the comparable periods in 2020 and 2021, we give…

…the total numbers of cities in the sample, the share with a positive number of user clicks, and the mean, median, 80th quantile, and standard deviation of the number of clicks.

… For a few subsets

… First for unique user visits, and then for total numbers of sessions.

tabs of session by year
tl21 %>% sumtab(Users, year, caption="All N. Amer. cities")
All N. Amer. cities
year N share > 0 Mean Median P80 Std.dev.
2020 3006 0.6 9.38 1 4 (296.34)
2021 3006 1.0 13.48 2 4 (425.33)
tabs of session by year
tl21 %>% filter(!City=="Portland_OR") %>% sumtab(Users, year, caption="N. Amer. Cities other than Portland")
N. Amer. Cities other than Portland
year N share > 0 Mean Median P80 Std.dev.
2020 3004 0.6 3.94 1 4 (21.01)
2021 3004 1.0 5.64 2 4 (30.79)
tabs of session by year
tl21 %>% filter(pop2022>250000) %>% sumtab(Users, year, caption="All N. Amer. Cities with pop. > 250k") #Fix -- filter on over 20 users for 2020 ONLY]
All N. Amer. Cities with pop. > 250k
year N share > 0 Mean Median P80 Std.dev.
2020 92 0.848 47.46 17.0 47.0 (99.21)
2021 92 1.000 70.04 19.5 80.2 (147.88)
tabs of session by year
tl21 %>% sumtab(Sessions, year, caption="Sessions by year, all")
Sessions by year, all
year N share > 0 Mean Median P80 Std.dev.
2020 3006 0.6 11.75 1 4 (370.61)
2021 3006 1.0 16.80 2 5 (528.67)

Note that all measures generally show an increase from year to year.

Upticks for Oregon, other
OR_users_21 <- tl21 %>% dplyr::filter(City=="Portland_OR" & year==2021) %>% dplyr::select('Users') %>% pull()

OR_users_20 <- tl21 %>% dplyr::filter(City=="Portland_OR" & year==2020) %>% dplyr::select('Users') %>% pull()

OR_uptick <- OR_users_21 - OR_users_20

nonOR  <- tl21 %>% dplyr::filter(City!="Portland_OR") %>% dplyr::select('Users', year)
nonOR_gt250k  <- tl21 %>% dplyr::filter(pop_gt250k==TRUE) %>% dplyr::select('Users', year)

nonOR_users_21 <- mean(nonOR$Users[nonOR$year==2021])
nonOR_users_20 <- mean(nonOR$Users[nonOR$year==2020])

nonOR_uptick <- (nonOR_users_21 - nonOR_users_20)/nonOR_users_20

nonOR_gt250k_users_21 <- mean(nonOR_gt250k$Users[nonOR_gt250k$year==2021])
nonOR_gt250k_users_20 <- mean(nonOR_gt250k$Users[nonOR_gt250k$year==2020])

nonOR_gt250k_uptick <- (nonOR_gt250k_users_21 - nonOR_gt250k_users_20)/nonOR_gt250k_users_20

Bounding : Maximum impact/minimum cost (subject to random variation)

(This duplicates the hand-calculated results in Gitbook HERE)

3

We have a….

  • lower bound on cost of $13.07 per user ($10.28 per visit) if ‘all visits were generated by the ad’, and

  • a somewhat more reasonable $24.69 cost per user if Portland was the ‘same as last year’. But it seems most reasonable to allow Portland to have similar trends as other cities.

Difference in Differences comparison to other cities

Guiding assumptions:

  • the cities used are fairly representative

  • ‘uptick as a percentage’ is unrelated to city size/visits last year

  • all the cities in the comparison group are ‘informative to the counterfactual’ in proportion to their total number of sessions.4


Thus

112.5% visits uptick (Year on Year) for Portland in 2020

For ‘all North American cities other than Portland (with greater than 250,000 people )’:

The average is 46.5 users in the 2020 period and 64.5 users in the 2021 period, an uptick of 64.5 - 46.5)/46.5 = about 38.8%. 5

38.8% uptick \(\times\) 144 = 55.9 ‘counterfactual uptick’ in users for Portland

162 -55.9 = 106 ‘uptick relative to counterfactual’

Code
or_uptick_vs_cfl_gt250 <- OR_users_21 - OR_users_20 - nonOR_gt250k_uptick*OR_users_20

USD 4000 /106 = $ 37.7 cost per user

This seems realistic at a first-pass.


Plotting the ‘difference in difference’

Scatterplot of unique users by city, both time periods, US cities with 500k-1.5mill population, Portland highlighted

users_by_year
(
  users_by_year <-  tl21 %>%
  filter(pop2022 >= 500000 & pop2022 < 1500000) %>%
ggplot() +
  aes(x = pop2022, y = Users, label = City) +
  geom_point(aes(shape = factor(year)), size=3, alpha=0.4) +
  geom_path(aes(group = City)) +
  geom_label(data = subset(tl21, City %in% c('Portland_OR','Boston','Seattle', 'San Francisco', 'Atlanta', 'San Diego', 'Indianapolis', 'Fort Worth', 'Sacramento', 'Denver', 'San Jose', 'Las Vegas') & year==2020), alpha  = 0.3, size =2.5) +
  scale_x_continuous(trans='log10') +
  theme_minimal()
)

users_by_year
#users_by_year %>%  plotly::ggplotly()

Plotting increases in users by population…

increase_by_size
(
  increase_by_size <-  tl21 %>%
      filter(pop2022 >= 500000 & pop2022 < 1500000 & year==2021) %>%
   ggplot() +
  aes(x = pop2022, y = diff_users, label = City) +
  geom_point(size=3, alpha=0.4) +
  geom_label(data = subset(tl21, City %in% c('Portland_OR','Boston','Seattle', 'San Francisco', 'Atlanta', 'San Diego', 'Indianapolis', 'Fort Worth', 'Sacramento', 'Denver', 'San Jose', 'Las Vegas')), alpha  = 0.3, size =2.5) +
  scale_x_continuous(trans='log10') +
  theme_minimal()
)

Plotting proportional increases in users by 2020 users…

Code
(
  prop_increase_by_size <-  tl21 %>%
      filter(pop2022 >= 500000 & pop2022 < 1500000 & year==2021 & lag_users>0) %>%
   ggplot() +
  aes(x = lag_users, y = prop_diff_users, label = City) +
  geom_point(size=3, alpha=0.4) +
    geom_smooth(method='lm') +
  geom_text(hjust=0, vjust=0, alpha=.5) +
  scale_x_continuous(trans='log10') +
  theme_minimal() +
    ggtitle("Mid-sized cities, proportional changes") +
     xlab("Users in 2020") +
     ylab("Proportional change in users")
)
`geom_smooth()` using formula 'y ~ x'

5.5 Next steps, if warranted

The above comparisons are crude and have limitations:

  • All cities are weighted equally, no matter their size or similarity to Portland
  • Some year-to-year idiosynchratic variation may be unrelated to trends or to the trial. We have not ‘quantified this uncertainty’

If we want a more precise estimate and careful CIs6 we can build an explicit model and simulation. But I want to know the value of precision here before I dig deeper.


  1. Note: the code below reads directly from the folder structure and is thus not portable. Ideally, we would modify this to read it in directly from the private Github repo for those with access.↩︎

  2. Downloaded from this link accessed on 17 Mar 2022; we add the largest-15 Canadian cities as well (hand-input)↩︎

  3. The present very simple section is hand-input for now…↩︎

  4. I’m waving the hands a bit here; this is not precise.↩︎

  5. This is very similar to the result if we look at all cities which has an uptick of 43.1%↩︎

  6. confidence, credible or ‘consistent intervals’, depending on model and terminology↩︎