5 TLYCS Portland trial: Brief report

Background, data input, brief report

5.1 The trial

In December 2021, TLYCS ran a YouTube advertising campaign in single city, involving ‘donation advice’. The top 10% household income households were targeted with (one of) three categories of videos. One of the ultimate goals was to get households to sign up for a ‘concierge’ personal donor advising service.

The details are presented in our gitbook HERE

Quick takeaways

There were very few signups for the concierge advising service. (About 16 in December 2021 , only 1 from Portland.)

We consider a ‘difference in difference’, to compare the year-on-year changes in visits to TLYCS during this period for Portland vs other comparison cities.

This comparison yields a ‘middle estimate cost’ of $37.7 per additional visitor to the site. This seems relatively expensive. We could look into this further to build a more careful model and consider statistical bounds, if such work is warranted.

5.2 Capturing data

Google Analytics capture

I (David Reinstein) did a manual ‘create report and download’ in TLYCS’s Google Analytics, basically as described here.

The Google Analytics report and it’s parameters can be accessed here, if you have access.

Google Analytics report: parameters and data storage

Report:

3-31 December 2021 vs prior year, same dates
All North American cities with 1 or more user
Counts: Users, sessions, certain types of conversions (see below)

I downloaded this as an Excel spreadsheet to the private eamt_actual_data repo:

eamt_actual_data/tlycs/tlycs_dec_2021_vs_2020_by_city_n_america.xlsx

5.3 Input and clean data

Reading TLYCS trial data

#tl21 <- read_excel(here::here("data_do_not_commit", "tlycs", "tlycs_dec_2021_vs_2020_by_city_n_america.xlsx"), sheet = "Dataset1") #this should have worked?

#tl21 <- readxl::read_excel("~/githubs/eamt_actual_data/tlycs/tlycs_dec_2021_vs_2020_by_city_n_america.xlsx",  sheet = "Dataset1")


tl21 <- readxl::read_excel(here::here("tlycs_portland_trial_2021", "tlycs_dec_2021_vs_2020_by_city_n_america.xlsx"),  sheet = "Dataset1") %>%
  dplyr::as_tibble()

We add companion data on US city sizes.²

input cities data

us_cities_pop <- read.csv(here::here("tlycs_portland_trial_2021", "uscities_pop.csv")) %>%
  dplyr::select(name, rank, usps, pop2022) %>%
  mutate(name = if_else(name=="New York City", "New York", name)) %>%
  as_tibble()

tl21 <- left_join(
  tl21,
  us_cities_pop,
 by = c("City" = "name") ) %>%
  mutate(
    pop_gt250k = pop2022>250000 | str_det(City, "Windsor|Oshawa|Halifax|Victoria|Kitchener|Hamilton|Quebec|Winnipeg|Ottawa|Edmonton|Calgary|Vancouver|Montreal|Toronto")
  ) %>%
  dplyr::select(City, `Date Range`, Users, pop2022, pop_gt250k, everything())

More coding

Below, we:

Choose the ‘largest Portland’ as this must be Portland, OR
Drop other city duplicates to enable lagging/leading more easily
Creates lags and ‘differences across years’ features.

Create ‘year’ feature

tl21 <- tl21 %>% mutate(
  year = case_when(
    str_detect(`Date Range`, "2021") == TRUE ~ 2021,
    str_detect(`Date Range`, "2020") == TRUE ~ 2020
  )
)

Table of ‘Portland’ cities

# Re-labels the 'correct' Portland (we think), as it's a common city name.^[And there are two other 'Portlands', each with a small number of sessions]

tl21 %>%
  filter(City=="Portland") %>%
  dplyr::select(City, year, Users, pop2022, pop_gt250k) %>%
  .kable() %>%
  .kable_styling

City	year	Users	pop2022	pop_gt250k
Portland	2,021	1	666,453	TRUE
Portland	2,020	0	666,453	TRUE
Portland	2,021	8	666,453	TRUE
Portland	2,020	6	666,453	TRUE
Portland	2,021	306	666,453	TRUE
Portland	2,020	144	666,453	TRUE

relable portland

#Presumably the largest one is Portland, Oregon, and we'll re-label it as such.

tl21 <- tl21 %>%
  mutate(
    City = case_when(City=="Portland" & Users>20 ~ "Portland_OR",
              City=="Portland" ~ "Other_Portland",
              TRUE ~ City)
  )

drop_dup_cities

tl21 <- tl21 %>%
  rowwise() %>%
   mutate(key = paste(sort(c(City, year)), collapse="")) %>%
   distinct(key, .keep_all=T) %>%
   dplyr::select(-key) %>%
  ungroup

Create lag and difference features

tlag <- function(x, n = 1L, time) {
  index <- match(time - n, time, incomparables = NA)
  x[index]
}

tl21 <- tl21 %>%
  group_by(City) %>%
  mutate(
  lag_users = tlag(Users, 1, time = year),
  diff_users = Users-lag_users,
    prop_diff_users = diff_users/lag_users
    ) %>%
  ungroup()

5.4 Casual/simple ‘uptick’ analysis

Below, for the comparable periods in 2020 and 2021, we give…

…the total numbers of cities in the sample, the share with a positive number of user clicks, and the mean, median, 80th quantile, and standard deviation of the number of clicks.

… For a few subsets

… First for unique user visits, and then for total numbers of sessions.

tabs of session by year

tl21 %>% sumtab(Users, year, caption="All N. Amer. cities")

All N. Amer. cities
year	N	share > 0	Mean	Median	P80	Std.dev.
2020	3006	0.6	9.38	1	4	(296.34)
2021	3006	1.0	13.48	2	4	(425.33)

tabs of session by year

tl21 %>% filter(!City=="Portland_OR") %>% sumtab(Users, year, caption="N. Amer. Cities other than Portland")

N. Amer. Cities other than Portland
year	N	share > 0	Mean	Median	P80	Std.dev.
2020	3004	0.6	3.94	1	4	(21.01)
2021	3004	1.0	5.64	2	4	(30.79)

tabs of session by year

tl21 %>% filter(pop2022>250000) %>% sumtab(Users, year, caption="All N. Amer. Cities with pop. > 250k") #Fix -- filter on over 20 users for 2020 ONLY]

All N. Amer. Cities with pop. > 250k
year	N	share > 0	Mean	Median	P80	Std.dev.
2020	92	0.848	47.46	17.0	47.0	(99.21)
2021	92	1.000	70.04	19.5	80.2	(147.88)

tabs of session by year

tl21 %>% sumtab(Sessions, year, caption="Sessions by year, all")

Sessions by year, all
year	N	share > 0	Mean	Median	P80	Std.dev.
2020	3006	0.6	11.75	1	4	(370.61)
2021	3006	1.0	16.80	2	5	(528.67)

Note that all measures generally show an increase from year to year.

Upticks for Oregon, other

OR_users_21 <- tl21 %>% dplyr::filter(City=="Portland_OR" & year==2021) %>% dplyr::select('Users') %>% pull()

OR_users_20 <- tl21 %>% dplyr::filter(City=="Portland_OR" & year==2020) %>% dplyr::select('Users') %>% pull()

OR_uptick <- OR_users_21 - OR_users_20

nonOR  <- tl21 %>% dplyr::filter(City!="Portland_OR") %>% dplyr::select('Users', year)
nonOR_gt250k  <- tl21 %>% dplyr::filter(pop_gt250k==TRUE) %>% dplyr::select('Users', year)

nonOR_users_21 <- mean(nonOR$Users[nonOR$year==2021])
nonOR_users_20 <- mean(nonOR$Users[nonOR$year==2020])

nonOR_uptick <- (nonOR_users_21 - nonOR_users_20)/nonOR_users_20

nonOR_gt250k_users_21 <- mean(nonOR_gt250k$Users[nonOR_gt250k$year==2021])
nonOR_gt250k_users_20 <- mean(nonOR_gt250k$Users[nonOR_gt250k$year==2020])

nonOR_gt250k_uptick <- (nonOR_gt250k_users_21 - nonOR_gt250k_users_20)/nonOR_gt250k_users_20

Bounding : Maximum impact/minimum cost (subject to random variation)

(This duplicates the hand-calculated results in Gitbook HERE)

We have a….

lower bound on cost of $13.07 per user ($10.28 per visit) if ‘all visits were generated by the ad’, and
a somewhat more reasonable $24.69 cost per user if Portland was the ‘same as last year’. But it seems most reasonable to allow Portland to have similar trends as other cities.

Difference in Differences comparison to other cities

Guiding assumptions:

the cities used are fairly representative
‘uptick as a percentage’ is unrelated to city size/visits last year
all the cities in the comparison group are ‘informative to the counterfactual’ in proportion to their total number of sessions.⁴

Thus

112.5% visits uptick (Year on Year) for Portland in 2020

For ‘all North American cities other than Portland (with greater than 250,000 people )’:

The average is 46.5 users in the 2020 period and 64.5 users in the 2021 period, an uptick of 64.5 - 46.5)/46.5 = about 38.8%. ⁵

38.8% uptick $\times$ 144 = 55.9 ‘counterfactual uptick’ in users for Portland

162 -55.9 = 106 ‘uptick relative to counterfactual’

Code

or_uptick_vs_cfl_gt250 <- OR_users_21 - OR_users_20 - nonOR_gt250k_uptick*OR_users_20

USD 4000 /106 = $ 37.7 cost per user

This seems realistic at a first-pass.

Plotting the ‘difference in difference’

Scatterplot of unique users by city, both time periods, US cities with 500k-1.5mill population, Portland highlighted

users_by_year

(
  users_by_year <-  tl21 %>%
  filter(pop2022 >= 500000 & pop2022 < 1500000) %>%
ggplot() +
  aes(x = pop2022, y = Users, label = City) +
  geom_point(aes(shape = factor(year)), size=3, alpha=0.4) +
  geom_path(aes(group = City)) +
  geom_label(data = subset(tl21, City %in% c('Portland_OR','Boston','Seattle', 'San Francisco', 'Atlanta', 'San Diego', 'Indianapolis', 'Fort Worth', 'Sacramento', 'Denver', 'San Jose', 'Las Vegas') & year==2020), alpha  = 0.3, size =2.5) +
  scale_x_continuous(trans='log10') +
  theme_minimal()
)

users_by_year

#users_by_year %>%  plotly::ggplotly()

Plotting increases in users by population…

increase_by_size

(
  increase_by_size <-  tl21 %>%
      filter(pop2022 >= 500000 & pop2022 < 1500000 & year==2021) %>%
   ggplot() +
  aes(x = pop2022, y = diff_users, label = City) +
  geom_point(size=3, alpha=0.4) +
  geom_label(data = subset(tl21, City %in% c('Portland_OR','Boston','Seattle', 'San Francisco', 'Atlanta', 'San Diego', 'Indianapolis', 'Fort Worth', 'Sacramento', 'Denver', 'San Jose', 'Las Vegas')), alpha  = 0.3, size =2.5) +
  scale_x_continuous(trans='log10') +
  theme_minimal()
)

Plotting proportional increases in users by 2020 users…

Code

(
  prop_increase_by_size <-  tl21 %>%
      filter(pop2022 >= 500000 & pop2022 < 1500000 & year==2021 & lag_users>0) %>%
   ggplot() +
  aes(x = lag_users, y = prop_diff_users, label = City) +
  geom_point(size=3, alpha=0.4) +
    geom_smooth(method='lm') +
  geom_text(hjust=0, vjust=0, alpha=.5) +
  scale_x_continuous(trans='log10') +
  theme_minimal() +
    ggtitle("Mid-sized cities, proportional changes") +
     xlab("Users in 2020") +
     ylab("Proportional change in users")
)

`geom_smooth()` using formula 'y ~ x'

5.5 Next steps, if warranted

The above comparisons are crude and have limitations:

All cities are weighted equally, no matter their size or similarity to Portland
Some year-to-year idiosynchratic variation may be unrelated to trends or to the trial. We have not ‘quantified this uncertainty’

If we want a more precise estimate and careful CIs⁶ we can build an explicit model and simulation. But I want to know the value of precision here before I dig deeper.

Note: the code below reads directly from the folder structure and is thus not portable. Ideally, we would modify this to read it in directly from the private Github repo for those with access.↩︎
Downloaded from this link accessed on 17 Mar 2022; we add the largest-15 Canadian cities as well (hand-input)↩︎
The present very simple section is hand-input for now…↩︎
I’m waving the hands a bit here; this is not precise.↩︎
This is very similar to the result if we look at all cities which has an uptick of 43.1%↩︎
confidence, credible or ‘consistent intervals’, depending on model and terminology↩︎

# TLYCS Portland trial: Brief report **Background, data input, brief report** ```{r, include=FALSE} library(here) library(pacman) p_load(dplyr, plotly, install = FALSE) source(here("code", "shared_packages_code.R")) ``` ## The trial In December 2021, TLYCS ran a YouTube advertising campaign in single city, involving 'donation advice'. The top 10% household income households were targeted with (one of) three categories of videos. One of the ultimate goals was to get households to sign up for a 'concierge' personal donor advising service. The details are presented in our gitbook [HERE](https://effective-giving-marketing.gitbook.io/untitled/partners-contexts-trials/the-life-you-can-save-tlycs/advisor-signup-portland) ### Quick takeaways {-} There were very few signups for the concierge advising service. (About 16 in December 2021 , only 1 from Portland.) We consider a 'difference in difference', to compare the year-on-year changes in visits to TLYCS during this period for Portland vs other comparison cities. This comparison yields a 'middle estimate cost' of \$37.7 per additional visitor to the site. This seems relatively expensive. We could look into this further to build a more careful model and consider statistical bounds, if such work is warranted. ## Capturing data ::: {.callout-note collapse="true"} ## Google Analytics capture I (David Reinstein) did a manual 'create report and download' in TLYCS's Google Analytics, basically as described [here](https://effective-giving-marketing.gitbook.io/untitled/marketing-and-testing-opportunities-tools-tips/collecting-data-trial-outcomes/google-analytics-interface). The Google Analytics report and it's parameters can be accessed [here](https://analytics.google.com/analytics/web/#/my-reports/mmZsQ-HMSbCwiG5hjQhWhQ/a10056556w22050110p185278310/_u.date00=20211203&_u.date01=20211231&_u.date10=20201203&_u.date11=20201231&197-table.advFilter=%5B%5B1,%22analytics.totalVisitors%22,%22GT%22,%220%22,0%5D%5D&197-table.plotKeys=%5B%5D&197-table.rowCount=5000&197-table-dataTable.sortColumnName=analytics.totalVisitors&197-table-dataTable.sortDescending=false&197-graphOptions.primaryConcept=analytics.totalVisitors&197-graphOptions.compareConcept=analytics.visits), if you have access. ::: ::: {.callout-note collapse="true"} ## Google Analytics report: parameters and data storage Report: - 3-31 December 2021 vs prior year, same dates - All North American cities with 1 or more user - Counts: Users, sessions, certain types of conversions (see below) I downloaded this as an Excel spreadsheet to the private `eamt_actual_data` repo: `eamt_actual_data/tlycs/tlycs_dec_2021_vs_2020_by_city_n_america.xlsx`  ::: ## Input and clean data ^[Note: the code below reads directly from the folder structure and is thus not portable. Ideally, we would modify this to read it in directly from the private Github repo for those with access.] ```{r} #| label: read_tlycs #| code-summary: "Reading TLYCS trial data" #tl21 <- read_excel(here::here("data_do_not_commit", "tlycs", "tlycs_dec_2021_vs_2020_by_city_n_america.xlsx"), sheet = "Dataset1") #this should have worked? #tl21 <- readxl::read_excel("~/githubs/eamt_actual_data/tlycs/tlycs_dec_2021_vs_2020_by_city_n_america.xlsx", sheet = "Dataset1") tl21 <- readxl::read_excel(here::here("tlycs_portland_trial_2021", "tlycs_dec_2021_vs_2020_by_city_n_america.xlsx"), sheet = "Dataset1") %>% dplyr::as_tibble() ``` We add companion data on US city sizes.^[Downloaded from [this link](https://worldpopulationreview.com/us-cities) accessed on 17 Mar 2022; we add the largest-15 Canadian cities as well (hand-input)] ```{r } #| label: us_cities_pop #| code-summary: "input cities data" us_cities_pop <- read.csv(here::here("tlycs_portland_trial_2021", "uscities_pop.csv")) %>% dplyr::select(name, rank, usps, pop2022) %>% mutate(name = if_else(name=="New York City", "New York", name)) %>% as_tibble() tl21 <- left_join( tl21, us_cities_pop, by = c("City" = "name") ) %>% mutate( pop_gt250k = pop2022>250000 | str_det(City, "Windsor|Oshawa|Halifax|Victoria|Kitchener|Hamilton|Quebec|Winnipeg|Ottawa|Edmonton|Calgary|Vancouver|Montreal|Toronto") ) %>% dplyr::select(City, `Date Range`, Users, pop2022, pop_gt250k, everything()) ``` ::: {.callout-note collapse="true"} ## More coding Below, we: - Choose the 'largest Portland' as this must be Portland, OR - Drop other city duplicates to enable lagging/leading more easily - Creates lags and 'differences across years' features. ::: ```{r} #| label: year_feature #| code-summary: "Create 'year' feature" tl21 <- tl21 %>% mutate( year = case_when( str_detect(`Date Range`, "2021") == TRUE ~ 2021, str_detect(`Date Range`, "2020") == TRUE ~ 2020 ) ) ``` ```{r} #| label: table_portland #| code-summary: "Table of 'Portland' cities" # Re-labels the 'correct' Portland (we think), as it's a common city name.^[And there are two other 'Portlands', each with a small number of sessions] tl21 %>% filter(City=="Portland") %>% dplyr::select(City, year, Users, pop2022, pop_gt250k) %>% .kable() %>% .kable_styling ``` ```{r } #| label: relable_portland #| code-summary: "relable portland" #Presumably the largest one is Portland, Oregon, and we'll re-label it as such. tl21 <- tl21 %>% mutate( City = case_when(City=="Portland" & Users>20 ~ "Portland_OR", City=="Portland" ~ "Other_Portland", TRUE ~ City) ) ``` ```{r } #| label: drop_dup_cities #| code-summary: "drop_dup_cities" tl21 <- tl21 %>% rowwise() %>% mutate(key = paste(sort(c(City, year)), collapse="")) %>% distinct(key, .keep_all=T) %>% dplyr::select(-key) %>% ungroup ``` ```{r} #| label: lag_stuff #| code-summary: "Create lag and difference features" tlag <- function(x, n = 1L, time) { index <- match(time - n, time, incomparables = NA) x[index] } tl21 <- tl21 %>% group_by(City) %>% mutate( lag_users = tlag(Users, 1, time = year), diff_users = Users-lag_users, prop_diff_users = diff_users/lag_users ) %>% ungroup() ```  ## Casual/simple 'uptick' analysis Below, for the comparable periods in 2020 and 2021, we give... ...the total numbers of cities in the sample, the share with a positive number of user clicks, and the mean, median, 80th quantile, and standard deviation of the number of clicks. ... For a few subsets ... First for unique user visits, and then for total numbers of sessions. ```{r} #| label: sumtabs_users #| code-summary: "tabs of session by year" tl21 %>% sumtab(Users, year, caption="All N. Amer. cities") tl21 %>% filter(!City=="Portland_OR") %>% sumtab(Users, year, caption="N. Amer. Cities other than Portland") tl21 %>% filter(pop2022>250000) %>% sumtab(Users, year, caption="All N. Amer. Cities with pop. > 250k") #Fix -- filter on over 20 users for 2020 ONLY] tl21 %>% sumtab(Sessions, year, caption="Sessions by year, all") ``` Note that all measures generally show an increase from year to year. ```{r} #| label: or_v_non_or_uptick #| code-summary: "Upticks for Oregon, other" OR_users_21 <- tl21 %>% dplyr::filter(City=="Portland_OR" & year==2021) %>% dplyr::select('Users') %>% pull() OR_users_20 <- tl21 %>% dplyr::filter(City=="Portland_OR" & year==2020) %>% dplyr::select('Users') %>% pull() OR_uptick <- OR_users_21 - OR_users_20 nonOR <- tl21 %>% dplyr::filter(City!="Portland_OR") %>% dplyr::select('Users', year) nonOR_gt250k <- tl21 %>% dplyr::filter(pop_gt250k==TRUE) %>% dplyr::select('Users', year) nonOR_users_21 <- mean(nonOR$Users[nonOR$year==2021]) nonOR_users_20 <- mean(nonOR$Users[nonOR$year==2020]) nonOR_uptick <- (nonOR_users_21 - nonOR_users_20)/nonOR_users_20 nonOR_gt250k_users_21 <- mean(nonOR_gt250k$Users[nonOR_gt250k$year==2021]) nonOR_gt250k_users_20 <- mean(nonOR_gt250k$Users[nonOR_gt250k$year==2020]) nonOR_gt250k_uptick <- (nonOR_gt250k_users_21 - nonOR_gt250k_users_20)/nonOR_gt250k_users_20 ``` ### Bounding : Maximum impact/minimum cost (subject to random variation) {-} (This duplicates the  hand-calculated results in Gitbook [HERE](https://app.gitbook.com/o/-MfFk4CTSGwVOPkwnRgx/s/-Mf8cHxdwePMZXRTKnEE/contexts-and-environments-for-testing/tlycs/advisor-signup-portland/results-first-pass)) ^[The present very simple section is hand-input for now...] We have a.... - lower bound on cost of \$13.07 per user (\$10.28 per visit) if 'all visits were generated by the ad', and - a somewhat more reasonable \$24.69 cost per user if Portland was the 'same as last year'. But it seems most reasonable to allow Portland to have similar trends as other cities. ### Difference in Differences comparison to other cities {-} Guiding assumptions: - the cities used are fairly representative - 'uptick as a percentage' is unrelated to city size/visits last year - all the cities in the comparison group are 'informative to the counterfactual' in proportion to their total number of sessions.^[I'm waving the hands a bit here; this is not precise.] \ *Thus* `r 100*(OR_users_21 - OR_users_20)/OR_users_20`% visits uptick (Year on Year) for Portland in 2020 For 'all North American cities other than Portland (with greater than 250,000 people )': The average is `r op(mean(nonOR_gt250k$Users[nonOR_gt250k$year==2020]))` users in the 2020 period and `r op(mean(nonOR_gt250k$Users[nonOR_gt250k$year==2021]))` users in the 2021 period, an uptick of `r op(nonOR_gt250k_users_21)` - `r op(nonOR_gt250k_users_20)`)/`r op(nonOR_gt250k_users_20)` = about `r op(100*nonOR_gt250k_uptick)`%. ^[This is very similar to the result if we look at all cities which has an uptick of `r op(100*nonOR_uptick)`%] `r op(100*nonOR_gt250k_uptick)`% uptick $\times$ `r OR_users_20` = `r op(nonOR_gt250k_uptick*OR_users_20)` 'counterfactual uptick' in users for Portland `r OR_users_21 - OR_users_20` -`r op(nonOR_gt250k_uptick*OR_users_20)` = `r op(OR_users_21 - OR_users_20 - nonOR_gt250k_uptick*OR_users_20)` 'uptick relative to counterfactual' ```{r} or_uptick_vs_cfl_gt250 <- OR_users_21 - OR_users_20 - nonOR_gt250k_uptick*OR_users_20 ``` USD 4000 /`r op(or_uptick_vs_cfl_gt250)` = \$ `r op(4000/or_uptick_vs_cfl_gt250)` cost per user This seems realistic at a first-pass. \ ### Plotting the 'difference in difference' {-} Scatterplot of unique users by city, both time periods, US cities with 500k-1.5mill population, Portland highlighted ```{r } #| label: users_by_year #| code-summary: "users_by_year" ( users_by_year <- tl21 %>% filter(pop2022 >= 500000 & pop2022 < 1500000) %>% ggplot() + aes(x = pop2022, y = Users, label = City) + geom_point(aes(shape = factor(year)), size=3, alpha=0.4) + geom_path(aes(group = City)) + geom_label(data = subset(tl21, City %in% c('Portland_OR','Boston','Seattle', 'San Francisco', 'Atlanta', 'San Diego', 'Indianapolis', 'Fort Worth', 'Sacramento', 'Denver', 'San Jose', 'Las Vegas') & year==2020), alpha = 0.3, size =2.5) + scale_x_continuous(trans='log10') + theme_minimal() ) #users_by_year %>% plotly::ggplotly() ``` Plotting increases in users by population... ```{r, warning=FALSE} #| label: increase_by_size #| code-summary: "increase_by_size" ( increase_by_size <- tl21 %>% filter(pop2022 >= 500000 & pop2022 < 1500000 & year==2021) %>% ggplot() + aes(x = pop2022, y = diff_users, label = City) + geom_point(size=3, alpha=0.4) + geom_label(data = subset(tl21, City %in% c('Portland_OR','Boston','Seattle', 'San Francisco', 'Atlanta', 'San Diego', 'Indianapolis', 'Fort Worth', 'Sacramento', 'Denver', 'San Jose', 'Las Vegas')), alpha = 0.3, size =2.5) + scale_x_continuous(trans='log10') + theme_minimal() ) ``` Plotting proportional increases in users by 2020 users... ```{r} ( prop_increase_by_size <- tl21 %>% filter(pop2022 >= 500000 & pop2022 < 1500000 & year==2021 & lag_users>0) %>% ggplot() + aes(x = lag_users, y = prop_diff_users, label = City) + geom_point(size=3, alpha=0.4) + geom_smooth(method='lm') + geom_text(hjust=0, vjust=0, alpha=.5) + scale_x_continuous(trans='log10') + theme_minimal() + ggtitle("Mid-sized cities, proportional changes") + xlab("Users in 2020") + ylab("Proportional change in users") ) ``` ## Next steps, if warranted The above comparisons are crude and have limitations: - All cities are weighted equally, no matter their size or similarity to Portland - Some year-to-year idiosynchratic variation may be unrelated to trends or to the trial. We have not 'quantified this uncertainty' If we want a more precise estimate and careful CIs^[confidence, credible or 'consistent intervals', depending on model and terminology] we can build an explicit model and simulation. But I want to know the value of precision here before I dig deeper.