p(y) with a 5% margin of error? What if we want to split this up by age, sex, race, and party? Assume ≈ 100 age, 2 sex, 5 race, 3 party Jake Hofman (Columbia University) Intro to Counting February 1, 2019 4 / 30
to small sample sizes or sparsity (e.g., ∼ 100 age × 2 sex × 5 race × 3 party = 3,000 groups, but typical surveys collect ∼ 1,000s of responses) Jake Hofman (Columbia University) Intro to Counting February 1, 2019 5 / 30
observations into larger, but fewer, groups (e.g., bin age into a few groups: 18-29, 30-49, 50-64, 65+) Jake Hofman (Columbia University) Intro to Counting February 1, 2019 5 / 30
well from small samples (e.g., ﬁt a model: support ∼ β0 + β1age + β2age2 + . . .) Jake Hofman (Columbia University) Intro to Counting February 1, 2019 5 / 30
so we can just count and divide to make estimates via relative frequencies (e.g., with ∼ 1M responses, we have 100s per group and can estimate support within a few percentage points) Jake Hofman (Columbia University) Intro to Counting February 1, 2019 6 / 30
computationally challenging at large scales (1M is easy, 1B a bit less so, 1T gets interesting) Jake Hofman (Columbia University) Intro to Counting February 1, 2019 8 / 30
on a single machine But the same ideas extend to counting at large scales on many machines (Hadoop, Spark, etc.) Jake Hofman (Columbia University) Intro to Counting February 1, 2019 9 / 30
Load dataset into memory • Split: Arrange observations into groups of interest • Apply: Compute distributions and statistics within each group • Combine: Collect results across groups 1http://bit.ly/splitapplycombine Jake Hofman (Columbia University) Intro to Counting February 1, 2019 10 / 30
each observation as (group, value): place value in bucket for corresponding group for each group: apply a function over values in bucket output group and result Jake Hofman (Columbia University) Intro to Counting February 1, 2019 12 / 30
each observation as (group, value): place value in bucket for corresponding group for each group: apply a function over values in bucket output group and result Useful for computing arbitrary within-group statistics when we have required memory (e.g., conditional distribution, median, etc.) Jake Hofman (Columbia University) Intro to Counting February 1, 2019 12 / 30
5 Rating Number of ratings group by rating value for each group: count # ratings Jake Hofman (Columbia University) Intro to Counting February 1, 2019 16 / 30
9,000 Movie Rank CDF group by movie id for each group: count # ratings sort by group size cumulatively sum group sizes Jake Hofman (Columbia University) Intro to Counting February 1, 2019 20 / 30
id for each group: compute median movie rank 0 2,000 4,000 6,000 8,000 100 10,000 User eccentricity Number of users Jake Hofman (Columbia University) Intro to Counting February 1, 2019 22 / 30
levels Observations Movielens 100K 10K 10 10M Netﬂix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Jake Hofman (Columbia University) Intro to Counting February 1, 2019 23 / 30
levels Observations Movielens 100K 10K 10 10M Netﬂix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Sampling? Unreliable estimates for rare groups Jake Hofman (Columbia University) Intro to Counting February 1, 2019 23 / 30
levels Observations Movielens 100K 10K 10 10M Netﬂix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Random access from disk? 1000x more storage, but 1000x slower2 2Numbers every programmer should know Jake Hofman (Columbia University) Intro to Counting February 1, 2019 23 / 30
levels Observations Movielens 100K 10K 10 10M Netﬂix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Streaming Read data one observation at a time, storing only needed state Jake Hofman (Columbia University) Intro to Counting February 1, 2019 23 / 30
value): if new group: initialize result update result for corresponding group as function of existing result and current value for each group: output group and result Jake Hofman (Columbia University) Intro to Counting February 1, 2019 24 / 30
value): if new group: initialize result update result for corresponding group as function of existing result and current value for each group: output group and result Useful for computing a subset of within-group statistics with a limited memory footprint (e.g., min, mean, max, variance, etc.) Jake Hofman (Columbia University) Intro to Counting February 1, 2019 24 / 30
id]++ for each group: totals[movie id] / counts[movie id] 1 2 3 4 5 Mean Rating by Movie Density Jake Hofman (Columbia University) Intro to Counting February 1, 2019 26 / 30
(group, value): histogram[group][value]++ for each group: compute result as a function of histogram output group and result Jake Hofman (Columbia University) Intro to Counting February 1, 2019 27 / 30
(group, value): histogram[group][value]++ for each group: compute result as a function of histogram output group and result We can recover arbitrary statistics if we can aﬀord to store counts of all distinct values within in each group Jake Hofman (Columbia University) Intro to Counting February 1, 2019 27 / 30
Statistics N Small dataset Yes General V*G Small distributions Yes General G Small # groups No Combinable V Small # outcomes No No 1 Large # both No No N = total number of observations G = number of distinct groups V = largest number of distinct values within group Jake Hofman (Columbia University) Intro to Counting February 1, 2019 28 / 30
N ∼ 100M ratings G ∼ 20K movies V ∼ 10 half-star values V *G ∼ 200K, store per-group histograms for arbitrary statistics (scales to arbitrary N, if you’re patient) Jake Hofman (Columbia University) Intro to Counting February 1, 2019 29 / 30
N ∼ 10B ratings G ∼ 1B videos V ∼ 10 half-star values V *G ∼ 10B, fails because per-group histograms are too large to store in memory G ∼ 1B, but no (exact) calculation for streaming median Jake Hofman (Columbia University) Intro to Counting February 1, 2019 29 / 30
N ∼ 10B ratings G ∼ 1B videos V ∼ 10 half-star values G ∼ 1B, use streaming to compute combinable statistics Jake Hofman (Columbia University) Intro to Counting February 1, 2019 29 / 30
Statistics N Small dataset Yes General V*G Small distributions Yes General G Small # groups No Combinable V Small # outcomes Yes General 1 Large # both No Combinable N = total number of observations G = number of distinct groups V = largest number of distinct values within group Jake Hofman (Columbia University) Intro to Counting February 1, 2019 30 / 30