slingforbigdata内容摘要:

r item: – Fine if putation time interarrival time – Otherwise build up putation backlog O(N) ◊ Better: “skip counting” – Find random index m(n) of next selection n – Distribution: Prob[m(n) ≤ m] = 1 (1pn+1)*(1pn+2)*…*(1pm) ◊ Expected number of selections from stream is k + Σkm≤N pm = k + Σkm≤N k/m = O(k ( 1 + ln (N/k) )) ◊ Vitter’85 provided algorithm with this average running time Sampling for Big Data Reservoir Sampling via Order Sampling ◊ Order sampling . bottomk sample, minhashing ◊ Uniform sampling of stream into reservoir of size k ◊ Each arrival n: generate onetime random value rn  U[0,1] – rn also known as hash, rank, tag… ◊ Store k items with the smallest random tags  Each item has same chance of least tag, so uniform  Fast to implement via priority queue  Can run on multiple input streams separately, then merge Sampling for Big Data Handling Weights ◊ So far: uniform sampling from a stream using a reservoir ◊ Extend to nonuniform sampling from weighted streams – Easy case: k=1 – Sampling probability p(n) = xn/Wn where Wn = i=1n xi ◊ k1 is harder – Can have elements with large weight: would be sampled with prob 1? ◊ Number of different weighted ordersampling schemes proposed to realize desired distributional objectives – Rank rn = f(un, xn ) for some function f and un  U[0,1] – kmins sketches [Cohen 1997], Bottomk sketches [Cohen Kaplan 2020] – [Rosen 1972], Weighted random sampling [Efraimidis Spirakis 2020] – Order PPS Sampling [Ohlsson 1990, Rosen 1997] – Priority Sampling [Duffield Lund Thorup 2020], [Alon+DLT 2020] Sampling for Big Data Weighted random sampling ◊ Weighted random sampling [Efraimidis Spirakis 06] generalizes minwise – For each item draw rn uniformly at random in range [0,1] – Compute the ‘tag’ of an item as rn (1/xn) – Keep the items with the k smallest tags – Can prove the correctness of the exponential sampling distribution ◊ Can also make efficient via skip counting ideas Sampling for Big Data Priority Sampling ◊ Each item xi given priority zi = xi / ri with rn uniform random in (0,1] ◊ Maintain reservoir of k+1 items (xi , zi ) of highest priority ◊ Estimation – Let z* = (k+1)st highest priority – Topk priority items: weight estimate x’I = max{ xi , z* } – All other items: weight estimate zero ◊ Statistics and bounds – x’I unbiased。 zero covariance: Cov[x’i , x’j ] = 0 for i≠j – Relative variance for any subset sum ≤ 1/(k1) [Szegedy, 2020] Sampling for Big Data Priority Sampling in Databases ◊ One Time Sample Preparation – Compute priorities of all items, sort in decreasing priority order □ No discard ◊ Sample and Estimate – Estimate any subset sum X(S) = iS xi by X’(S) = iS x’I for some S’  S – Method: select items in decreasing priority order ◊ Two variants: bounded variance or plexity 1. S’ = first k items from S: relative variance bounded ≤ 1/(k1) □ x’I = max{ xi , z* } where z* = (k+1)st highest priority in S 2. S’ = items from S in first k: execution time O(k) □ x’I = max{ xi , z* } where z* = (k+1)st highest priority [Alon et. al., 2020] Sampling for Big Data Making Stream Samples Smarter ◊ Observation: we see the whole stream, even if we can’t store it – Can keep more information about sampled items if repeated – Simple information: if item sampled, count all repeats ◊ Counting Samples [Gibbons amp。 Mattias 98] – Sample new items with fixed probability p, count repeats as ci – Unbiased estimate of total count: 1/p + (ci – 1) ◊ Sample and Hold [Estan amp。 Varghese 02]: generalize to weighted keys – New key with weight b sampled with probability 1 (1p)b ◊ Lower variance pared with independent sampling – But sample size will grow as pn ◊ Adaptive sample and hold: reduce p when needed – “Sticky sampling”: geometric decreases in p [Manku, Motwani 02] – Much subsequent work tuning decrease in p to maintain sample size Sampling for Big Data Sketch Guided Sampling ◊ Go further: avoid sampling the heavy keys as much – Uniform sampling will pick from the heavy keys again and again ◊ Idea: use an oracle to tell when a key is heavy [Kumar Xu 06] – Adjust sampling probability accordingly ◊ Can use a “sketch” data structure to play the role of oracle – Like a hash table with collisions, tracks approximate frequencies – . (Counting) Bloom Filters, CountMin Sketch ◊ Track probability with which key is sampled, use HT estimators – Set probability of sampling key with (estimated) weight w as 1/(1 + w) for parameter  : decreases as w increases – Decreasing  improves accuracy, increases sample size Sampling for Big Data Challenges for Smart Stream Sampling ◊ Current router constraints – Flow tables maintained in fast expensive SRAM □ To support per packet key lookup at line rate ◊ Implementation requirements – Sample and Hold: still need per packet lookup – Sampled NetFlow: (uniform) sampling reduces lookup rate □ Easier to implement despite inferior statistical properties ◊ Long development times to realize new sampling algorithms ◊ Similar concerns affect sampling in other applications – Processing large amounts of data needs awareness of hardware – Uniform sampling means no coordination needed in distributed setting Sampling for Big Data Future for Smarter Stream Sampling ◊ Software Defined Networking – Current: proprietary software running on。
阅读剩余 0%
本站所有文章资讯、展示的图片素材等内容均为注册用户上传(部分报媒/平媒内容转载自网络合作媒体),仅供学习参考。 用户通过本站上传、发布的任何内容的知识产权归属用户或原始著作权人所有。如有侵犯您的版权,请联系我们反馈本站将在三个工作日内改正。