slingforbigdata内容摘要:
r item: – Fine if putation time interarrival time – Otherwise build up putation backlog O(N) ◊ Better: “skip counting” – Find random index m(n) of next selection n – Distribution: Prob[m(n) ≤ m] = 1 (1pn+1)*(1pn+2)*…*(1pm) ◊ Expected number of selections from stream is k + Σkm≤N pm = k + Σkm≤N k/m = O(k ( 1 + ln (N/k) )) ◊ Vitter’85 provided algorithm with this average running time Sampling for Big Data Reservoir Sampling via Order Sampling ◊ Order sampling . bottomk sample, minhashing ◊ Uniform sampling of stream into reservoir of size k ◊ Each arrival n: generate onetime random value rn U[0,1] – rn also known as hash, rank, tag… ◊ Store k items with the smallest random tags Each item has same chance of least tag, so uniform Fast to implement via priority queue Can run on multiple input streams separately, then merge Sampling for Big Data Handling Weights ◊ So far: uniform sampling from a stream using a reservoir ◊ Extend to nonuniform sampling from weighted streams – Easy case: k=1 – Sampling probability p(n) = xn/Wn where Wn = i=1n xi ◊ k1 is harder – Can have elements with large weight: would be sampled with prob 1? ◊ Number of different weighted ordersampling schemes proposed to realize desired distributional objectives – Rank rn = f(un, xn ) for some function f and un U[0,1] – kmins sketches [Cohen 1997], Bottomk sketches [Cohen Kaplan 2020] – [Rosen 1972], Weighted random sampling [Efraimidis Spirakis 2020] – Order PPS Sampling [Ohlsson 1990, Rosen 1997] – Priority Sampling [Duffield Lund Thorup 2020], [Alon+DLT 2020] Sampling for Big Data Weighted random sampling ◊ Weighted random sampling [Efraimidis Spirakis 06] generalizes minwise – For each item draw rn uniformly at random in range [0,1] – Compute the ‘tag’ of an item as rn (1/xn) – Keep the items with the k smallest tags – Can prove the correctness of the exponential sampling distribution ◊ Can also make efficient via skip counting ideas Sampling for Big Data Priority Sampling ◊ Each item xi given priority zi = xi / ri with rn uniform random in (0,1] ◊ Maintain reservoir of k+1 items (xi , zi ) of highest priority ◊ Estimation – Let z* = (k+1)st highest priority – Topk priority items: weight estimate x’I = max{ xi , z* } – All other items: weight estimate zero ◊ Statistics and bounds – x’I unbiased。 zero covariance: Cov[x’i , x’j ] = 0 for i≠j – Relative variance for any subset sum ≤ 1/(k1) [Szegedy, 2020] Sampling for Big Data Priority Sampling in Databases ◊ One Time Sample Preparation – Compute priorities of all items, sort in decreasing priority order □ No discard ◊ Sample and Estimate – Estimate any subset sum X(S) = iS xi by X’(S) = iS x’I for some S’ S – Method: select items in decreasing priority order ◊ Two variants: bounded variance or plexity 1. S’ = first k items from S: relative variance bounded ≤ 1/(k1) □ x’I = max{ xi , z* } where z* = (k+1)st highest priority in S 2. S’ = items from S in first k: execution time O(k) □ x’I = max{ xi , z* } where z* = (k+1)st highest priority [Alon et. al., 2020] Sampling for Big Data Making Stream Samples Smarter ◊ Observation: we see the whole stream, even if we can’t store it – Can keep more information about sampled items if repeated – Simple information: if item sampled, count all repeats ◊ Counting Samples [Gibbons amp。 Mattias 98] – Sample new items with fixed probability p, count repeats as ci – Unbiased estimate of total count: 1/p + (ci – 1) ◊ Sample and Hold [Estan amp。 Varghese 02]: generalize to weighted keys – New key with weight b sampled with probability 1 (1p)b ◊ Lower variance pared with independent sampling – But sample size will grow as pn ◊ Adaptive sample and hold: reduce p when needed – “Sticky sampling”: geometric decreases in p [Manku, Motwani 02] – Much subsequent work tuning decrease in p to maintain sample size Sampling for Big Data Sketch Guided Sampling ◊ Go further: avoid sampling the heavy keys as much – Uniform sampling will pick from the heavy keys again and again ◊ Idea: use an oracle to tell when a key is heavy [Kumar Xu 06] – Adjust sampling probability accordingly ◊ Can use a “sketch” data structure to play the role of oracle – Like a hash table with collisions, tracks approximate frequencies – . (Counting) Bloom Filters, CountMin Sketch ◊ Track probability with which key is sampled, use HT estimators – Set probability of sampling key with (estimated) weight w as 1/(1 + w) for parameter : decreases as w increases – Decreasing improves accuracy, increases sample size Sampling for Big Data Challenges for Smart Stream Sampling ◊ Current router constraints – Flow tables maintained in fast expensive SRAM □ To support per packet key lookup at line rate ◊ Implementation requirements – Sample and Hold: still need per packet lookup – Sampled NetFlow: (uniform) sampling reduces lookup rate □ Easier to implement despite inferior statistical properties ◊ Long development times to realize new sampling algorithms ◊ Similar concerns affect sampling in other applications – Processing large amounts of data needs awareness of hardware – Uniform sampling means no coordination needed in distributed setting Sampling for Big Data Future for Smarter Stream Sampling ◊ Software Defined Networking – Current: proprietary software running on。slingforbigdata
本资源仅提供20页预览,下载后可查看全文
阅读剩余 0%
本站所有文章资讯、展示的图片素材等内容均为注册用户上传(部分报媒/平媒内容转载自网络合作媒体),仅供学习参考。
用户通过本站上传、发布的任何内容的知识产权归属用户或原始著作权人所有。如有侵犯您的版权,请联系我们反馈本站将在三个工作日内改正。
相关推荐
safety,accidents,andhumanerror
ult of these problems – Pilot error blamed on over 70% of airplane accidents – Operator error blamed on over 60% of nuclear power plant accidents – Doctor/Nurse errors in ICU occur at a rate of •
scie根据作者查引用的方法北京工业大学图书馆
一作者 没有显示 title(没有被 SCI收录) 显示 title(被 SCI收录) ,1974年以后出版因我校购买,所以为超链接的形式, 74年之前的没有超链接 6 如何查看引文 ,即引用某篇文章的文章 ? 首先选中要查看的记录 ,然后点击 ” Finish Search”按钮 7 这
sci收录检索指引读者自行检索sci收录或引用,要求盖章、
登录方式: 图书馆主页( → 网络数据库 点击进入我馆购置的中、英文数据库 登录方式: 外文数据库 点击进入 web of knowdge 平台 注意:选择外文数据库 第 1步: 该平台上除了有 SCI数据库外,还有 Medline、 BP等数据库,因此,请切换到引文数据库 web of science界面,并只勾选 SCIEXPANDED数据库。 请注意此处数据库只勾选SCIEXPANDED
rss介绍及其在图书馆的应用
次(TOC)的 RSS Feed。 23 RSS介紹及其在圖書館的應用 (五)資料庫代理商 :除了出版社外,資料庫代理商也陸續跟進提供 RSS服務,ProQuest資料庫目前提供 ABI/INFORM及Dissertations amp。 Theses的 RSS Feeds;EBSCOhost在今年三月份也提供了 RSS Feeds Alert的功能,給使用者多了一個選擇。 相信不久將來