[rakutentechconf2013] [d-3_2] counting big databy streaming algorithms

18
Counting Big Data by Streaming Algorithms 2013/10/26 @ Rakuten Technology Conference 2013 Rakuten Institute of Technology, Rakuten, Inc., Yusaku Kaneta http://www.rakuten.co.jp/

Upload: rakuten-inc

Post on 23-Jan-2015

630 views

Category:

Technology


3 download

DESCRIPTION

Rakuten Technology Conference 2013 "Counting Big Data by Streaming Algorithms" Yusaku Kaneta (Rakuten)

TRANSCRIPT

  • 1. Counting Big Data by Streaming Algorithms 2013/10/26 @ Rakuten Technology Conference 2013 Rakuten Institute of Technology, Rakuten, Inc., Yusaku Kaneta http://www.rakuten.co.jp/

2. Who am I? Yusaku Kaneta (@yusakukaneta) Joined Rakuten in April 2012. Rakuten Institute of Technology (RIT) Interests: String processing (esp., Pattern matching) Hardware design using FPGA Bitwise tricks & techniques Love TAOCP 7.1.3 & Hacker's Delight 2 3. Problem: Count Big Data Counting: Fundamental operation in data analysis. Big data is difficult to just count Because it needs huge amount of memory. E.g., 400GB+ is needed for one-year access logs.3 4. Batch Processing Batch processing can solve this. E.g., Two issues: High latency Requirement for a cluster of machines BatchBatchBatch= High costBatchBatchBatch4 5. Our Goals 1. Reduce memory Cost reduction.2. Reduce latency Quick business decisions.3. Achieve high-accuracy Correct business decisions. 5 6. Our Approach Streaming algorithms Can fulfill all our goals! Become common in Web companies. See the paper on Googles PowerDrill & the code of Twitters Algebird for examples of how to use. Keys: Limited memory Low latency Theoretical guarantee for accuracy 6 7. Streaming Algorithm Library RIT internally provides a C library for streaming algorithms, libsketch. Three advantages: Memory efficient Bindings forHigh speedHigh accuracy& 7 8. Why C? Our target: Python & Ruby users! for data analysisfor stream processing But most of existing libraries are written in Scala (algebird), Java (stream-lib), ...This is a reason why our library is written in C! Easy to incorporate C libraris in Python & Ruby. 8 9. Application 10. Count Query in Rakuten Example: We want to know... 1. How many unique users that checked an item in one day (month, or year)? 2. How many products sold in one day (month, or year)? Streaming algorithms for the queries 1. HyperLogLog algorithm 2. Count-Min Sketch algorithm 10 11. Count Query in Rakuten Example: We want to know... 1. How many unique users that checked an item in one day (month, or year)? 2. How many products sold in one day (month, or year)? Streaming algorithms for the queries 1. HyperLogLog algorithm 2. Count-Min Sketch algorithm 11 12. Problem: Unique Item Count Nave approach: Uses dict in Python: dict[key] += 1 This can require a large amount of memory. Streaming algorithm: HyperLogLog Counts unique items approximately. This needs a fixed amount of memory. Google recently proposed an improved version of HyperLogLog, called HyperLogLog++.12 13. HyperLogLog Basic ideas:Hash function Harmonic mean Stochastic averaging13 14. HyperLogLog Algorithm Keys 1. Set i to upper bits 2. Set A[i] to max(j, A[i])upper bitslower bitsItem1 hash(Item1): 0 0 0 1 0 0 0 0 0 1 1 0 Item2 i = (0001)2= 1 j = (# leading 0s)+1= 6 A[1] Item3 4 0 1 Item1 array A 2 6 3. Estimate # unique items from E=1/(2-A[i]). (In practice, we use heuristics for corrections.) 14 15. Demo Nave vs. HyperLogLog15 16. Performance Task: Count unique items in an item set. Memory efficientHigh speed1%4x -1%Memory 1193MB5MBSpeed-up 419sec108secHigh accuracyAccuracy 100%99%This data set is small, but we are using HyperLogLog for bigger data. 16 17. Conclusion Streaming algorithms in Rakuten We are using them for data analysis. We have an internal C library with bindings. HyperLogLog, Count-Min Sketch, and so on.Future: Plan to implement other algorithms.17 18. Reference HyperLogLog & HyperLogLog++ [Flajolet et al., AOFA 2007], [Heule et al., EDBT 2013] Count-Min Sketch [Cormode, Muthukrishnan, J. Algorithms, 2005] An excellent slide by Alex Smola http://alex.smola.org/teaching/berkeley2012/slides/3_Streams.pdf AK TECH BLOG by Aggregate Knowledge http://blog.aggregateknowledge.com/ Stream-lib by Clearspring https://github.com/clearspring/stream-lib18