scalding at etsy
Post on 27-Jan-2015
Embed Size (px)
DESCRIPTIONA description of how Scalding came to be used for analysis at Etsy.
- 1. So hey everybody, my name is Dan McKinley
2. Im visiting from LA 3. I worked for Etsy for 6.5 years, mostly from Brooklyn. In an office considerably less sparse than this one, I assure you. Mea culpa, thats worked in the past tense. I quit to join a startup last month. After signing up to give this talk. But I left on very good terms so Im still doing it. 4. This talks about Scalding, and how we wound up using it at Etsy. 5. When I was writing this talk this passage from Douglas Adams kept popping into my brain. I do feel like we had scalding thrust upon us at Etsy, rather than choosing it intentionally. Which is not the same as saying that I was personally unhappy with it, exactly. I was not. This is the character that went on to try to insult every being in the cosmos in alphabetical order. So Im not sure if it was intended as intentional allegory about the scala community. 6. The first thing I wanted to do was give an overview of how Etsy uses scalding now. 7. This is hopefully the only strata-esque slide in the talk. Dont run for the exits or anything. What I want to communicate with it is that in abstract, we aggregate logs from the live site, put them on hdfs. Then from there we crunch them to build internal tooling and features. For live features were putting job outputs into mysql shards; for backend tools we typically use a BI database (vertica) to fill the same need. 8. Scalding gets used at all points on the hadoop side. Parsing logs, generating recommendations and ranking datasets, and business intelligence is all either done in Scalding or will be ported to Scalding very shortly. 9. There are a bunch of ways that people use analytics at Etsy. The way you get your answers depends on the kind of question youre asking. 10. Ill go through some examples. This is a simple one. Lets say you just want to know how many shops open up a day. 11. Thats a pretty common question. And so somebodys thought of it way before you, and theyve put it on a dashboard. So you can just go look at the dashboard. 12. Another kind of question is one about how an A/B test youre running is doing. 13. We do a lot of A/B testing at Etsy, so much so that weve built our own A/B analyzer fronted called Catapult. So for most questions relating to variants in A/B tests you can go to that. 14. Then there are slightly more complicated questions. Like, how many of the top sellers sell vintage goods? Maybe youre the first person to ever ask such a question. 15. But, people have thought of questions that are kind of similar to it before. And in most of those cases you can go ask the BI database. 16. And then there are questions that are even farther out there. Cases where youre probably the first person to ask not just this specifically, but youre also probably the first person to ask any question even similar to it. Like this one. Etsy gets traffic to items that are sold. How often could we redirect that traffic to items that have close tags and titles? 17. Thats the kind of thing youd use scalding to answer today. We have the data in theory, but we havent normalized it and put it in BI. Or maybe its too big to fit in BI. 18. A very common kind of novel question relates to debugging A/B tests. 19. We do a ton of that with scalding too. 20. I conceptualize our data universe as having three domains. 21. There are questions weve anticipated, questions we didnt anticipate, and then there are permanent systems. 22. Like I said, we have tooling support for the first domain. And we use scalding for the second two. 23. Thats questions where the data needed to get an answer is in a relatively raw form, which Ill wave my hands and call analysis. And then we also build features and systems with scalding, which is more like what Id call engineering. We do work for ranking, for recommendations, and so on in scalding. 24. Let me give you some idea for how big of a thing this is. 25. Its pretty big, I guess. When I quit we had about 800 scalding jobs in source control. And if everyone is like me, there are probably twice as many in working directories, not committed. Only about 90 of those, though, run as part of our nightly batch process. 26. 58 people had written scalding jobs 27. And 14 of them figured out how to use Algebird. Etsys engineering team, by the way, is like 150 programmers. 28. This histogram showing how many jobs people have written is about what youd expect. Theres a small group of people like me who have written a ton of jobs. And most people have written one or two jobs. 29. And the way it breaks down across the domains is like this. Most of the people using scalding are using to answer analytics questions. The experts tend to be the people building systems with scalding. 30. So why would we pick scalding? 31. Well, we didnt really pick it on purpose. It was an accident. 32. To explain how that accident happened I guess I first have to explain how we got started with analytics 33. And that was kind of an accident too. We didnt necessarily set out to build something to replace Google Analytics. 34. What we did do was buy an advertising startup called Adtuitive back in 2009. 35. And those guys brought something with them called cascading.jruby. For our purposes you can consider this to be pretty close to Pig, but using JRuby. 36. This is a really simple example of a job written in cascading.jruby. Hopefully youll just believe me that the Java equivalent would be Byzantine. 37. The thing we wanted to get out of that acquisition this feature. Paid promoted listings that you see when you search on Etsy. In the beginning we pretty much just wanted to build whatever we needed to have this. 38. But do that we needed things like impression logging and fronted feedback. So we started collecting event beacons from our frontend. 39. And shipped those beacon logs to hdfs and turned them into event logs. 40. And we sessionized the event logs and made visit logs out of them. 41. That decision to make a table for visits, with a row per user session, turned out to be important. Our data is stored as serialized sequences of events inside cascading tuples. 42. So even though we just wanted this feature, well, what the hell did we just do. We just started building an analytics system I guess. 43. The next thing we knew we had a proprietary tool for analyzing AB tests. Go figure. 44. By 2013 we definitely had our own giant analytics stack. It was built, racked, and debugged. And It was right about then that scalding blew the whole thing to smithereens. 45. The thing that caused this was that we had hired Avi Bryant, who some of you may know as one of the authors of scalding. And something of a group theory crank. And just an all-around amazing smart guy. 46. And as an amazing smart guy, when Avi joined Etsy he had some cover to get a little rogue with things. 47. And what he did with that cover was that he added scalding to the build. And then he started trying to make things with it. Etsys not bureaucratic in any way I understand the word. But in theory theres supposed to be at least some discussion before you start using a new framework. That didnt happen at all with Scalding. 48. And immediately after this, he up and quit. So the force of his intellect and personality doesnt explain scaldings runaway success. If thats all it was about everyone would have stopped using it the minute he left. But the opposite of that happened. 49. About a year ago we had this giant cascading.jruby system, which was starting to get mature. 50. But by last October the official policy was to rewrite the few pieces that were left in Scalding. 51. Theres a technical reason this happened, which I think is interesting, but at the same time its pretty simple. 52. I think its simple enough that I can show it to you in a couple of examples. Lets say that we want to count how many visits searched for any given search term. 53. In other words we want to find every search and every visit, and produce a table like this. Search terms to the number of visits that entered them. 54. The cascading.jruby job is really simple and straightforward. It looks like this. Dont worry about understanding it or anything, the point is that its short and easy. 55. And the equivalent scalding job is also really short and simple. 56. Conceptually theyre both just doing this. 57. You unroll the search events, then you grab the search terms out of them, then you just group and count. 58. And both scalding and cascading.jruby manage to factor that into one mapreduce step. And in this case they both perform identically. 59. But you can start to see the difference if you add just one more layer of complexity. Lets say that we wanted to count up the search terms again, but this time relate them to purchases that happen after them in visits. 60. Like this. We want a table showing how many visits searched for a thing, and another column giving how many of those visits bought something. 61. In this case the scalding job is not that much more complicated. Its still just about this long. 62. And scalding manages to get this done in one mapreduce step again. Its just unrolling the searches out of the visits like it was before, and grouping with a sum. 63. The jruby job, on the other hand, no longer fits on the slide. Its in this gist if anyone wants to look at it. 64. I can show you what it does schematically. You make two branches, one for the searches and one for the purchases. Then you cross join them and filter that shit down. And then you wind up with a branch for conversions per search term and a branch for visits per term, and you join those back together to get your answer. 65. So the pure cascading.jruby solution is more complicated. And it also turns out to be a lot slower, too. Cascading doesnt have a query optimizer, and this might be a lot closer if