storm and clojure - clojure meetup january 2015
TRANSCRIPT
Using Storm with Clojure
Clojure Meetup NYC, Jan 12th 2015
Alex Kehayias, CTO, Shareablee
@alexkehayias
Agenda
• Best Practices
• Patterns
• Deployment
• Testing
• Reusable Bolts/Spouts
• Open sourcing stuff!
Best Practices
• Keep bolt/spout code small, move logic to
smaller testable functions
• Don’t hardcode configuration, use storm’s built in
Config. Use stateful bolts and pass configuration
to any function that needs it
• Don’t block within bolts, control flow in the Spout
not in bolts
• Test bolts and spouts individually, it’s much
harder to debug a whole topology all at once
Best Practices
• Find common code and make reusable
bolts/spouts for future topologies
• Mind your dependencies (and the Storm
classpath)
• Be mindful of memory usage within a bolt,
resources are shared and you could get
OOM killed
• It’s ok to make async network calls inside of
bolts if the majority of time is spent waiting
Patterns
Pass-through Data
Problem:
I need some information for some logic in a specific bolt, but not all bolts need the information.
Solution:
• By convention, leave the first item in input and output tuples as a hashmap
• Use the “meta” field to coordinate any downstream logic
Retries
Problem:
I have a side effect that can potentially fail, but can be retried and succeed. I want to re-enqueue the message, but don’t want to litter my bolts with the same code to re-enqueue a message (especially with reusable bolts!).
Solution:
• Use the tuple ID specified by the spout to hold all information about the message
• Use the fail method of a spout to take the encoded tuple ID and turn it into a message that can be re-enqueued
• To prevent infinite retries, use the meta field to increment the number of retries and if it exceeds a threshold drop the message on the floor
Retries
(defn message-to-tuple-id
[message]
(pr-str message))
(defn tuple-id-to-message
[tuple-id]
(binding [*read-eval* false]
(read-string tuple-id)))
Accumulating State
Problem:
I want to process tuples and output to csv files, but I don’t want millions of tiny files. I want to accumulate results and then export them in larger batches. The data will be too large to hold in memory so I will need to flush to disc.
Solution:
• Use a stateful bolt and hold paths to temp files
• Use java.io to handle creating unique files to avoid file system collisions
• Storm process state is not shared so writing is safe
• Use a tick tuple (built in to Storm) to periodically flush the files and store them
Accumulating State
(if (tick-tuple? tuple)
(doseq [[k v] (:tmp-files @state)]
(let [csv-content (slurp v)
output [k csv-content]]
(when-not (empty? csv-content)
(emit-bolt! collector output :anchor tuple)
(io/delete-file v)
(swap! state dissoc-in [:tmp-files k]))))
(let [{:keys [partition-key coll]} tuple
[file-path created] (get-or-create-file state partition-key)
csv-content (coll->csv coll)]
(spit file-path csv-content :append true)))
Going to Production
Deployment
• Storm classpath supersedes your jar even
when deps are included. Locally it will
work, but in production it will not even if
you include the dep and exclude it
• Out of date apache httpcore/httpclient
(updated in 9.2 release)
• storm-starter is out of date
Configuration
• Storm will pass a Config instance to all stateful bolts
• Upon topology submit, choose which configuration you want to use
• Always use the config hashmap for things like host names, tables, queues, etc so you can work in different environments
bash$ storm jar target/my-project-standalone.jar my.main.Classconfig/local.config.properties my-topo-name
Configuration
(defn load-properties-from-file
"Convert a .properties file into a map."
[file-name]
(with-open [^java.io.Reader reader (clojure.java.io/reader file-name)]
(let [props (java.util.Properties.)]
(.load props reader)
(-> (into {} props)
convert-bool-strings))))
(defn -main
[& [properties-file topo-name]]
(let [topo-name (or topo-name "default")
conf-path (or properties-file "config/local.config.properties")
properties (load-properties conf-path)
topo (create-topology topo-name properties)
conf (merge-config! (new Config) properties)]
(submit-remote-topology topo-name conf topo)))
Logging
• Reuse storm’s logging backtype.storm.log
• Use topology.debug=true to log each tuple
being emitted and received
• Don’t use prn or prn-str. You can overflow
the write buffer and block all tuples
silently!
How to Test Storm
Testing
• Use with-simulated-time-local-cluster and
complete-topology to assert the output
streams of specific bolts
• Use with-local-cluster for integration tests
• Use with-redefs to assert side effects
• Use with-redefs to mock out bolts or side
effects so you can test behavior
Testing - Bolts
(storm/defspout mock-spout [“f1” “f2”]
[collector]
nil)
(defn mock-topology []
(storm/topology
{"spout" (storm/spout-spec mock-spout)}
{"parse" (storm/bolt-spec {"spout" :shuffle} my-bolt)}))
(def test-user-tuple
[{} {:body “hello” :status 200}])
(deftest test-my-bolt
(with-simulated-time-local-cluster [cluster]
(let [results (complete-topology
cluster
(mock-topology)
:mock-sources {"spout" [test-user-tuple]})
result (read-tuples results ”my-bolt" ”my-stream")
expected [“foo” “bar”]]
(is (= (first result) expected)))))
Testing - Topologies
(deftest test-audience-topology
(let [test-results (atom [])
tick-counter (atom 0)
conf (conf/merge-config! (new Config) test-conf)
{:keys [rmq-conn
rmq-ch
rmq-exchange
rmq-queue]} (rmq/make-connection conf audience-conf)]
(with-redefs [;; Don't make any network calls, mock out responses
http-bolt mock-http-bolt
;; Mock this out so we can assert the results
store (mock-store test-results)
;; There is no control over bolt specific configs so
;; overwrite the conditional to export results
;; without having to wait the full amount of time
tick-tuple? (ƒ [tuple]
(if (> @tick-counter 5)
true
(do (swap! tick-counter inc) false)))]
;; Results from this run of the cluster goes into the
;; test-results atom for assertions
(st/with-local-cluster [cluster]
(rmq/publish-task @rmq-ch rmq-exchange rmq-queue test-task)
(st/submit-local-topology (:nimbus cluster)
"testing"
conf
(mk-audience-topology test-conf))
(Thread/sleep 5000)
(rmq/close-connection @rmq-conn))
;; Make assertions about the end output of the tuple
(doseq [[location content] @test-results]
(is (.contains location "batch/facebook/1.0/actions/ym=201501/")
"Test there is a correct location for the data")
(is (seq content) "Test content is not nil")))))
Testing – Uncaught
Exceptions• Shell bolts can run forever, manually kill
the processes
• Tests will end with “Tests failed” with no
stacktrace
• Tests will cause nrepl in emacs to exit
Code Reuse
What do we need?
• Ability to reuse and combine bolts and
spouts in meaningful ways
• Abstractions for common use cases
• Work on bolts in isolation
Reusable Bolts
In Practice:
• Separate libraries for core bolts/spouts
that are fully tested and version controlled
• Find the core functionality needed, avoid
custom code
• Include a pass-through field in the input
tuple to allow state to be held
• Use releases to peg dependencies to
Reusable Bolts
elasticsearch-boltWrite json documents into Elasticsearch
https://github.com/shareablee/elasticsearch-bolt
http-boltMake HTTP requests and return the results
https://github.com/shareablee/http-bolt
csv-export-boltAccumulate data to csv format and periodically emit csv string results
https://github.com/shareablee/csv-export-bolt
archive-boltStore data in s3
https://github.com/shareablee/archive-bolt
Reusable Bolts
What’s next:
• cassandra-bolt
• rabbitmq-spout
• redis-bolt