storm and clojure - clojure meetup january 2015

26
Using Storm with Clojure Clojure Meetup NYC, Jan 12 th 2015 Alex Kehayias, CTO, Shareablee @alexkehayias

Upload: alex-kehayias

Post on 15-Jul-2015

3.216 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Storm and Clojure - Clojure Meetup January 2015

Using Storm with Clojure

Clojure Meetup NYC, Jan 12th 2015

Alex Kehayias, CTO, Shareablee

@alexkehayias

Page 2: Storm and Clojure - Clojure Meetup January 2015

Agenda

• Best Practices

• Patterns

• Deployment

• Testing

• Reusable Bolts/Spouts

• Open sourcing stuff!

Page 3: Storm and Clojure - Clojure Meetup January 2015

Best Practices

• Keep bolt/spout code small, move logic to

smaller testable functions

• Don’t hardcode configuration, use storm’s built in

Config. Use stateful bolts and pass configuration

to any function that needs it

• Don’t block within bolts, control flow in the Spout

not in bolts

• Test bolts and spouts individually, it’s much

harder to debug a whole topology all at once

Page 4: Storm and Clojure - Clojure Meetup January 2015

Best Practices

• Find common code and make reusable

bolts/spouts for future topologies

• Mind your dependencies (and the Storm

classpath)

• Be mindful of memory usage within a bolt,

resources are shared and you could get

OOM killed

• It’s ok to make async network calls inside of

bolts if the majority of time is spent waiting

Page 5: Storm and Clojure - Clojure Meetup January 2015

Patterns

Page 6: Storm and Clojure - Clojure Meetup January 2015

Pass-through Data

Problem:

I need some information for some logic in a specific bolt, but not all bolts need the information.

Solution:

• By convention, leave the first item in input and output tuples as a hashmap

• Use the “meta” field to coordinate any downstream logic

Page 7: Storm and Clojure - Clojure Meetup January 2015

Retries

Problem:

I have a side effect that can potentially fail, but can be retried and succeed. I want to re-enqueue the message, but don’t want to litter my bolts with the same code to re-enqueue a message (especially with reusable bolts!).

Solution:

• Use the tuple ID specified by the spout to hold all information about the message

• Use the fail method of a spout to take the encoded tuple ID and turn it into a message that can be re-enqueued

• To prevent infinite retries, use the meta field to increment the number of retries and if it exceeds a threshold drop the message on the floor

Page 8: Storm and Clojure - Clojure Meetup January 2015

Retries

(defn message-to-tuple-id

[message]

(pr-str message))

(defn tuple-id-to-message

[tuple-id]

(binding [*read-eval* false]

(read-string tuple-id)))

Page 9: Storm and Clojure - Clojure Meetup January 2015

Accumulating State

Problem:

I want to process tuples and output to csv files, but I don’t want millions of tiny files. I want to accumulate results and then export them in larger batches. The data will be too large to hold in memory so I will need to flush to disc.

Solution:

• Use a stateful bolt and hold paths to temp files

• Use java.io to handle creating unique files to avoid file system collisions

• Storm process state is not shared so writing is safe

• Use a tick tuple (built in to Storm) to periodically flush the files and store them

Page 10: Storm and Clojure - Clojure Meetup January 2015

Accumulating State

(if (tick-tuple? tuple)

(doseq [[k v] (:tmp-files @state)]

(let [csv-content (slurp v)

output [k csv-content]]

(when-not (empty? csv-content)

(emit-bolt! collector output :anchor tuple)

(io/delete-file v)

(swap! state dissoc-in [:tmp-files k]))))

(let [{:keys [partition-key coll]} tuple

[file-path created] (get-or-create-file state partition-key)

csv-content (coll->csv coll)]

(spit file-path csv-content :append true)))

Page 11: Storm and Clojure - Clojure Meetup January 2015

Going to Production

Page 12: Storm and Clojure - Clojure Meetup January 2015

Deployment

• Storm classpath supersedes your jar even

when deps are included. Locally it will

work, but in production it will not even if

you include the dep and exclude it

• Out of date apache httpcore/httpclient

(updated in 9.2 release)

• storm-starter is out of date

Page 13: Storm and Clojure - Clojure Meetup January 2015

Configuration

• Storm will pass a Config instance to all stateful bolts

• Upon topology submit, choose which configuration you want to use

• Always use the config hashmap for things like host names, tables, queues, etc so you can work in different environments

bash$ storm jar target/my-project-standalone.jar my.main.Classconfig/local.config.properties my-topo-name

Page 14: Storm and Clojure - Clojure Meetup January 2015

Configuration

(defn load-properties-from-file

"Convert a .properties file into a map."

[file-name]

(with-open [^java.io.Reader reader (clojure.java.io/reader file-name)]

(let [props (java.util.Properties.)]

(.load props reader)

(-> (into {} props)

convert-bool-strings))))

(defn -main

[& [properties-file topo-name]]

(let [topo-name (or topo-name "default")

conf-path (or properties-file "config/local.config.properties")

properties (load-properties conf-path)

topo (create-topology topo-name properties)

conf (merge-config! (new Config) properties)]

(submit-remote-topology topo-name conf topo)))

Page 15: Storm and Clojure - Clojure Meetup January 2015

Logging

• Reuse storm’s logging backtype.storm.log

• Use topology.debug=true to log each tuple

being emitted and received

• Don’t use prn or prn-str. You can overflow

the write buffer and block all tuples

silently!

Page 16: Storm and Clojure - Clojure Meetup January 2015

How to Test Storm

Page 17: Storm and Clojure - Clojure Meetup January 2015

Testing

• Use with-simulated-time-local-cluster and

complete-topology to assert the output

streams of specific bolts

• Use with-local-cluster for integration tests

• Use with-redefs to assert side effects

• Use with-redefs to mock out bolts or side

effects so you can test behavior

Page 18: Storm and Clojure - Clojure Meetup January 2015

Testing - Bolts

(storm/defspout mock-spout [“f1” “f2”]

[collector]

nil)

(defn mock-topology []

(storm/topology

{"spout" (storm/spout-spec mock-spout)}

{"parse" (storm/bolt-spec {"spout" :shuffle} my-bolt)}))

(def test-user-tuple

[{} {:body “hello” :status 200}])

(deftest test-my-bolt

(with-simulated-time-local-cluster [cluster]

(let [results (complete-topology

cluster

(mock-topology)

:mock-sources {"spout" [test-user-tuple]})

result (read-tuples results ”my-bolt" ”my-stream")

expected [“foo” “bar”]]

(is (= (first result) expected)))))

Page 19: Storm and Clojure - Clojure Meetup January 2015

Testing - Topologies

(deftest test-audience-topology

(let [test-results (atom [])

tick-counter (atom 0)

conf (conf/merge-config! (new Config) test-conf)

{:keys [rmq-conn

rmq-ch

rmq-exchange

rmq-queue]} (rmq/make-connection conf audience-conf)]

(with-redefs [;; Don't make any network calls, mock out responses

http-bolt mock-http-bolt

;; Mock this out so we can assert the results

store (mock-store test-results)

;; There is no control over bolt specific configs so

;; overwrite the conditional to export results

;; without having to wait the full amount of time

tick-tuple? (ƒ [tuple]

(if (> @tick-counter 5)

true

(do (swap! tick-counter inc) false)))]

;; Results from this run of the cluster goes into the

;; test-results atom for assertions

(st/with-local-cluster [cluster]

(rmq/publish-task @rmq-ch rmq-exchange rmq-queue test-task)

(st/submit-local-topology (:nimbus cluster)

"testing"

conf

(mk-audience-topology test-conf))

(Thread/sleep 5000)

(rmq/close-connection @rmq-conn))

;; Make assertions about the end output of the tuple

(doseq [[location content] @test-results]

(is (.contains location "batch/facebook/1.0/actions/ym=201501/")

"Test there is a correct location for the data")

(is (seq content) "Test content is not nil")))))

Page 20: Storm and Clojure - Clojure Meetup January 2015

Testing – Uncaught

Exceptions• Shell bolts can run forever, manually kill

the processes

• Tests will end with “Tests failed” with no

stacktrace

• Tests will cause nrepl in emacs to exit

Page 21: Storm and Clojure - Clojure Meetup January 2015

Code Reuse

Page 22: Storm and Clojure - Clojure Meetup January 2015

What do we need?

• Ability to reuse and combine bolts and

spouts in meaningful ways

• Abstractions for common use cases

• Work on bolts in isolation

Page 23: Storm and Clojure - Clojure Meetup January 2015

Reusable Bolts

In Practice:

• Separate libraries for core bolts/spouts

that are fully tested and version controlled

• Find the core functionality needed, avoid

custom code

• Include a pass-through field in the input

tuple to allow state to be held

• Use releases to peg dependencies to

Page 24: Storm and Clojure - Clojure Meetup January 2015

Reusable Bolts

elasticsearch-boltWrite json documents into Elasticsearch

https://github.com/shareablee/elasticsearch-bolt

http-boltMake HTTP requests and return the results

https://github.com/shareablee/http-bolt

csv-export-boltAccumulate data to csv format and periodically emit csv string results

https://github.com/shareablee/csv-export-bolt

archive-boltStore data in s3

https://github.com/shareablee/archive-bolt

Page 25: Storm and Clojure - Clojure Meetup January 2015

Reusable Bolts

What’s next:

• cassandra-bolt

• rabbitmq-spout

• redis-bolt

Page 26: Storm and Clojure - Clojure Meetup January 2015

Thank you!

( we’re hiring )

[email protected]

@alexkehayias

https://github.com/shareablee