plugin-based software design with ruby and rubygems
TRANSCRIPT
Plugin-based software design with Ruby and RubyGems
Sadayuki Furuhashi Founder & Software Architect
RubyKaigi 2015
A little about me…
Sadayuki Furuhashigithub: @frsyuki
Fluentd - Unifid log collection infrastracture
Embulk - Plugin-based parallel ETL Founder & Software Architect
It's like JSON. but fast and small.
A little about me…
What’s Plugin Architecture?
Benefits of Plugin Architecture> Plugins bring many features > Plugins keep core software simple > Plugins are easy to test > Plugins builds active developer community
Benefits of Plugin Architecture> Plugins bring many features > Plugins keep core software simple > Plugins are easy to test > Plugins builds active developer community
> “…if it’s designed well”.
plugin architecture?How to design
plugin architecture?
How did I designHow to design
Today’s topic> Plugin Architecture Design Patterns > Plugin Architecture of Fluentd > Plugin Architecture of Embulk > Pitfalls & Challenges
Plugin ArchitectureDesign Patterns
Plugin Architecture Design Patternsa) Traditional Extensible Software Architecture
b) Plugin-based Software Architecture
Traditional Extensible Software Architecture
Host Application
Plugin
Plugin
Register plugins to extension points
To add more extensibility, add more extension points.
Plugin-based software architecture
Core
Plugin
Plugin
Plugin Plugin Plugin
Plugin Plugin
Application
Plugin-based software architecture• Application as a network of plugins.
> Plugins: provide features. > Core: framework to implement plugins.
• More flexibility != More complexity. • Application must be designed as modularized.
> It’s hard to design :( > Optimizing performance is difficult :(
• Loosely-coupled API often makes performance worse.
Design Pattern 1: Dependency Injection
Core
class
interface
class interface interface
class class A component is an interface or a class.
Each component publishes API:
Design Pattern 1: Dependency Injection
Core
class
Plugin
Plugin Plugin Plugin
class Plugin
When application runs:
A DI container replaces objects with plugins when application runs
Replace classes with mocks for unit tests
Design Pattern 1: Dependency Injection
Core
dummy
dummy
dummy dummy dummy
Plugin dummy
Testing the application
Dependency Injection (Java)public interface Store{ void store(String data);}
public class Module{ @Inject Module(Store store) { store.store(); }}
public class DummyStore implements Store{ void store(String data) { }}
public class MainModule implements Module{ public void configure( Binder binder) { binder.bind(Store.class) .to(DummyStore.class); }}
interface → implementationmapping
From source code, implementation is black box. It’s replaced at runtime.
Dependency Injection (Ruby)
Ruby?(What’s a good way to use DI in Ruby?) (Please tell me if you know)
Dependency Injection (Ruby)
class Module def initialize(store: DummyStore.new) store.store(”data”) endend
class DummyStore def store(data) endend
injector = Injector.new. bind(store: DBStore)object = injector.get(Module)
class DBStore def initialize(db: DBM.new) @db = db end
def store(data) @db.insert(data) endend
injector = Injector.new. bind(store: DBStore). bind(db: SqliteDBImpl)object = injector.get(Module)
I want to do this: Keyword arguments
{:keyword => class} mappingat runtime
Design Pattern 2: Dynamic Plugin Loader
Core
Plugin Plugin
Calls Plugin loader to load plugins
Plugin Loader
Design Pattern 2: Dynamic Plugin Loader
Core
Plugin Plugin
Plugins also call Plugin Loader. Plugins create an ecosystem.
Plugin Loader
Plugin Plugin
Design Pattern 3: Combination
Core
class
Plugin
class Plugin Plugin
class class
Plugin Loader Plugin
Plugin Plugin
Plugin Plugin
Dependency Injection + Plugin Loader
Plugin Architecture Design Patternsa) Traditional Extensible Software Architecture b) Plugin-based Software Architecture
> Dependency Injection (DI) > Dynamic Plugin Loader > Combination of those
There’re trade-offs > Choose the best solution for each project
Plugin Architectureof Fluentd
What’s Fluentd?> Data collector for unified logging layer
> Streaming data transfer based on JSON
> Written in C & Ruby > Plugin Marketplace on RubyGems
> http://www.fluentd.org/plugins > Working in production
> http://www.fluentd.org/testimonials
Deployment of Fluentd
Deployment of Fluentd
The problems around log collection…
Solution: N × M → N + Mplugins
# logs from a file<source> type tail path /var/log/httpd.log pos_file /tmp/pos_file format apache2 tag web.access</source>
# logs from client libraries<source> type forward port 24224</source>
# store logs to ES and HDFS<match web.*> type copy <store> type elasticsearch logstash_format true </store> <store> type s3 bucket s3-event-archive </store></match>
<match metrics.*> type nagios host watch-server</match>
Example: Simple forwarding
Example: HA & High performance
- HA (fail over)- Load-balancing- Choice of at-most-once or at-least-once
Example: Realtime search + Batch Analytics combo
All data
Hot data
Fluentd Core
EventRouter
Input Plugin
Output Plugin
Filter Plugin
Buffer Plugin
Output Plugin
Input Plugin
Plugin Architecture of Fluentd
Plugin Loader
Fluentd Core
EventRouter
Input Plugin
Output Plugin
Filter Plugin
Buffer Plugin
Output Plugin
Input Plugin
Plugin Marketplace using RubyGems.org
$ gem install fluent-plugin-s3Plugin
Loader
/gems/
RubyGems.org
Fluentd’s Plugin Architecture• Fluentd is a plugin-based event collector.
> Fluentd core: takes care of message routing between plugins.
> Plugins: do all other things! • 300+ plugins released on RubyGems.org • Fluentd loads plugins using Gem API.
Plugin Architectureof Embulk
Embulk: Open-source Bulk Data Loader written in Java & JRuby
Amazon S3
MySQL
FTP
CSV Files
Access Logs
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
Reliable framework :-)
Parallel execution, transaction, auto guess, …and many by plugins.
Demo
Use case 1: Sync MySQL to Elasticsearch
embulk-input-mysql
embulk-filter-kuromoji
embulk-output-elasticsearch
MySQL
kuromoji
Elasticsearch
Use case 2: Load from S3 to Analytics
embulk-parser-csv
embulk-decoder-gzip
embulk-input-s3
csv.gz on S3
Treasure Data BigQuery Redshift
+
+embulk-output-td embulk-output-bigquery embulk-output-redshift
embulk-executor-mapreduce
Use case 3: Embulk as a Service at Treasure Data
Use case 3: Embulk as a Service at Treasure Data
REST API to load/export data to/from Treasure Data
Input Output
Embulk’s Plugin Architecture
Embulk Core
Executor Plugin
Filter Filter
Guess
Output
Embulk’s Plugin Architecture
Embulk Core
Executor Plugin
Filter Filter
GuessFileInput
Parser
Decoder
Guess
Embulk’s Plugin Architecture
Embulk Core
FileInput
Executor Plugin
Parser
Decoder
FileOutput
Formatter
Encoder
Filter Filter
Embulk’s Plugin Architecture
Embulk Core
PluginManager
Executor Plugin
InjectedPluginSource
ParserPlugin
JRubyPluginLoader
FormatterPlugin
JRuby Plugin Loader Plugin
FilterPlugin
OutputPluginInputPlugin
JRuby RuntimeJava Runtime
Plugin Marketplace using RubyGems.org
Embulk Core
PluginManager
Executor Plugin
InjectedPluginSource
ParserPlugin FormatterPluginFilterPlugin
OutputPluginInputPlugin
JRuby RuntimeJava Runtime
$ embulk gem install embulk-input-oracle
/gems/
RubyGems.org
JRubyPluginLoader
JRuby Plugin Loader Plugin
Plugin Package Structureembulk-input-s3.gem+- build.gradle|+- src/main/java/org/embulk/input/s3| \- S3FileInputPlugin.java| AwsCredentials.java|+- classpath/| \- embulk-input-s3-0.2.6.jar| aws-java-sdk-s3-1.10.33.jar| httpclient-4.3.6.jar|+- lib/embulk/input/ \- s3.rb
Java source files
Compiled jar file
All dependent jar files
Ruby script toload the jar files
Embulk Plugin Load Sequence
Bundler.setup_environmentEmbulk::Runner = Embulk::Runner.new( .embulk.EmbulkEmbed::Bootstrap.new.initialize)Embulk::Runner.run(ARGV)
Java
JRuby
Java
org.embulk.cli.Main.main(String[] args) { org.jruby.Main.main( "embulk.jar!/embulk/command/embulk_bundle.rb", args);}
org.embulk.exec.BulkLoader.run(…)
org.embulk.plugin.PluginManager.newPlugin(…)
{ jruby = org.jruby.embed.ScriptingContainer()
rubyObj = jruby.runScriptlet("Embulk::Plugin") jruby.callMethod(rubyObj, "new_java_input", "s3")}
Embulk Plugin Load Sequence
def new_java_input(type) rubyPluginClass = lookup(:input, type) return rubyPluginClass.new_javaend
Java
JRuby
org.embulk.plugin.PluginManager.newPlugin(…)
Embulk Plugin Load Sequence
def new_java jars = Dir["classpath/**/*.jar"] factory = org.embulk.embulk.plugin.PluginClassLoaderFactory.new classloader = factory.create(jars) return classloader.loadClass("org.embulk.input.s3.S3InputPlugin")end
Java
JRuby
PluginClassLoaderFactory.create(URL[] jarPaths) { return new PluginClassLoader(jarPaths); }
Embulk• Embulk is a plugin-based parallel bulk data loader.
• Guess plugins suggest you what plugins are necessary, and how to configure the plugins.
• Executor plugins run plugins in parallel. • Embulk core takes care of message passing
between plugins. • Embulk loads plugins using JRuby and Gem API.
./embulk.jar
$ ./embulk.jar guess example.yml
executable jar!
Header of embulk.jar
: <<BAT@echo offsetlocalset this=%~f0set java_args=
rem ...
java %java_args% -jar %this% %args%exit /b %ERRORLEVEL%BAT
# ...
exec java $java_args -jar "$0" "$@"exit 127
PK...
embulk.jar is a shell script
: <<BAT@echo offsetlocalset this=%~f0set java_args=
rem ...
java %java_args% -jar %this% %args%exit /b %ERRORLEVEL%BAT
# ...
exec java $java_args -jar "$0" "$@"exit 127
PK...
argument of “:” command (heredoc). “:” is a command that does nothing.
#!/bin/sh is optional. Empty first line means a shell script.
java -jar $0
shell script exits here (following data is ignored)
embulk.jar is a bat file
: <<BAT@echo offsetlocalset this=%~f0set java_args=
rem ...
java %java_args% -jar %this% %args%exit /b %ERRORLEVEL%BAT
# ...
exec java $java_args -jar "$0" "$@"exit 127
PK...
.bat exits here (following lines are ignored)
“:” means a comment-line
embulk.jar is a jar file
: <<BAT@echo offsetlocalset this=%~f0set java_args=
rem ...
java %java_args% -jar %this% %args%exit /b %ERRORLEVEL%BAT
# ...
exec java $java_args -jar "$0" "$@"exit 127
PK...
jar (zip) format ignores headers (file entries are in footer)
Pitfalls & Challenges
Pitfalls & Challenges• Plugin version conflicts • Performance impact due to loosely-coupled API
Plugin Version Conflicts
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Version conflicts!
aws-sdk.jar v1.10
embulk-output-redshift.jar
Multiple Classloaders in JVM
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Isolated environments
aws-sdk.jar v1.10
embulk-output-redshift.jar
Class Loader 1
Class Loader 2
Version conflicts in a JRuby Runtime
Embulk Core
Java Runtime
httpclient 2.5.0
embulk-input-sfdc.gem
Version conflicts!
httpclient v2.6.0
embulk-input-marketo.gem
JRuby Runtime
Java Runtime
Multiple JRuby Runtime?
Fluentd Core
activerecord ~> 3.4
fluentd-plugin-sql.gem
Isolated environments?
activerecord ~> 4.2
fluent-plugin-presto.gem ?
Sub VM 1?
Sub VM 2?
Version conflicts in Fluentd
Fluentd Core
CRuby Runtime
activerecord ~> 3.4
fluentd-plugin-sql.gem
Version conflicts!
activerecord ~> 4.2
fluent-plugin-presto.gem ?
Challenges• Version conflict is not completely solved.
• Java can use multiple ClassLoader • I haven’t figured out hot to do the same thing in
Ruby • I don’t have clear ideas to solve performance impact
• Write more code to learn?
Wrapping Up
“How did I build Plugin Architecture?”• I built Fluentd using dynamic plugin loader.
• “Plugin calls Plugins” • Most of features are provided by the ecosystem of plugins.
• I built Embulk using combination of: • Dependency Injection, • JRuby to implement a Dynamic Plugin Loader, • Java VM and nested ClassLoaders to load multiple versions
of plugins. • But some problems are not solved yet:
• Version conflicts in a Ruby VM. • Design patterns of plugins AND high performance.
What’s Next?• You build plugin-based software architecture!
• And you’ll talk to me how you did :-) • I’m working on another project: a distributed
workflow engine • Java VM + Python
Thank You!Sadayuki Furuhashi
Founder & Software Architect