gobbin config-meetup-june-2016
TRANSCRIPT
Min Tu Pradhan Cadabam
Gobblin ConfigurationManagementGobblin Meetup June 2016
1. Current Solutions and Motivation – Why we
built Gobblin config?
2. Architecture – Gobblin config internals
3. Retention Example – How retention is
configured using Gobblin config?
Agenda
1. Current Solutions and Motivation – Why we
built Gobblin config?
2. Architecture – Gobblin config internals
3. Retention Example – How retention is
configured using Gobblin config?
Agenda
Job Configs Vs. Dataset Configs
Copy Job
- Permission for loginEvent 700- Permission for logoutEvent 777
Option 1 : One job per dataset- Too many jobs- Long whitelist- Difficult to maintain
Option 2 : Prefix- Too many configs- Can not have single config for
all datasets with same permissions
/events/loginEvent/events/logoutEvent
/events/loginEvent - 700/events/logoutEvent - 777
Source Destination
Copy Job 1 Copy Job 2
dest.permission = 700whitelist = loginEvent
dest.permission = 777whitelist = logoutEvent
loginEvent.dest.permission = 700logoutEvent.dest.permission = 777
Copy Job with prefix
Data Life Cycle Management Configs
/events/loginEvent_Avro /events/loginEvent_Orc
/events/loginEvent_Orc Retention Job
Conversion JobCopy Job
• Shared configs across jobs
• Destination path of conversion job is source path of copy job
• Retention job works on destination path of copy job
• Dataset needs to be enabled in all jobs
/events/loginEvent_Orc
/events/loginEvent_Orc
Retention Job
Retention Job
Other Motivations
• New version of configs should be deployable
without deploying new binaries
• Should be easy to rollback to previous stable
version of configs
• Config changes should have an audit trail
• Complex value types and substitution resolution
support
1. Current Solutions and Motivation – Why we
built Gobblin config?
2. Architecture – Gobblin config internals
3. Retention Example – How retention is
configured using Gobblin config?
Agenda
At a very high-level, we extend typesafe config with:
• Abstraction of a Config Store
• Config versioning
• Support for logical “import” URIs
• Ability to traverse the ”import” relationships
Dataset Configuration Management
Architecture
Client Application
ConfigClient API
ConfigStore API
HadoopFS
Store
HiveMetaStor
eAdapter
MySQLAdapter
Zookeeper
Adapter…
Data Model
Config Store
Dataset config key (URI):/events/loginEvent
Key1: value1Key2: value2
…KeyM: valueM
Dataset config key (URI):/events
Tag config key(URI):/tags
imports
Imported by
Tag config key(URI):/tags/highPriority
keyA: valueXkeyB: valueY
Implicit import Implicit import
HOCON format
• Support Java Properties file
• Support Json file
• Value substitution
• “+=“ syntax to append elements to arrays, path += "/bin”
• …
gobblin.retention : { selection { timeBased.lookbackTime=3y }}
Using Configs in code
ConfigClient client =
ConfigClient.createConfigClient(VersionStabilityPolicy policy);
Config config = client.getConfig(URI uri);
Collection<URI> imports = client.getImports(URI dataset, boolean recursive);
Collection<URI> importedBy = client.getImportedBy(URI tag, boolean recursive);
Config lifecycle at LinkedIn
Example of a config store on HDFSROOT├── _CONFIG_STORE // contents = latest non-rolled-back version ├── 1.0.53 // version directory├── events│ ├── main.conf│ ├── loginEvent│ │ └── main.conf // configuration file for /events/loginEvent│ │ └── includes.conf // specify import links for /events/loginEvent│ ├── shareEvent│ │ └── includes.conf│ └── clickEvent│ └── includes.conf│└── tags ├── highPriority │ └── main.conf // configuration file for /tags/highPriority │ └── includes.conf // specify import links for /tags/highPriority ├── blacklist └── 10Days
1. Current Solutions and Motivation – Why we
built Gobblin config?
2. Architecture – Gobblin config internals
3. Retention Example – How retention is
configured using Gobblin config?
Agenda
Retention
├── events ├── loginEvent │ ├── 2016-06-20.avro │ └── 2016-06-25.avro └── logoutEvent ├── 2016-05-10.avro └── 2016-06-10.avro
├── events ├── loginEvent │ └── 2016-06-25.avro └── logoutEvent └── 2016-06-10.avro
• Deleting data that is not required
• Most common retention policy is to delete data older than some days
Example
• Retention policy of 10 days for loginEvent
• Retention policy of 30 days for logoutEvent
Before Retention After Retention
More complex use cases in Production
• Default retention policy of 30 days for all events
• Retention policy of 10 days for loginEvent
• Blacklist retention for clickEvent
• 3 years retention for high priority events like shareEvent
● “events” is the common parent block for “shareEvent”, “loginEvent”, “logoutEvent”, “clickEvent”
● Each block implicitly imports configs from the parent block, “logoutEvent” implicitly imports “events” (Dashed lines)
● Any block can explicitly import any other block (Solid lines)● A child block overrides any key value pairs specified in the parent block
Retention Config
● “logoutEvent” inherits the default retention of 30 days from implicit import, “events”
logoutEvent 30 Days
● “loginEvent” inherits the default retention of 30 days from implicit import, “events”
● “loginEvent” defines a 10 days policy which overrides the 30 days inherited from “events”
loginEvent 10 Days
● “shareEvent” explicitly imports a high priority tag which has retention of 3 years
● “clickEvent” explicitly imports blacklist tag which disables retention for “clickEvent”
Retention Config for share/clickEvent
├── events│ ├── main.conf // Default 30 Days│ ├── loginEvent│ │ └── main.conf // 10 Days│ ├── shareEvent│ │ └── includes.conf // Import /tags/highPriority│ └── clickEvent│ └── includes.conf // Import /tags/blacklist│└── tags ├── highPriority │ └── main.conf // Define 3 Years retention └── blacklist
HDFS Config store
Retention Config Examples/events/main.conf
gobblin.retention : { dataset : { finder.class=gobblin.data.management.retention.CleanableDatasetFinder pattern="/events/*" } selection { policy.class = gobblin.data.management.SelectBeforeTimeBasedSelectionPolicy timeBased.lookbackTime=30d } version : { finder.class=gobblin.data.management.DateTimeDatasetVersionFinder }}
gobblin.retention : { selection { timeBased.lookbackTime=3y }}
/tags/highPriority/main.conf
Supported Policies
• SelectBeforeTimeBasedSelectionPolicy
• NewestKSelectionPolicy
• DailyDependentHourlyPolicy
• CombineSelectionPolicy
More policies -
http://gobblin.readthedocs.io/en/latest/data-management/Gobblin-Retention/
Future work
• Config stores other than Hdfs based config store
• Improve tooling, validation and UI for config store
deployment
Questions