tigres: template interfaces for agile parallel data ......tigres: design templates for common...

19
Tigres: Template Interfaces for Agile Parallel Data-Intensive Science Lavanya Ramakrishnan [email protected] http://tigres.lbl.gov 1

Upload: others

Post on 09-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Tigres: Template Interfaces for Agile Parallel Data-Intensive Science

Lavanya Ramakrishnan

[email protected]

http://tigres.lbl.gov

1

Page 2: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Tree Files (ps and .pdf files)

blast blast

clustalw clustalw

dnapars protpars

drawgram

drawgram

(CS Biased) View of Workflow Challenges: Gene2Life Molecular Biology Analysis

•  Mostly simple sequential workflow

•  Repetitive •  Tracking

–  Provenance, metadata, etc

•  “Iterative” –  Swap programs and

data sets •  Desktop to HPC/

Cloud

Nucleotide or amino acid.

search

alignment

analysis

visualization

Page 3: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Pre Interproscan

Interproscan Interproscan

Post Interproscan

Motif

N=135

256 processors

(CS Biased) View of Workflow Challenges: MotifNetwork

•  Mid-sized “compute-intensive” workflow

•  Mix of single processor and multiprocessor tasks

•  Intermediate data formatting/logic

•  Move •  Share

data preparation

analysis

aggregation

processing

People use ad-hoc scripts, keep notes in text files and encode metadata in file names

Source: Jeff Tilson

Page 4: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Big Data is here … Larger volumes of data More dynamic content Significant variety in data types Large amounts of unstructured data Increased rate of data arrival Need for faster data processing rates …..

… It is not getting easier

Page 5: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

MapReduce and Hadoop Ecosystem

Map

Reduce

Computation performed on large volumes of data in parallel Provides scaling, data locality, fault tolerance Higher-level tools have evolved for specific data analysis There are challenges in using MapReduce/Hadoop

for scientific workflows

Page 6: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Tigres: Design templates for common scientific workflow patterns

"LightSrc" Domain templates

Base Tigres templates

Scale up

Application "LightSrc-1"

Application "LightSrc-2"

Create andDebug

Share

Create andDebug

Implement templates as a library in an existing language

Page 7: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Tigres Templates

TaskN

Task1 Sequence

Taskn Task1 ... ...

Split

Parallel

TaskN Task1

Task

Merge

Tasko

Taskn Task1

Page 8: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Key Aspects of Tigres

•  Targeted for large-scale data-intensive workflows –  Motivated by “MapReduce” model –  No centralized managed model

•  Library model embedded in existing languages such as Python and C –  “Extend current scripting/programming tools” –  API-based, embedded in code

•  Light-weight execution framework –  “As easy to run as an MPI program on an HPC resource” –  No persistent services

•  Scientist-Centered Design Process –  Get feedback from user before writing all the code

Page 9: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Design

Execution Environment

API Implementation Optimizations Scientist-Centered

Design Process

Tigres Design Process

Page 10: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Create a workflow 1.  Define input types 2.  Define task 3.  Assign input values

4.  Repeat 1-3 above for other tasks

5.  Create appropriate input

arrays 6.  Create appropriate tasks

arrays 7.  Create (and run) the

template

Task2

Task1

Task45 ...

Task3

Task40

Task55

Task50 ...

Task6

input1_task1    {Type:  Object_a}   input2_task1    

{Type:  int}    

Page 11: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Templates

•  Sequence ( name, task_array, input_array ) –  e.g., output [ ] = Sequence (“my seq”, task_array_12,

input_array_12) •  Parallel ( name, task_array, input_array )

–  e.g., output[ ] = Parallel(“abc”, task_array_12, input_array_12)

•  Split ( name, split_task, split_input_values, task_array, task_array_in ) –  e.g., Split( “split”, task_x1, input_value_1, spl_t_arr,

spl_i_arr) •  Merge ( name, task_array, input_array, merge_task,

merge_input_values) –  e.g., Merge( “merge”, syn_t_arr, syn_i_arr, task_x1,

input_value_1)

Page 12: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Scientist-Centered Design Process

•  Use Google docs for an interactive step-by-step exercises with “facilitator” and “human compiler” –  white/black board didn’t work

•  Preparation –  terminology, basic template, example, exercise –  15 minutes preparation time

•  Testers ( ~6) –  Developers, web design/UI staff, application scientists

Page 13: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Concept understanding by user Changes to Nomenclature Support in C also important

Priorities for first prototype: Desktop to NERSC Monitoring Intermediate state management

Impact of Scientist-Centered Design

Design

Execution Environment

API Implementation Optimizations Scientist-Centered

Design Process

Page 14: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

What did we learn?

•  Documentation clarity was key –  Majority of our participants “coded-by example”

•  Nomenclature was important –  Confusion with initial terminology

•  Keep API simple –  Dependencies/output – two different styles within and

outside a template can be confusing •  Support extended API

–  Optional parameters, different programming styles •  Execution Semantics were important

–  Monitoring, logging It took days for first stub implementation rather

than weeks (or months)!

Page 15: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Summary

•  “Scientist-friendly” programming API to manage workflows

•  Plan to test API with different user groups

Page 16: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

•  Core team –  Deb Agarwal (PI), Lavanya Ramakrishnan, Daniel Gunter –  Gilberto Pastorello, Valerie Hendrix, Ryan Rodriguez

•  CS Research groups –  Shane Canon –  John Shalf

•  Science research groups –  Cosmology - Alex Kim, Rollin Thomas, Stephen Bailey –  Gamma Ray - Dan Chivers –  Advanced Light Source - Dula Parkinson –  HEP - Paolo Calafiura –  Materials – Kristin Persson

Tigres Team

Page 17: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Website: http://tigres.lbl.gov Contact: [email protected]

Page 18: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Monitoring

•  Initialize –  init (tigres-destination, user-destination)

•  User Logging –  setLevel(level) enumeration FATAL upto TRACE –  write(level, name, key-value pairs)

•  Query –  getStatus(type, names) –  getInfo(name, key-value-pairs)

Page 19: Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common scientific workflow patterns "LightSrc" Domain templates Base Tigres templates Scale up

Input and Task

•  InputTypes ( name, types[ ] ) –  e.g., input_type1 = InputTypes(“Types1”, {“int”, “string”})

•  InputValues ( name, values[ ] ) –  e.g., input_value1 = InputValues(“Values1”, {1, “hello”})

•  InputArray ( name, input_values[ ]) –  e.g., input_array_12 = InputArray(“Array12”, {input_value_1, input_value_2})

•  Task ( name, type, impl_name, input_types, env) –  e.g., task_f1= Task(“A”, FUNCTION, “myfunc”, input_type1))

•  TaskArray ( name, task[ ] ) –  e.g., task_array_xy = TaskArray(“xy”, {task_f1, task_x1})