end-to-end data science and machine learning for telcos · deploying logging +serving monitoring...
TRANSCRIPT
End-to-End Data Science and Machine Learning for Telcos Telstra's Use Case — Animesh Singh — Tim Osborne — Adam Makarucha
Think 2020 / DOC ID / May, 2020 / © 2020 IBM Corporation
Session 6123
CODAIT
Improving the Enterprise AI Lifecycle in Open Source
Center for Open Source Data & AI Technologies
DEG / May XX, 2020 / © 2020 IBM Corporation 3
CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise.
We contribute to and advocate for the open-source technologies that are foundational to IBM’s AI offerings.
30+ open-source developers!
Enterprise Machine Learning
4
The Machine Learning Lifecycle
5
*Source: Hidden Technical Debt in Machine Learning Systems
Perception
Perception
*Source: Hidden Technical Debt in Machine Learning Systems
In reality…ML Code is tiny part in this overall platform
And the ML workflow spans teams …
Data cleansing
Dataanalysis&transformation
Data validation
Data splitting
Data prep
Building a model
Model validation
Training at scale
Model creation
Deploying Serving Monitoring &
Logging + Explainability
Finetune & improvements
Rollout
Training optimization Model
Model
Data
Data
Data ingestion
Edge Cloud
Da
taflo
w a
nd
Wo
rkflow
Orc
he
stratio
n
Marketplace(A
IHub
)
Dataconsistency(versioning)
FeatureEngineering
And is much more complex..
End to end ML on Kubernetes?
11
● Containers ● Packaging ● Kubernetes service endpoints ● Persistent volumes ● Scaling ● Immutable deployments ● GPUs, Drivers & the GPL ● Cloud APIs ● DevOps ● ...
First, can you become an expert in ...
We need a platform. Enter Kubeflow
Prepared and
Analyzed Data
Trained Model
Deployed Model
Prepared Data
Untrained Model
Libraries and CLIs - Focus on end users
Systems - Combine multiple services
Low Level APIs / Services (single function)
Arena kfctl kubectl
katib pipelines
notebooks
fairing
TFJob PyTorchJob
Jupyter CR
Seldon CR
kube bench
• End to end ML Platform on Kubernetes. Focused on multiple aspects of Model Lifecycle
• Originated at Google, and has grown to have a large community of developers
• Google, IBM, Cisco, RedHat, Intel, Microsoft and others contributing
• IBM is the 2nd largest contributor in terms of overall commits. IBM maintainers (committers/reviewers) in Katib (HPO+Training), Kubeflow Serving, Manifests, Pipelines etc.
Metadata
Orchestration
Pipelines CR
Argo
Study Job
MPI CR
Spark Job
Model DB
TFX
Developed By Kubeflow
Developed Outside Kubeflow
* Not all components shown
IAM
Scheduling Kubeflow https://github.com/kubeflow
Jupyter Notebooks
Workflow Building
Pipelines
Tools
Serving
Metadata
Data Management
Kale
Fairing
TFX
Airflow, +
KF Pipelines
HP Tuning
Tensorboard
KFServing
Seldon Core
TFServing, + Training Operators Pytorch
XGBoost, +
Tensorflow
Prometheus
Versioning Reproducibility Secure Sharing
Develop (Kubeflow Jupyter Notebooks)
Data Scientist
- Self-service Jupyter Notebooks provide faster model experimentation
- Simplified configuration of CPU/GPU, RAM, Persistent Volumes
- Faster model creation with training operators, TFX, magics, workflow automation (Kale,
Fairing)
- Simplify access to external data sources (using stored secrets)
- Easier protection, faster restoration & sharing of “complete” notebooks
IT Operator
- Profile Controller, Istio, Dex enable secure RBAC to notebooks, data & resources
- Smaller base container images for notebooks, fewer crashes, faster to recover
Distributed Model Training and HPO (TFJob, PyTorch Job, MPI Job, Katib, …)
Addresses One of the key goals for model builder persona: Distributed Model Training and Hyper parameter optimization for Tensorflow, PyTorch etc. Common problems in HP optimization
– Overfitting
– Wrong metrics
– Too few hyperparameters
Katib: a fully open source, Kubernetes-native hyperparameter tuning service
– Inspired by Google Vizier
– Framework agnostic
– Extensible algorithms
– Simple integration with other Kubeflow components
Kubeflow also supports distributed MPI based training using Horovod
● Founded by Google, Seldon,
IBM, Bloomberg and Microsoft
● Part of the Kubeflow project
● Focus on 80% use cases -
single model rollout and update
● Kfserving 1.0 goals:
○ Serverless ML Inference
○ Canary rollouts
○ Model Explanations
○ Optional Pre/Post
processing
KFServing
Kubeflow Pipelines § Containerized implementations of ML Tasks
§ Pre-built components: Just provide params or code snippets (e.g. training code)
§ Create your own components from code or libraries
§ Use any runtime, framework, data types
§ Attach k8s objects - volumes, secrets
§ Specification of the sequence of steps
§ Specified via Python DSL
§ Inferred from data dependencies on input/output
§ Input Parameters
§ A “Run” = Pipeline invoked w/ specific parameters
§ Can be cloned with different parameters
§ Schedules
§ Invoke a single run or create a recurring scheduled pipeline
Define Pipeline with Python SDK
@dsl.pipeline(name='TaxiCabClassificationPipelineExample’)deftaxi_cab_classification(output_dir,project,Train_data='gs://bucket/train.csv',Evaluation_data='gs://bucket/eval.csv',Target='tips',Learning_rate=0.1,hidden_layer_size='100,50’,steps=3000): tfdv =TfdvOp(train_data,evaluation_data,project,output_dir) preprocess =PreprocessOp(train_data,evaluation_data,tfdv.output[“schema”],project,output_dir) training =DnnTrainerOp(preprocess.output,tfdv.schema,learning_rate,hidden_layer_size,steps,
target,output_dir) tfma =TfmaOp(training.output,evaluation_data,tfdv.schema,project,output_dir) deploy =TfServingDeployerOp(training.output)
Compile and Submit Pipeline Run
dsl.compile(taxi_cab_classification,'tfx.tar.gz')run=client.run_pipeline(
'tfx_run','tfx.tar.gz',params={'output':‘gs://dpa22’,'project':‘my-project-33’})
From Single Apps to Complete Platform
20
Dec 2017
Introduce Kubeflow JupyterHub TFJob TFServing
May 2018
Kubeflow 0.1 Argo Ambassador Seldon
Aug
Kubeflow 0.2 Katib -HP Tuning Kubebench PyTorch
Oct
Kubeflow 0.3 kfctl.sh TFJob v1alpha2
Jan 2019
Kubeflow 0.4 Pipelines JupyterHub UI refresh TFJob, PyTorch beta
April
Kubeflow 0.5 KFServing Fairing Jupyter WebApp + CR
Sep
Contributor Summit
Jul
Kubeflow 0.6 Metadata Kustomize Multi-user support
Individual Applications
Connecting Apps And Metadata
November
Kubeflow 0.7 Pipelines+ KFServing v0.2 kfctl refactor
March 2020
Kubeflow 1.0 Production ready stable components
Productionisation & Hardening
Telstra AI Lab - (TAIL) - Configuration
• Kubernetes – 1.15
• Spectrum Scale CSI Driver
• MetalLB for Load Balancing
• Istio 1.3.1 for ingress
• Kubeflow – 1.0.1
• Jupyter Notebook images are IBM’s
multiarchitecture powerai images (https://hub.docker.com/r/ibmcom/powerai/tags)
Telstra AI Lab - (TAIL)
Mixed-Architecture 2x IBM Power9 AC922
Nodes 4x Cisco Intel Nodes
Telstra AI Lab - (TAIL)
237.6 TFlops GPU Single Precision performance
Telstra AI Lab - (TAIL): Compute
4x NVLink’ed Nvidia V100 GPUs
4x PCIe Nvidia V100 GPUs
64x Power 9 Cores
68x Intel Cores
P9 P9
AC922
150GB/s
150GB/s 150GB/s 150GB/s
Telstra AI Lab - (TAIL): AC922
Large Model Support Able to train models that are greater exceed GPU memory.
Distributed Deep Learning Linear scaling for deep learning training across multiple GPU enable nodes.
Supports Open Source DL Frameworks Tensorflow, Pytorch, Caffe all supported and optimized.
Telstra AI Lab - (TAIL): Configuration
• Taint Nodes • Node Selector • Only does data science
• Kubeflow running on x86
• Can be used to run other components, such as databases, microservices, etc.
Telstra AI Lab - (TAIL): Challenges
• Enterprise proxy and internal host names • Running squid proxy that routes to the
enterprise proxy to enable access to
docker.io, github.com, pypi.org, etc.
• Configure HostAliases in notebooks
• Getting data into the cluster • Provisioned Minio object storage instance in
each user namespace and accessible via
kubeflow endpoint • User over-provisioning of cores / PVCs:
• Locked defaults and created reasonable limits
Telstra AI Lab - (TAIL): Successes
• Easy to select the Power platform with configuration options in the notebook server
• Added open source code to enable node selector, tolerations, and hostAliases
• Using Kubeflow-Kale to simplify pipelining of code
• Significantly simplifies the adoption of pipelines and conversion of code.
• First instance code conversion took - 1 day – optimisation of code 2 weeks.
• Significant performance improvements thanks to the available compute and
software tools
• First use case went from a run time 15 hours > 2 hours
Telstra AI Lab - (TAIL) – Future state
• RedHat Openshift – 4.3
• GPU Operator
• Kubeflow Operator
• Extending the compute
• Integrate feature stores and
streaming technologies
• Integrate with CI/CD tools (Tekton
Pipelines)