openshift: practical experience

OpenShift:Practical Experience

Dr Malcolm Beattie

IBM UK Systems Lab Services

3 November 2020

Session 1AB

Agenda• Z-specific differences for OpenShift• Capacity planning• Persistent Volumes with NFS and the Local Storage Operator• Adding an O/S configuration file to a node• Troubleshooting: Logging into a node• Copying files to/from containers and images• Interactive use of containers and images• Example images for interactve use (distros and languages)• Interesting Open Source images and software• Questions

Z-specific differences for OpenShift• OpenShift releases on Z are pretty much same time as x86 and Power

– Earlier 4.x were a bit behind; now caught up, should stay that way• Not many differences/restrictions for Z (and Power) compared to x86

– Still a few but should catch up in a while then stay that way• GA dates

– 4.2 on 16 Oct 2019 (Jan 2020 on Z)– 4.5 on 14 Jul 2020 (Aug 2020 on Z)– 4.6 on 27 Oct 2020 (same as Z)

Z Restrictions in OCP 4.6

• IBM Z and LinuxONE Restrictions from Release Notes for OCP 4.6– the subset that I think are more likely to be of interest

• Some Technology Preview features not included– CSI volume snapshots– OpenShift Pipelines

• OCP features that are unsupported– Multus CNI plug-in– CSI volume cloning– NVMe– Persistent storage using Fibre Channel (unclear what this means - see soon)

Z Restrictions/introductions in OCP 4.6

• Persistent shared storage must be provisioned using NFS• Persistent non-shared storage must be provisioned using local

storage, like iSCSI, FC, or LSO with DASD/FCP [sic]• Worker nodes must run Red Hat Enterprise Linux CoreOS (RHCOS)

• Supported features newly introduced in OCP 4.6– Persistent storage using iSCSI– Persistent storage using local volumes (Local Storage Operator)– OpenShift do (odo)

Capacity Planning

Capacity Planning: how many nodes?● The default cluster install gets you to create

– 3 master nodes– 2 (or more) worker node

● There absolutely must be 3 master nodes (or 5)– if 2 of 3 master nodes are down, you are toast until/unless you can go through the tricky recovery

method from the etcd data backup you definitely took, didn't you?

● Splitting out some internal pieces from worker nodes onto separate nodes marked as “infrastructure” can be good– HAProxy load balancers used for ingresses/routes, image registry and metric gathering do not attract

OCP vCPU licence requirements– Cluster performance is much, much less variable done that way (measured by Boe)

● Nodes are just virtual machines; we're used to this on Z

Capacity Planning: CPU● OCP is heavyweight● OCP does a lot of things automatically to support an application

environment; it comes with 100-150 pods running doing...– ...health checks on itself and applications and networking; checking updates for itself,

operators and applications; gathering detailed metrics on itself and applications; alert checking and management; rotating certificates throughout the cluster; checking, updating, caching and pruning images, logs, operators and catalogs; reconciling differences between current and desired configurations

● Even without applications running, it will use 1-1.5 IFLs– This does not mean that applications use more CPU than they would otherwise– In fact, Boeblingen measured CPU usage as the same as running not in containers– This makes sense because a “container” really is just a process running in Linux– Work is continuing on making OCP more CPU efficient for its own use

Capacity Planning: CPU● Recommended minimum provision for a cluster is 3 IFLs

– that gives 6 “vCPUs”, “threads” or “logical cores” depending on your preferred terminology– these can be overcommited (of course) between your nodes

● You must give at least 3 vCPUs to each node, official minimum is 4– This is nothing to do with actual consumable CPU capacity– This is because OCP/K8s allow pods to specify a minimum number of “milliCPUs” to fence

off for their use, even if they don't use the actual CPU capacity– System pods shipped with OCP do specify these and are over-generous on Z since a Z IFL

provides much more capacity than an x86 core– Without enough vCPUs on the node, OCP will refuse to schedule users pods even though

actual CPU utilisation is nowhere near full: the pods will stall forever in the scheduler queue

Capacity Planning: Memory● Official minimum for each master node is 16GB

– Reducing that somewhat does not make things stop working

● Official minimum for each worker node is 8GB– Give the worker nodes enough for your application workloads– Once you have enough worker nodes for resilience, you are more likely to want to

scale up on Z to just make the node have more memory and CPU than scale across multiple nodes unless you have other good reasons to do so

Capacity Planning: Networking● OCP, just like most microservice environment designs, uses Software

Defined Networking (SDN) to make things much, much easier for admins/devops and developers– pods and services get their own IP addresses– don't care whether pods are in same Linux instance, in different Linux instances on same server/CEC or

in different servers/CECs

● That SDN has some performance overhead– Some overhead can be reduced with further work (it's software) and that's happening– Some overhead can be removed completely by configuring deployments to use underlying networking

hardware directly (OSA, HiperSockets, RoCE, ...) but trade-off may be that cluster cannot schedule them as flexibly or handle the HA/resilience for you

– One way to handle multiple network technologies more nicely is the Multus CNI but that's not available yet for OCP on Z as of 4.6

– Z hardware is great at high throughput/low latency networking but don't make assumptions about how it interacts with OCP/SDN

Capacity Planning: Disk● Two kinds of OCP/K8s disk storage: Ephemeral and PersistentVolumes● Ephemeral

– Used within pod containers for its (main) filesystem formed from the image and a “copy-on-write” layer on top so the container sees a read/write root filesystem

– Provisioned via subdirectories of the node's single install disk– Contents vanish when a pod is stopped or restarted

● Persistent Volumes (PVs) (note: not the same as an LVM PV - Physical Volume)– Used to provide persistent disk storage for pods– Pods request via a PersistentVolumeClaim (PVC) for a given size and storage class– Many different ways a node can get at storage to use for PVs depending on where

the cluster lives:Fibre Channel, iSCSI, NFS, Local Volume, HostPath, Spectrum Scale, Red Hat OpenShift Container Storage, AWS EBS, GCE Persistent Disk, Cinder, Manila, Azure Disk/File, VMware vSphere, ...

Capacity Planning: Ephemeral Disk● Provisioned via subdirectories of the node's own (main) disk● RHCOS only supports installation onto a single disk device

– Must give the node a single “big enough” disk (DASD or FCP)– Official minimum supported disk size is 120GB (so EAV needed if DASD)– Smaller seems OK (e.g. mod54)– If using something much smaller (e.g. mod27), need to tune log/metric

configuration to only keep a day or so and node gets busy trying to prune itself: stay with mod54 or bigger

Capacity Planning: Persistent Volumes● PVs are usually consumed as “mounted filesystems”; it's only recently that

OCP/K8s introduced ways for pods to “see” a block device and it's uncommon● A few ways of configuring PVs have the pod “reach out” and access the storage

itself but this is rare, not recommended and the pod needs to be privileged● Usually, the node is configured to reach out to the storage in one of these ways:

– Linux implicitly can do so without special configuration (e.g. NFS)– the cluster admin places configuration files in /etc (e.g. with iSCSI or for udev rules to bring

DASD devices or FCP devices online) - done with machineconfig objects - see later– an operator is installed in the cluster to do it automatically based on appropriate CRD

(Cluster Resource Definition) objects

● Then PVs are created either– manually by the cluster admin; or– automatically by an operator that is installed (“dynamic provisioning”)

Adding a PV using NFS● Configure an NFS server (e.g. a Linux

guest whether bastion host or not)● Ensure you export with root_squash:

/srv/ocpnfs ocp*.foo.com(rw,root_squash)

● Create /srv/ocpnfs/pvnnn for each PV– owned by user root (uid 0)– owned by group nfsnobody (gid 65534)– permissions octal 2775 (rwxrwsr-x)

● The “capacity” value in the yaml need have no relation to actual capacity: it just decides whether the user's request can bind against this PV

● Set spec.storageClassName if you want to avoid random PVCs binding to it

apiVersion: v1kind: PersistentVolumemetadata: name: pv010spec: accessModes: - ReadWriteOnce - ReadWriteMany - ReadOnlyMany capacity: storage: 100Gi nfs: path: /srv/ocpnfs/pv010 server: nfsguest.foo.com persistentVolumeReclaimPolicy: Retain volumeMode: FileSystem

Using Local Storage Operator (LSO)● Install LSO in a project (Operators >

Operator Hub > Local Storage)– only arrived for Z in OCP 4.6

● LSO deals with block devices on the node; you can use LVM to create LVM Logical Volumes (LVs) for LSO to use as OCP Persistent Volumes (PVs)

● LSO watches for you to create “LocalVolume” objects in that namespace and for each it finds it– goes to all the nodes matching the filter– makes a filesystem on all the device paths

specified– auto-creates corresponding PVs with the

specified storageclass

apiVersion: local.storage.openshift.io/v1kind: LocalVolumemetadata: name: vglocalvol namespace: local-storagespec: logLevel: Normal managementState: Managed storageClassDevices: - devicePaths: - /dev/vglocalvol/locvol1 - /dev/vglocalvol/locvol2 - /dev/vglocalvol/locvol3 fsType: ext4 storageClassName: locvol volumeMode: Filesystem

Adding an O/S config file to a node

• Each node runs RHCOS, maintained/updated by the cluster itself• Any changes you make to its filesystem (e.g. /etc) may be lost• Some valid configuration changes need files placed here

– HyperPAV device alias definitions: add files in /etc/udev/rules.d– settings for RPS (Receive Packet Steering): add files in /etc/sysctl.d

• OpenShift uses machineconfig objects to build an RHCOS node– There is a machineconfigpool object for each node type, e.g. “master”, “worker”, “infra”– For each nodetype, OCP finds all machineconfigs labelled

machineconfiguration.openshift.io/role=nodetype– and renders their contents into one big machineconfig object to apply to those nodes

Adding a machineconfig● “role: nodetype” will get this applied

to all nodetype nodes● For name_here, use the format “nn-

my-description” where 2-digit nn affects the order of rendering

● For pathname_here put the path where you want the file to be placed

● source URL can be an inline data URL, thus needs URL encoding, e.g.

data:,line%20one%0Aline%20two%0A

● Use your favourite URL encoder, e.g.perl -MURI::Escape -pe 'print uri_escape($_)'

apiVersion: machineconfiguration.openshift.io/v1kind: MachineConfigmetadata: name: name_here labels: machineconfiguration.openshift.io/role: nodetypespec: config:... storage: files: - filesystem: root mode: 420 path: pathname_here contents: source: 'data:,url_encoded_contents_here'

Example: adding HyperPAV aliasesapiVersion: machineconfiguration.openshift.io/v1kind: MachineConfigmetadata: name: 41-worker-hyperpav-a00-a04 labels: machineconfiguration.openshift.io/role: workerspec: config: ignition: config: {} security: tls: {} timeouts: {} version: 2.2.0 networkd: {} passwd: {} storage: files: - filesystem: root mode: 420 path: "/etc/udev/rules.d/41-dasd-eckd-0.0.0a00.rules" contents: source: 'data:,%23%20Generated%20by%20chzdev%0AACTION%3D%3D%22add%22%2C%20SUBSYSTEM%3D%3D%22ccw%22%2C%20KERNEL%3D%3D%220.0.0a00%22%2C%20DRIVER%3D%3D%22dasd-eckd%22%2C%20GOTO%3D%22cfg%5Fdasd%5Feckd%5F0.0.0a00%22%0AACTION%3D%3D%22add%22%2C%20SUBSYSTEM%3D%3D%22drivers%22%2C%20KERNEL%3D%3D%22dasd-eckd%22%2C%20TEST%3D%3D%22%5Bccw%2F0.0.0a00%5D%22%2C%20GOTO%3D%22cfg%5Fdasd%5Feckd%5F0.0.0a00%22%0AGOTO%3D%22end%5Fdasd%5Feckd%5F0.0.0a00%22%0A%0ALABEL%3D%22cfg%5Fdasd%5Feckd%5F0.0.0a00%22%0AATTR%7B%5Bccw%2F0.0.0a00%5Donline%7D%3D%221%22%0A%0ALABEL%3D%22end%5Fdasd%5Feckd%5F0.0.0a00%22%0A' verification: {} - filesystem: root...

Troubleshooting

Troubleshooting: logging into a node● Sometimes you want a shell on the node itself● Preferred method isoc debug node/nodename– Starts up a pod scheduled onto the node– ...but with most of the real namespaces of the underlying node...– ...and the root filesystem of the node itself mounted on /host in the container– so “chroot /host” and it feels pretty much like a normal root shell on the node

● If things are very broken, this may not work

Troubleshooting: logging into a node● If “oc debug node/nodename” does not work then you can ssh into the

node from your bastion host:ssh core@nodename

● Use username "core", not root, not your own username● Authenticate with the ssh private key you put as the value for sshKey in

the install-config.yaml used at cluster install time● If you created a separate private key identity file for that (e.g. with ssh-

keygen), then give its filename in the ssh command:ssh -i /path/to/the/id_rsa core@nodename

● Can “sudo -i” to get interactive root shell● Using ssh to a node “marks” the node so support folks may ask why you

needed to log in in a non-preferred way

Copying files from/to containers● You can copy a file from any running pod to your local workstation:oc cp mypod:/path/to/file.foo .

● or copy a file from your workstation to any running pod:oc cp file.foo mypod:/put/the/file/as/file.foo

● Note that a file you copy into a pod into a filesystem not mounted from a PV will be ephemeral and thus vanish at pod restart/deletion

● All copying subject (of course) to your cluster permissions● Options to copy multiple files, directory hierarchies etc● Can also extract files from an image without needing to start a container:oc image extract imagename --path /get/this/directory:.

Interactive use of containers

Interactive use of containers and images● For real workloads on OCP, applications are deployed into

pods and they listen for network connections via services/routes/ingresses

● Logs, live streamed and historical (until pruned), are available:● Web console: Workloads, find and click pod, go to tab “Logs”● CLI using “oc logs ...”

● However, interactive creation of pods and access to pods is often really useful for troubleshooting and trying things out

Interactive shells on existing pods● From web console, Workoads, find pod, click Terminal tab; or● From CLI “oc rsh”, e.g.oc rsh pod/foooc rsh deploy/foo (finds latest pod of deployment)oc rsh job/foo (finds latest pod instance of batch job)

● All the above find an existing pod, start up a new shell process running /bin/sh in that pod and connect that to your terminal/browser– Can tweak things like shell name, terminal details, container name within

the pod and such like but these are rarely needed● A manual version for the same thing isoc exec -it pod/foo --command /bin/sh

Start a debug pod based on a resource● Instead of starting a shell/process in an existing pod, you may

want to– start up a new pod “just like” a desired deployment/daemonset/build/...– but instead of going ahead and running the program, just give you an interactive

shell in the pod instead

● For this, there is the CLI commandoc debug resource/foo

● This works for many different resource types: anything that creates a pod (e.g. deployment or image stream tag) or can host pods (e.g. node)– Command-line options can tweak things like the user that it runs as or which

container in the pod to use

Start an interactive pod from an image● This is basic OCP/K8s usage but often is just what you need

oc run --rm -it mypod1 --image=imagename● This fetches image imagename from a registry (if not cached)

– can be one your cluster knows by default such as quay.io (Red Hat Quay) and docker.io (Docker hub), one you've installed via operator (such as cp.icr.io, the IBM Cloud Registry for Cloud Paks), or one you add as prefix to imagename

● then starts a pod running it, connected to your terminal. ● You can add--command /bin/shif the image would normally start its program and you want a shell instead (assuming the image filesystem has a /bin/sh)

Images, images, images

Example images for interactive use● RHEL UBI8 (Universal Base Image)

oc run --rm -it sh1 --image=ubi8/ubiIf you don't see a command prompt, try pressing enter.[root@sh1 /]# cat /etc/redhat-releaseRed Hat Enterprise Linux release 8.2 (Ootpa)

● Ubuntuoc run --rm -it sh1 --image=ubuntuIf you don't see a command prompt, try pressing enter.root@sh1:/# head -2 /etc/os-releaseNAME="Ubuntu"VERSION="20.04.1 LTS (Focal Fossa)"

● SLESoc run --rm -it sh1 --image=registry.suse.com/suse/sle15:15.1If you don't see a command prompt, try pressing enter.sh1:/ # head -2 /etc/os-releaseNAME="SLES"VERSION="15-SP1"

More images● OCP has proper facilities for development and CI/CD with buildconfigs,

Source2Image (S2I) and so on...● ...but sometimes you just want to do a quick build/compile of something

that nobody's got around to publishing a Z executable for yet...● E.g. for Go (a.k.a. “golang” when naming needs to avoid false positives):

oc run --rm -it sh1 --image=golangIf you don't see a command prompt, try pressing enter.root@sh1:/go# go versiongo version go1.15.3 linux/s390x

More images● There is a lot of open source software (as well as proprietary software

like IBM Cloud Paks) providing interesting, useful functionality (in my opinion) with container images available either– automatically built on Z already (an increasing number); or– that builds out-of-the-box on Z

● OCP and modern CI/CD methods make having a Z image much more transparent than with pre-container software– all the image names we used (ubi8, ubuntu, sle15, golang, java) were

arch-independent thanks to OCI manifests– for any images that need to be branded differently for a given

architecture (e.g. “clefos”), OCP has flexible imagestreams that let you hide any naming problems

Interesting Open Source images● Gitea

– gitea.io - “Git with a cup of tea”– Your own Git repositories with web access for inspecting, cloning, browsing,

documentation etc - like GitLab, GitHub or BitBucket...– ...but running within your own OCP cluster and can be used easily for

push/pull/clone/build of images for your cluster● RabbitMQ

– rabbitmq.com - “Messaging that just works”– Message broker with easy web interface and support for many/most languages and

message protocols

https://gitea.io/

https://www.rabbitmq.com/

Interesting Open Source images● Node-RED

– nodered.org - Low-code programming for event-driven applications– Web UI for building event-driven programs with inputs and outputs over the

network from messaging, TCP/IP and with built-in and NodeJS processing● MinIO

– min.io - Kubernetes Native, High Performance Object Storage– Serves up S3-compatible object storage held in persistent storage on your cluster

and has a simple web interface for browsing, upload and downloading objects

https://nodered.org/

https://min.io/

Interesting Open Source images● Benthos

– benthos.dev - The stream processor for mundane tasks– Like a cross between CMS pipelines and z/OS DFSORT for filtering and munging

data (not the sorting part) but for cloud native processing: a single executable– inputs and outputs come from files (csv, tar, megssage-per-file), message

protocols (Kafka, AMQP, MQTT), SQL, HTTP, Redis, S3, SQS, GCP Cloud, TCP/IP sockets, web sockets, HDFS, ...

– processing can (un)compress, archive (tar, zip, binary, lines, json_array), manipulate and filter fields (JSON-style), XML, cache keys/values (e.g. in Redis)

– processing has a full optimised language (“bloblang”) or simpler syntax for “awk”-like or simpler field processing or supports plugins and subprocesses if needed

– declarative configuration for the processing and handles batching, streaming, parallelism, throttling and web interface for metrics and progress etc

https://www.benthos.dev/

Questions?● Questions?

● Thank you

● My contact details:

Malcolm BeattieLinux and IBM Z Technical ConsultantIBM UK Systems Lab [email protected]

mailto:[email protected]

Please submit your session feedback!Do it online at http://conferences.gse.org.uk/2020/feedback/1AB

• This session is 1AB

http://conferences.gse.org.uk/2020/feedback/1AB

GSE UK Conference 2020 Charity

• The GSE UK Region team hope that you find this presentation and others that follow useful and help to expand your knowledge of z Systems.

• Please consider showing your appreciation by kindly donating a small sum to our charity this year, NHS Charities Together. Follow the link below or scan the QR Code:

http://uk.virginmoneygiving.com/GuideShareEuropeUKRegion

http://uk.virginmoneygiving.com/GuideShareEuropeUKRegion

openshift: practical experience

Documents