atlas hk tier-2 site setup & storage research in cuhk by roger wong & runhui li

ATLAS HK Tier-2 Site Setup&

Storage Research in CUHK

By Roger Wong& Runhui Li

Roadmap

• ATLAS HK Tier-2 Site Setup– Presented by Roger Wong

([email protected]) Research Computing Team Information Technology Services Centre The Chinese University of Hong Kong

• Storage Research in CUHK

mailto:[email protected]

Major Tasks

• HTCondor• ARC CE + EGIIS• DPM• Frontier Squid%• Client

Install software components

Install Software Components

HTCondor

• Completed

ARC CE + EGIIS

• Basic configuration completedDP

M

•Basic installation completed (all-in-one node)•It works for protocols such as RFIO, XROOT

Squid

•Completed

Client

•Could access ARC CE, DPM and Frontier Squid

Install Testing Cluster• 10 VMs

– One EGIIS (for testing the registration process to CERN grid)– One ARC CE node with HTCondor manager & submit roles

• Two HTCondor worker nodes

– One ARC CE node with HTCondor manager, submit and execute roles– One DPM head node

• Two DPM disk nodes

– One Squid server– One client

• All 10 servers with production host certificates applied from AP Grid PMA

• Would like to try to connect to CERN grid now (yet to be discussed with counterpart in Lyon)– Need to tune configuration parameters– Need to sort out all outstanding issues

Conduct tender of production cluster

• Preliminary specification– 1,000+ cores– 1 PB storage

• Target to finalize the cluster specification after the testing cluster is connected to CERN grid in “test” mode

Upgrade testing cluster into production cluster

• Replacing ARC CE, HTCondor worker nodes and Squid

Replacing VMs in testing cluster with PMs

Add more HTCondor worker nodes

Reinstall DPM with PMs and storage devices

Tentative Timeline

Connect testing cluster to CERN grid in “test” mode (by 2015)

Conduct tender for production cluster(Jan 2016)

Put cluster into production (H2 2016)

Roadmap

• ATLAS HK Tier-2 Site Setup• Storage Research in CUHK

– Lead by Professor Patrick P. C. Lee ([email protected])

– Presented by Runhui Li ([email protected]) Advanced Networking and System Research Lab Department of Computer Science and Engineering


Storage Research in CUHK Build dependable storage systems with fault tolerance, recovery,

security, performance in mind

Techniques:• Erasure coding: Provide fault tolerance via “controlled” redundancy

(e.g., RAID)• Deduplication: Remove content-level “uncontrolled” redundancy• Security: Ensure data confidentiality and integrity against attacks

Targeted architectures:• Clouds, data centers, disk arrays, SSDs

Approach:• Build prototypes, backed by experiments and theoretical analysis• Open-source software • http://www.cse.cuhk.edu.hk/~pclee

http://www.cse.cuhk.edu.hk/~pclee/www/software.html



Storage Research in CUHK

Erasure coding Deduplication Security

Cloud Data center Disk array SSD

Backup MapReduce StreamingPrimary I/O

Our focus

Big data

File and storage systems

Motivation

Distributed storage systems are widely deployed to provide scalable storage by striping data across multiple nodes

Failures are common

12

LAN

Replication vs. Erasure Coding

Solution: Add redundancy:• Replication• Erasure coding

Enterprises (e.g., Google, Azure, Facebook) move to erasure coding to save footprints due to explosive data growth• e.g., 3-way replication has 200% overhead; erasure

coding can reduce overhead to 33% over 50% of operational cost saving [Huang, ATC’12]

13

Background: Erasure Coding

Divide file to data chunks (each with multiple blocks) Encode data chunks to additional parity chunks Distribute data/parity chunks to nodes Fault-tolerance: any out of nodes can recover file data

14

File encode divide

Nodes

(n, k) = (4, 2)

ABCD

A+CB+D

A+DB+C+D

AB

CD

A+CB+D

A+DB+C+D

AB

CD

Erasure Coding

Key advantage:• Reduce storage space with high fault tolerance

Challenges:• Data chunk updates need parity chunk updates expensive updates

• k chunks needed to recover a lost chunk expensive recovery

Our work: Mitigating performance overhead of erasure coding, while preserving storage efficiency

CodFS

Object-based distributed file system• Splits a large file into smaller segments that are striped across

different storage nodes

Erasure coding• Each segment is independently encoded with erasure coding for

fault tolerance

Decoupling metadata and data management• Metadata updates off the critical path

Lightweight recovery• Monitor health of storage nodes and trigger recovery if needed

16"Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage“, USENIX FAST 2014

CodFS Solve update problem

Novelty: use parity logging with reserved space• Puts deltas in a reserved space next to parity chunks to

eliminate disk seeks in parity updates• Predicts and reclaims reserved space in workload-aware

manner• Mitigates both network and disk I/Os in updates and recovery

17

Data nodes Parity nodes

∆A

∆P = f(∆A) ∆Q = g(∆A)

CodFS: I/O Workflow

18

Client MDS

OSD OSD...

OSD OSD... OSD OSD

primary

secondary

segment

chunk

Encode

1

23

4

MDS: metadata serverOSD: object storage device

CodFS Implementation

CodFS Architecture• Exploits parallelization

across nodes and within each node

• Provides a file system interface based on FUSE

OSD: Modular Design19

Results

Aggregate read/write throughput• Achieve several hundreds of megabytes per second• Network bound 20

Projects on Erasure Coding Mixed failures

• STAIR codes: a general, space-efficient erasure code for tolerating both device failures and latent sector errors [FAST’14, TOS’14]

• I/O-efficient integrity checking against silent data corruptions [MSST’14]

Efficient updates• CodFS: enhanced parity logging to reduce network and disk I/Os [FAST’14]

Efficient recovery• NCCloud: reduce bandwidth for archival storage [FAST’12, INFOCOM’13, TC’14]

• I/O-efficient recovery schemes for erasure codes [MSST’12, DSN’12, TC’14, TPDS’14]

Integration of erasure coding and Hadoop• CORE: Regenerating code deployment in HDFS [MSST’13,TC’15]

• Degraded-First Scheduling: MapReduce on erasure-coded storage [DSN’14]

• Encoding-Aware Replication: efficient transition from replication to erasure coding on HDFS [DSN’15]

Modeling of SSD RAID• Stochastic model to capture reliability changes as SSDs age [SRDS’13, TC]

Projects on Deduplication

LiveDFS: Linux kernel-space deduplication file system [Middleware’11]

• Extends Linux file system with deduplication• Follows Linux file system layout• Deployed as a kernel driver module

CloudVS: Tunable version control for virtual machine images on Openstack [NOMS’12,TSC’15]

• Extends Eucalyptus with deduplication• Tunable performance between storage efficiency and performance

RevDedup: Reverse deduplication with high read/write throughput on GB/s scale [APSys’13, TOS’15]

• Efficient hybrid inline and out-of-line deduplication

Projects on Security

FADE: secure access control and assured deletion for cloud storage [SecureComm’10, ICPP Workshop 11, TDSC’12]

FMSR-DIP: remote data checking for regenerating codes [SRDS’12, TPDS’14]

Cryptographic deduplication cloud storage [TPDS’14, TPDS’15]

CDStore: unifying erasure coding, deduplication, and security via convergent dispersal [HotStorage’14, USENIX ATC’15]

DISCUSSION

Connecting to CERN Grid

By Roger Wong

Connecting to CERN Grid (1)• Registered in GOCDB or OIM• Step 0: Required Services

– SE• CUHK: Setup SRMv2.2 and configure necessary space tokens• Lyon: Configure FTS channels• Could we transfer in data from more than one Tier-1 sites?

– CE and WNs• Not that many in the testing cluster when first connecting to Tier-1 site• Will add much more WNs within 6 months

– CVMFS• What does CUHK need to do? Just ensure our client has CVMFS installed?

– Squid• Install default and fail-over Squid servers• Manual fail-over?

• Question– Could CUHK transfer in data from more than one Tier-1 sites?

• Outstanding items– Separate DPM head node and disk nodes– SRMv2.2 configuration

Connecting to CERN Grid (2)• Step 1: Register the site to AGIS

– Register an “Atlas Site” with the site name in the GOCDB / OIM– Is site name just CUHK?

• Step 2: Register the storage to DDM– Register “DDM EndPoints” corresponding to the space tokens in AGIS

• CUHK– SE name– Space token availability– Email address of responsible person– seinfo– FTS channel information

• Lyon– Open a DDM Ops Savannah ticket

» Include DDM endpoint in SiteServices and DeletionServices» Validate the transfer and deletion steps with one dataset» DDM endpoints will appear in DaTRI after 24 hours

– Fill in all the information

Connecting to CERN Grid (3)• Step 3: Set up a Squid

– Register the Squid in AGIS as well as the Frontier services that it should look up

• Step 4: Panda Queues– CUHK

• CE name and queue name• vmem size per job slot• Available disk size (workdir) per job slot• Wall-time limit if any

– Lyon• Register “Panda Site”, “Panda Resources” and associated “Panda Queues” in AGIS

• Question– Will the ARC CE queue become Panda queue automatically if CUHK registers as

a Panda site? No extra set up in CUHK is necessary?

Connecting to CERN Grid (4)

• Step 4: Panda Queues– Panda site

• Usually “Atlas Site” == GOCDB/OIM site name

– Panda resource• Associated to the “Panda site”• Production jobs: usually “Panda site” == “Atlas site” ==

GOCDB/OIM

– Panda queue• Associated to the “Panda resource”• Usually the same name as “Panda resource”• Associate the CE and the queue• Set queue status to test


• Step 5: ATLAS SW installation/validation system– After “Panda queues” are configured, contact

[email protected] to start automatic software installation/validation



• ATLAS Functional Tests– DDM FT: Test storage and connectivity stability– SAM test: Test CE and storage stability

• Step 6: Perform data transfer functional test– Lyon: include the site in DDM FT (T1->site and

Sonar)


• Step 7: Perform production functional test• Step 8: Perform analysis functional test

– Contact atlas-adc-hammercloud-support and inform the “Panda resource” name

– The site should set automatically into the HC DB within hours

– Test jobs should appear with 24 hours


• Step 9: Analysis activity– Add the site to PanDA database (for pathena/prun

analysis jobs)– Site appear in PanDA Cloud Monitor– Run GangaRobot jobs with a success rate > 95%

for 10 days• Step 10: Production activity

– The site would be online after a few jobs could be successfully run

atlas hk tier-2 site setup & storage research in cuhk by roger wong & runhui li

Documents

storage space

storage efficiencycodfsobject

scalable storage

overhead erasure coding

site setupstorage research

engineeringstorage research

erasure codingsolution

data chunk updates