nttドコモ様 導入事例 openstack summit 2015 tokyo 講演「after one year of openstack cloud...

38
Copyright©2015 NTT DOCOMO, INC. All rights reserved. After One Year of OpenStack Cloud Operation (NTT DOCOMO) NTT DOCOMO Inc. Ken Igarashi NTT Software Asako Ishigaki NEC Akihiro Motoki

Upload: virtualtech-japan-inc

Post on 06-Jan-2017

9.968 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

After One Year of OpenStack Cloud Operation (NTT DOCOMO)

NTT DOCOMO Inc.Ken IgarashiNTT Software

Asako IshigakiNEC

Akihiro Motoki

Page 2: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

DOCOMO, INC All Rights Reserved

Ken Igarashi○ Leading OpenStack Project at NTT DOCOMO○ One of the first members of proposing

OpenStack Bare Metal Provisioning (currently called "Ironic") - bit.ly/1stuN2E

Asako Ishigaki○ Engineer, NTT Software ○ Developing OpenStack log collection and

analytics tools.

Akihiro Motoki○ Senior Research Engineer, NEC○ Core developer of Neutron and Horizon.

About Us

2

Page 3: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Our Project

organization

Page 4: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 4

Scalable Test using 100 nodes

(10)

System Design

(8)

Recovery Tests(12)

Racking and Cabling

(14)

24/7 support(14)

User Support(+x)

2014-6 2014-8 2014-11 2015-2 2015-5 2015-112015-8

Page 5: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 5

o Team Rules (Culture) Focusing on using OpenStack instead of developing OpenStack

Think how to use it. Don’t think OpenStack can’t do XXXX.

Reducing Opex/Promoting Automation Operation tools

• “Anything that a humane needs to do more than twice must be automated.”

Reduce operators by HA and self healing.

Page 6: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 6

o Tools Ansible, Python, Shell Script

CI/CD

• pep-8• Ansible-lint• Install

Spec Writing

Test

Review

Production

+5200+ deployments

(2015)

2000+ patches(2015)

Deployment

Procedure

Page 7: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Operation

HAAutomation

Page 8: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 8

o OpenStack Configuration(http://bit.ly/1DbJPUO) Double redundancies for hardware Triple redundancies for software

VMVM

VMVMVMVM

MySQL (Galera)

Arbitrator

DB1 DB2

DB3 DB4 VMVMNova

OpenStack APIs

Zabbix

LBLBNeutron Agents

PXE, DNS, DHCP

MaaS

RabbitMQ

Page 9: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 9

o OpenStack Configuration(http://bit.ly/1DbJPUO) Double redundancies for hardware Triple redundancies for software

VMVM

VMVMVMVM

MySQL (Galera)

Arbitrator

DB1 DB2

DB3 DB4 VMVMNova

OpenStack APIs

Zabbix

LBLBNeutron Agents

PXE, DNS, DHCP

MaaS

RabbitMQ

Page 10: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 10

o Deployment CMDB Registration

Page 11: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 11

o Choose playbooks for Ansible Dynamic Inventory

Ansible

Page 12: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 12

o Deployments Common: network, account, logging, Zabbix agent, drivers/firmware x

37

OpenStack: Nova, Swift, Neutron, ……. x 62 HA Configuration

compileInitial update setup

kernel driver firmware filesystemdevelopment environment

Install HDD Driver

Page 13: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 13

o Operation x 31 Common: process restart, log correction OpenStack Operation: usage, VM migration/backup, user

add/delete/quota change OpenStack Monitoring: health check tools

perhost instance check• Launch instances on given node(s)• boot succeed, instance log• Metadata retrieval, login prompt, SSH access• Optionally, test volume attach and its read/write access

Page 14: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 14

o 2015/10/27 4:40pm - 5:20pm Heian (New Takanawa)

What are operators doing behind the Cloud?

Page 15: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Monitoring System

monitoring

Page 16: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 16

o Monitoring System

Weekday daytime

24h / 365d

VMVM…

VMVMSwiftVMVMCinder

VMVMNova

RabbitMQ

Neutron Agents

Data Bases

Fluentd

Elasticsearch

Zabbix

Kibana

Page 17: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 17

VMVM…

VMVMSwiftVMVMCinder

VMVMNova

RabbitMQ

Neutron Agents Data Bases

Memory CPU Network HDD

General

OpenStack

Monitoring Items Self Healing

1,970 25

3,957 59

Page 18: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 18

o RabbitMQ Configuration

3 node cluster cluster_partition_handling, autoheal

Monitoring Split Brain check:

• “rabbitmqctl eval '[N||{partitions,N}<-rabbit_mnesia:status()].’”

Port Check (5672, 25672) Process Check

• Beam.smp• Rabbitmq-server

At least one node running(1/3)• {Openstack-RabbitMQ:grpsum["HostG-

RabbitMQ","net.tcp.service[tcp,,25672]",last,0].count(#3,0,"eq")}=3

• {OpenStack-RabbitMQ:grpsum["HostG-RabbitMQ","proc.num[beam.smp]",last,0].count(#3,0,"eq")}=3

Page 19: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 19

o MySQL Configuration

4 Nodes + 1 Arbitrator

Monitoring Cluster Check

• wsrep_local_recv_queue• wsrep_local_send_queue• wsrep_flow_control_paused• wsrep_local_commits

Arbitrator

LB

R/W

Page 20: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 20

o MySQL Cluster

Master

Disk

Galera

recv_queuesend_queue

Commit

Disk

Replication

OK

Slave

MySQL

Client

OK

Wait until receive OK from replication

Page 21: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 21

o MySQL Cluster Freeze

Master

Disk

Galera

recv_queuesend_queue

Commit

Disk

Replication

OK

Slave

MySQL

Client

OK

Wait until receive OK from replication

👿

• Disk Failure: (removed from 😀cluster)

• Disk Speed Throttling : 😢

Page 22: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 22

Page 23: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

DOCOMO, INC All Rights Reserved

○ Prohibit some self-healing actions Do not reboot some OpenStack processes

– neutron-plugin-openvswitch-agent Do not reboot network nodes

– loose network reachability (can’t recreate network namespace)

Prohibited Actions while MySQL Cluster Freeze

23

Solved at Liberty?

All the VMs loose connections

Page 24: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 24

o Throttling happens during DB backup Limit Backup Node

Backup Method

LBR/W

Limit Backup Node

LOCK TABLES FOR BACKUP (online)

1. Take from cluster(Donor/Desynced)

2. DB lock and do backup(FLUSH TABLES WITH READ

LOCK) 3. Return to cluster

(wsrep_desync=OFF)

– wsrep_local_recv_queue– wsrep_local_commits

Page 25: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Log Analytics

Kibana

Page 26: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

DOCOMO, INC All Rights Reserved

(1) detect critical system-failure

We have to recover immediately

(2) detect malicious access

We need tonotify users

(3) detect no critical errors

Better to be fixedas soon as possible

(4) find errors/warnings that have no service impact

We want to filter out next time

Purpose of Log Analytics

26

Page 27: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

DOCOMO, INC All Rights Reserved

○ e.g.Logs of a dayTotal:

100 GB, 80M linesSum of critical, error and warning logs:

200K linesThe meaningful logs are more restrictive:

(1) 0 critical failure (2) 0 malicious access(3) 6 non-critical failure (4) 6 ignorable failure

0%0%1%

30%

39%

30%

Breakdown of Logs

CriticalErrorWarningInfoDebugOther

Treasure Hunt in The Ocean of Logs

0%

24%23%49%

3%

HW

OS

OpenStack backend

OpenStack

Operation tools

27

Page 28: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

DOCOMO, INC All Rights Reserved

○ We analyze logs to enhance our black list and white list.○ Logs found in our black list are sent to Zabbix.

Log Analytics Based on White/Black List

---------------

Logs trash

Zabbix Kibana

--------------------

expand

expand

reduce

analyze…

28

add

addblack list

white list

Page 29: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

DOCOMO, INC All Rights Reserved

Log Server

Network Node

Control Node

Compute Node

How to Adopt Black/White List Using Fluentd

Fluentd

Elasticsearch

zabbix_senderfluentd

LB

UTM

• Add “ignorable” flag according to white list

• Put metadata to create graphs from the logs

rsyslog

refer

Zabbix

alerts

Kibana

graph graph

Notify Zabbix according to black list

29

Page 30: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

DOCOMO, INC All Rights Reserved

Log Server

How to Adopt Black/White List Using Fluentd

Fluentd

Elasticsearch

zabbix_senderfluentd

1. syslog10:01 crit: hardware failure

path: syslog rsyslog api.log

timestamp: 10:01 10:03 10:04

severity: crit warn ERROR

item: - ids ignore

source_ip: - x.x.x.x -

message: hardware failure

IDS: from x.x.x.x

invalid request format

3. api.log10:04 ERROR: invalid request format

2. rsyslog10:03 warn: IDS: from x.x.x.x

Zabbix

hardwarefailure

Kibana

IDSgraph

critgraph

refer

30

Page 31: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

DOCOMO, INC All Rights Reserved

Example of Our White List # with Juno

• Count response codes and understand the trend. That’s enough.

^keystonemiddleware\.auth_token \[\-\] Unable to find authentication token in headers$

• This ERROR means user’s operation was denied due to quota.• It has no impact to our system. Should be INFO log?

^nova\.api\.openstack \[[^\]]*\] Caught error: VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds allowed Gigabytes quota\..*$

• This WARNING is caused by presence of SHUTOFF instances.

• It is commonplace condition. Need to be ignored.

^nova.scheduler.host_manager \[[^\]]+\] Host has more disk space than database expected .*$

31

1

2

3

Page 32: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

DOCOMO, INC All Rights Reserved

○ We succeeded in reducing logs to be analyzed. In other words, so many meaningless logs have high log-levels.

Effect of Our White List

Without White List: 160K

With White List: 37

reduce99.98%

32

Today

We can analyze all logs in 2-3 hours a day!

1 year agoWe couldn’t analyze all logs

in a day

Page 33: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

DOCOMO, INC All Rights Reserved

Example of Our Black List

• This message indicates disk problem on • Compute node.

^kernel: \[[^\]]*\] XXXXX.*hardware failure\.$

• Corosync needs cleanup its resources.

^pengine: warning: unpack_rsc_op:Processing failed op monitor for .*$

• Fullbackup of mysql failed once.

^mysql_fullbackup\[\d+\]:\sFailed\sto\sMySQL\sfullbackup.*$

33

Warning alert

Information alert

Information alert

1

2

3

Page 34: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

DOCOMO, INC All Rights Reserved

Demonstration with Kibana○ 3 dashboards

OpenStack All Logs Error Logs Critical Logs Warning Logs IDS

34

Page 35: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

DOCOMO, INC All Rights Reserved

Trademarks○ Kibana is a trademark of Elasticsearch BV, registered in the U.S. and in other countries.○ Elasticsearch is a trademark of Elasticsearch BV, registered in the U.S. and in other countries.○ logstash is a trademark of Elasticsearch BV.

35

Page 36: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 36

o Presentation - Operation 2015/10/27 4:40pm - 5:20pm Heian (New Takanawa)「 What are operators doing behind the Cloud? 」

o Exhibition NEC Booth(H4)

28(Wed.)10:45-13:00,16:30-18:30, 29(Thu.)   9:00-14:00 NTT Group Booth(S14)

28(Wed.) 13:15-16:15「 Touch and Feel! NTT DOCOMO’s Cloud Operation 」

[email protected]

Page 37: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 37

NEC NTT

Page 38: NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved.ご清聴ありがとうございました。