expectations to machine learning for network … l various network logs are gathered by nmss and...

47
Copyright©2016 NTT corp. All Rights Reserved. Expectations to Machine Learning for Network Management NML-RG meeting, IRTF/IETF 97 (Soeul) November, 2016 Kohei Shiomoto (NTT)

Upload: vuongdien

Post on 30-Apr-2018

220 views

Category:

Documents


2 download

TRANSCRIPT

Copyright©2016NTTcorp.AllRightsReserved.

ExpectationstoMachineLearningforNetworkManagement

NML-RGmeeting,IRTF/IETF97(Soeul)November,2016

KoheiShiomoto(NTT)

1Copyright©2016NTTcorp.AllRightsReserved.

Outline

n Networkmanagementissues(3min)n Data-drivenapproach(3min)n Twoexamplesofourpractice(9min)

1. SYSLOGanalytics2. TroubleTicketanalytics

ExplainonlyourgoalandkeyideaSkipthroughseveralslidesonmath

n Discussion

2Copyright©2016NTTcorp.AllRightsReserved.

ChallengesinICTManagement

n Diversifiedapplications&services/ModernWebtrafficn Streaming,browsing,SNS,e-commerce,e-Health,…

n HighexpectationsforAvailability&Qualityn 99.9999%availability,highresolution,lownoise,

stallingfree,quickresponse,…n ComplexICTsystemstructure

n Devices,software,protocols,…n Interactionofplayers

n Customer,ISPs,CDN,…n Communicationsgetencrypted.https://…n ...

3Copyright©2016NTTcorp.AllRightsReserved.

Disaggregation-whywepursue?

n Verticallyintegratedsystemn Networkdevices(router,switch,middle-box,

etc.)havebeenverticallyintegratedsystem.n Thoseverticallyintegratedsystemsconsistof

manyhardwareandsoftwarecomponentsprovidedbydifferentcomponentvendors.

n Theendoflifeofsomecomponentscouldriskthelifeofentiresystem.

n Addingnewfunctionalitiesisundercontrolofsystemvendor.

n Disaggregationcouldsolvetheissues.

4Copyright©2016NTTcorp.AllRightsReserved.

SDN&NFV

Software-drivenControlSoftware-DefinedNetworking(SDN)

• Routing&Signaling,trafficengineering/steering,• CommodityL2/L3Switch-Routerhardware

NetworkFunctionVirtualization(NFV)• Middlebox (NAT,FW,IDS/IPS,VPN,CDN,…)• CommodityX86machinehardware

DKMMx86NIC DK

MMx86NIC

TCAMNPNICNIC

HA middle HA middle HA middle

DKMMx86NIC DK

MMx86NIC

TCAMNPNICNIC

HA middle HA middle HA middle

5Copyright©2016NTTcorp.AllRightsReserved.

Costofdisaggregation-Increaseofcomplexity

n Disaggregationbringsaboutanumberofbenefits:costreduction,elasticcapacity,quickdeploymentofnewfunctions.

n Butthosebenefitscouldnotbeobtainedwithoutsacrifice.

n Networkdevice,whichisdisaggregatedintocomponents,introducesfurthercomplexitycausedbyinteractionsamongcomponents.

n Eachcomponentisfrequentlyandquicklyreplacedwithneweroneasnewfeaturesaredevelopedandreleased.

6Copyright©2016NTTcorp.AllRightsReserved.

Howtodealwithcomplexsystem?

n Powerfulnetworkmanagementparadigmisrequiredtodealwithcomplexity.

n Wecouldnotrelyontraditionalmechanism-drivenapproach,whichpiecestogetheraccuratemechanismsofindividualcomponents.

n Wehavetorelyonaholisticdata-drivenapproachtomodelanentiresystembyanalyzingrelationshipbetweeninputsandoutputs.n ConventionalMechanism-drivenapproach

nGivenunderstandingprecisemechanismsofcomponents,buildupamodelofentiresystem.

n Towards Data-drivenapproachnGivendata,infertherelationshipbetweeninputsandoutputs.nMachinelearningisakey.

7Copyright©2016NTTcorp.AllRightsReserved.

Varietiesofdatacanbeused.

n Trafficloadn Performancen Syslogn Troubleticketsn SNSmessages(e.g.Twitter)n …

Numerical,text,…

Copyright©2016NTTcorp.AllRightsReserved.

Spatio-TemporalFactorizationofLogDataforUnderstandingNetworkEvents

Tatsuaki Kimura,Keisuke Ishibashi,Tatsuya Mori,HiroshiSawada,Tsuyoshi Toyono,Ken Nishimatsu,AkioWatanabe,Akihiro Shimoda,Kohei Shiomoto

SYSLOGAnalytics

Background

l VariousnetworklogsaregatheredbyNMSsandmonitored– Switch,router,RADIUSsever,…– syslog,serverlog,alarm,SNMPtrap,…– logscontainusefulinformationforNWtroubleshooting

ISPIPTV VoIP

Syslog Serverlog

Example[router,switchsyslog](Cisco):%TRACKING-5-STATE:1interfaceFa0/0line-protocolUp->Down%LINK-3-UPDOWN:InterfaceFastEthernet 0/9,changedstatetodown%SYS-5-CONFIGI:Configuredfromconsolebyvty2(10.11.11.11)

[NMSalarm]:[エラータイプ 100]リンクダウンが起こりました.ホスト名:hostA IPアドレス:10.11.1.1プロトコル3[エラータイプ 100]パケットロスを検出しました.type:xxxx,host:hostB,IPアドレス:10.11.11.11<優先度:低>オペレータがログインしました.fromyyyy,host:hostC,2011/11/11

Background

l VariousnetworklogsaregatheredbyNMSsandmonitored– Switch,router,RADIUSsever,…– syslog,serverlog,alarm,SNMPtrap,…– logscontainusefulinformationforNWtroubleshooting

l Diverseandmassiveamountsoflogs– multiplevenders,multipleservices,complexnetworkevents– over1,000,000ofmessages/day

⇒ Analyzinglogshasbecomeseriousproblem

ISPIPTV VoIP

Syslog Serverlog

11Copyright©2016NTTcorp.AllRightsReserved.

Ourresearchgoal

Miningnetwork(NW)eventinformation fromlargeanddiverseNWlogdataNetworkevents=spatialandtemporalpatternsoflogmessagesmessagesassociatedwithinitializationofvariousprocesscausedbyrouterrebooteventmultiplelayerflapscausedbyL-1flap(L-2,OSPFre-convergence,BGPflap,...)virtualpathdis-connectionrelatedtophysicalmachinefailure

linkdown

L-1flaprooterreboot

HardwareFailure

MemoryError

IFflap

2012-1-1T00:00:00 %TRACKING-5-STATE: 1 interface Fa0/0 line-protocol Up->Down 2012-1-1T00:00:00 %LINK-3-UPDOWN: Interface FastEthernet 0/9, changed state to down 2012-1-1T00:00:00 %SYS-5-CONFIG I: Configured from console by vty2 (10.11.11.11)2012-1-1T01:11:00 msg [100]: STP: VLAN 1 Port 38 STP State -> DISABLED (PortDown)2012-1-1T01:11:00 msg [101]: System: Interface ethernet 38, state down2012-1-1T03:00:00 msg [200] : STP: VLAN 100 Port 22 STP State -> DISABLED (PortDown)2012-1-1T03:00:00 msg [201] : System: Interface ethernet 22, state down2012-1-1T00:00:00 %SYS-5-CONFIG I: Configured from console by vty2 (10.11.11.11)2012-1-1T10:30:00 System: Interface ethernet 1, state down2012-1-1T10:30:00 System: Interface ethernet 1, state up2012-1-1T10:30:00 System: Interface ethernet 2, state down2012-1-1T12:00:00 init: alarm-control (PID 111) terminate signal sent2012-1-1T12:00:00 init: bslockd (PID 124 ) terminate signal sent2012-1-1T12:00:00 init: ce-l2tp-service (PID 123 ) terminate signal sent2012-1-1T12:00:00 init: chassis-control (PID 1111 ) terminate signal sent2012-1-1T12:00:00 init: disk-monitoring (PID 7082 ) terminate signal sent2012-1-1T00:00:00 %SYS-5-CONFIG I: Configured from console by vty2 (10.11.11.11)2012-1-1T15:45:10 msg [200] : STP: VLAN 100 Port 22 STP State -> DISABLED (PortDown)2012-1-1T15:45:10 msg [201] : System: Interface ethernet 22, state down2012-1-1T16:12:40 System: Interface ethernet 1, state down2012-1-1T16:12:40 System: Interface ethernet 1, state up2012-1-1T16:12:40 System: Interface ethernet 2, state down2012-1-1T20:30:00 init: alarm-control (PID 111) terminate signal sent2012-1-1T20:30:00 init: bslockd (PID 124 ) terminate signal sent2012-1-1T20:30:00 init: ce-l2tp-service (PID 123 ) terminate signal sent2012-1-1T20:30:00 init: chassis-control (PID 1111 ) terminate signal sent2012-1-1T20:30:00 init: class-of-service (PID 11112) terminate signal sent

12Copyright©2016NTTcorp.AllRightsReserved.

WhyminingNWevents?

Automatically constructingdomainknowledgeofNWoperators(withoutrequiringskillsandexperience)

Manypossibleapplications:ObtainingnewalarmrulesQuickunderstandingofrootcauseofproblemsMayhelpindetectionof“silentfailure”

linkdown

L-1flaprooterreboot

HardwareFailure

MemoryError

IFflap

13Copyright©2016NTTcorp.AllRightsReserved.

Keyidea(observation)

Logdatacanbeconsideredasrank-3tensortypesoflogmessages(templates),hostname,time

Observedlogdata= ‘superposition’ofNWevents

ExtractionofNWeventsproblem⇒ TensorFactorization problem

linkflapevent

linkflapevent

cronjob

cronjobA

B

C

msg1

msg2

msg1

msg2

msg3

msg4

msg5

msg6

msg4

msg5

msg6

msg1 msg2 msg3 msg4 msg5 msg6

A ◯ ◯ △ △ △

B ◯ ◯ ◯

C △ △ △

H: hosts

J: time-windowsI: templates

14Copyright©2016NTTcorp.AllRightsReserved.

Challenges&SummaryofContributions[1]Unstructuredandmassivelogmessages

Morethan1,000,000lines/dayFormatsoflogmessagesdependonvendororservice

⇒Wepresent StatisticalTemplateExtraction(STE)・ automaticallyextractprimarytemplates fromlargelogdata

[2]Logdataare verycomplexUnderlyingnetworkeventsoccurallaroundnetworkNetworkeventsspanacrossseverallocations,networklayers,andservices

⇒Wepresent LogTensorFactorization(LTF)• extractspatialandtemporalpatterns oflogdata• basedonNonngegativeTensorFactorization(NTF)approach

15Copyright©2016NTTcorp.AllRightsReserved.

STEMotivation

Rawlogmessagescannot beuseddirectlylogmessagescontainvariousparameters(IPaddress,hostname,PID,…,etc.)Tocorrelatelogmessages,weneedtoknowlogtemplates =messageswithoutparameters

%TRACKING-5-STATE: 1 interface Fa0/0 line-protocol Up->Down %LINK-3-UPDOWN: Interface FastEthernet 0/9, changed state to down %SYS-5-CONFIG I: Configured from console by vty2 (10.11.11.11)

%TRACKING-5-STATE: * interface * line-protocol Up->Down %LINK-3-UPDOWN: Interface FastEthernet *, changed state to down %SYS-5-CONFIG I: Configured from console by * (*)

16Copyright©2016NTTcorp.AllRightsReserved.

STEMethodology

1.Scoringfrequencyofwords amongsimilarmessagesparameterwords appearinfrequentlycomparedtotemplatewords ineachposition

2.Clusteringscore,anddetermineparameterwords foreachmessagethresholdsforscoreofparameter wordsdifferdependingonlogmessagesdensity-based clusteringalgorithm(DBSCAN)

raw log messages: <189> security telnet connection 15720 with 10.7.11.11 broken<189> security telnet connection 18340 with 10.8.9.123 broken

①Scoring: If word appears in P-th position in log that contains L words:

Score(word,P,L) = Pr(word | P,L)1 2 3 4 5 6 7

<189> security telnet 15720 with 10.7.11.11 broken

<189> security telnet 18340 with 10.8.9.123 broken

log template:<189> security telnet connection * with * broken

Remove

②Clustering scores (DBSCAN): Distance between each cluster is > δ

0 1

17Copyright©2016NTTcorp.AllRightsReserved.

Spatial&temporalpatternswewanttoextract

Definition 1 [Template Group]Ø groupoftemplatesthattendtoco-occur:l = (i1, i2, ...)

ü representseventatindividualhostü e.g.linkflap:linkdown +linkup

reboot:processinitializationmessages

Definition 2 [Network Event]Ø setoftuples<host,templategroups>thattendtoco-occur

e = {(h1, l1), (h2, l2), ...} (h1, h2∈H, l1, l2∈L)ü spatialextension oftemplategroupsü e.g.linkdowneventamongneighboringhosts

host h1

template groups

host h2

host h3

tem

plat

eA

tem

plat

eB

tem

plat

eC

tem

plat

eD

l1 l2 l3 l4

Hierarchicalcorrelationsareobservedinlogdata

18Copyright©2016NTTcorp.AllRightsReserved.

LTF

LogTensor X (I × H × J)xihj: # of occurrences of template i at host h, time window j

• templates:1,..., i,..., I• hosts: 1,..., h,..., H• timewindows: 1,..., j,..., J (logdataare partitioned)

• LTFfactorizesX into1. Log template matrix V = [vl] (I × L)

2. Network event tensor Z = [zlk] (L × K × H)

3. Weight matrix W = [wk] (K × J)* K: # of network events. L: # of template groups (given)

H: hosts

J: time-windowsI: templates

X Z WV≅

I

L: templategroups

JKH

LK: network events

19Copyright©2016NTTcorp.AllRightsReserved.

IntuitiveinterpretationsofLTF

template group matrix Vl-th row of matrix V represents l-th template group ⇒ corresponds to Template Group

network event tensor Zk-th slice of Z represents which template groups occur at which hosts⇒corresponds to Network Event

VI: templates

L: template groups

tem

plat

eA

tem

plat

eB

tem

plat

eC

tem

plat

eD

1.0

0.0

Z

L: templategroups

K: network events

host A

L: template groups

H: hostshost Bhost C

20Copyright©2016NTTcorp.AllRightsReserved.

LTFProblemFormulation

Optimizationproblemwithnonnegativeconstraintsoneachtensor

Objectivefunction=KL-divergencebetweenX andV, Z, W

Algorithm(multiplicativeupdaterules;typeofEMalgorithm)

simpleanditerativeform

Formulation&Algorithm

LTFProblem

21Copyright©2016NTTcorp.AllRightsReserved.

DataSet5millionlinesoflogs(capturedinsmallnetwork,5months)

Evaluationmetrics=effectivenesscalculate#ofextractedtemplates5,000,000logs⇒ 2,000~3,000templates(lessthan1%)

Falsepositivewords⇒ 0.07~0.08%of1,000,000words

STE Evaluation

22Copyright©2016NTTcorp.AllRightsReserved.

DataSetover600,000linesof1-daylogdatadimensionsareroughly 100 (I) × 150 (H) × 150 (J)

EvaluationMetricsexpressivepower:HowwelldoesLTFfittorealdata?⇒ usewellknownmeasure‘averagetestlog-likelihood’

• predictionpowerofrandomlymaskedelements• highervaluemeansbetterfittodata

LTF Evaluation Metrics

23Copyright©2016NTTcorp.AllRightsReserved.

LTF Evaluation Results

averagetestlog-likelihoodwithdifferentKandLweusednormalNTFmodelasbaseline*L:#oftemplategroups,K:#ofnetworkevents

LTFfitsbettertorealdatathancurrentNTF

24Copyright©2016NTTcorp.AllRightsReserved.

Casestudyresults(ExamplesofoutputofLTF)1/2

Neighboringlinkflapevent

CoreRouter A

EdgeRouter B(neighbor)

Interfaceup/down

Interfaceup/downlineprotocolup/down

zikh vil

25Copyright©2016NTTcorp.AllRightsReserved.

Tunnelingpathdisconnectionevent

CoreRouter

EdgeRouter AModulereloadeventvirtualpathdisconnection

EdgeRouter A

Tunnelingpathdisconnection

...

Casestudyresults(ExamplesofoutputofLTF)2/2

Morethan50connections

26Copyright©2016NTTcorp.AllRightsReserved.

SummaryofSYLOGAnalytics

PresentedSTEExtractingprimarytemplatesfromnoisylogmessagesCompressing5,000,000linesto3000templates

Presented LTFModelinggenerationoflogsasrank-3tensorandfactorizingintotemplategroupsandnetworkeventsMuchbetterthancurrentNTFCancorrectlyextracthiddencomplexnetworkevents

Copyright©2016NTTcorp.AllRightsReserved.

WorkflowExtractionforServiceOperationusingMultipleUnstructuredTroubleTickets

AkioWatanabe,KeisukeIshibashi,TsuyoshiToyono,Tatsuaki Kimura,Keishiro Watanabe,YoichiMatsuo,Kohei ShiomotoNTTNetworkTechnologyLaboratoriesApril25,NOMS2016

TroubleTicketAnalytics

Background

• Wewouldliketospecifytroubleshootingprocess– MTTRstronglydependsonthetimeuntildecidingtheresolution

• Whyprocessisnotdefined?– variousprocessforthousandsoffailures– requiringtacitknowledge– includingdomainrule

Detection Isolation Resolutiondecision Resolution

Troubleshootingprocedureforsystemmanagement

Research targets: Specify Troubleshooting

What is the cause?

What should we do?

Present understanding of process

• Operators search the resolution from trouble tickets– amount of valuable knowledge about failures

• Much information are written by natural language

Ourgoal

• Automatically specifying troubleshooting process from trouble tickets

• Generating Workflow; graphical flowchart of actions for each failure

summarizingmultipletroubletickets

clarifyingdeciderules

Two challenges

• Finding action sequences for each trouble ticket

• Finding isolating action; that have multiple transitions to resolutions

Seeingdetectederrors

CheckingNetworkConnectivity

Checkingsystemlog

Senderrorlogstovendor

CloseTicket

Replacingmodule

Seeingdetectederrors

CheckingNetworkConnectivity

Checkingsystemlog

Senderrorlogstovendor

CloseTicket

ReplacingmoduleCloseTicket

ReplacingCable

Seeingdetectederrors

CheckingNetworkConnectivity

Checkingsystemlog

Senderrorlogstovendor

CloseTicket

ReplacingmoduleCloseTicket

ReplacingCable

Seeingdetectederrors

Checking Network Connectivity

Approach overview

• 3 steps for extract workflow from multiple trouble tickets– 1. extract only action sentences from trouble tickets

– 2. align the same messages in different tickets

– 3. find operational change as a branch

wedetectederror

verifiedconnectivity

systemdetectederror

connectivityisOK

1.labeling2.alignment

werebootedserverwereplacedmodule

3.findingisolatingaction

1. Action Sentence Labeling

• Extracting sentences about (operator/system’s) actions– Append sentences to labels indicating if is written about actions or

not

• Supervised learning from labeled texts

– Naive Bayes are used as classifier

– Character-2gram is used as features

10:00Managementsystemdetectedthefollowingerror

10:00:00Routerxxxwentdown

ConnectionOkay

Wecheckedtherouterlog.

#loginr001#showsystem

11:30Rebootmessagewasdetected.

12:00WesentsystemlogstoAInc.2/210:00receivedreportMailfrom...

Action

ActionAction

ActionAction

Other

OtherOther

Other

Errorcountinuing,wedecidedonreplacement.

11:15Replacementbegan.ActionOther

11:30Replacementdone.ActionAction

2. Action Alignment

• Aligning sentences describing the same action

consider aligned numbers as action sequence

Action sentences1 Actionsentences2 Actionsentences31 Systemdetected temporary

errorError isdetected. Error: Nodedown

2 Weverified connectivity.3 PingisOK Ping NG4 Manyerrors wereinlog Thereisnoerror inlog

1 1 1

3

4

2 3

Formulation as maximal matching Problem

• Corresponding similar sentences

– maximize the sum of similarities of aligned sentences

• Solving by multiple sequence alignment (MSA) method

– MUSCLE algorithm [18] is used for aligning multiple sequences of sentences

– dice coefficient as similarity of sentence pair

3. Isolating Action Searching

• Finding isolating action i.e. branch of transitions to two resolutions

• Problem: many imitate branches caused by loss or noise in action sequence

checkIProute

checksession

checkIProutereplacemodule poweroff

pulloutcable

insertnewmodule

bootmodule

checkconnectivity

checkifaddresscanresolve checkIProute

findisolatingaction=branchoftransitions

imitatebranchbynoise imitatebranchbylossofaction

Key observation

• the following actions of isolating action can be clustered by resolutions

followingactions

isolating action

seems to have two kinds of resolution

imitate branch

Method for finding isolating action

• 1. dividing the following actions using Spectral Clusteringfor each actions

• 2. choosing the action that has the best clustering score

+

similarity of clustered sentences

dissimilarity of not clustered sentences

Scoreoftheaction

Experiments

• Dataset: practical trouble tickets for network system– Written by Japanese

• primitive linguistic preprocessing are executed

– Separated into subsets by detected error

• given parameter– the threshold of similarity for alignment– the number of isolating actions

thenumberoftickets resolutions/oftickets(i) 5 (A) turnonbreaker(x2)

(B) wait& see(x3)

(ii) 4 (A)replacemodule(x2)(B)replaceportinterface(x1)(C)replacecable(x1)

(iii) 3 (A) poweroutage(x2)(B) replace ONU(x1)

(iv) 29 (A) wait&see(x12)(B) detail loganalysis(x15)(C) replacemodule(x2)

Quantitative evaluation

• We compared obtained result with ground truth• Comparison of

– alignment result & manually appended action ID– clustering result & manually checked true resolution for each ticket– words of extracted isolating actions & isolating action in true operation

document

• miss of alignment is limited the case in

– exchange of order

• miss of isolating action searching is caused by

– the loss of isolating action

– multiple (three or more) isolations

Case study result

• frequent actions are the same with the actions in manual document

• causes are described into the next actions of isolating actions

Summary of Trouble Ticket Analytics

• Proposed extracting method for workflow automatically from multiple trouble tickets

– Action Sentence Labeling

– Action Alignment

– Isolating Action Searching

• Future works

– relaxing the limitation of order of actions

– finding multiple isolating actions

43Copyright©2016NTTcorp.AllRightsReserved.

ExpectationstoMachineLearning

n CorrelationandCausalityInferencen AnomalyDetectionn RootCauseAnalysisn KnowledgeDiscovery

44Copyright©2016NTTcorp.AllRightsReserved.

Diagnosing&Trouble-shooting

n Predictionn Detectionn RootCauseAnalysisn Recovery

45Copyright©2016NTTcorp.AllRightsReserved.

Concludingremarks- Data-drivenapproach

n Disaggregateverticallyintegratedsystemintocomponentstoachievesustainablehealthygrowth.

n Hardtounderstandprecisemechanismsofeverycomponentofentiresystem.

n Measureandcollectbigdataoninputsandoutputsofthesystemtoinfertherelationshipbetweenthem.

n Mathematicaltools,e.g.,machinelearningareavailablehere.

n Keytosuccessisinter-playbetweenmathematicsandnetworkengineering.

46Copyright©2016NTTcorp.AllRightsReserved.

Thankyouforyourattention