conf2016 final it83125 randyllongmire surescripts … · 2017-10-08 · using the rest api with ......

30
Copyright © 2016 Splunk Inc. Randyl Longmire Senior Operations Engineer, Surescripts LLC Finding Straw in a Hay Field The Art of DevOps Log Farming

Upload: others

Post on 25-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

Copyright©2016Splunk Inc.

RandylLongmireSeniorOperationsEngineer,SurescriptsLLC

FindingStrawinaHayFieldTheArtofDevOpsLogFarming

Page 2: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

Disclaimer

2

Duringthecourseofthispresentation,wemaymakeforwardlookingstatementsregardingfutureeventsortheexpectedperformanceofthecompany.Wecautionyouthatsuchstatementsreflectourcurrentexpectationsandestimatesbasedonfactorscurrentlyknowntousandthatactualeventsorresultscoulddiffermaterially.Forimportantfactorsthatmaycauseactualresultstodifferfromthose

containedinourforward-lookingstatements,pleasereviewourfilingswiththeSEC.Theforward-lookingstatementsmadeinthethispresentationarebeingmadeasofthetimeanddateofitslivepresentation.Ifreviewedafteritslivepresentation,thispresentationmaynotcontaincurrentoraccurateinformation.Wedonotassumeanyobligationtoupdateanyforwardlookingstatementswemaymake.Inaddition,anyinformationaboutourroadmapoutlinesourgeneralproductdirectionandissubjecttochangeatanytimewithoutnotice.Itisforinformationalpurposesonlyandshallnot,beincorporatedintoanycontractorothercommitment.Splunkundertakesnoobligationeithertodevelopthefeaturesor

functionalitydescribedortoincludeanysuchfeatureorfunctionalityinafuturerelease.

Page 3: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

Agenda

3

IntroductionsWhereintheDevOpscyclethissessionisfocusedTurningthe‘hayfield’oflogentriesintoavaluableresourceusingSplunksoftwareQueries,transactions,alerts,andautomationSummaryWhat’snext?

Whatarewedoinghere?

Page 4: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

WhoisSurescripts?SurescriptsisHowHealthcareGetsConnected.

Anationwidehealthinformationnetworksecurelyconnectingdoctors’offices,hospitals,pharmacists,andhealthplansthroughanintegratedandtechnologyneutralplatform.

• Wepartnerwithmorethan700EHRapplicationsusedbyover900,000healthcareprofessionalsandmorethan1,000hospitals,impactingmorethan270millioninsuredlives.

• Weprocessmorethan6billiontransactionseachyear,includingnearly700millionmedicationhistories,morethan1billione-prescriptionsandnearly10millionclinicalmessages.

Page 5: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

WhoisRandylLongmire?

• SeniorOperationsEngineer(10yrs.)atSurescripts• BornandraisedintheNorthwestUnitedStates• 14yearsinHealthcareTechnology• 20+yearsinComputerSupport,SystemsandOperations

The‘ProblemResolver’

Page 6: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

TheScopeofthisSession

6

ServerandsoftwaredeploymentsMonitoringofserverandapplicationhealthTroubleshootingandproblemresolution– System/OSErrors– ApplicationandDatabaseErrors– HTTP/SMTPCommunicationFailures

WhichelementsoftheDevOpscyclearewefocusingon?

Page 7: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

7

WearealertedtoanHTTP500erroronasinglesiteUsingatexteditororlogparser,wemanuallysearchforanythingthatlookslikeanerroraroundthetimethatitwasreportedWethenmanuallycorrelatethiserrorwithlogsfromrelatedsystemsaroundthesametimestamp

Scenario1:ErrorTroubleshootingTroubleshootingtheoldway

Page 8: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

8

Theanatomyofthehaystack

Page 9: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

9

Modernmethodsoffindinganeedleinthehaystack

Page 10: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

10

Typicalmethodoffindinganeedleinthehaystack

Page 11: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

11

Fromahaystacktoahayfield

SurescriptsHostedWebApps• 1800servers(haystacks)• 68millionlogentriesdaily

Page 12: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

FindingStrawinaHayFieldWithSplunk Software

12

Queryscopecanrangefromveryfocusedtoverygeneral

Querytofindthestring“TimeoutException”inasinglelogtypeonasingleserver:

Querytofindthestring“TimeoutException”inanylogonanyserverwithinthe‘webApps’index:

Page 13: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

UsingTimechart toGraphtheHistory

13

Querytofindthestring“TimeoutException”andvisualizethefrequencythroughtime:

Usingthesamequery,wecannowlookintothepast

Page 14: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

Scenario2:Alerts

14

NowthatwehavethepowerofSplunk queriesavailable,wecanusethemtocreateproactivealerts.WhenthesameTimeoutExceptionerroroccurs,wecannowbealertedtoitimmediatelyaswellastriggerotheractions.

UsingAlertsforproactivemonitoring

Page 15: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

15

UsingAlertsforErrorDetection

Step1:DefinetheQuery

Page 16: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

16

UsingAlertsforErrorDetection

Step2:SettheAlertTypeandTriggerConditionScheduledorReal-TimeTriggerbasedonresultcounts

Page 17: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

17

UsingAlertsforErrorDetection

Step3:SpecifyTriggerActionsLogeventsSendanEmailRunascriptOpenaServiceNow incidentPOSTtoawebhook URLetc

Page 18: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

Scenario3–Automation

18

TheProblem:– Partnerapplicationhasversiondependencieswithourapplication– Whenonesideupgrades,theconnectivityisbrokenuntiltheotherside

upgradesSolutionbeforeSplunk– Supportticketisopenedrequestinganupgradeonorafteracertaindate– ConnectivitywouldbebrokenforanywherefromhourstodaysSolutionwithSplunk– PowerShellscriptrunsqueryagainstSplunkAPIevery30minutes– Whenanunsupportedversionerrorisdetectedinthelogsfromanyofthe1800

servers,theupgradeforthatserverisqueuedautomatically

UsingPowerShellwiththeSplunkRESTAPI

Page 19: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

19

#region Variables# Splunk Server Address

[string]$SplunkServer = "https://splunk.example.com:8089"# Splunk API Username

[string]$SplunkAPIUser = "splunkAPI"# Limit the number of results

[int]$resultLimit = 1000# Splunk Search String

[string]$SearchString = "search index=""webapps"" source=""*webApp.log"" ""unsupported ver*"" | stats count as ErrorCount by host | head $resultLimit"# Value for time frame from now to search

[string]$incrementValue = "-5m" # -5m,-5h,-5d, etc# Seconds to wait for the job to complete

[int]$timeLimit = 60 # Generate hashtable of body contents used to perform the search

$RestBody = @{search=$SearchStringoutput_mode="json"earliest_time="$incrementValue"}

#endregion Variables

Step0:DeclarethesearchqueryandparametersUsingtheRESTAPIwithPowerShell

Page 20: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

20

# Get a session Key from Splunk API#region GetSessionKey

$object = @{"username" = $SplunkAPIUser"password" = $(GetSecret $SplunkAPIUser)}try {

$token = Invoke-RestMethod -Uri "$SplunkServer/services/auth/login/" -Body$object -Method Post}catch {

log "Error getting Splunk login token. $($_.Exception)"exit

}$header = @{"Authorization" = "Splunk $($token.response.sessionKey)"}

#endregion GetSessionKey

Step1:GetanAPISessionKeyUsingtheRESTAPIwithPowerShell

Page 21: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

21

# Submit the search job#region SubmitJob

try { $JobID = (Invoke-RestMethod -Method Post -Uri

"$SplunkServer/services/search/jobs/" -Headers $header -Body$restBody -ErrorAction Stop).sid}catch {

log "Error submitting Query to Splunk API. Please check your search parameters and try again."

log "Error Detail: $_"exit

}#endRegion SubmitJob

Step2:SubmitthesearchjobUsingtheRESTAPIwithPowerShell

Page 22: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

22

$JobStatus = Invoke-RestMethod -Method Post -Uri "$SplunkServer/services/search/jobs/$jobID"-Headers $header -ErrorAction Stop# Wait for job to completeWhile (((($JobStatus.entry.content.dict.key | where {$_.name -eq "dispatchState"})."#text") -ne "DONE") -and (!($timeOut))){

If (((New-TimeSpan $startTime (Get-Date))).totalSeconds -gt $timeLimit) {log "Timeout exceeded!"# Delete the job and exittry {

$JobDelete = Invoke-RestMethod -Method DELETE -Uri"$SplunkServer/services/search/jobs/$jobID" -Headers $header -ErrorAction Stop

exit}catch {

exit}

}sleep -Seconds 1$JobStatus = Invoke-RestMethod -Method Post -Uri

"$SplunkServer/services/search/jobs/$jobID" -Headers $header -ErrorAction Stop}

Step3:WaitforthejobtocompleteUsingtheRESTAPIwithPowerShell

Page 23: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

23

# Get search results from the job######################

try{$JobResults = Invoke-RestMethod -Method Get -Uri

"$SplunkServer/services/search/jobs/$jobID/results?output_mode=json&count=0" -Headers $header -ErrorAction Stop

log "$($JobResults.results.Count) Search Results received"}catch {

log "Could not obtain search results from Splunk API. $_"}

# Delete the jobtry {

$JobDelete = Invoke-RestMethod -Method DELETE -Uri"$SplunkServer/services/search/jobs/$jobID" -Headers $header -ErrorAction Stop

log "Job Deleted successfully"}catch {

log "Could not delete Splunk job ($JobID). $_"}

}

Step4:GettheresultsUsingtheRESTAPIwithPowerShell

Page 24: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

24

# Process the search results# We could also limit this list by only returning results with an ErrorCount over a specified number[array]$results = $JobResults.results.host

ForEach($hostname in $results) {# Do some automation based on the results, in our case queue the

server for an upgrade.}

Step5:ProcesstheresultsUsingtheRESTAPIwithPowerShell

Page 25: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

BonusScenario– LogReadability

25

UsingtransactionstocreatereadableSMTPlogs

TheProblem:– CustomerreportsanSMTPmessagewassentbutneverdelivered

SolutionbeforeSplunk– SupportticketisopenedreportingSMTPdetailsformissingmessage– ManuallysearchingandparsingSMTPlogsforhours,iftheystillexist

SolutionwithSplunk– Usea‘transaction’querytodisplayallcommunicationthreadsfromsender’sIP

Page 26: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

ReadingSMTPcommunicationbeforeSplunk

26

BeforeSplunk:

Page 27: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

ConvertingSMTPlogsintoreadabletransactions

27

index=webApps sourcetype=iis c_ip="74.125.69.27"|transactionc_ip startswith=EHLOendswith=QUITmaxspan=4s

AfterSplunk:

Page 28: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

Summary

28

BeforeSplunk– Logfilesweremoreofamanagementtaskthanausefultool– Themanualprocessoflogparsingwastediousandtime-consuming

WithSplunk– LogfilesareanempoweringresourceacrossallaspectsofDevOps– Queriescantargetabroadscopeorlaserfocusforerroridentificationand

troubleshooting– Alertsprovidepro-activemonitoringandautomation– Timecharts enablegraphingfordashboardsandhistoricaldata– TheAPIopensthepowerofSplunk tocountlessotherapplications

Page 29: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

What’sNext?

29

BeCreative– Splunk’s applicationsexpandwithyourimagination

BeCollaborative– UsetheSplunk communitytools

BeAdventurous– Discovernewcommandsandmethodsandwaystheycanbeapplied

BeInspired– Adaptandtransformexistingsolutionsintonewandexcitingtools

Page 30: conf2016 FINAL IT83125 RandylLongmire Surescripts … · 2017-10-08 · Using the REST API with ... # Do some automation based on the results, in our case queue the server for an

THANKYOU