lost art of troubleshooting by leon fayer

78
http://fayerplay.com lost art of troublesh @papa_fi re ooting Leon Fayer

Upload: devopsdays-baltimore

Post on 12-Apr-2017

17 views

Category:

Technology


1 download

TRANSCRIPT

PowerPoint Presentation

http://fayerplay.comlost artof troublesh@papa_fire

ootingLeon Fayer

{me}

20+ years breaking & fixingdev, architect, [DevOps]

vp @ OmniTIfix other peoples

@papa_fire

why troubleshooting?

@papa_fire

cloud ruined everythingit really did

@papa_fire

Most reliable way to fix Windows problems1997DevOps mantra for managing cloud-based systems2017

when in doubt - rebootdestroy and rebuild

old McDonaldhad a farm

old McDonaldlost a farm

due to mad cow disease

troubleshooting - a form of problem solving

@papa_fire

problem solving - ability to fix things that you know nothing about

@papa_fire

why is problem solving important?

@papa_fire

because systems are complex

@papa_fire

because of Murphys law

@papa_fire

because someone is always watching

@papa_fire

{disclamer}

@papa_fire

@papa_fire

wishfulthinking

@papa_fire

reality

@papa_fire

where to begin?

@papa_fire

replicate

@papa_fire

OUR TEAM

isolate

@papa_fire

fix?

@papa_fire

whats the problem?

its broken!

@papa_fire

understanding

OUR TEAM

understandproblem

@papa_fire

we cant support 100s req/minwe need to scale better!

@papa_fire

we cant support 100s req/minwe need to scale better!

improve performance

@papa_fire

performance problem

@papa_fire

perceived problem

@papa_fire

actual problem

@papa_fire

OUR TEAM

understandbusiness

@papa_fire

I dont give a **** if thedatacenter is on fireas long as I am stillmaking money

@papa_fire

what doesit mean to you?

@papa_fire

@papa_fire

sales

@papa_fire

@papa_fire

content

@papa_fire

content

ad revenue

@papa_fire

every technical decisionpowers a business need

@papa_fire

OUR TEAM

understandimpact

@papa_fire

@papa_fire

is there alesser of two evils?

sometimes breaking = fixing

@papa_fire

80% now > 100% tomorrow

@papa_fire

incremental improvements

@papa_fire

anatomy of a problem

@papa_fire

anatomy of a problem

problem

norm

norm

@papa_fire

anatomy of a problem

problem

norm

acceptable

norm

@papa_fire

anatomy of a problem

problem

norm

acceptable

norm

fixfixfixfix

@papa_fire

what have welearned?

understanding ofwhats importantcause and effectlargest impactacceptable risk

@papa_fire

what not to do

@papa_fire

dont assume

@papa_fire

@papa_fire

I didnt build it

its not documented

it passedthe tests

works indev

everythinglooks right

@papa_fire

@papa_fire

dont feed your egosolve the problem

@papa_fire

ask for help

@papa_fire

OUR TEAM

tools

@papa_fire

loggingmonitoringprofiling

@papa_fire

loggingactionableconciseparsable

@papa_fire

OUR TEAM[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 16:46:31] AbandonedReservation successfully enqueued.[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:02] Parsed args[2017-02-01 18:57:02] Posting to API[2017-02-01 18:57:02] Initializing args[2017-02-01 18:57:02] Loading reservation_form_data[2017-02-01 18:57:03] Reservation Form Data loaded successfully[2017-02-01 18:57:03] Appending campaign info[2017-02-01 18:57:03] Reservation name: [Some very very very long name][2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance.[2017-02-01 18:57:03] Setting currency to US Dollar[2017-02-01 18:57:03] Appending marketing info[2017-02-01 18:57:03] Have a non-sku source_code[2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE[2017-02-01 18:57:03] Appending match rule = Match Rule[2017-02-01 18:57:03] Appending user info[2017-02-01 18:57:03] Appending order info[2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871[2017-02-01 18:57:03] Determining actual cost table[2017-02-01 18:57:03] Appending comment notes[2017-02-01 18:57:03] Appending abandoned flag[2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = [email protected] contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... }[2017-02-01 19:04:03] Post complete, took 420 seconds[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0

@papa_fire

OUR TEAM[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 16:46:31] AbandonedReservation successfully enqueued.[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:02] Parsed args[2017-02-01 18:57:02] Posting to API[2017-02-01 18:57:02] Initializing args[2017-02-01 18:57:02] Loading reservation_form_data[2017-02-01 18:57:03] Reservation Form Data loaded successfully[2017-02-01 18:57:03] Appending campaign info[2017-02-01 18:57:03] Reservation name: [Some very very very long name][2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance.[2017-02-01 18:57:03] Setting currency to US Dollar[2017-02-01 18:57:03] Appending marketing info[2017-02-01 18:57:03] Have a non-sku source_code[2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE[2017-02-01 18:57:03] Appending match rule = Match Rule[2017-02-01 18:57:03] Appending user info[2017-02-01 18:57:03] Appending order info[2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871[2017-02-01 18:57:03] Determining actual cost table[2017-02-01 18:57:03] Appending comment notes[2017-02-01 18:57:03] Appending abandoned flag[2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = [email protected] contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... }[2017-02-01 19:04:03] Post complete, took 420 seconds[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0

useful information[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:03] API GET data:[2017-02-01 19:04:03] Post complete, took 420 seconds[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0

@papa_fire

OUR TEAM[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 16:46:31] AbandonedReservation successfully enqueued.[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:02] Parsed args[2017-02-01 18:57:02] Posting to API[2017-02-01 18:57:02] Initializing args[2017-02-01 18:57:02] Loading reservation_form_data[2017-02-01 18:57:03] Reservation Form Data loaded successfully[2017-02-01 18:57:03] Appending campaign info[2017-02-01 18:57:03] Reservation name: [Some very very very long name][2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance.[2017-02-01 18:57:03] Setting currency to US Dollar[2017-02-01 18:57:03] Appending marketing info[2017-02-01 18:57:03] Have a non-sku source_code[2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE[2017-02-01 18:57:03] Appending match rule = Match Rule[2017-02-01 18:57:03] Appending user info[2017-02-01 18:57:03] Appending order info[2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871[2017-02-01 18:57:03] Determining actual cost table[2017-02-01 18:57:03] Appending comment notes[2017-02-01 18:57:03] Appending abandoned flag[2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = [email protected] contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... }[2017-02-01 19:04:03] Post complete, took 420 seconds[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0

information I need[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)

@papa_fire

monitoringall inclusivebusiness-firstcorrelatable

@papa_fire

whats the problem?

its broken!

@papa_fire

revenue

@papa_fire

revenue

@papa_fire

revenue

user performance

@papa_fire

revenue

database load

user performance

@papa_fire

revenue

database load

decline rate

user performance

@papa_fire

profiling

@papa_fire

OUR TEAM

when you have the whatbut still have no idea why

@papa_fire

OUR TEAM

#!/usr/sbin/dtrace -s

#pragma quiet

::ap_process_request:process-request-entry/zonename == "www4"/{ self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp;}

sched:::off-cpu/self->uri != 0/{ self->runtime += timestamp - self->oncpu; self->offcpu = timestamp;}

sched:::on-cpu/self->uri != 0/{ self->oncpu = timestamp; self->waittime += timestamp - self->offcpu;}

::ap_process_request:process-request-return/self->uri != 0/{ @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count();}

:::tick-5min{ printf("\n%Y\n", walltimestamp); printf("\nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URL\n"); trunc(@duration,10); printa(@duration); trunc(@duration);

printf("\n\nNUMBER OF HITS\n"); trunc(@count,10); printa(@count); trunc(@count);

printf("\n\nTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL\n"); trunc(@waiting,10); printa(@waiting); trunc(@waiting);}TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL

/directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344

@papa_fire

OUR TEAM

#!/usr/sbin/dtrace -s

#pragma quiet

::ap_process_request:process-request-entry/zonename == "www4"/{ self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp;}

sched:::off-cpu/self->uri != 0/{ self->runtime += timestamp - self->oncpu; self->offcpu = timestamp;}

sched:::on-cpu/self->uri != 0/{ self->oncpu = timestamp; self->waittime += timestamp - self->offcpu;}

::ap_process_request:process-request-return/self->uri != 0/{ @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count();}

:::tick-5min{ printf("\n%Y\n", walltimestamp); printf("\nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URL\n"); trunc(@duration,10); printa(@duration); trunc(@duration);

printf("\n\nNUMBER OF HITS\n"); trunc(@count,10); printa(@count); trunc(@count);

printf("\n\nTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL\n"); trunc(@waiting,10); printa(@waiting); trunc(@waiting);}TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL

/directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344/api/mobile/get_all_events 368584344

@papa_fire

OUR TEAM

TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL

/directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344#!/usr/sbin/dtrace -s

#pragma quiet

::ap_process_request:process-request-entry/zonename == "www4"/{ self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp;}

sched:::off-cpu/self->uri != 0/{ self->runtime += timestamp - self->oncpu; self->offcpu = timestamp;}

sched:::on-cpu/self->uri != 0/{ self->oncpu = timestamp; self->waittime += timestamp - self->offcpu;}

::ap_process_request:process-request-return/self->uri != 0/{ @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count();}

:::tick-5min{ printf("\n%Y\n", walltimestamp); printf("\nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URL\n"); trunc(@duration,10); printa(@duration); trunc(@duration);

printf("\n\nNUMBER OF HITS\n"); trunc(@count,10); printa(@count); trunc(@count);

printf("\n\nTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL\n"); trunc(@waiting,10); printa(@waiting); trunc(@waiting);}/api/get_item/60693 19773404

@papa_fire

down the rabbit hole

@papa_fire

troubleshootingis

required skilleducationaliterativefrustratingrewarding

@papa_fire

@papa_fire

https://www.track5media.com/wp-content/uploads/2016/06/workers-gathered-around-comuputer-screen.jpghttp://more-sky.com/data/out/10/IMG_379964.jpghttps://ruwix.com/pics/trolls/9-rubix-cube-neversolved.jpghttp://blog.cartif.com/wp-content/uploads/2016/02/evolucion.pnghttps://cdn-images-1.medium.com/max/2000/1*t-yZUIXuaXo97yiqYtpC5A.jpeghttp://www.6speedonline.com/forums/attachment.php?attachmentid=286232&stc=1&d=1380726388http://www.wallpapers.faketrix.com/content/animal/feathered/page-2/1024/Ostrich-non-flying-winged-animals.jpghttp://oldmanyellsat.cloud/oldman.jpghttp://cdn.wccftech.com/wp-content/uploads/2016/05/4195797-windows-7-alternate-blue.jpg https://www.poweradmin.com/blog/wp-content/uploads/2015/10/amazon-aws.pnghttps://supportingcmu.org/image/Herd.pnghttp://www.publicdomainpictures.net/pictures/30000/velka/green-fields-1351063140pg3.jpghttps://hurtigruten.global.ssl.fastly.net/assets/48dee2/globalassets/photos/voyages/explorer-voyages/2017-18/ms-fram-antarctica/the-frozen-land-of-the-penguins/2500x1250_r739816dominicbarrington.jpg?width=1600&height=800&transform=DownFillhttps://www.thegeneralistit.com/wp-content/uploads/2015/11/dreamstime_xxl_38819851-Business-woman-eliminate-problem-and-find-solution.jpghttp://paperzip.co.uk/wp-content/uploads/2016/01/word-of-the-day-newspaper.jpghttp://vignette3.wikia.nocookie.net/starwars/images/7/72/DeathStar1-SWE.png/revision/latest?cb=20150121020639https://lcarsgfx.files.wordpress.com/2014/10/prometheus1.pnghttps://cdn.meme.am/cache/instances/folder699/400x/65194699.jpghttp://blog.weespring.com/wp-content/uploads/2014/06/baby-safety-manual-5.jpghttps://4.bp.blogspot.com/-2fGfDw-sohs/V9_CAwCcnaI/AAAAAAAACos/zrARBywD2qAZOphkQMC7WZGdV3vMY5nTACLcB/s1600/Stop%2Bwhining.jpghttps://ih0.redbubble.net/image.14163956.5143/raf,750x1000,075,t,black_white.u4.jpghttp://www.inspireddad.org/wp-content/uploads/uploads/2013/02/ducttape_0930a8_3926013.jpghttps://katieleigh.files.wordpress.com/2014/10/img_0683.jpghttp://pre02.deviantart.net/020c/th/pre/i/2016/094/8/0/down_the_rabbit_hole_by_irenhorrors-d7hgsr3.jpghttp://i1-linux.softpedia-static.com/screenshots/Valgrind_1.pnghttp://i.imgur.com/m6Rkbdx.gifcredits

questions?

@papa_fire