software testing doesnt scale james hamilton jamesrh@microsoft.com microsoft sql server
Post on 26-Mar-2015
222 Views
Preview:
TRANSCRIPT
Software Testing Software Testing Doesn’t ScaleDoesn’t Scale
James HamiltonJames HamiltonJamesRH@microsoft.comJamesRH@microsoft.com
Microsoft SQL ServerMicrosoft SQL Server
22
OverviewOverview The Problem:The Problem:
S/W size & complexity inevitableS/W size & complexity inevitable Short cycles reduce S/W reliabilityShort cycles reduce S/W reliability S/W testing is the real issueS/W testing is the real issue Testing doesn’t scaleTesting doesn’t scale
trading complexity for qualitytrading complexity for quality
Cluster-based solutionCluster-based solution The Inktomi lessonThe Inktomi lesson Shared-nothing cluster architectureShared-nothing cluster architecture Redundant data & metadataRedundant data & metadata Fault isolation domainsFault isolation domains
33
S/W Size & Complexity InevitableS/W Size & Complexity Inevitable
Successful S/W products grow largeSuccessful S/W products grow large # features used by a given user small# features used by a given user small
But union of per-user features sets is hugeBut union of per-user features sets is huge
Reality of commodity, high volume S/WReality of commodity, high volume S/W Large feature setsLarge feature sets Same trend as consumer electronicsSame trend as consumer electronics
Example mid-tier & server-side S/W stack:Example mid-tier & server-side S/W stack: SAP: ~47 mlocSAP: ~47 mloc DB: ~2 mlocDB: ~2 mloc NT: ~50 mlocNT: ~50 mloc
Testing all feature interactions impossibleTesting all feature interactions impossible
44
Short Cycles Reduce S/W ReliabilityShort Cycles Reduce S/W Reliability
Reliable TP systems typically evolve slowly Reliable TP systems typically evolve slowly & conservatively& conservatively
Modern ERP systems can go through 6+ Modern ERP systems can go through 6+ minor revisions/yearminor revisions/year
Many e-commerce sites change even fasterMany e-commerce sites change even faster Fast revisions a competitive advantageFast revisions a competitive advantage
Current testing and release methodology:Current testing and release methodology: As much testing as dev timeAs much testing as dev time Significant additional beta-cycle timeSignificant additional beta-cycle time
Unacceptable choice: Unacceptable choice: reliable but slow evolving or fast changing yet reliable but slow evolving or fast changing yet
unstable and brittleunstable and brittle
55
Testing the Real IssueTesting the Real Issue 15 yrs ago test teams tiny fraction of dev group15 yrs ago test teams tiny fraction of dev group
Now tests teams of similar size as dev & growing rapidlyNow tests teams of similar size as dev & growing rapidly Current test methodology improving incrementally:Current test methodology improving incrementally:
Random grammar driven test case generationRandom grammar driven test case generation Fault injectionFault injection Code path coverage toolsCode path coverage tools
Testing remains effective at feature testingTesting remains effective at feature testing Ineffective at finding inter-feature interactionsIneffective at finding inter-feature interactions
Only a tiny fraction of Heisenbugs found in testing (Only a tiny fraction of Heisenbugs found in testing (www.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Avialiawww.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Avialiability_talk.pptbility_talk.ppt))
Beta testing because test known to be inadequateBeta testing because test known to be inadequate Test team growth scales exponentially with system Test team growth scales exponentially with system
complexitycomplexity Test and beta cycles already intolerably longTest and beta cycles already intolerably long
66
The Inktomi LessonThe Inktomi Lesson Inktomi web search engine (SIGMOD’98)Inktomi web search engine (SIGMOD’98) Quickly evolving software:Quickly evolving software:
Memory leaks, race conditions, etc. considered normalMemory leaks, race conditions, etc. considered normal Don’t attempt to test & beta until quality highDon’t attempt to test & beta until quality high
System availability of paramount importanceSystem availability of paramount importance Individual node availability unimportantIndividual node availability unimportant
Shared nothing clusterShared nothing cluster Exploit ability to fail individual nodes:Exploit ability to fail individual nodes:
Automatic reboots avoid memory leaksAutomatic reboots avoid memory leaks Automatic restart of failed nodesAutomatic restart of failed nodes Fail fast: fail & restart when redundant checks failFail fast: fail & restart when redundant checks fail Replace failed hardware weekly (mostly disks)Replace failed hardware weekly (mostly disks)
Dark machine roomDark machine room No panic midnight calls to admins No panic midnight calls to admins
Mask failures rather than futile attempt to avoidMask failures rather than futile attempt to avoid
77
Apply to High Value TP Data?Apply to High Value TP Data?
Inktomi model:Inktomi model: Scales to 100’s of nodesScales to 100’s of nodes S/W evolves quicklyS/W evolves quickly Low testing costs and no-beta requirementLow testing costs and no-beta requirement
Exploits ability to lose individual node without Exploits ability to lose individual node without impacting system availabilityimpacting system availability
Ability to temporarily lose some data W/O Ability to temporarily lose some data W/O significantly impacting query qualitysignificantly impacting query quality
Can’t loose data availability in most TP systemsCan’t loose data availability in most TP systems Redundant data allows node loss w/o data availability lostRedundant data allows node loss w/o data availability lost
Inktomi model with redundant data & metadata a Inktomi model with redundant data & metadata a solution to exploding test problemsolution to exploding test problem
88
Client
Connection Model/ArchitectureConnection Model/Architecture
ServerNode
Server Cloud
All data & metadata multiply All data & metadata multiply redundantredundant
Shared nothingShared nothing Single system imageSingle system image Symmetric server nodesSymmetric server nodes
Any client connects to any serverAny client connects to any server
All nodes SAN-connectedAll nodes SAN-connected
99
Client
Compilation & Execution ModelCompilation & Execution Model
Server Cloud
Server ThreadLex analyzeParseNormalizeOptimizeCode generate
Query execute
Query execution on many Query execution on many subthreads synchronized subthreads synchronized by root threadby root thread
1010
Client
Node Loss/RejoinNode Loss/Rejoin
Server Cloud
Execution in progressExecution in progress
Rejoin. Rejoin. Node local recoveryNode local recovery Rejoin clusterRejoin cluster Recover global data at rejoining nodeRecover global data at rejoining node Rejoin clusterRejoin cluster
Lose nodeLose node RecompileRecompile Re-executeRe-execute
1111
Client
Redundant Data Update ModelRedundant Data Update Model
Server Cloud
Updates are standard parallel Updates are standard parallel plansplans
Optimizer knows all Optimizer knows all redundant data pathsredundant data paths
Generated plan updates allGenerated plan updates all No significant new technologyNo significant new technology Like materialized view & index Like materialized view & index
updates todayupdates today
1212
Fault Isolation DomainsFault Isolation Domains Trade single-node perf for redundant data checks:Trade single-node perf for redundant data checks:
Fairly common…but complex error recovery is even more Fairly common…but complex error recovery is even more likely to be wrong than original forward processing codelikely to be wrong than original forward processing code
Many of the best redundant checks are compiled out of Many of the best redundant checks are compiled out of “retail versions” when shipped (when needed most)“retail versions” when shipped (when needed most)
Fail fast rather than attempting to repair:Fail fast rather than attempting to repair: Bring down node for mem-based data structure faultsBring down node for mem-based data structure faults Never patch inconsistent data…other copies keep Never patch inconsistent data…other copies keep
system availablesystem available
If anything goes wrong “fire” the node and If anything goes wrong “fire” the node and continue:continue: Attempt node restartAttempt node restart Auto-reinstall O/S, DB and recreate DB partitionAuto-reinstall O/S, DB and recreate DB partition Mark node “dead” for later replacementMark node “dead” for later replacement
1313
SummarySummary 100 MLOC of server-side code and growing:100 MLOC of server-side code and growing:
Can’t fight it & can’t test it … Can’t fight it & can’t test it … quality will continue to decline if we don’t do something quality will continue to decline if we don’t do something
differentdifferent
Can’t afford 2 to 3 year dev cycleCan’t afford 2 to 3 year dev cycle 60’s large system mentality still prevails:60’s large system mentality still prevails:
Optimizing precious machine resources is false economyOptimizing precious machine resources is false economy
Continuing focus on single-system perf dead Continuing focus on single-system perf dead wrong:wrong: Scalability & system perf rather than individual node Scalability & system perf rather than individual node
performanceperformance
Why are we still incrementally attacking an Why are we still incrementally attacking an exponential problem?exponential problem?
Any reasonable alternatives to clusters?Any reasonable alternatives to clusters?
Software Testing Software Testing Doesn’t ScaleDoesn’t Scale
James HamiltonJames HamiltonJamesRH@microsoft.comJamesRH@microsoft.com
Microsoft SQL ServerMicrosoft SQL Server
top related