systems seminar schedule
DESCRIPTION
Systems Seminar Schedule. Monday, 18 Februrary, 4pm: “New Wine in Old Bottles” - Douglas Thain 4 March: No seminar: Paradyn/Condor Week Tuesday , 19 March, 3pm: “The Microsoft .NET System” - Mike Litzkow Tuesday , 2 April, 3pm: “Condor and the Grid” - Miron Livny Monday, 15 April, 4pm: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/1.jpg)
Systems Seminar ScheduleSystems Seminar Schedule Monday, 18 Februrary, 4pm:
– “New Wine in Old Bottles” - Douglas Thain 4 March:
– No seminar: Paradyn/Condor Week Tuesday, 19 March, 3pm:
– “The Microsoft .NET System” - Mike Litzkow Tuesday, 2 April, 3pm:
– “Condor and the Grid” - Miron Livny Monday, 15 April, 4pm:
– “Exploiting Gray-Box Knowledge of Buffer-Cache Management” - Nathan Burnett
Monday, 29 April, 4pm:– “Bridging the Information gap in Storage Protocol Stacks” - Tim Denehy
![Page 2: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/2.jpg)
New WineNew Winein Old Bottles:in Old Bottles:
Java on CondorJava on Condor
Douglas ThainUniversity of Wisconsin
18 February 2002
![Page 3: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/3.jpg)
AbstractAbstractWe have added Java support to Condor. I’ll
tell you how it works and how to use it. There are some nifty features for end users.
Adding this code forced us to think about the fundamental problem of coupling systems and representing errors.
A lesson: One must consider the scope of an error as well as its detail.
![Page 4: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/4.jpg)
Disclaimer:Disclaimer:
This is still rough around the edges.(Someone had to go first!)
![Page 5: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/5.jpg)
OutlineOutline
Why Java and Condor?ArchitectureInitial ExperienceA Little Error TheoryChanges for the BetterConclusions
![Page 6: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/6.jpg)
Java for Scientific ComputingJava for Scientific ComputingJava is emerging as a tool for large scale
(Grande) scientific computing.– More accessible to domain scientists.– Simplified porting.– Faster development, debugging.
User communities are forming:– ACM Java Grande Conference– The Java Grande Forum
A. Globus, E. Langhirt, M. Livny, R. Ramamurthy, M. Solomon, and S. Traugott. JavaGenes and Condor: Cycle-Scavenging genetic algorithms. ACM Conf on Java Grande, 2000.
![Page 7: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/7.jpg)
LimitationsLimitations Java floating point and complex arithmetic do not
yet satisfy all of the scientific community.– Arguments continue between industry and academia.
Java is yet slower than comparable programs in C/C++/Fortran.– WAT compilers and JIT compilers are catching up.– You choose: 2x slowdown vs 5x machines.
Can we really harness 5x machines while still maintaining platform independence?
![Page 8: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/8.jpg)
Condor for Scientific Condor for Scientific ComputingComputing
Condor creates a high-throughput computing system on a community of computers.
A high-throughput computing system seeks to maximize the amount of work done over a long period of time.
A community of computers may be any collection of machines that agree to work together.
![Page 9: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/9.jpg)
Condor Enables Ordinary UsersCondor Enables Ordinary Users
INFN CentralManager
condorschedd
JobJobJobJobJob
condorstartd
RAMcpu
condorstartd
RAMcpu
condorstartd
RAMcpu
condorstartd
RAMcpu
condorstartd
RAMcpu
condorstartd
RAMcpu
Job Job
Job
JobJob
Job
condorstartd
RAMcpu
condorstartd
RAMcpu
condorstartd
RAMcpu
condorstartd
RAMcpu
condorstartd
RAMcpu
condorstartd
RAMcpu
Job Job
Job
JobJob
Job
condorstartd
RAMcpu
condorstartd
RAMcpu
condorstartd
RAMcpu
condorstartd
RAMcpu
condorstartd
RAMcpu
condorstartd
RAMcpu
Job Job
Job
JobJob
Job
UWCS CentralManager
![Page 10: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/10.jpg)
0
100
200
300
400
500
600
700
800
226 Condor Pools5576 Condor Hosts
Top 10 Condor Pools:
![Page 11: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/11.jpg)
The Hype:The Hype:Java:
– “Write once, run anywhere!”Condor:
– “Submit once, run everywhere!”The Grid:
– Uniform, dependable, consistent, pervasive, and inexpensive computing.
![Page 12: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/12.jpg)
The RealityThe RealityCoupling systems is not trivial!The easy part:
– Putting java in front of the program name.The tricky parts:
– Java installation messes.– Unavailable file systems.– Distinguishing program errors from
environmental errors.
![Page 13: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/13.jpg)
OutlineOutline
Why Java and Condor?ArchitectureInitial ExperienceA Little Error TheoryChanges for the BetterConclusions
![Page 14: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/14.jpg)
schedd startd
MatchMaker
MachinePolicies
JobPolicies
HomeFile
System
Claiming Protocol
Activation Protocol
Matchmaking Protocol
Execution Protocolshadow starter
ForkFork
The Job
Fork
Creates the execution environment.
Exports the details, policy, and I/O services.
![Page 15: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/15.jpg)
JVM
Fork
startershadow
ForkFork
HomeFile
System
Wrapper
I/O Library
The Job
I/O Server I/O ProxySecure Remote I/O
Local System Calls
Local I/O(Chirp)
![Page 16: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/16.jpg)
User InterfaceUser Interface
condor_status -java
Name JavaVendor Ver State Activity LoadAv Mem
aish.cs.wisc. Sun Microsy 1.2.2 Owner Idle 0.000 249anfrom.cs.wis Sun Microsy 1.2.2 Owner Idle 0.030 249 babe.cs.wisc. Sun Microsy 1.2.2 Claimed Busy 1.120 123...
Machines Owner Claimed Unclaimed Matched PreemptingINTEL/LINUX 514 101 408 5 0 0 Total 514 101 408 5 0 0
![Page 17: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/17.jpg)
User InterfaceUser Interface
universe = javaexecutable = Main.classjar_files = MyLibrary.jarinput = infileoutput = outfilearguments = Main 1 2 3queue
condor_submit
![Page 18: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/18.jpg)
I/O InterfaceI/O Interface
Input, output, and error files are automatically transferred to/from the execution site.
Any other named files may be transferred as well. To do online I/O without transferring whole files,
you must make small changes to the code:– FileInputStream -> ChirpInputStream– FileOutputStream -> ChirpOutputStream
![Page 19: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/19.jpg)
Application
Java Standard Libraries
Java Virtual Machine
Operating System
C Standard Library
Chirp I/O Library
Added a new library on existing interfaces. User must call new constructors.
JNI
Java symbols are fully qualified, so transparent replacedment of classes is not possible.
Could replace native methods in the JVM, but this ties us to open-source JVMs.
Could trap real system calls, but these are complex (asynchronous, nonblocking, threaded) and may be difficult to distringuish from the JVM’s own operations.
![Page 20: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/20.jpg)
OutlineOutline
Why Java and Condor?ArchitectureInitial ExperienceA Little Error TheoryChanges for the BetterConclusions
![Page 21: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/21.jpg)
Initial ExperienceInitial Experience
Bad news: Nearly any unexpected failure would cause the job to be returned to the user:– Out of memory at execution site.– Java misconfigured at execution site.– I/O proxy can’t initialize.– Home file system offline.
![Page 22: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/22.jpg)
Initial ExperienceInitial Experience
Although this was correct in some sense -- the information was true -- it was very frustrating.
Users want to know when their program fails by design (NullPointerException,) but not if it fails due to the environment.
What did we do wrong?
![Page 23: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/23.jpg)
OutlineOutline
Why Java and Condor?ArchitectureInitial ExperienceA Little Error TheoryChanges for the BetterConclusions
![Page 24: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/24.jpg)
A Little Error TheoryA Little Error Theory
Build on standard definitions from fault-tolerance and programming languages.
Some brief examples to get the idea.Return to Condor and use the theory to
understand our design mistakes.
![Page 25: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/25.jpg)
Fault Tolerance TerminologyFault Tolerance Terminology
Failure– An externally-visible deviation from
specifications. Error
– An internal data state that leads to a failure.Fault
– An external event that creates an error.
A. Avizienis and J.C. Laprie. Dependable computing: From concepts to design diversity. IEEE 74(5) May 1986.
![Page 26: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/26.jpg)
ExampleExample
Client Server
What is sqrt(4)? Hmm, sqrt(4) is...
Hmm, sqrt(9) is...Answer: 3
ERRORFAILURE
FAULT
![Page 27: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/27.jpg)
Implicit errors– The system claims to have reached a valid result, but
an auditor claims it is invalid. Example: sqrt(3)==2 Explicit errors
– The system tells us it cannot complete the desired action. Example: file not found.
Escaping errors– The system detects an error, but has no method of
reporting it, so it escapes by an alternate route. Example: core dump, kernel panic.
John B. Goodenough, Exception Handling: issues and a proposed notation. CACM 18(120, December 1975.K. Ekandham and A. Bernstein. Some new Transitions in hierarchical level structures. Operating Systems Review 12(4), 1978.
![Page 28: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/28.jpg)
Program
Virtual Memory System
PhysicalMemory Backing
Store
load data
Could return a default value, but that creates an implicit error.
Would like to return an explicit error, but a load insn has no exit code.
ParentProcess
Escaping error: Tell the parent that the program could not complete.
NormalExit
AbnormalExit
![Page 29: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/29.jpg)
Interface ContractsInterface Contracts
int load( int address );
The implementor must either compute a result that conforms to the contract, or is obliged to cause an escaping error.
C. Hoare. An axiomatic basis for computer programming. CACM 12(10:576-580, October 1969.B. Meyer. Object-Oriented Software Construction. Prentice Hall, 1997.
![Page 30: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/30.jpg)
ExceptionsExceptions
int open( String filename ) throws FileNotFound, AccessDenied;
A language with exceptions provides more structure to the contract. A declared exception is an explicit error. Yet, escaping errors are still possible.
![Page 31: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/31.jpg)
Program
Virtual File System
MemoryDisk
open
Success,FileNotFound,AccessDenied
ParentProcess Normal
Exit
AbnormalExit
MemoryCorrupt,DiskOffline,PigeonLost
INTERFACE
IMPLEMENTATION
![Page 32: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/32.jpg)
Error ScopeError ScopeIn order to be accepted by end users, a
distributed system must be able to distinguish between errors computed by the program and errors forced upon it by the environment.
We use the term scope to draw the distinction.
![Page 33: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/33.jpg)
Error ScopeError ScopeThe scope of an error is the portion of the
system that it invalidates.An error must be delivered to the process
responsible for managing that scope.
Error Scope HandlerFileNotFound File Calling FunctionRPC Disconnect Process Parent ProcessCache Coherency Problem
Machine Hypervisor or Operator
PVM Node Crash PVM Cluster Parent Process
![Page 34: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/34.jpg)
Error DetailError Detail
The detail of an error describes in phenomenological terms the cause of the error.
In the right hands, the detail is useful. In the wrong hands, the detail can be misleading.
Suppose open returns AccessDenied...– File is not accessible - Ok.– Library containing ‘open’ is not accessible - Problem!
![Page 35: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/35.jpg)
LessonsLessons
Principle 1:– A routine must not generate an implicit error as a result
of receiving an explicit error. Principle 2:
– An escaping error converts a potential implicit error into an explicit error at a higher level.
Principle 3:– An escaping error must be propagated to the program
that manages the error’s scope.
![Page 36: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/36.jpg)
OutlineOutline
Why Java and Condor?ArchitectureInitial ExperienceA Little Error TheoryChanges for the BetterConclusions
![Page 37: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/37.jpg)
Java and Condor RevisitedJava and Condor Revisited
What did we do wrong?
We focussed on error detail without considering error scope.
![Page 38: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/38.jpg)
![Page 39: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/39.jpg)
Java and Condor RevisitedJava and Condor Revisited
To fix the system, we revisited the notion of error scope throughout.
Two examples:– JVM exit code– I/O errors
![Page 40: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/40.jpg)
JVM Exit CodeJVM Exit CodeDetail Scope Exit CodeProgram exited by completing main Program 0
Program exited through System.exit(x) Program x
Exception: Null pointer. Program 1Exception: Out of memory. Virtual
Machine1
Exception: Java Misconfigured. Remote Resource
1
Exception: Home file system offline. Local Resource
1
Exception: Program image corrupt. Job 1
![Page 41: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/41.jpg)
JVM
startershadow
ForkFork
HomeFile
System
Wrapper
I/O Library
The Job
ResultFile
JVM Result
Result of Execution Attempt + Result of Program, If any.
Starter Result +Program Result
![Page 42: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/42.jpg)
I/O Error ScopeI/O Error Scope
All Java I/O operations throw a single exception type -- IOException.
Our mistake: convert all detected errors into IOExceptions and pass them to the program.
Makes sense for FileNotFound, but not for ProxyUnavailable or CredentialsExpired.
![Page 43: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/43.jpg)
JVM
starter
Wrapper
I/O Library
The Job
ResultFile
JVM Result
Result of Execution Attempt + Result of Program, If any.
To I/O Proxy
Error OutsideProgram Scope
Error InsideProgram Scope
![Page 44: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/44.jpg)
OutlineOutline
Why Java and Condor?ArchitectureInitial ExperienceA Little Error TheoryChanges for the BetterConclusions
![Page 45: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/45.jpg)
ConclusionConclusionWe started building the Java Universe with
some naive assumptions about errors.On encountering practical difficulties, we
thought more abstractly about errors and developed the notion of scope and detail.
By routing errors according to their scope, we made the system more robust and usable.
![Page 46: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/46.jpg)
Food for ThoughtFood for Thought
There isn’t always an easy way to propagate an error to the scope handler.– Escaping error to parent process:
Raise a POSIX signal.– Escaping error to the starter:
Throw a Java Error, trapped by the Wrapper, placed in file, read after process exits.
![Page 47: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/47.jpg)
Food for ThoughtFood for Thought The mere use of exceptions in a program does not
imply a disciplined error management. For example, throws IOException is a very
vague statement about an interface. What is an implementor allowed to throw?
– Can open() return FileNotFound? (Probably.)
– Can read() throws FileNotFound? (Asking for trouble.)
– What about ConnectionRefused?
![Page 48: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/48.jpg)
Food for ThoughtFood for Thought An contract can govern more than simply the
interface specification. Consider this self-cleaning program:
fd = open(“file”);unlink(“file”);close(fd);
Works on UNIX, fails on WinNT. Can an interface (code+docs) really state all the
necessary semantic information? Should it?
![Page 49: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/49.jpg)
DeploymentDeployment
As of February 14th, the Java Universe is running on 515 RedHat 7.2 machines.
Will be rolled out as part of Condor 6.3.2 on all platforms in the regular release schedule.
Sun JDK 1.2.2 on UNIX machines. Sun JDK 1.3.2 on WinNT machines. “Is the Java Universe available on my machine?”
– condor_status -java
![Page 50: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/50.jpg)
c2 clustertux lab
istat
skywalker.cs.wisc.edu
![Page 51: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/51.jpg)
AcknowledgementsAcknowledgements
Although we me take credit (or blame) for the most recent changes, the Condor architecture has dealt with errors for many years. Much credit goes to the core designers, esp. Mike Litzkow, Todd Tannenbaum, and Derek Wright.
![Page 52: Systems Seminar Schedule](https://reader036.vdocuments.site/reader036/viewer/2022070500/56816866550346895ddec3ea/html5/thumbnails/52.jpg)
More Info:More Info:
The Condor Project:– http://www.cs.wisc.edu/condor
These slides:– http://www.cs.wisc.edu/~thain
Douglas Thain– [email protected]
Questions now?