source control howto

8/14/2019 Source Control HOWTO

1/59

Source Control HOWTO

byEric Sink (http://www.ericsink.com/)Original tutorial source: http://www.ericsink.com/scm/source_control.html

I am writing a series of articles explaining how to do source control and the best practices thereof. See belowfor links to the individual chapters in this series. The Introduction explains my motivations and goals forwriting this series.

Please note: This is a work in progress. I plan to be adding new chapters over time, and I may also be revisingthe existing chapters as I go along.

Printer-friendly version: Sorry folks, but I currently do not have this material available in a form which ismore suitable for paper. I am planning to eventually publish this material as a book. When that happens, alink will appear here.

Chapter 0: Introduction (scm_intro.html)

Our universities don't teach people how to do source control. Our employers don't teach people how

to do source control. SCM tool vendors don't teach people how to do source control. We need somematerials that explain how source control is done. My goal for this series of articles is to create acomprehensive guide to help meet this need.

Chapter 1: Basics (scm_basics.html)

Our discussion of source control must begin by defining the basic terms and describing the basicoperations.

Chapter 2: Checkins (scm_checkins.html)

In this chapter, I will explore the various situations wherein a repository is modified, starting with

the simplest case of a single developer making a change to a single file.

Chapter 3: File Merge (scm_file_merge.html)

Many software teams have discovered that the tradeoff here is worth the trouble. Concurrentdevelopment can bring substantial gains in the productivity of a team. The extra effort to deal withmerge situations is usually a small price to pay.

Chapter 4: Repositories (scm_repositories.html)

A file system is two-dimensional: its space is defined by directories and files. In contrast, arepository is three-dimensional: it exists in a continuum defined by directories, files and time. An

SCM repository contains every version of your source code that has ever existed. The additionaldimension creates some rather interesting challenges in the architecture of a repository and thedecisions about how it manages data.

Chapter 5: Working Folders (scm_working_folders.html)

The repository is the official archive of our work. We treat our repository with great respect. Incontrast, we treat our working folder with very little regard. It exists for the purpose of beingabused. Our working folder starts out worthless, nothing more than a copy of the repository. If it isdestroyed, we have lost nothing, so we run risky experiments which endanger its life.

Chapter 6: History(scm_history.html)

There is nothing endearing about a development team that can't find something when they need it.A good SCM tool must do more than just keep every version of everything. It must also provideways of searching and viewing and sorting and organizing and finding all that stuff.


2/59

Chapter 7: Branches (scm_branches.html)

Nelly has a friend who has a cousin with a neighbor who knows somebody whose life completely fellapart after they tried using the branch and merge features of their source control tool. So Nellyrefuses to use branching at all.

Chapter 8: Merge Branches (scm_merge_branches.html)

Successfully using the branching and merging features of your source control tool is first a matter ofattitude on the part of the developer. No matter how much help the source control tool provides, it isnot as smart as you are. You are responsible for doing the merge. Think of the tool as a tool, not as aconsultant.

Chapter 9: Source Control Integration with IDEs (scm_ide_integration.html)

Just as a spice rack belongs near the stove, source control should always be available where thedeveloper is working.

Here's a list of chapters I am thinking about writing:

A chapter on integration with bug-tracking and automated buildsA chapter on common mistakes people make when using source control.A chapter on remote access (client/server, binary deltas, security, . ..)A chapter on importing from one source control tool to anotherA chapter on cross-platform issuesA chapter on writing custom tools which access a source control serverA chapter on miscellaneous stuff that doesn't fit anywhere else (share, pin, cloak, shadow folders, emailnotifications, browser-based clients, keyword expansion, ...)


3/59

Best Practice: Use source control

Some surveys indicate that 70% of software teams do

not use any kind of source control tool. I cannotimagine how they cope.

Throughout this series of articles, I will be sprinklingBest Practices that will appear in sidebar boxes like

Chapter 0: Introduction

This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guideon source control, version control, and configuration management.

What is source control?

Sometimes we call it "version control". Sometimes we call it "SCM", which stands for either "softwareconfiguration management" or "source code management". Sometimes we call it "source control". I use allthese terms interchangeably and make no distinction between them (for now anyway -- configurationmanagement actually carries more advanced connotations I'll discuss later).

By any of these names, source control is an important practice for any software development team. The mostbasic element in software development is our source code. A source control tool offers a system for managingthis source code.

There are many source control tools, and they are all different. However, regardless of which tool you use, it islikely that your source control tool provides some or all of the following basic features:

It provides a place to store your source code.It provides a historical record of what you have done over time.It can provide a way for developers to work on separate tasks in parallel, merging their efforts later.It can provide a way for developers to work together without getting in each others' way.

HOWTO

My goal for this series of articles is to help people learn how to do source control. I work for SourceGear, adeveloper tools ISV. We sell an SCM tool calledVault (http://www.sourcegear.com/vault/) . Through theexperience of selling and supporting this product, I have learned something rather surprising:

Nobody is teaching people how to do source control.

Our universities often don't teach people how to do source control. We graduate with Computer Sciencedegrees. We know more than we'll ever need to know about discrete math, artificial intelligence and the designof virtual memory systems. But many of us enter the workforce with no knowledge of how to use any of thebasic tools of software development, including bug-tracking, unit testing, code coverage, source control, oreven IDEs.

Our employers don't teach people how to do source control. In fact, many employers provide their developerswith no training at all.

SCM tool vendors don't teach people how to do sourc e control. We provide documentation on our products,

but the help and the manuals usually amount to simple explanations of the program's menus and dialogs. Wesort of assume that our customers come to us with a basic background.

Here at SourceGear, our product is positioned specifically as a replacement for SourceSafe. We assume thateveryone who buys Vault already knows how to use SourceSafe. However, experience is teaching us that thisassumption is often untrue. One of the most common questions received by our support team is from usersasking for a solid explanation of the basics of source control.

We need some materials that explain how sourcecontrol is done. My goal for this series of articles isto create a comprehensive guide to help meet thisneed.

Somewhat tool-specific

Ideally, a series of articles on the techniques of


4/59

this one. These boxes will contain pithy and practicaltips for developers and managers using SCM tools.

source control would be tool-neutral, applicable toany of the available SCM tools. It simply makessense to teach the basic skills without teaching thespecifics of any single tool. We learn the basic skills of writing before we learn to use a word processor.

However, in the case of SCM tools, this tool-agnostic approach is somewhat difficult to achieve. Unlikewriting, source control is simply not done without the assistance of specialized tools. With no tools at all, themethods of source control are not practical.

Complicating matters further is the fact that not all source control tools are alike. There are at least dozens ofSCM tools available, but there is no standard set of features or even a standard terminology. The word"checkout" has different meanings for CVS and SourceSafe. The word "branch" has very different semanticsfor Subversion and PVCS.

So I will keep the tool-neutral ideal in mind as I write, but my articles will often be somewhat tool-specific.Vault is the tool I know best, since I have played a big part in its design and coding. Furthermore, I freelyacknowledge that I have a business incentive to talk about my own product. Although I will often mentionother SCM tools, the articles in this series will use the terminology of Vault.

he world's most incomplete list of SCM tools

Several SCM tools that I mention in this series are listed below, with hyperlinks for more information.

Vault (http://www.sourcegear.com/vault/) . Our product. 'Nuff said.SourceSafe (http://msdn.microsoft.com/vstudio/previous/ssafe/) . Microsoft. Old. Loved. Hated.Subversion (http://subversion.tigris.org/) . Open source. New. Neato.CVS (https://www.cvshome.org/) . Open source. Old. Reliable. Dusty.Perforce (http://www.perforce.com/) . Commercial. A competitor of SourceGear, but one that Iadmire.BitKeeper (http://www.bitkeeper.com/) . Commercial. Uses a distributed repository architecture, soI won't be talking about this one much.

Arch (http://www.gnu.org/software/gnu-arch/) . Open source. Distributed repository architecture.

Again, I spend most of my words here on tools with a centralized server.

This is a very incomplete list. There are many SCM tools, and I am not interested in trying to produce andmaintain and accurate listing of them all.

Audience

I am writing about source control for programmers and web developers.

When we apply some of the concepts of source control to the world of traditional documents, the result iscalled "document management". I'm not writing about any of those usage scenarios.

When we apply some of the concepts of source control to the world of graphic design, the result is called "assetmanagement". I'm not writing about any of those usage scenarios.

My audience here is the group of people who deal primarily with source code files or HTML files.

Warnings about my writing style

First of all, let me say a thing or two about political correctness. Through these articles, I will occasionally findthe need for gender-specific pronouns. In such situations, I generally try to use the male and female variantsof the words with approximately equal frequency.

Second of all, please accept my apologies if my dry sense of humor ever becomes a distraction from the

material. I am writing about source control and trying to make it interesting. That's like writing about sex andtrying to make it boring, so please cut me some slack if I try to make you chuckle along the way.

Looking Ahead


5/59

Source control is a large topic, so there is much to be said. I plan for the chapters of this series to be sortedvery roughly from the very basic to the very advanced. In the next chapter, I'll start by defining the mostfundamental terminology of source control.


6/59

Chapter1: Basics


A tale of two trees

Our discussion of source control must begin by defining the basic terms and describing the basic operations.Let's start by defining two important terms: repository and working folder.

An SCM tool provides a place to store your source code. We call this place a repository. The repository existson a server machine and is shared by everyone on your team.

Each individual developer does her work in a working folder, which is located on a desktop machine andaccessed using a client.

Each of these things is basically a hierarchy of folders. A specific file in the repository is described by its path,just like we describe a specific file on the file system of your local machine. In Vault and SourceSafe, arepository path starts with a dollar sign. For example, the path for a file might look like this:

$/trunk/src/myLibrary/hello.cs

The workflow of a developer is an infinite loop which looks something like this:

Copy the contents of the repository into a working folder.

Make changes to the code in the working folder.Update the repository to incorporate those changes.Repeat.

I've omitted certain details like staff meetings and vacations, but this loop essentially describes the life of a


7/59

Best Practice: Don't break the tree

The benefit of working folders is mostly lost if thecontents of the repository become "broken". At alltimes, the contents of the repository should be in astate which allows everyone on the team to continue towork. If a developer checks in some code which won't

build or won't pass the test suite, the entire team grindsto a halt.

Many teams have some sort of a social penalty which isapplied to developers who break the tree. I'm nottalking about anything severe, just a little incentive toremind developers to be careful. For example, requirethe guilty party put a dollar in a glass jar. (Use themoney to take the team to go see a movie after theproduct is shipped.) Another idea is to require theguilty developer to make the coffee every morning. Thepoint is to make the developer feel embarrassed, but

not punished.

developer who is working with an SCM tool. The repository is the official place where all completed work isstored. A task is not considered to be completed until the repository contains the result of that task.

Let's imagine for a moment what life would be like without this distinction between working folder andrepository. In a single-person team, the situation could be described as tolerable. However, for any pluralityof developers, things can get very messy.

I've seen people try it. They store their code on a file server. Everyone uses Windows file sharing and edits thesource files in place. When somebody wants to edit main.cpp, they shout across the hall and ask if anybody

else is using that file. Their Ethernet is saturated most of the time because the developers are actuallycompiling on their network drives. When we sell our source control tool to someone in this situation, I feel likean ER doctor. I go home that night with a feeling of true contentment, because I know that I have saved a life.

With an SCM tool, working on a multi-person teamis much simpler. Each developer has a workingfolder which is a private workspace. He can makechanges to his working folder without adverselyaffecting the rest of the team.

Terminology note: Not all SCM tools use the exactterms I am using here. Many systems use the word

"directory" instead of "folder". Some SCM tools,including SourceSafe, use the word "database"instead of "repository". In the context of Vault,these two words have a different meaning. Vaultallows multiple repositories to exist within a singleSQL database. For this reason, I use the word"database" only when I am referring to the SQLdatabase.

I ad Out

The repository exists on a server machine which isfar away from the desktop machine containing theworking folder where the developer does her work. The word "far" in the previous sentence is intended tomean anything from a few centimeters to thousands of kilometers. The physical distance doesn't reallymatter. The SCM tool provides the ability to communicate between the client and the server over TCP/IP,whether the network is a local Ethernet or an Internet connection to another continent.

Because of this separation between working folder and repository, the most frequently used features of an SCMtool are the ones which help us move things back and forth between them. Let's define some terms:

Add: A repository starts out completely empty, so we need to "Add" things to it. Using the "Add Files"command in Vault you can specify files or folders on your desktop machine which will be added to therepository.

Get: When we copy things from the repository to the working folder, we call that operation "Get". Notethat this operation is usually used when retrieving files that we do not intend to edit. The files in the

working folder will be read-only.

Checkout: When we want to retrieve files for the purpose of modifying them, we call that operation"Checkout". Those files will be marked writable in our working folder. The SCM server will keep a recordof our intent.

Checkin: When we send changes back to the repository, we call that operation "Checkin". Our workingfiles will be marked back to read-only and the SCM server will update the repository to contain new

versions of the changed files.

Note that these definitions are merely starting points. The descriptions above correspond to the behavior ofSourceSafe and Vault (with its default settings). However, we will see later that other tools (such as CVS) worksomewhat differently, and Vault can optionally be configured in a mode which matches the behavior of CVS.

Terminology note: Some SCM tools use these words a bit differently. Vault and SourceSafe use the word


8/59


9/59

Chapter2: Checkins


In this chapter, I will explore the various situations wherein a repository is modified, starting with the simplestcase of a single developer making a change to a single file.

Editing a single file

Consider the simple situation where a developer needs to make a change to one source file. This case isobviously rather simple:

Checkout the file1.Edit the working file as needed2.Checkin the file3.

I won't talk much about step 2 here, as it doesn't really involve the SCM tool directly. Editing the file usually

involves the use of some other tools, like an integrated development environment (IDE).

But I do want to explore steps 1 and 3 in greater detail.

Step 1: Checkout

Checking out a file has two basic effects:

On the server, the SCM tool will remember the fact that you have the file checked out so that others maybe informed.On your client, the SCM tool will prepare your working file for editing by changing it to be writable.

The server side of checkout

File checkouts are a way of communicating your intentions to others. When you have a file checked out, otherusers can be aware and avoid making changes to that file until you are done with it. The checkout status of afile is usually displayed somewhere in the user interface of the SCM client application. For example, in thefollowing screendump from Vault, users can see that I have checked out libsgdcore.cpp:


10/59

Best Practice: Use checouts nd locscrefully

It is best to use checkouts and locks only when youneed them. A checkout discourages others frommodifying a file, and a lock prevents them from doingso. You should therefore be careful to use thesefeatures only when you actually need them.

Don't checkout files just because you think you mightneed to edit them.

Don't checkout whole folders. Checkout the specificfiles you need.

Don't checkout hundreds or thousands of files at onetime.

Don't hold exclusive locks any longer than necessary.

Don't go on vacation while holding exclusive locks onfiles.

This screendump also hints at the fact there areactually two kinds of checkouts. The issue here isthe question of whether two people can checkout a

file at the same time. The answer varies acrossSCM tools. Some SCM tools can be configured tobehave either way.

Sometimes the SCM tool will allow multiple peopleto checkout a file at the same time. SourceSafe andVault both offer this capability as an option. Whenthis "multiple checkouts" feature is used, things canget a bit more complicated. I'll talk more aboutthis later.

If the SCM tool prevents anyone else from checking

out a file which I have checked out, then mycheckout is "exclusive" and may be described as a"lock". In the screendump above, the user interfaceis indicating that I have an exclusive lock onlibsgdcore.cpp. Vault will allow no one else tocheckout this file.

The client side of checkout

On the client side, the effect of a checkout is quite simple: If necessary, the latest version of the file is retrievedfrom the server. The working file is then made writable, if it was not in that state already.

All of the files in a working folder are made read-only when the SCM tool retrieves them from the repository. Afile is not made writable until it is checked out. This prevents the developer from accidentally editing a file.

Undoing a checkout


11/59

Best Practice: Explin your checinscompletely

Every SCM tool provides a way to associate a commentwhen checking changes into the repository. Thiscomment is important. If we consistently use goodcheckin comments, our repository's history containsnot only every change we have ever made, but it alsocontains an explanation of why those changeshappened. These kinds of records can be invaluablelater as we forget things.

I believe developers should be encouraged to entercheckin comments which are as long as necessary to

explain what is going on. Don't just type "minorchange". Tell us what the minor change was. Don't justtell us "fixed bug 1234". Tell us what bug 1234 is andtell us a little bit about the changes that were necessaryto fix it.

Normally, a checkout ends when a checkin happens. However, sometimes we checkout a file and subsequentlydecide that we did not need to do so. When this happens, we "undo the checkout". Most SCM tools have acommand which offers this functionality. On the server side, the command will remove the checkout andrelease any exclusive lock that was being held. On the client side, Vault offers the user three choices for howthe working file should be treated:

Revert: Put the working file back in the state it was in when I checked it out. Any changes I madewhile I had the file checked out will be lost.Leve: Leave the working file alone. This option will effectively leave the file in a state which we call"Renegade". It is a bad idea to edit a file without checking it out. When I do so, Vault notices mytransgression and chastises me by letting me know that the file is "Renegade".Delete: Delete the working file.

I usually prefer to work with "Revert" as my option for how the Undo Check Out command behaves.

Step 3: Checkin

One issue does deserve special mention. Most SCMtools ask the user to enter a comment when makinga checkin. This comment will be stored in the

repository forever along with the changes beingsubmitted. The comment provides a place for thedeveloper to explain what was changed and why thechange was made.

After the file is checked out, the developer proceedsto make her changes. She edits the file and verifiesthat her change is correct. Having completed allthis, she is ready to submit her changes to therepository. Doing so will make her changepermanent and official. Submitting her changes tothe repository is the operation we call "checkin".

The process of a checkin isn't terribly complicated:

The new version of the file is sent to the SCMserver where it is stored.

1.

The version number of the file in therepository is incremented by one.

2.

The file is no longer considered to be checked out or locked.3.The working file on the client side is made read-only again.4.

The following screendump shows the checkin dialog box from Vault:


12/59


13/59

Best Practice: roup your checins logiclly

I recommend that each transaction you check into therepository should correspond to one task. A "task"might be a bug fix or a feature. Include all of therepository changes which were necessary to completethat task, and nothing else. Avoid fixing multiple bugsin a single checkin transaction.

Just as the version number of a file is incremented when we modify it, these folder-level changes cause theversion number of a folder to be incremented. If we ask for the previous version of a folder, we can stillretrieve it just the way it was before. The renamed file will be back to the old name. The deleted file willreappear exactly where it was before.

It may bother you to realize that the "delete" command in your SCM tool doesn't actually delete anything.However, you'll get used to it.

Atmic transactions

I've been talking mostly about the simple case of making a change to a single source code file. However, mostprogramming tasks require us to make multiple repository changes. Perhaps we need to edit more than onefile to accomplish our task. Perhaps our task requires more than just file modifications, but also folder-levelchanges like the addition of new files or the renaming of a file.

When faced with a complex task that requires several different operations, we would like to be able to submitall the related changes together in a single checkin operation. Although tools like SourceSafe and CVS do notoffer this capability, some source control systems (like Vault and Subversion) do include support for "atomictransactions".

The concept is similar to the behavior of atomictransactions in a SQL database. The Vault serverguarantees that all operations within a transactionwill stay together. Either they will all succeed, orthey will all fail. It is impossible for the repositoryto end up in a state with only half of the operationsdone. The integrity of the repository is assured.

To ensure that a transaction can contain all kinds ofoperations, Vault supports the notion of a pendingchange set. Essentially, the Vault client keeps a running list of changes you have made which are waiting to besent to the server. When you invoke the Delete command, not only will it not actually delete anything, but it

doesn't even send the command to the server. It merely adds the Delete operation to the pending change set,so that it can be sent later as part of a group.

In the following screen dump, my pending change set contains three operations. I have modifiedlibsgdcore.cpp. I have renamed libsgdcore.h to headerfile.h. And I have deleted libsgdcore_diff_file.c.


14/59

Note that these operations have not actually happened yet. They won't happen unless I submit them to theserver, at which time they will take place as a single atomic transaction.

Vault persists the pending change set between sessions. If I shutdown my Vault client and turn off mycomputer, next time I launch the Vault client the pending change set will contain the same items it does now.

The Church of "Edit-Merge-Commit"

Up until now, I have explained everything about checkouts and checkins in a very "matter of fact" fashion. Ihave claimed that working files are always read-only until they are checked out, and I have claimed that filesare always checked out before they are checked in. I have made broad generalizations and I have explainedthings in terms that sound very absolute.

I lied.

In reality, there are two very distinct doctrines for how this basic interaction with an SCM tool can work. Ihave been describing the doctrine I call "checkout-edit-checkin". Reviewing the simple case when a developerneeds to modify a single file, the practice of this faith involves the following steps::

Checkout the file1.Edit the working file as needed2.Checkin the file3.

Followers of the "checkout-edit-checkin" doctrine are effectively submitting to live according to the followingrules:

Files in the working folder are read-only unless they are checked out.

Developers must always checkout a file before editing it. Therefore, the entire team always knows whois editing which files.Checkouts are made with exclusive locks, so only one developer can checkout a file at one time.

This approach is the default behavior for SourceSafe and for Vault. However, CVS doesn't work this way at


15/59


16/59

have been actively using edit-merge-commit as my development style for over five years, and I cannotremember a situation where automerge produced an incorrect file. Experience has made me a believer.

Lking Ahead

In the next chapter, I will be talking in greater detail about the process of merging two modified versions of afile.


17/59

Chapter3: File Merge


How did we get ourselves into this mess?

There are several reasonswhy we may need to merge two modified versions of a file:

When using "edit-merge-commit" (sometimes called "optimistic locking"), it is possible for twodevelopers to edit the same file at the same time.Even if we use "checkout-edit-checkin", we may allow multiple checkouts, resulting once again in thepossibility of two developers editing the same file.

When merging between branches, we may have a situation where the file has been modified in bothbranches.

In other words, this mess only happens when people are working in parallel. If we serialize the efforts of ourteam by never branching and never allowing two people to work on a module at the same time, we can avoid

ever facing the need to merge two versions of a file.

However, we want our developers to work concurrently. Think of your team as a multithreaded piece ofsoftware, each developer running in its own thread. The key to high performance in a multithreaded system isto maximize concurrency. Our goal is to never have a thread which is blocked on some other thread.

So we embrace concurrent development, but the threading metaphor continues to apply. Multithreadedprogramming can sometimes be a little bit messy, and the same can be said of a multithreaded softwareteam. There is a certain amount of overhead involved in things like synchronization and context switching.This overhead is inevitable. If your team is allowing concurrent development to happen, it will periodicallyface a situation where two versions of a file need to be merged into one.

In rare cases, the situation can be properly resolved by simply choosing one version of the file over the other.However, most of the time, we actually need to merge the two versions to create a new version.

What do we do about it?

Let's carefully state the problem as follows: We have two versions of a file, each of which was derived from thesame common ancestor. We sometimes call this common ancestor the "original" file. Each of the otherversions is merely the result of someone applying a set of changes to the original. What we want to create is anew version of the file which is conceptually equivalent to starting with the original and applying both sets ofchanges. We call this process "merging".

The difficulty of doing this merge varies greatly for different types of files. How would we perform a merge of

two Excel spreadsheets? Two PNG images? Two files which have digital signatures? In the general case, theonly way to merge two modified versions of a file is to have a very smart person carefully construct a new copyof the file which properly incorporates the correct elements from each of the other two.

However, in software and web development there is a special case which is very common. As luck would haveit, most source code files are plain text files with an average of less than 80 characters per line. Merging filesof this kind is vastly simpler than the general case. Many SCM tools contain special features to assist with thissort of a merge. In fact, in a majority of these cases, the two files can be automatically merged withoutrequiring the manual effort of a developer.

An example

Let's call our two developers Jane and Joe. Both of them have retrieved version 4 of the same file and both ofthem are working on making changes to it.

One of these developers will checkin before the other one. Let's assume it is Jane who gets there first. When


18/59

Best Practice: Keep the repository in sight

This example happens to involve the need to mergeonly a single checkin. Since Joe's baseline is 4 and thecurrent repository version is 5, Joe is only 1 version outof date. If the repository version were 25 instead of 5,then Joe would be 21 versions out of date instead of just1, but the technique is the same. No matter how old hisbaseline is, Joe still needs to retrieve the latest versionand do a three-way merge. However, the older hisbaseline, the more likely he is to encounter conflicts inthe merge.

Keep in touch with the repository. Update yourworking folder as often as you can without interruptingyour own work. Commit your work to the repository asoften as you can without breaking the build. It isn't

wise to let the distance between your working folderand the repository grow too large.

Jane tries to checkin her changes, nothing unusual will happen. The current version of the file is 4, and thatwas the version she had when she started making her changes. In other words, version 4 was her baseline forthese changes. Since her baseline matches the curr ent version, there is no merge necessary. Her changes arechecked in, and a version of the file is created in the repository. After her checkin, the current version of thefile is now 5.

The responsibility for merging is going to fall upon Joe. When he tries to checkin his changes, the SCM toolwill protest. His baseline version is 4, but the current version in the repository is now 5. If Joe is allowed tocheckin his version of the file, the changes made by Jane in version 5 will be lost. Therefore, Joe will not beallowed to checkin this file until he convinces the SCM tool that he has merged Jane's version 5 changes intohis working copy of the file.

Vault reports this situation by setting the status on this file to be "Needs Merge", as shown in the screen dumpbelow:

In order to resolve this situation, Joe effectivelyneeds to do a three-way comparison between thefollowing three versions of the file:

Version 4 (the baseline from which he andJane both started)

Version 5 (Jane's version)Joe's working file (containing his ownchanges)

Version 4 is the common ancestor for both Joe'sversion and Jane's version of the file. By running adiff between version 4 and version 5, Joe can seeexactly what changes Jane made. He can use thisinformation to apply those changes to his ownversion of the file. Once he has done so, he cancredibly claim that his version is a merge of hischanges and Jane's.

Strictly speaking, Joe is responsible for whateverchanges Jane made, regardless of how difficult themerge may be. He must perform the changes to his file that Jane would have made if she has started with hisfile instead of with version 4. In theory, this could be very difficult:


19/59

Best Practice: Only use "automerge on get"

It is widely accepted that SCM tools should onlyattempt automerge on the "get" of a file. In otherwords, when Joe realizes that he must merge in thechanges Jane made between version 4 and version 5,he will tell his SCM client application to "get" version 5and attempt to automatically merge it into his workingfile. CVS, Subversion and Vault all function in thismanner.

Unfortunately, SourceSafe attempts to "automerge oncheckin". This is just a really bad idea. When Joe tries

to checkin his changes, SourceSafe attempts theautomerge. If it believes that it has succeeded, then hischanges are checked in and version 6 was created.However, it is possible that Joe never examined version6, or even compiled it. The repository now contains afile which has never existed in the working folder of anydeveloper on earth. Its contents have never been seenby human eyes, and it has never been run through acompiler. Automerge is safe, but it's not thatsafe.

It is much better to "automerge on get". This way, thedeveloper can (and should) examine the file after the

automerge has happened. This simple change makes iteasier to trust automerge. Instead of trying to do thedeveloper's job, automerge simply becomes a toolwhich the developer can use to get his job done faster.

What happens if Jane changed some of the same lines that Joe changed, but in different ways?What happens if Jane's changes are functionally incompatible with Joe's?What happens if Jane made a change to a C# function which Joe has deleted?What happens if Jane changed 80 percent of the lines in the file?What happens if Jane and Joe each changed 80 percent of the lines in the file, but each did so forentirely different reasons?

What happens if Jane's intent was not clear and she cannot be reached to ask questions?

All of these situations are possible, and all of them are Joe's responsibility. He must incorporate Jane'schanges into his file before he can checkin a version 6.

In certain rare situations, Joe may examine Jane's changes and realize that his version needs nothing fromJane's version 5. Maybe Jane's change simply isn't relevant anymore. In these cases, the merge isn't needed,and Joe can simply declare the merge to be resolved without actually doing anything. This decision remainssubject to Joe's judgment.

However, most of the time it will be necessary for the merge to actually happen. In these cases, Joe has thefollowing options:

Attempt to automergeUse a visual merge tool

Redo one set of changes by hand

Each of these will be explained further in the sections below.

Attempt to automerge

As I mentioned above, a surprising number of cases can be easily handled automatically. Most source controltools include the ability to attempt an automatic m erge. The algorithm uses all three of the involved versionsof the file and attempts to safely produce a merged version.

The reason that automerge is so safe in practice isthat the algorithm is extremely conservative.Automerge will refuse to produce a merged versionif Joe's changes and Jane's changes appear to be inconflict. In the most obvious case, if Joe and Janeboth modified the same line, automerge will detectthis "conflict" and refuse to proceed. In othercases, automerge may fail with conflicts if twochanges are too close to each other.

Use a visual merge tool

In cases where automerge cannot automatically

resolve conflicts, we can use a visual merge tool tomake the job easier. These tools provide a visualdisplay which shows all three files and highlightsexactly what has changed. This makes it mucheasier for the developer to perform the merge, sinceshe can zero in on the conflicts very quickly.

There are several excellent visual merge toolsavailable, including Guiffy(http://www.guiffy.com/) and Araxis Merge(http://www.araxis.com/) . The following screendump is from "SourceGear DiffMerge", the visual

merge tool which is included with Vault. (Pleasenote sometimes I have to reduce the size of screendumps to make them fit. In those cases, you canclick on the image to see it at full resolution).


20/59

(screendumps/scm_diffmerge_1.gif)

This picture is typical of other three-way visual merge applications. The left pane shows Jane's version of thefile. The right pane shows Joe's version. The center pane shows the original file, the common ancestor fromwhich they both started to make changes. As you can see, Jane and Joe have each inserted a one-linecomment. By right-clicking on each change, the developer can choose whether to apply that change to themiddle pane. In this example, the two changes don't conflict. There is no reason that the resulting file cannotincorporate both changes.

The following picture shows an example of changes which are conflicting.

(screendumps/scm_diffmerge_2.gif)

Both Jane and Joe have tried to change the wording of this comment. In the original file, the word used in thecomment was "Global". Jane decided to change this word to "Worldwide", but Joe has changed it to the word"Rampant". These two changes are conflicting, as indicated by the yellow background color being used todisplay them. Automerge cannot automatically handle cases like these. Only a human being can decide which

change to keep.

The visual merge tool makes it easy to handle this situation. I can decide which change I want to keep andapply it to the center pane.


21/59


22/59

Best Practice: Give concurrent development atry

Many teams avoid all forms of concurrentdevelopment. Their entire team uses "checkout-edit-checkin" with exclusive locks, and they neverbranch.

For some small teams, this approach works just fine.However, the larger your team, the more frequently adeveloper becomes "blocked" by having to wait forsomeone else.

Modern source control systems are designed to make

concurrent development easy. Give them a try.

Since his baseline version now matches the current version of the file, the Vault server will now allow Joe to dohis checkin.

Worth the trouble

I hope I have not scared you away from concurrentdevelopment by explaining the gory details ofmerging files. In fact, my goal is quite the opposite.

Remember that easily-resolved merges are themost common case. Automerge handles a largepercentage of situations with no problems at all. Alarge percentage of the remaining cases can beeasily handled with a visual merge tool. Thedifficult situations are rare, and can still be handledeasily by a developer who is patient and careful.

Many software teams have discovered that thetradeoff here is worth the trouble. Concurrentdevelopment can bring substantial gains in the

productivity of a team. The extra effort to deal withmerge situations is usually a small price to pay.

Looking Ahead

In the next chapter I will be discussing the concept of a repository in a lot more detail.


23/59

Chapter4: Repositories


Cars and clocks

In previous chapters I have mentioned the concept of a repository, but I haven't said much further about it. Inthis chapter, I want to provide a lot more detail. Please bear with me as I spend a little time talking about howan SCM tool works "under the hood". I am doing this because an SCM tool is more like a car than a clock.

An SCM tool is not like a clock. Clock users have no need to know how a clock works inside. We justwant to know what time it is. Those who understand the inner workings of a clock cannot tell time anymore skillfully than the rest of us.

An SCM tool is more like a car. Lots of people do use cars without knowing how they work. However,people who really understand cars tend to get better performance out of them.

Rest assured, that this book is still a "HOWTO". My goal here remains to create apracticalexplanation of

how to do source control. However, I believe that you can use an SCM tool more effectively if you know a littlebit about what's happening inside.

Repository = File System * Time

A repository is the official place where you store all your source code. It keeps track of all your files, as well asthe layout of the directories in which they are stored. It resides on a server where it can be shared by all themembers of your team.

But there has to be more. If the definition in the previous paragraph were the whole story, then an SCMrepository would be no more than a network file system. A repository is much more than that. A repository

contains history.

A file system is two-dimensional: its space is defined by directories and files. In contrast, a repository is three-dimensional: it exists in a continuum defined by directories, files and time. An SCM repository contains everyversion of your source code that has ever existed. The additional dimension creates some rather interestingchallenges in the architecture of a repository and the decisions about how it manages data.

How do we store all those old versions of everything?

As a first guess, let's not be terribly clever. We need to store every version of the source tree. Why not justkeep a complete copy of the entire tree for every change that has happened?

We obviously use Vault as the SCM tool for our own development of Vault. We began development of Vault inthe fall of 2001. In the summer of 2002, we started "dogfooding". On October 25th, 2002, we abandoned ourrepository history and started a fresh repository for the core components of Vault. Since that day, this tree hasbeen modified 4,686 times.

This repository contains approximately 40 MB of source code. If we chose to store the entire tree for everychange, those 4,686 copies of the source tree would consume approximately 183 GB, without compression. Attoday's prices for disk space, this option is worth considering.

However, this particular repository is just not very large. We have several others as well, but the sum total ofall the code we have ever written still doesn't qualify as "large". Many of our Vault customers have trees whichare a lot bigger.

As an example, consider the source tree for OpenOffice.org. This tree is approximately 634 MB. Based ontheir claim of 270 developers and the fact that their repository is almost four years old, I'm going toconservatively estimate that they have made perhaps 20,000 checkins. So, if we used the dumb approach ofstoring a full copy of their tree for every change, we'll need around 12 TB of disk space. That's 12 terabytes


24/59

(http://dictionary.reference.com/search?q=terabytes) .

At this point, the argument that "disk space is cheap" starts to break down. The disk space for 12 TB of data ischeaper than it has ever been in the history of the planet. But this is mission critical data. We have to considerthings like performance and backups and RAID and administration. The cost of storing 12 TB of ultra-important data is more than just the cost of the actual disk platters.

So we actually do have an incentive to store this information a bit more efficiently. Fortunately, there is anobvious reason why this is going to be easy to do. We observe that tree N is often not terribly different from

tree N-1. By definition, each version of the tree is derived from its predecessor. A checkin might be as simpleas a one-line fix to a single file. All of the other files are unchanged, so we don't really need to store anothercopy of them.

So, we don't want to store the full contents of the tree for every single change. Instead, we want a way to store atree represented as a set of changes to another tree. We call this a "delta".

Deta direction

As we decide to store our repositories using deltas, we must be concerned about performance. Retrieving atree which is in a deltified representation requires more effort than retrieving one which is stored in full. For

example, let's suppose that version 1 of the tree is stored in full, but every subsequent revision is represented asa delta from its predecessor. This means that in order to retrieve version 4,686, we must first retrieve version1 and then apply 4,685 deltas. Obviously, this approach would mean that retrieving some versions will befaster than others. When using this approach we say that we are using "forward deltas", because each deltaexpresses the set of changes from one version to the next.

We observe that not all versions of the tree are equally likely to be retrieved. For example, version 83 of theVault tree is not special in any way. It is likely that we have not retrieved that version in over a year. I suspectthat we will never retrieve it again. However, we retrieve the latest version of the tree many times per day. Infact, as a broad generalization, we can say that at any given moment, the most recent version of the tree isprobably the most likely one to be needed.

The simplistic use of forward deltas delivers its worst performance for the most common case. Not good.

Another idea is to use "reverse deltas". In this approach, we store the most recent tree in full. Every other treeN is represented as a set of differences from tree N+1. This approach delivers its best performance for themost common case, but it can still take an awfully long time to retrieve older trees.

Some SCM tools use some sort of a compromise design. In one approach, instead of storing just one full treeand representing every other tree as a delta, we sprinkle a few more full trees along the way. For example,suppose that we store a full tree for every 10th version. This approach uses more disk space, but the SCMserver never has to apply more than 9 deltas to retrieve any tree.

What is a delta?

I've been throwing around this concept of deltas, but I haven't stopped to describe them.

A tree is a hierarchy of folders and files. A delta is the difference between two trees. In theory, those two treesdo not need to be related. However, in practice, the only reason we calculate the difference between them isbecause one of them is derived from the other. Some developer started with tree N and made one or morechanges, resulting in tree N+1.

We can think of the delta as a set of changes. In fact, many SCM tools use the term "changeset" for exactly thispurpose. A changeset is merely a list of the changes which express the difference between two trees.

For example, let's suppose that Wilbur starts with tree N and makes the following changes:

He deletes $/top/subfolder/foo.c because it is no longer needed.1.He edits $/top/subfolder/Makefile to remove foo.c from the list of file names2.He edits $/top/bar.c to remove all the calls to the functions in foo.c3.He renames $/top/hello.c and gives it the new name hola.c4.He adds a new file called feature_creep.c to $/top/5.


25/59

He edits $/top/Makefile to add feature_creep.c to the list of filenames6.He moves $/top/subfolder/readme.txt into $/top7.

At this point, he commits all of these changes to the repository as a single transaction. When the SCM serverstores this delta, it must remember all of these changes.

For changeset item 1 above, the delete of foo.c is easily represented. We simply remember that foo.c existed intree N but does not exist in tree N+1.

For changeset item 4, the rename of hello.c is a bit more complex. To handle renames, we need each object inthe repository to have an identifier which never changes, even when the name or location of the item changes.

For changeset item 7, the move of readme.txt is another example of why repositories need IDs for each item.If we simply remember every item by its path, we cannot remember the occasions when that path changes.

Changeset item 5 is going to be a lot bulkier than some of the other items here. For this item we need toremember that tree N+1 has a file called feature_creep.c which was never present in tree N. However, a fullrepresentation of this changeset item needs to contain the entire contents of that file.

Changeset items 2, 3 and 6 represent situations where a file which already existed has been modified in someway. We could handle these items the same way as item 5, by storing the entire contents of the new version of

the file. However, we will be happier if we can do deltas at the file level just as we are doing deltas at the treelevel.

File deltas

A file delta merely expresses the difference between two files. Once again, the reason we calculate a file deltais because we believe it will be smaller than the file itself, usually because one of the files is derived from theother.

For text files, a well-known approach to the file delta problem is to compare line-by-line and output a list oflines which have been modified, inserted or changed. This is the same kind of results which are produced bythe Unix 'diff' command. The bad news is that this approach only works for text files. The good news is that

software developers and web developers have a lot of text files.

CVS and Perforce use this approach for repository storage. Text files are deltified using a line-oriented diff.Binary files are not deltified at all, although Perforce does reduce the penalty somewhat by compressing them.

Subversion and Vault are examples of tools which use binary file deltas for repository storage. Vault uses a filedelta algorithm called VCDiff, as described in RFC 3284 (http://www.faqs.org/rfcs/rfc3284.html) . Thisalgorithm is byte-oriented, not line-oriented. It outputs a list of byte ranges which have been changed. Thismeans it can handle any kind of file, binary or text. As an ancillary benefit, the VCDiff algorithm compressesthe data at the same time.

Binary deltas are a critical feature for some SCM tool users, especially in situations where the binary files are

large. Consider the case where a user checks out a 10 MB file, changes a few bytes, and checks it back in. InCVS, the size of the repository will increase by 10 MB. In Subversion and Vault, the repository will only growby a small amount.

Deltas and diffs are different

Please note that I make a distinction between the terms "delta" and "diff".

A "delta" is the difference between two versions. If we have one full file and a delta, then we canconstruct the other full file. A delta is used primarily because it is smaller than the full file, not becauseit is useful for a human being to read. The purpose of a delta is efficiency. When deltas are done at thelevel of bytes instead of textual lines, that efficiency becomes available to all kinds of files, not just textfiles.

A "diff" is the human-readable difference between two versions of a text file. It is usually line-oriented,but really cool visual diff tools can also highlight the specific characters on a line which differ. Thepurpose of a diff is to show a developer exactly what has changed between two versions of a file. Diffsare really useful for text files, because human beings tend to read text files. Most human beings don't


26/59

est Practice: Checkin all the canonical stuff,and nothing else

Although you can store anything you want in a

repository, that doesn't mean you should. The bestpractice here is to store everything which is necessaryto do a build, and nothing else. I call this "the canonicalstuff".

To put this another way, I recommend that you do notstore any file which is automatically generated.Checkin your hand-edited source code. Don't checkinEXEs and DLLs. If you use a code generation tool,checkin the input file, not the generated code file. Ifyou generate your product documentation in severaldifferent formats, checkin the original format, the one

that you manually edit.

If you have two files, one of which is automaticallygenerated from the other, then you just don't need tocheckin both of them. You would in effect be managingtwo expressions of the same thing. If one of them getsout of sync with the other, then you have a problem.

read binary files, and human-readable diffs of binary files are similarly uninteresting.

As mentioned above, some SCM tools use binary deltas for repository storage or to improve performance overslow network lines. However, those tools also support textual diffs. Deltas and diffs serve two distinctpurposes, both of which are important. It is merely coincidence that some SCM tools use textual diffs as theirrepository deltas.

he evolution of source control technology

At this point I should admit that I have presented a somewhat idealized view of the world. Not all SCM toolswork the way I have described. In fact, I have presented things exactly backwards, discussing tree-wide deltasbefore file deltas. That is not the way the history of the world unfolded.

Prehistoric ancestors of modern programmers had to live with extremely primitive tools. Early version controlsystems like RCS only handled file deltas. There was no way for the system to remember folder-leveloperations like add, renaming or deleting files.

Over time, the design of SCM tools matured. CVS is probably the most popular source control tool in theworld today. It was originally developed as a set of wrappers around RCS which essentially provided supportfor some folder-level operations. Although CVS still has some important limitations, it was a big step forward.

Today, several modern source control systems are designed around the notion of tree-wide deltas. Byaccurately remembering every possible operation which can happen to a repository, these tools provide a trulycomplete history of a project.

What can be stored in a repository?

People sometimes ask us what kind of things can bestored in a repository. In general, the answer is:"Any file". It is true that I am focusing on toolswhich are designed for software developers andweb developers. However, those tools don't really

care what kind of file you store inside them. Vaultdoesn't care. Perforce, Subversion and CVS don'tcare. Any of these tools will gratefully accept anyfile you want to store.

If you will be storing a lot of binary files, it is helpfulto know how your SCM tool handles them. A toolwhich uses binary deltas in the repository may be abetter choice.

Ifallof your files are binary, you may want toexplore other solutions. Tools like Vault and

Subversion were designed for programmers. Theseproducts contain features designed specifically foruse with source code, including diff andautomerge. You can use these systems to store allof your Excel spreadsheets, but they are probablynot the best tool for the job. Consider exploring"document management" systems instead.

How is the repository itself stored?

We need to descend through one more layer of abstraction before we turn our attention back to more practical

matters. So far I have been talking about how things are stored and managed within a repository, but I havenot broached the subject of how the repository itself is stored.

A repository must store every version of every file. It must remember the hierarchy of files and folders forevery version of the tree. It must remember metadata, information about every file and folder. It must


27/59

est Practice: Use separate repositories forthings which are truly separate

Most SCM tools offer the ability to have multipledistinct repositories. Vault can even host multiplerepositories on the same Vault server. People often askus when this capability should be used.

In general, you should store related items in the samerepository. Start a separate repository only insituations where the contents of the two are completelyunrelated. In a small ISV, it may be quite logical tohave only one repository which contains every project.

remember checkin comments, explanations provided by the developer for each checkin. For large trees andtrees with very many revisions, this can be a lot of data that needs to be managed efficiently and reliably.There are several different ways of approaching the problem.

RCS kept one archive file for every file being managed. If your file was called "foo.c" then the archive file wascalled "foo.c,v". Usually these archive files were kept in a subdirectory of the working directory, just one leveldown. RCS files were plain text, you could just look at them with any editor. Inside the file you would find abunch of metadata and a full copy of the latest version of the file, plus a series of line-oriented file deltas, onefor each previous version. (Please forgive me for speaking of RCS in the past tense. Despite all the fondmemories, that particular phase of my life is over.)

CVS uses a similar design, albeit with a lot more capabilities. A CVS repository is distinct, completely separatefrom the working directory, but it still uses ",v" files just like RCS. The directory structure of a CVS repositorycontains some additional metadata.

When managing larger and larger source trees, it becomes clear that the storage challenges of a repository areexactly the same as the storage challenges of a database. For this reason, many SCM tools use an actualdatabase as the backend data store. Subversion uses Berkeley DB. Vault uses SQL Server 2000. The benefitof this approach is enormous, especially for SCM tools which support atomic transactions. Microsoft hasinvested lots of time and money to ensure that SQL Server is a safe place to store important information. Datacorruption simply doesn't happen. All of the ultra-tricky details of transactions are handled by the underlying

database.

Perforce uses somewhat of a hybrid approach, storing all of the metadata in a database but keeping all of theactual file contents in RCS files. This approach trades some safety for speed. Since Perforce manages its ownarchive files, it has to take responsibility for all the strange things that threaten to corrupt them. On the otherhand, writing a file is a bit faster than writing a blob into a SQL database. Perforce has the reputation of beingone of the fastest SCM tools.

Managing repositories

Creating a source control repository is kind of a

special event. It's a little bit like adopting a cat.People often get a cat without realizing the animalis going to be around for 10-20 years. Yourrepository may have similar longevity, or evenlonger.

Shortly after SourceGear was founded in 1997, wecreated a SourceSafe repository. Over seven yearslater, that repository is still in use, almost everyday. (Along with a whole bunch of legacy projects,it contains the source code for SourceOffSite. Wenever migrated that project to Vault because we

wanted the SourceOffSite developers to continueeating their own dogfood.)

That repository is well over a gigabyte in size (which is actually rather small, but then SourceGear has neverbeen a very big company). It contains thousands of files, thousands of checkins, and has been backed upthousands of times.

Treat your repository well and it will serve you well:

Obviously you should do regular backups. That repository contains everything your fussy and expensiveprogrammers have ever created. Don't risk losing it.Just for fun, take an hour this week and check your backup to see if it actually works. It's shocking howmany people are doing daily backups that cannot actually be restored when they are needed.

Put your repository on a reliable server. If your repository goes down, your entire team is blocked fromdoing work. Disk drives like to fail, so use RAID. Power supplies like to fail, so get a server withredundant power supplies. The electrical grid likes to fail, so get a good Uninterruptible Power Supply(UPS).Be conservative in the way your SCM server machine is managed. Don't put anything on that machine


28/59

est Practice: Never obliterate anything thatwas real work

The purist in me wants to recommend that nothing

should ever be obliterated. However, my pragmatistside prevails. There are situations where obliterate isnot sinful.

However, obliterate should never be used to deleteactual work. Don't obliterate a file simply because youdiscovered it to be a bad idea. Don't obliterate a filesimply because you don't need it anymore. Obliterate isfor situations where something in the repository shouldnever have been there at all. For example, if youaccidentally checkin a gigabyte of MP3s alongside yourC++ include files, obliterate is a justifiable choice.

that doesn't need to be there. Don't feel the need to install every single Service Pack on the day it getsreleased. I've been shocked how many times one of our servers went south simply because we installeda service pack or hotfix from Windows Update. Obviously I want our machines to be kept current withthe latest security fixes, but I've been burned too many times not to be cautious. Install those patches onsome other machine before you put them on critical servers.Keep your SCM server inside a firewall. If you need to allow your developers to access the repositoryfrom home, carefully poke a hole, but leave everything else as tight as you can. Make sure yourdevelopers are using some sort of bulk encryption. Vault uses SSL. Tools like Perforce, CVS and

Subversion can be tunneled through ssh or something similar.

This brief list of tips is hardly a complete guide for administrators. I am merely trying to describe the level ofcare and caution which should be used for your SCM repository.

Undo

As I have mentioned, one of the best things about source control is that it contains your entire history. Everyversion of everything is stored. Nothing is ever deleted.

However, sometimes this benefit can be a real pain. What if I made a mistake and checked in something thatshould not be checked in? My history contains something I would rather forget. I want to pretend that it never

happened. Isn't there some way to really delete from a repository?

In general, the recommended way to fix a problem is to checkin a new version which fixes it. Try not to worryabout the fact that your repository contains a full history of the error. Your mistakes are a part of yourpast. Accept them and move on with your life.

However, most SCM tools do provide one or more ways of dealing with this situation. First, there is acommand I call "rollback". This command is essentially an "undo" for revisions of a file. For example, let'ssay that a certain file is at version 7 and we want to go back to version 6. In Vault, we select version 6 andchoose the Rollback command.

To be fair, I should admit that the rollback command is not always destructive. In some SCM tools, the

rollback feature really does make version 7 disappear forever. Vault's rollback is non-destructive. It simplycreates a version 8 which is identical to version 6. The designers of Vault are fanatical purists, or at the veryleast, one of them is.

As a concession to those who are less fanatical, Vault does support a way to truly destroy things in arepository. We call this feature "obliterate". I believe Subversion and Perforce use the same term. Theobliterate command is the only way to delete something and make it truly gone forever.

In my original spec for Vault, I had decided that wewould not implement any form of destructivedelete. We eventually decided to compromise andimplement this command, but I really wanted todiscourage its use. SourceSafe makes it far too easyto rewrite history and pretend that something neverhappened. In the Delete dialog box, SourceSafeincludes a checkbox called "Destroy Permanently".This is an atrocious design decision, roughlyequivalent to leaving a sledgehammer next to theserver machine so that people can bash the harddisks with it every once in a while. This checkbox isalmost irresistible. It simply begs to be checked,even though it is very rarely the right thing to do.

When we first designed the obliterate command forVault, I wanted its user interface to somehow make

the user feel guilty. I argued that the obliteratedialog box should include a photograph of a 75-year old catholic nun scowling and holding a yardstick.

The rest of the team agreed that we should discourage people from using this command, but in the end, wesettled on a less graphical approach. In Vault, the obliterate command is available only in the Admin client,


29/59

not the regular client people use every day. In effect, we made the obliterate command available, butinconvenient. People who really need to obliterate can find the command and get it done. Everyone else has tothink twice before they try to rewrite history and pretend something never happened.

Kimchi again?

Recently when I asked my fifth grade daughter what she had learned in school, she proudly informed me that"everyone in Korea eats kimchi at every meal, every day". In the world of a ten-year-old, things are simpler.

Rules don't have exceptions. Generalizations always apply.

This is how we learn. We understand the basic rules first and see the finer points later. First we learn thatmemory leaks are impossible in the CLR. Later, when our app consumes all available RAM, we learn more.

My habit as I write these chapters is to first present the basics in a "matter of fact" fashion, rarelyacknowledging that there are exceptions to my broad generalizations. I did this during the chapter oncheckins, failing to mention the "edit-merge-commit" until I had thoroughly explored "checkout-edit-checkin".

In this chapter, I have written everything from the perspective of just one specific architecture. SCM tools likeVault, Perforce, CVS and Subversion are based on the concept of a centralized server which hosts a single

repository. Each client has a working folder. All clients contact the same server.

I confess that not all SCM tools work this way. Tools like itKeeper (http://www.bitkeeper.com/) andArch(http://www.gnu.org/software/gnu-arch/) are based on the concept of distributed repositories. Instead of onerepository, there can be several, or even many. Things can be retrieved or committed to any repository at anytime. The repositories are synchronized by migrating changesets from one repository to another. This resultsin a merge situation which is not altogether different from merging branches.

From the perspective of this SCM geek, distributed repositories are an attractive concept. Admittedly, they areadvanced and complex, requiring a bit more of a learning curve on the part of the end user. But for the poweruser, this paradigm for source control is very cool.

Having no experience in the implementation of these systems, I will not be explaining their behavior in anydetail. Suffice it to say that this approach is similar in some ways, but very different in others. This series ofarticles will continue to focus on the more mainstream architecture for source control.

Looking ahead

In this chapter, I discussed the details of repositories. In the next chapter, I' ll go back over to the client sideand dive into the details of working folders.


30/59

Best Practice: D't letyour working folderbecome too valuable

Checkin your work to the repository as often as you canwithout breaking the build.

Best Practice: Use non-working folders whenyou are not working

SCM tools need this "hidden state information" so itcan efficiently keep track of things as you makechanges to your working folder. However, sometimesyou want to retrieve files from the repository with noplan of making changes to them. For example, if you

are retrieving files to make a source tarball, or for thepurpose of doing an automated build, you don't reallyneed the hidden state information at all.

Your SCM tool probably has a way to retrieve things

Chapter5: Working Folders


The joy of indifference

CVS calls it a sandbox. Subversion calls it a working directory. Vault calls it a working folder. By any of thesenames, a working folder is a directory hierarchy on the developer's client machine. It contains a copy of thecontents of a repository folder. The very basic workflow of using source control involves three steps:

Update the working folder so that it exactly matches the latest contents of the repository.1.Make some changes to the working folder.2.Checkin (or commit) those changes to the repository.3.

The repository is the official archive of our work. We treat our repository with great respect. We are extremelycareful about what gets checked in. We buy backup disks and RAID arrays and air conditioners and whateverit takes to make sure our precious repository is always comfortable and happy.

In contrast, we treat our working folder with verylittle regard. It exists for the purpose of beingabused. Our working folder starts out worthless,nothing more than a copy of the repository. If it isdestroyed, we have lost nothing, so we run riskyexperiments which endanger its life. We attemptcode changes which we are not sure will ever work.Sometimes the contents of our working folder won't even compile, much less pass the test suite. Sometimesour code changes turn out to be a Really Bad Idea, so we simply discard the entire working folder and get anew one.

But if our code changes turn out to be useful, things change in a very big way. Our working folder suddenlyhas value. In fact, it is quite precious. The only copy of our most recent efforts is sitting on a crappy,laptop-grade hard disk which gets physically moved four times a day and never gets backed up. The stress ofthis situation is almost intolerable. We want to get those changes checked in to the repository as quickly aspossible.

Once we do, we breathe a sigh of relief. Our working folder has once again become worthless, as it should be.

Hidden state information

Once again I need to spend some time explaining grungy details of how SCM tools work. I don't want to repeatthe analogy I used in the last chapter, so the following line of "code" should suffice:

Response.Write(previousChapter.Section["Cars and Clocks"]);

Let's suppose I have a brand new working folder.In other words, I started with nothing at all and Iretrieved the latest versions from the repository. Atthis moment, my new working folder is completelyin sync with the contents of the repository. But thatcondition is not likely to last for long. I will bemaking changes to some of the files in my workingfolder, so it will be "newer" than the repository.Other developers may be checking in their changes

to the repository, thus making my working folder"out of date". My working folder is going to be newand old at the same time. Things are going to getconfusing. The SCM tool is responsible for keepingtrack of everything. In fact, it must keep track of


31/59

"plain", without writing the hidden state informationanywhere. I call this a "non-working folder". In Vault,this is done automatically whenever you retrieve files toa destination which is not configured as the workingfolder, although I sometimes wish we had made thisfunctionality a completely separate command.

the state of each file individually.

For housekeeping purposes, the SCM tool usuallykeeps a bit of extra information on the client side.When a file is retrieved, the SCM client stores itscontents in the corresponding working file, but italso records certain information for later.Examples:

Your SCM tool may record the timestamp on the working file, so that it can later detect if you havemodified it.It may record the version number of the repository file that was retrieved, so that it may later know thestarting point from which you began to make your changes.It may even tuck away a complete copy of the file that was retrieved, so that it can show you a diff

without accessing the server.

I call this information "hidden state information". Its exact location depends on which SCM tool you areusing. Subversion hides it in invisible subdirectories in your working directory. Vault can work similarly, butby default it stores hidden state information in the current user's "Application Data" directory.

Working file states

Because of the changes happening on both the client and the server, a working file can be in one of severalpossible states. SCM tools typically have some way of displaying the state of each file to the user. Vault showsfile states in the main window. CVS shows them in response to the 'cvs status' command.

The table below shows the possible states for a working file. The column on the left shows my particular namefor each of these states, which through no coincidence is the name that Vault uses. The column on the far rightshows the name shown by the 'cvs status' command. However, the terminology doesn't really matter. Oneway or another, your SCM tool is probably keeping track of all these things and can tell you the state of any filein your working folder hierarchy.

Refresh

In order to keep all this file status information current, the SCM client must have ways of staying up to datewith everything that is happening. Whenever something changes in the working folders or in the repository,the SCM client wants to know.

Changes in the working folders on the client side are relatively easy. The SCM client can quickly scan files inthe working folders to determine what has changed. On some operating systems, the client can register to benotified of changes to any file.

Notification of changes on the server can be a bit trickier. The Vault client periodically queries the server toask for the latest version of the repository tree structure. Most of the time, the server will simply respond that

"nothing has changed". However, when something has in fact changed, the client receives a list of thingswhich have changed since the last time that client asked for the tree structure.

For example, let's assume Laura retrieves the tree structure and is informed that foo.cpp is at version 7.Later, Wilbur checks in a change to foo.cpp and creates version 8. The next time Laura's Vault clientperforms a refresh, it will ask the server if there is anything new. The server will send down a list, informingher client that foo.cpp is now at version 8. The actual bits for foo.cpp will not be sent until Laura specificallyasks for them. For now, we just want the client to have enough information so that it can inform Laura thather copy of foo.cpp is now "Old".

Operations that involve a working folder

OK, let's go back to speaking a bit more about practical matters. In terms of actual usage, most interactionwith your SCM tool happens in and around your working folder. The following operations are the basic thingsI can do to a working folder:


32/59

In the following sections, I will cover each of these operations in a bit more detail.

Make the changes

The primary thing you do to a working folder is make changes to it.

In an idealized world, it would be really nice if the SCM tool didn't have to be involved at all. The developerwould simply work, making all kinds of changes to the working folder while the SCM tool eavesdrops, keeping

an accurate list of every change that has been made.

Unfortunately, this perfect world isn't quite available. Most operations on a working folder cannot beautomatically detected by the SCM client. They must be explicitly indicated by the user. Examples:

It would be unwise for the SCM client to notice that a file is "Missing" and automatically assume itshould be deleted from the repository.

Automatically inferring an "Add" operation is similarly unsafe. We don't want our SCM toolautomatically adding any file which happens to show up in our working folder.Rename and move operations also cannot be reliably divined by mere observation of the result. If Irename foo.cpp to bar.cpp, how can my SCM client know what really happened? As far as it can tell, Imight have deleted foo.cpp and added bar.cpp as a new file.

All of these so-called "folder-level" operations require the user to explicitly give a command to the SCM tool.The resulting operation is added to the pending change set, which is the list of all changes that are waiting tobe committed to the repository.

However, it just so happens that in the most common case, our "eavesdropping" ideal is available. Developerswho use the edit-merge-commit model typically do not issue any explicit command telling the SCM tool oftheir intention to edit a file. The files in their working folder are left in a writable state, so they simply opentheir text editor or their IDE and begin making changes. At the appropriate time, the SCM tool will notice thechange and add that file to the pending change set.

Users who prefer "checkout-edit-checkin" actually have a somewhat more consistent rule for their work. TheSCM tool must be explicitly informed ofallchanges to the working folder. All files in their working folder areusually marked read-only. The SCM tool's Checkout command not only informs the server of the checkoutrequest, but it also flips the bit on the working file to make it writable.

Review changes

One of the most important features provided by a working folder is the ability to review all of the changes Ihave made. For SCM tools that do keep track of a pending change set (Vault, Perforce, Subversion), this is theplace to start. The following screen dump shows the pending change set pane from the Vault client, which isshowing me that I have currently made two changes in my working folder:

screendumps/scm_pending_5.gif)

The pending change set view shows all kinds of changes, including adds, deletes, renames, moves, andmodified files. It is helpful to keep an eye on the pending change set as I work, verifying that I have notforgotten anything.

However, for the case of a modified file, this visual display only shows me which files have changed. To reallyreview my changes, I need to actually lookinside the modified files. For this, I invoke a diff tool. The followingscreen dump is from a popular Windows diff tool called Beyond Compare(http://www.scootersoftware.com/):


33/59

Best Practice: Run diff just before you checkin,every time

Never checkin your changes without giving them aquick review in some sort of a diff tool.

This picture is fairly typical of the visual diff tool genre, showing both files side-by-side and highlighting theparts that are different. There are quite a few tools like this. The following screen dump is from the visual difftool which is provided with Vault:

The left panel shows version 21 ofsgdmgui_props.cpp, which is the current version inthe repository. The right panel shows my working

file. The colored regions show exactly what haschanged:

On line 33 I changed the type of this functionfrom long to short.

At line 35 I inserted a one-line comment.


34/59

Best Practice: Be careful with undo

When you tell your SCM client to undo the changes youhave made to a file, those changes will be lost. If your

working folder has become valuable, be careful with it.

Note that SourceGear's diff tool shows inserted lines by drawing lines in the center gap to indicate exactlywhere the insertion occurs. In contrast, Beyond Compare is showing a dead region on the left side across fromthe inserted line on the right. This particular issue is a matter of personal preference. The latter approachdoes have the benefit that identical lines are always across from each other.

Both of these tools do a nice job on the modification to line 33, showing exactly which part of the line waschanged. Most of the recent visual diff tools support this ability to highlight intraline differences.

Visual diff tools are indispensable. They give me a way to quickly review exactly what has changed. I strongly

recommend you make a habit of reviewing all of your changes just before you checkin. You can catch a lot ofsilly mistakes by taking the time to be sure that your changes look the way you think they look.

Undo changes

Sometimes I make changes which I simply don't intend to keep. Perhaps I tried to fix a bug and discoveredthat my fix introduced five new bugs that are worse than the one I started with. Or perhaps I just changed mymind. In any case, a very nice feature of a working folder is the ability to undo.

In the case of a folder-level operation, perhaps the Undo command should actually be called "Nevermind".After all, the operation is pending. It hasn't happened yet. I'm not really saying that I want to Undo something

which has already happened. Rather, I am just saying that I no longer want to do something that I previouslysaid I did.

For example, if I tell the Vault client to delete a file, the file isn't really deleted until I commit that change tothe repository. In the meantime, it is merely waiting around in my pending change set. If I then tell the Vaultclient to Undo this operation, the only thing that actually has to happen is to remove it from my pendingchange set.

In the case of a modified file, the Undo commandsimply overwrites the working file with the"baseline" version, the one that I last retrieved.Since Vault has been keeping a copy of this baseline

version, it merely needs to copy this baseline filefrom its place in the hidden state information overthe working file.

For users who use the checkout-edit-checkin style of development, closely related here is the need to undo acheckout. This is essentially similar to undoing the changes in a file, but involves the extra step of informingthe server that I no longer want the file to be checked out.

Digression: Your skillet is not a working folder

Source control tools have been a daily part of my life for well over a decade. I can't imagine doing softwaredevelopment without them. In fact, I have developed habits that occasionally threaten my mental health.

Things would be so much easier if the concept of a working folder were available in other areas of life:

"Hmmm. I can't remember which of these pool chemicals I have already done. Luckily, I can just diffagainst the version of the pool water from an hour ago and see exactly what changes I have made.""Boy am I glad I remembered to set the read-only bit on my front lawn to remind me that I'm notsupposed to cut the grass until a week after the fertilizer was applied.""No worries -- if I accidentally put too much pepper on this chicken, I can just revert to the latest versionin the repository."

Unfortunately, SCM tools are unique. When I make a mistake in my woodshop, I can't undo it. Only insoftware development do I have the luxury of a working folder. It's a place where I can work withoutconstantly worrying about making a mistake. It's a place where I can work without having to be

source control howto

Documents