t k ourselves - ornl · 2 t the end of the road in little cottonwood canyon, near salt lake city,...

38
THE U.S. DEPARTMENT OF ENERGY A N D THE HUMAN GENOME PROJECT JULY 1996 T O K NOW O URSELVES T O K NOW O URSELVES

Upload: others

Post on 20-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

T H E U . S . D E P A R T M E N T O F E N E R G Y

A N D

T H E H U M A N G E N O M E P R O J E C T

J U L Y 1 9 9 6

T O K N O W

OURSELVEST O K N O W

OURSELVES

Page 2: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar
Page 3: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

T H E U . S . D E P A R T M E N T O F E N E R G Y

A N D

T H E H U M A N G E N O M E P R O J E C T

J U L Y 1 9 9 6

T O K N O W

OURSELVEST O K N O W

OURSELVES

Page 4: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

FOREWORD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

THE GENOME PROJECT—WHY THE DOE? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4A bold but logical step

INTRODUCING THE HUMAN GENOME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6The recipe for life

Some definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6A plan of action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

EXPLORING THE GENOMIC LANDSCAPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Mapping the terrain

Two giant steps: Chromosomes 16 and 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Getting down to details: Sequencing the genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Shotguns and transposons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20How good is good enough? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Sidebar: Tools of the Trade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Sidebar: The Mighty Mouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

BEYOND BIOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Instrumentation and informatics

Smaller is better—And other developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Dealing with the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

ETHICAL, LEGAL, AND SOCIAL IMPLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . 32An essential dimension of genome research

Contents

Page 5: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

2

T THE END OF THE ROAD in LittleCottonwood Canyon, near SaltLake City, Alta is a place ofnear-mythic renown amongskiers. In time it may well

assume similar status among moleculargeneticists. In December 1984, a conferencethere, co-sponsored by the U.S. Departmentof Energy, pondered a single question: Doesmodern DNA research offer a way of detect-ing tiny genetic mutations—and, in particu-lar, of observing any increase in the mutationrate among the survivors of the Hiroshimaand Nagasaki bombings and their descen-dants? In short the answer was, Not yet.But in an atmosphere of rare intellectual fer-tility, the seeds were sown for a project thatwould make such detection possible in thefuture—the Human Genome Project.

In the months that followed, muchdeliberation and debate ensued. But in 1986,the DOE took a bold and unilateral step byannouncing its Human Genome Initiative,convinced that its mission would be wellserved by a comprehensive picture of thehuman genome. The immediate responsewas considerable skepticism—skepticismabout the scientific community’s technologi-cal wherewithal for sequencing the genomeat a reasonable cost and about the value ofthe result, even if it could be obtained eco-nomically.

Things have changed. Today, a decadelater, a worldwide effort is under way todevelop and apply the technologies needed tocompletely map and sequence the humangenome, as well as the genomes of severalmodel organisms. Technological progress

has been rapid, and it is now generally agreedthat this international project will producethe complete sequence of the human genomeby the year 2005.

And what is more important, the valueof the project also appears beyond doubt.Genome research is revolutionizing biologyand biotechnology, and providing a vitalthrust to the increasingly broad scope of thebiological sciences. The impact that will befelt in medicine and health care alone, oncewe identify all human genes, is inestimable.The project has already stimulated signifi-cant investment by large corporations andprompted the creation of new companies hop-ing to capitalize on its profound implications.

But the DOE’s early, catalytic decisiondeserves further comment. The organizers ofthe DOE’s genome initiative recognized thatthe information the project would generate—both technological and genetic—would con-tribute not only to a new understanding ofhuman biology, but also to a host of practicalapplications in the biotechnology industryand in the arenas of agriculture and environ-mental protection. A 1987 report by a DOEadvisory committee provided some examples.The committee foresaw that the project couldultimately lead to the efficient production ofbiomass for fuel, to improvements in theresistence of plants to environmental stress,and to the practical use of genetically engi-neered microbes to neutralize toxic wastes.The Department thus saw far more to thegenome project than a promised tool forassessing mutation rates. For example,understanding the human genome will havean enormous impact on our ability to assess,

Foreword

A

Page 6: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

Foreword

3

individual by individual, the risk posed byenvironmental exposures to toxic agents. Weknow that genetic differences make some ofus more susceptible, and others more resis-tant, to such agents. Far more work must bedone before we understand the genetic basisof such variability, but this knowledge willdirectly address the DOE’s long-term mis-sion to understand the effects of low-levelexposures to radiation and other energy-related agents—especially the effects ofsuch exposure on cancer risk. And thegenome project is a long stride toward suchknowledge.

The Human Genome Project has otherimplications for the DOE as well. In 1994,taking advantage of new capabilities devel-oped by the genome project, the DOE for-mulated the Microbial Genome Initiative tosequence the genomes of bacteria of likelyinterest in the areas of energy production anduse, environmental remediation and wastereduction, and industrial processing. As aresult of this initiative, we already have com-plete sequences for two microbes that liveunder extreme conditions of temperature andpressure. Structural studies are under way tolearn what is unique about the proteins ofthese organisms—the aim being ultimately toengineer these microbes and their enzymesfor such practical purposes as waste controland environmental cleanup. (DOE-fundedgenetic engineering of a thermostable DNApolymerase has already produced an enzymethat has captured a large share of the several-hundred-million-dollar DNA polymerasemarket.)

And other little-studied microbes hintat even more intriguing possibilities. Forinstance, Deinococcus radiodurans is a speciesthat prospers even when exposed to hugedoses of ionizing radiation. This microbe hasan amazing ability to repair radiation-induced damage to its DNA. Its genome iscurrently being sequenced with DOE sup-port, with the hope of understanding andultimately taking practical advantage of itsunusual capabilities. For example, it mightbe possible to insert foreign DNA into thismicrobe that allows it to digest toxic organic

components found in highly radioactivewaste, thus simplifying the task of furthercleanup. Another approach might be tointroduce metal-binding proteins onto themicrobe’s surface that would scavenge highlyradioactive isotopes out of solution.

Biotechnology, fueled in part byinsights reaped from the genome project, willalso play a significant role in improvingthe use of fossil-based resources. Increasedenergy demands, projected over the next 50years, require strategies to circumvent themany problems associated with today’sdominant energy systems. Biotechnologypromises to help address these needs byupgrading the fuel value of our current ener-gy resources and by providing new means forthe bioconversion of raw materials to refinedproducts—not to mention offering thepossibility of entirely new biomass-basedenergy sources.

We have thus seen only the dawn of abiological revolution. The practical and eco-nomic applications of biology are destined fordramatic growth. Health-related biotechnol-ogy is already a multibillion-dollar successstory—and is still far from reaching its poten-tial. Other applications of biotechnology arelikely to beget similar successes in the comingdecades. Among these applications are sev-eral of great importance to the DOE. We canlook to improvements in waste control and anexciting era of environmental bioremedia-tion; we will see new approaches to improv-ing energy efficiency; and we can even hopefor dramatic strides toward meeting the fueldemands of the future. The insights, thetechnologies, and the infrastructure that arealready emerging from the genome project,together with advances in fields such as com-putational and structural biology, are amongour most important tools in addressing thesenational needs.

Aristides A. N. PatrinosDirector, Human Genome ProjectU.S. Department of Energy

Page 7: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

4

much to learn about how low dosesproduce their insidious effects. When presentmerely in low but significant amounts, toxicagents such as radiation or mutagenic chemi-cals work their mischief in the most subtleways, altering only slightly the geneticinstructions in our cells. The consequencescan be heritable mutations too slight to pro-duce discernible effects in a generation or twobut, in their persistence and irreversi-bility, deeply troublesome nonetheless.

Until recently, science offered littlehope for detecting at first hand thesetiny changes to the DNA that encodes ourgenetic program. Needed was a tool thatcould detect a change in one “word” ofthe program, among perhaps a hundredmillion. Then, in 1984, at a meeting convenedjointly by the DOE and the InternationalCommission for Protection Against Environ-mental Mutagens and Carcinogens, the ques-tion was first seriously asked: Can we, shouldwe, sequence the human genome? That is,can we develop the technology to obtain aword-by-word copy of the entire geneticscript for an “average” human being, and thusto establish a benchmark for detecting theelusive mutagenic effects of radiation andcancer-causing toxins? Answering such aquestion was not simple. Workshops wereconvened in 1985 and 1986; the issue wasstudied by a DOE advisory group, by theCongressional Office of Technology Assess-ment, and by the National Academy ofSciences; and the matter was debated publiclyand privately among biologists themselves. Inthe end, however, a consensus emerged thatwe should make a start.

HE BIOSCIENCES RESEARCH com-munity is now embarked on aprogram whose boldness, evenaudacity, has prompted compar-isons with such visionary efforts

as the Apollo space program and theManhattan project. That life scientistsshould conceive such an ambitious project isnot remarkable; what is surprising—at leastat first blush—is that the project should traceits roots to the Department of Energy.

For close to a half-century, the DOEand its governmental predecessors have beencharged with pursuing a deeper understand-

ing of the potential healthrisks posed by energy useand by energy-productiontechnologies—with specialinterest focused on theeffects of radiation onhumans. Indeed, it is fair tosay that most of what weknow today about radiologi-cal health hazards stemsfrom studies supported bythese government agencies.Among these investigationsare long-standing studies ofthe survivors of the atomicbombings of Hiroshima andNagasaki, as well as anynumber of experimentalstudies using animals, cells

in culture, and nonliving systems. Much hasbeen learned, especially about the conse-quences of exposure to high doses of radia-tion. On the other hand, many questionsremain unanswered; in particular, we have

In 1986the DOE

was the firstfederal agencyto announcean initiativeto pursue a

detailed under-standing of thehuman genome.

The Genome Project–Why the DOE?

A B O L D B U T L O G I C A L S T E P

T

Page 8: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

Adding impetus to the DOE’s earliestinterest in the human genome was theDepartment’s stewardship of the nationallaboratories, with their demonstrated abilityto conduct large multidisciplinary projects—just the sort of effort that would be neededto develop and implement the technologicalknow-how needed for the Human GenomeProject. Biological research programs al-ready in place at the national labs benefitedfrom the contributions of engineers, physi-cists, chemists, computer scientists, andmathematicians, working together in teams.Thus, with the infrastructure in place andwith a particular interest in the ultimateresults, the Department of Energy, in 1986,was the first federal agency to announce andto fund an initiative to pursue a detailedunderstanding of the human genome.

Of course, interest was not restricted tothe DOE. Workshops had also been spon-sored by the National Institutes of Health,the Cold Spring Harbor Laboratory, and theHoward Hughes Medical Institute. In 1988the NIH joined in the pursuit, and in the fallof that year, the DOE and the NIH signed amemorandum of understanding that laid thefoundation for a concerted interagency effort.The basis for this community-wide excite-ment is not hard to comprehend. The firstimpulse behind the DOE’s commitment wasonly one of many reasons for coveting adeeper insight into the human genetic script.Defective genes directly account for an esti-mated 4000 hereditary human diseases—mal-adies such as Huntington disease and cysticfibrosis. In some such cases, a single mis-placed letter among three billion can havelethal consequences. For most of us, though,even greater interest focuses on the far morecommon ailments in which altered genesinfluence but do not prescribe. Heart dis-ease, many cancers, and some psychiatric dis-orders, for example, can emerge from compli-cated interplays of environmental factors andgenetic misinformation.

The first steps in the Human GenomeProject are to develop the needed technolo-gies, then to “map” and “sequence” the

genome. But in a sense, these well-publi-cized efforts aim only to provide the rawmaterial for the next, longer strides. The ulti-mate goal is to exploit those resources for atruly profound molecular-level understand-ing of how we develop from embryo to adult,what makes us work, and what causes thingsto go wrong. The benefits to be reapedstretch the imagination. In the offing is anew era of molecular medicine characterizednot by treating symptoms, but rather bylooking to the deepest causes of disease.Rapid and more accurate diagnostic tests willmake possible earlier treatment for countlessmaladies. Even more promising, insightsinto genetic susceptibilities to disease and toenvironmental insults, coupled with preven-tive therapies, will thwart some diseases alto-gether. New, highly targeted pharmaceuti-cals, not just for heritable diseases, but forcommunicable ailments as well, will attackdiseases at their molecular foundations. Andeven gene therapy will become possible, insome cases actually “fixing” genetic errors.All of this in addition to a new intellectualperspective on who we are and where wecame from.

The Department of Energy is proud tobe playing a central role in propelling ustoward these noble goals.

The Genome Project — Why the DOE?

5

Page 9: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

6

fusion of sperm and egg that marked our con-ception. The source of our personal unique-ness, our full genome, is therefore preservedin each of our body’s several trillion cells. Ata more basic level, the genome is DNA,deoxyribonucleic acid, a natural polymerbuilt up of repeating nucleotides, each consist-ing of a simple sugar, a phosphate group, andone of four nitrogenous bases. The hierarchyof structure from chromosome to nucleotideis shown in Figure 1. In the chromosomes,two DNA strands are twisted together intoan entwined spiral—the famous doublehelix—held together by weak bonds betweencomplementary bases, adenine (A) in onestrand to thymine (T) in the other, and cyto-sine to guanine (C–G). In the language ofmolecular genetics, each of these linkagesconstitutes a base pair. All told, if we countonly one of each pair of chromosomes, thehuman genome comprises about three billionbase pairs.

The specificity of these base-pair link-ages underlies all that is wonderful aboutDNA. First, replication becomes straightfor-ward. Unzipping the double helix providesunambiguous templates for the synthesis ofdaughter molecules: One helix begets twowith near-perfect fidelity. Second, by a simi-lar template-based process, depicted inFigure 2, a means is also available for pro-ducing a DNA-like messenger to the cellcytoplasm. There, this messenger RNA, thefaithful complement of a particular DNAsegment, directs the synthesis of a particularprotein. Many subtleties are entailed in thesynthesis of proteins, but in a schematicsense, the process is elegantly simple.

OR ALL THE DIVERSITY of theworld’s five and a half billion peo-ple, full of creativity and contra-dictions, the machinery of everyhuman mind and body is built

and run with fewer than 100,000 kinds ofprotein molecules. And for each of these pro-teins, we can imagine a single correspondinggene (though there is sometimes some redun-dancy) whose job it is to ensure an adequateand timely supply. In a material sense, then,all of the subtlety of our species, all of our artand science, is ultimately accounted for by asurprisingly small set of discrete geneticinstructions. More surprising still, the differ-ences between two unrelated individuals,between the man next door and Mozart, mayreflect a mere handful of differences in theirgenomic recipes—perhaps one altered wordin five hundred. We are far more alike thanwe are different. At the same time, there isroom for near-infinite variety.

It is no overstatement to say that todecode our 100,000 genes in some funda-mental way would be an epochal step towardunraveling the manifold mysteries of life.

S O M E D E F I N I T I O N S

The human genome is the full comple-ment of genetic material in a human cell.(Despite five and a half billion variations on atheme, the differences from one genome tothe next are minute; hence, we hear about thehuman genome—as if there were only one.)The genome, in turn, is distributed among 23sets of chromosomes, which, in each of us, havebeen replicated and re-replicated since the

Introducing theHuman Genome

T H E R E C I P E F O R L I F E

F

Page 10: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

Every protein is made up of one ormore polypeptide chains, each a series of(typically) several hundred molecules knownas amino acids, linked by so-called peptidebonds. Remarkably, only 20 different kindsof amino acids suffice as the building blocksfor all human proteins. The synthesis of aprotein chain, then, is simply a matter ofspecifying a particular sequence of aminoacids. This is the role of the messenger RNA.(The same nitrogenous bases are at work in

RNA as in DNA, except that uracil takes theplace of the DNA base thymine.) Each lin-ear sequence of three bases (both in RNAand in DNA) corresponds uniquely to asingle amino acid. The RNA sequence AAUthus dictates that the amino acid asparagineshould be added to a polypeptide chain, GCAspecifies alanine—and so on. A segment ofthe chromosomal DNA that directs the syn-thesis of a single type of protein constitutesa single gene.

Introducing the Human Genome

7

A

TA

TCG

CG

G

C

GC

CG

AT

GT

A A

TG

C

TA

N

N

P Hydrogen

Carbon

Oxygen

Deoxyribose

Phosphate

Thymine

Nucleus

Chromosomes

A single nucleotide

Daughter helix

Daughter helix

Separating strands of DNA

FIGURE 1. SOME DNA DETAILS. Apart from reproductive gametes, each cell of thehuman body contains 23 pairs of chromosomes, each a packet of compressed and entwined DNA. Every strand of the DNA is a huge natural polymer of repeating nucleotide units, each of whichcomprises a phosphate group, a sugar (deoxyribose), and a base (either adenine, thymine, cytosine,or guanine). Every strand thus embodies a code of four characters (A’s, T’s, C’s, and G’s), the recipefor the machinery of human life. In its normal state, DNA takes the form of a highly regular double-stranded helix, the strands of which are linked by hydrogen bonds between adenine and thymine (A–T)and between cytosine and guanine (C–G). Each such linkage is said to constitute a base pair; somethree billion base pairs constitute the human genome. It is the specificity of these base-pair linkagesthat underlies the mechanism of DNA replication illustrated here. Each strand of the double helixserves as a template for the synthesis of a new strand, the nucleotide sequence of which is strictlydetermined. Replication thus produces twin daughter helices, each an exact replica of its sole parent.

Page 11: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

8

◆ Complete a genetic linkage map at a reso-lution of two to five centimorgans by1995—As discussed on page 10, this goalwas far surpassed by the fall of 1994.

◆ Complete a physical map at a resolutionof 100 kilobases by 1998—This impliesa genome map with 30,000 “signposts,”separated by an average of 100,000base pairs. Further, each signpost will bea sequence-tagged site, a stretch of DNAwith a unique and well-defined DNAsequence. Such a map will greatly facili-tate “production sequencing” of the entiregenome. By the end of 1995, molecularbiologists were halfway to this goal: Aphysical map was announced with 15,000sequence-tagged signposts. Physical map-ping is discussed on pages 10–16.

◆ By 1998 develop the capacity to sequence50 million base pairs per year in longcontinuous segments—Adequate fiscalinvestment and continuing progressbeyond 1998 should then produce afully sequenced human genome by theyear 2005 or earlier. Sequencing is thesubject of pages 16–26.

◆ Develop efficient methods for identifyingand locating known genes on physicalmaps or sequenced DNA—The goalshere are less quantifiable, but the aim iscentral to the Human Genome Project: tohome in on and ultimately to understandthe most important human genes, namely,the ones responsible for serious diseasesand those crucial for healthy developmentand normal functions.

◆ Pursue technological developments inareas such as automation and robotics—A continuing emphasis on technologicaladvance is critical. Innovative technolo-gies, such as those described on pages27–30, are the necessary underpinnings offuture large-scale sequencing efforts.

◆ Continue the development of databasetools and software for managing andinterpreting genome data—This is thearea of informatics, discussed on pages30–31. The challenge is not so much thevolume of data, but rather the need to

A P L A N O F A C T I O N

In 1990 the Department of Energy andthe National Institutes of Health developed ajoint research plan for their genome pro-grams, outlining specific goals for the ensu-ing five years. Three years later, emboldenedby progress that was on track or even aheadof schedule, the two agencies put forth anupdated five-year plan. Improvements intechnology, together with the experience ofthree years, allowed an even more ambitiousprospect.

In broad terms, the revised planincludes goals for genetic and physicalmapping of the genome, DNA sequencing,

identifying and locatinggenes, and pursuing furtherdevelopments in technologyand informatics. To a largeextent, the following pagesare devoted to a discussionof just what these goalsmean, and what part theDOE is playing in pursuingthem. In addition, the planemphasizes the continuingimportance of the ethical,legal, and social implicationsof genome research, and itunderscores the critical rolesof scientific training, tech-nology transfer, and publicaccess to research data andmaterials. Most of the goalsfocus on the human genome,but the importance of con-tinuing research on widely

studied “model organisms” is also explicitlyrecognized.

Among the scientific goals of humangenome research, several are especiallynotable, as they provide clear milestones forfuture progress. In reciting them, however, itis important to note an underlying assump-tion of adequate research support. Such sup-port is obviously crucial if the joint plan is tosucceed. Some of the central goals for1993–98 follow:

The plan includesgoals for genetic

and physicalmapping, DNA

sequencing,identifying andlocating genes,

and pursuing fur-ther developments

in technologyand informatics.

Page 12: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

Introducing the Human Genome

9

mount a system compatible with re-searchers around the world, and one thatwill allow scientists to contribute new dataand to freely interrogate the existing data-bases. The ultimate measure of successwill be the ease with which biologists canfruitfully use the information produced bythe genome project.

◆ Continue to explore the ethical, legal,and social implications of genomeresearch—Much emphasis continues to beplaced on issues of privacy and the fair useof genetic information. New goals focuson defining additional pertinent issues and

developing policy responses to them, dis-seminating policy options regardinggenetic testing services, fostering greateracceptance of human genetic variation,and enhancing public and professionaleducation that is sensitive to socioculturaland psychological issues. This side ofthe genome project is discussed onpages 32–33.

mRNA moves out of nucleus

Growing protein chain

Amino acids

Cytoplasm

Messenger RNA (mRNA)

Nucleus

Ribosome

Codon

Transfer RNA

C C A A

C

CCC

C

CC

G G

A

A A

A

A

GG

U GGG

U

U

U U

TA

C G T C GG

CCG

ACGTAAC

C

TA

GG

TU

U

G C A G CC

AlaAla

Ser

Asn

mRNA

DNA

FIGURE 2. FROM GENES TO PROTEINS. In the cell nucleus, RNA is produced by transcription, in much the same waythat DNA replicates itself. RNA, however, substitutes the sugar ribose for deoxyribose and the base uracil for thymine, and is usuallysingle-stranded. One form of RNA, messenger RNA or mRNA, conveys the DNA recipe for protein synthesis to the cell cytoplasm.There, bound temporarily to a cytoplasmic particle known as a ribosome, each three-base codon of the mRNA links to a specific form oftransfer RNA (tRNA) containing the complementary three-base sequence. This tRNA, in turn, transfers a single amino acid to a growingprotein chain. Each codon thus unambiguously directs the addition of one amino acid to the protein. On the other hand, the same aminoacid can be added by different codons; in this illustration, the mRNA sequences GCA and GCC are both specifying the addition of theamino acid alanine (Ala).

Page 13: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

10

NE OF THE CENTRAL GOALS ofthe Human Genome Projectis to produce a detailed “map”of the human genome. But,just as there are topographic

maps and political maps and highway maps ofthe United States, so there are different kindsof genome maps, the variety of whichis suggested in Figure 3. One type, a geneticlinkage map, is based on careful analysesof human inheritance patterns. It indicates

for each chromosome thewhereabouts of genes orother “heritable markers,”with distances measured incentimorgans, a measure ofrecombination frequency.During the formation ofsperm and egg cells, a processof genetic recombination—or“crossing over”—occurs inwhich pieces of genetic mate-rial are swapped betweenpaired chromosomes. Thisprocess of chromosomalscrambling accounts for thedifferences invariably seeneven in siblings (apart from

identical twins). Logically, the closer twogenes are to each other on a single chromo-some, the less likely they are to get split upduring genetic recombination. When theyare close enough that the chances of beingseparated are only one in a hundred, theyare said to be separated by a distance ofone centimorgan.

Exploring theGenomic Landscape

M A P P I N G T H E T E R R A I N

The role of human pedigrees nowbecomes clear. By studying family trees andtracing the inheritance of diseases and physi-cal traits, or even unique segments of DNAidentifiable only in the laboratory, geneticistscan begin to pin down the relative positionsof these genetic markers. By the end of 1994,a comprehensive map was available thatincluded more than 5800 such markers,including genes implicated in cystic fibrosis,myotonic dystrophy, Huntington disease,Tay-Sachs disease, several cancers, and manyother maladies. The average gap betweenmarkers was about 0.7 centimorgan.

Other maps are known as physical maps,so called because the distances between fea-tures are measured not in genetic terms, butin “real” physical units, typically, numbers ofbase pairs. A close analogy can thus bedrawn between physical maps and the roadmaps familiar to us all. Indeed, the analogycan be extended further. Just as small-scaleroad maps may show only large cities andindicate distances only between major fea-tures, so a low-resolution physical mapincludes only a relative sprinkling of chromo-somal landmarks. A well-known low-resolu-tion physical map, for example, is the familiarchromosomal map, showing the distinctivestaining patterns that can be seen in the lightmicroscope. Further, by a process known asin situ hybridization, specific segments of DNAcan be targeted in intact chromosomes byusing complementary strands synthesized inthe laboratory. These laboratory-made“probes” carry a fluorescent or radioactive

Just as thereare topographic

maps andpolitical mapsand highway

maps, so thereare different

kinds ofgenome maps.

O

Page 14: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

Exploring the Genomic Landscape

11

Low-resolution physical map of chromosome 19

Genetic linkage map

Ordered markers

Distances in centimorgans

Overlapping clones

Restriction map

Base sequence

INS

R

DI9S

179

DI9S

432

DI9S

429

DI9S

7

DI9S

433

DI9S

191

RYR1

DI9S

200

APO

C2

DM

DI9S

22

DI9S

60

1

DI9S

254

Insu

lin-r

esis

tant

di

abet

es (I

NSR)

Fam

ilial

hype

rcho

lest

erol

emia

(LDL

R)

Pseu

doac

hond

ropl

asia

(COM

P)

Myo

toni

c dy

stro

phy

(DM

)

Hem

olyt

ic a

nem

ia (G

PI)

Mal

igna

nt h

yper

ther

mia

(RYR

1)

21.5

10.8

9.9

10.5

17.9

18.0

FIGURE 3. GENOMIC GEOGRAPHY. The human genome can bemapped in a number of ways. The familiar and reproducible banding pattern ofthe chromosomes constitutes one kind of physical map, and in many cases, thepositions of genes or other heritable markers have been localized to one band oranother. More useful are genetic linkage maps, on which the relative positions ofmarkers have been established by studying how frequently the markers are sep-arated during a natural process of chromosomal shuffling called genetic recombi-nation. The cryptically coded ordered markers near the top of this figure are phys-ically mapped to specific regions of chromosome 19; some of them also constitute

a low-resolution genetic linkage map. (Hundreds of genes and other markers havebeen mapped on chromosome 19; only a few are indicated here. See Figure 5 fora display of mapped genes.) A higher-resolution physical map might describe, asshown here, the cutting sites (the short vertical lines) for certain DNA-cleavingenzymes. The overlapping fragments that allow such a map to be constructed arethen the resources for obtaining the ultimate physical map, the base-pair sequencefor the human genome. At the bottom of this figure is an example of output froman automatic sequencing machine.

Page 15: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

12

the Trade,” pages 17–19). A typical restric-tion enzyme known as EcoRI, for example,recognizes the DNA sequence GAATTC andselectively cuts the double helix at that site.One use of these handy tools involves cuttingup a selected chromosome into small pieces,then cloning and ordering the resulting frag-ments. The cloning, or copying, process is aproduct of recombinant DNA technology, inwhich the natural reproductive machinery ofa “host” organism—a bacterium or a yeast,for example—replicates a “parasitic” frag-ment of human DNA, thus producing themultiple copies needed for further study (see“Tools of the Trade”). By cloning enoughsuch fragments, each overlapping the nextand together spanning long segments (oreven the entire length) of the chromosome,workers can eventually produce an orderedlibrary of clones. Each contiguous block ofordered clones is known as a contig (a smallone is shown in Figure 3), and the resultingmap is a contig map. If a gene can be local-ized to a single fragment within a contig map,its physical location is thereby accuratelypinned down. Further, these convenientlysized clones become resources for furtherstudies by researchers around the world—as well as the natural starting points forsystematic sequencing efforts.

T W O G I A N T S T E P S :C H R O M O S O M E S 1 6 A N D 1 9

One of the signal achievements of theDOE genome effort so far is the successfulphysical mapping of chromosomes 16 and 19.The high-resolution chromosome 19 map,constructed at the Lawrence LivermoreNational Laboratory, is based on restrictionfragments cloned in cosmids, synthetic cloning“vectors” modeled after bacteria-infectingviruses known as bacteriophages. Like aphage, a cosmid hijacks the cellular machin-ery of a bacterium to mass-produce its owngenetic material, together with any “foreign”human DNA that has been smuggled into it.The foundation of the chromosome 19 map isa large set of cosmid contigs that were assem-bled by automated analysis of overlapping

label, which can then be detected and thuspinpointed on a specific region of the chro-mosome. Figure 4 shows some results offluorescence in situ hybridization (FISH).Of particular interest are probes known ascDNA (for complementary DNA), which aresynthesized by using molecules of messengerRNA as templates. These molecules ofcDNA thus hybridize to “expressed” chromo-somal regions—regions that directly dictatethe synthesis of proteins. However, a physi-cal map that depended only on in situhybridization would be a fairly coarse one.Fluorescent tags on intact chromosomes can-not be resolved into separate spots unlessthey are two to five million base pairs apart.

Fortunately, means are also available toproduce physical maps of much higher reso-lution—analogous to large-scale county mapsthat show every village and farm road, andindicate distances at a similar level of detail.Just such a detailed physical map is one thatemerges from the use of restriction enzymes—DNA-cleaving enzymes that serve as highlyselective microscopic scalpels (see “Tools of

FIGURE 4. FISHINGFOR GENES. Fluorescencein situ hybridization (FISH)probes are strands of DNA thathave been labeled withfluorescent dye molecules.The probes bind uniquely tocomplementary strands of chro-mosomal DNA, thus pinpointingthe positions of target DNAsequences. In this example, oneprobe, whose fluorescencesignal is shown in red, bindsspecifically to a gene (DSRAD)that codes for an importantRNA-modifying enzyme. A sec-ond probe, whose signalappears in green, binds to amarker sequence whose locationwas already known. The previ-ously unknown location of theDSRAD gene was thus accu-rately mapped to a narrowregion on the long arm ofchromosome 1.

Page 16: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

Exploring the Genomic Landscape

13

CDC34 BSG ELA2 AZU1 PRTN3 GPX4 POLR2E TCF3 AMH OAZ AES LMNB2 GNA11 GNA15 MATK TBXA2R LON-CM2 PKG2 FUT6 FUT3 FUT5 CAPS RFX2 MLLT1 C3 VAV1 INSR

NAGR1 DNMT ICAM3 ICAM1 TYK2 LDLR EPOR PRKCSH ZNF58 ZNF20 MANB JUNB RAD23A LYL1 *ACP5 *FCER2 *GCDH

RFX1 NOTCH3 MEL MELL1 ERBAL2 JAK3 RAB3A JUND ELL UBA52 COMP

CD70 EEF2 EXT3 FHM FNRBL HHC2 HYL INSL3

LW MSK20 NFIX ZNF53A ZNF54A ZNF77 ZNF121

A1BG DFNA4 E11S LU

RPS11 PIK3R2 PNIA6 ZNF13

ZNF17 ZNF27/ZM F53B ZNF78L1

BST2 GTF2F1 GZMM OK ZNF14 ZNF55 ZNF57

ZNF43 ZNF56 ZNF66 ZNF67 ZNF90 ZNF92 ZNF93 PDE4A PTGER1

AXL CCO CD22 CORD2 EIF2AP3 GRIK5 NFC NPHS1

MHS1 MSK1 MSK37MST-LSB POLR2I RDRC ZNF146 ZFP36

APS CALR CGB FPRL1 FPRL2 FTL IL11 KCNC2 KLK2 MER5

PTGIR PTPRH RP11 SLC1A5 UNR ZNF28 ZNF30 ZNF50 ZNF61 ZNF83

AES BCT2 CAPN4 CXB3S DNL2 EDM1 GEY GGTL2 GUSM HCL1 HKR1 HKR2 LGALS7 M7V1 MHP1 MYBPC2 NKB1 NKG7 NTF6A NTF6B NTF5 NTF6G PCSK4 PDE4A PNI1 POU2F2T NNC1

Whole chromosome:

MEF2B ZNF85 ZNF95 ZNF98 ZNF97 ZNF91

UQCRFS1 CCNE

CEBPA PEPD GPI SCN1B HPN MAG TNNI3? ATP4A COX6B APLP1 COX7A1 RYR1 CLC AKT2

SNRPA CYP2A CYP2F CYP2B TGFB1 BCKDHA CGM10 CGM7 CGM2 CEA NCA CGM1 IGA ATP1A3 POU2F2 LIPE BGP CGM9 CGM6 CGM8 CGM12 PSG3 PSG8 CGM13 PSG12 PSG1

PSG6 PSG7 CGM14 PSG13 CGM15 PSG2 CGM16 PSG5 PSG4 CGM17 PSG11 CGM18 CGM11 XRCC1 PLAUR BCL3 PVR APOE/AD2 APOC1 APOC1P1 APOC2 CKM ERCC2

KLK1 CD33 ETFB FPR1 ZNF160 PRKCG FCAR TNNT1 SYT3 AAVS1 ZNF114 ZNF134 ZNF42 *LIM2

p13.3 p13.2 p13.1 p12

TRSP

q13.2 q13.4

q12

q13.1

FOSB ERCC1 PPP5C DM CALM3 C5R1 STD LIG1 FUT1 FUT2 BAX GYS1 DBP LHB SNRP70 KCNA7 HRC CD37 RRAS KCNC3 POLD1 SPIB *PRSS6

q13.3

FIGURE 5. AN EMERGING GENE MAP. More than 250 genes havealready been mapped to chromosome 19. Those listed on the lower half of thisillustration have been assigned to specific cosmids and (except for those markedwith asterisks) have been ordered on the Livermore physical map. Their positionsare therefore known with far greater accuracy than shown here. The geneslisted above the chromosome have been mapped to larger regions of the chromo-

some—or merely localized to chromosome 19 generally—and have not yet beenassigned to cosmids in the Livermore database. The text mentions several of themost important genes mapped so far. Others include INSR, which codes for aninsulin receptor and is involved in adult-onset diabetes; LDLR, a gene for a low-density lipoprotein receptor involved in hypercholesterolemia; and ERCC2, a DNArepair gene implicated in one form of xeroderma pigmentosum.

Page 17: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

14

FIGURE 6. MAPPING CHROMOSOME 16.This much-reduced physical map of the short arm of humanchromosome 16 summarizes the progress made at Los Alamostoward a complete map of the chromosome. A legible, fullydetailed map of the chromosome is more than 15 feet long;only a few features of the map can be described here. Justbelow the schematic chromosome, the black arrowheads andthe vertical lines extending the full length of the page signify“breakpoints” and indicate the portions of the chromosomemaintained in separate cell cultures. The cultured portions typ-ically extend from a breakpoint to one end of the chromosome.These breakpoints establish the framework for the Los Alamosmapping effort. Within this framework, some 700 megaYACs(shown in black) provide low-resolution coverage for essen-tially the entire chromosome. Smaller flow-sorted YACs (lightblue, red, and black), together with about 4000 cosmids,assembled into about 500 cosmid contigs (blue and red),establish high-resolution coverage for 60% of the chromo-some. Sequence-tagged sites (STSs) are shown as colored ver-tical lines above the megaYACs, and genes (green) and geneticmarkers (pink) that have been localized only to the breakpointmap are shown near the bottom. Also shown are cloned anduncloned disease regions, as well as those markers whoseanalogs have been identified among mouse chromosomes(see “The Mighty Mouse,” pages 24–25).

but unordered restriction fragments. Thesecontigs span an estimated 54 million basepairs, more than 95 percent of the chromo-some, excluding the centromere.

Most of the contigs have been mappedby fluorescence in situ hybridization to visi-ble chromosomal bands. Further, more than200 cosmids have been more accuratelyordered along the chromosome by a high-res-olution FISH technique in which the dis-tances between cosmids are determined witha resolution of about 50,000 base pairs. Thisordered FISH map, with cosmid referencepoints separated by an average of 230,000base pairs, provides the essential frameworkto which other cosmid contigs can beanchored. Moreover, the EcoRI restrictionsites have been mapped on more than 45 mil-lion base pairs of the overall cosmid map.Over 450 genes and genetic markers havealso been localized on this map, of whichnearly 300 have been incorporated into theordered map. Figure 5 shows the locations ofthe mapped genes. Among these genes is theone responsible for the most common form ofadult muscular dystrophy (DM), which wasidentified in 1992 by an international consor-tium that included Livermore scientists. A second important disease gene (COMP),responsible for a form of dwarfism knownas pseudoachondroplasia, has also been iden-tified. And yet another gene, one linked to aform of congenital kidney disease, has beenlocalized to a single contig spanning onemillion base pairs, but has not yet beenprecisely pinpointed. About 2000 othergenes are likely to be found eventually onchromosome 19.

In a similar effort, the Los AlamosNational Laboratory Center for HumanGenome Studies has completed a highly inte-grated map of chromosome 16, a chromo-some that contains genes linked to blood dis-orders, a second form of kidney disease,leukemia, and breast and prostate cancers. A readable display of this integrated mapcovers a sheet of paper more than 15 feetlong; a portion of it, much reduced andshowing only some of its central features, isreproduced here as Figure 6. The framework

for the Los Alamos effort is yet another kindof map, a “cytogenetic breakpoint map”based on 78 lines of cultured cells, each ahybrid that contains mouse chromosomesand a fragment of human chromosome 16.Natural breakpoints in chromosome 16 arethus identified, leading to a breakpoint mapthat divides the chromosome into segmentswhose lengths average 1.1 million base pairs.Anchored to this framework are a low-reso-lution contig map based on YAC clones and ahigh-resolution contig map based largely oncosmids (for more on YACs, yeast artificialchromosomes, see “Tools of the Trade,” pages17–19). The low-resolution map, comprising700 YACs from a library constructed by theCentre d’Etude du Polymorphisme Humain(CEPH), provides practically complete cov-erage of the chromosome, except the highlyrepetitive DNA in the centromere region.The high-resolution map comprises some4000 cosmid clones, assembled into about500 contigs covering 60 percent of the chro-mosome. In addition, it includes 250 smallerYAC clones that have been merged with thecosmid contig map. The cosmid contig map

Page 18: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

Exploring the Genomic Landscape

15

p13.3 p11.2p13.2 p12.3 p12.1

11 17 Mouse chromosome 16 Mouse chromosome 7

Breakpoints STSs

Low-resolution coverage Mega YACs High-resolution coverage Flow-sorted YACs Contigs Cosmids Disease regions Cloned genes Genetic markers Markers homologous to mouse

Page 19: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

16

are from short “sequence tags” on clonedfragments. Only about 30 million base pairsof human DNA (roughly one percent ofthe total) have been sequenced in longerstretches, the longest being about 685,000base pairs long. Even more daunting is therealization that we will eventually need tosequence many parts of the genome manytimes, thus to reveal differences that indicatevarious forms of the same gene.

Hence, as with so many human enter-prises, the challenge of sequencing thegenome is largely one of doing the jobcheaper and faster. At the beginning of theproject, the cost of sequencing a single basepair was between $2 and $10, and oneresearcher could produce between 20,000and 50,000 base pairs of continuous, accuratesequence in a year. Sequencing the genomeby the year 2005 would therefore likely cost$10–20 billion and require a dedicated cadreof at least 5000 workers. Clearly, a majoreffort in technology development was calledfor—an effort that would drive the cost wellbelow $1 per base pair and that would allowautomation of the sequencing process. Fromthe beginning, therefore, the DOE hasemphasized programs to pave the way forexpeditious and economical sequencingefforts—programs to develop new technolo-gies, including new cloning vectors, and toestablish suitable resources for sequencing,including clone libraries and libraries ofexpressed sequences.

Efforts to develop new cloning vectorshave been especially productive. YACsremain a classic tool for cloning largefragments of human DNA, but they are notperfect. Some regions of the genome, forexample, resist cloning in YACs, and othersare prone to rearrangement. New vectorssuch as bacterial artificial chromosomes(BACs), P1 phages, and P1-derived artificialcloning systems (PACs) have thus beendevised to address these problems. Thesenew approaches are critical for ensuringthat the entire genome can be faithfullyrepresented in clone libraries, without thedanger of deletions, rearrangements, orspurious insertions. Continues on p. 20

is an especially important step forward, sinceit is a “sequence-ready” map. It is basedon bacterial clones that are ideal sub-strates for DNA sequencing, and fur-ther, these clones have been restriction

mapped to allow identificationof a minimum set of overlap-ping clones for a large-scalesequencing effort.

The high- and low-resolu-tion maps have been tiedtogether by sequence-taggedsites (STSs), short but uniquestretches of DNA sequence.They have also been integratedinto the breakpoint map, andwith genetic maps developedat the Adelaide Children’sHospital and by CEPH. Theintegrated map also includes atranscription map of 1000sequenced exons (expressedfragments of genes) and more

than 600 other markers developed at otherlaboratories around the world.

G E T T I N G D O W N T O D E T A I L S :S E Q U E N C I N G T H E G E N O M E

Ultimately, though, these physical mapsand the clones they point to are mere step-ping stones to the most visible goal of thegenome project, the string of three billioncharacters—A’s, T’s, C’s, and G’s—represent-ing the sequence of base pairs that definesour species. Included, of course, would bethe sequence for every gene, as well as thesequences for stretches of DNA whose func-tions we don’t yet know (but which may beinvolved in such little-understood processesas orchestrating gene expression in differentparts of our bodies, at different times of ourlives). Should anyone undertake to print itall out, the result would fill several hundredvolumes the size of a big-city phone book.

Only the barest start has been made intaking this dramatic step in the HumanGenome Project. Several hundred millionbase pairs have been sequenced and archivedin databases, but the great majority of these

These maps aremere steppingstones to thestring of three

billion characters –A’s, T’s, C’s,

and G’s – thatdefines our

species.

Page 20: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

17

ver the next decade, as molecular

biologists tackle the task of

sequencing the human genome

on a massive scale, any number

of innovations can be expected in

mapping and sequencing technologies. But several

of the central tools of molecular genetics are likely to

stay with us—much improved perhaps, but not fun-

damentally different. One such tool is the class of

DNA-cutting proteins known as restriction enzymes.

These enzymes, the first of which were discovered in

the late 1960s, cleave double-stranded DNA mole-

cules at specific recognition sites, usually four or six

nucleotides long. For example, a restriction enzyme

called EcoRI recognizes the single-strand sequence

GAATTC and invariably cuts the double helix as

shown in the illustration on the right.

When digested with a particular restriction

enzyme, then, identical segments of human DNA

yield identical sets of restriction fragments. On the

other hand, DNA from the same genomic region of

two different people, with their subtly different

genomic sequences, can yield dissimilar sets of frag-

ments, which then produce different patterns when

sorted according to size.

This leads directly to discussion of a second

essential tool of modern molecular genetics, gel

electrophoresis, for it is by electrophoresis that DNA

fragments of different sizes are most often separated.

In classical gel electrophoresis, electrically charged

macromolecules are caused to migrate through a

polymeric gel under the influence of an imposed sta-

tic electric field. In time the molecules sort them-

selves by size, since the smaller ones move more

rapidly through the gel than do larger ones. In 1984

a further advance was made with the invention of

pulsed-field gel electrophoresis, in which the

strength and direction of the applied field is varied

rapidly, thus allowing DNA strands of more than

50,000 base pairs to be separated.

Tools of the Trade

DIGESTING DNA. Isolated from various bacteria, restriction enzymes serve asmicroscopic scalpels that cut DNA molecules at specific sites. The enzyme EcoRI, forexample, cuts double-stranded DNA only where it finds the sequence GAATTC. Theresulting fragments can then be separated by gel electrophoresis. The electrophoresispattern itself can be of interest, since variations in the pattern from a given chromo-somal region can sometimes be associated with variations in genetic traits, includingsusceptibilities to certain diseases. Knowledge of the cutting sites also yields a kindof physical map known as a restriction map.

Gel electrophoresis

Restriction fragments

Chromosomal DNA digested with EcoRI restriction enzyme

Cutting site

AA

T

A A T

GCC

GT

T

O

Page 21: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

18

bling the amount of DNA originally present. Again

and again, the strands can be separated and the

polymerase reaction repeated—so effectively, in

fact, that DNA can be amplified by 100,000-fold in

less than three hours. As with cloning vectors, the

result is a large collection of copies of the original

DNA fragment.

When a clone library can be ordered—that

is, when the relative positions on the human chro-

mosomes can be established for all the fragments—

one then has the perfect resource for achieving the

project’s central goal, sequencing the human

genome. How the sequencing is actually done can

be illustrated by the most popular method in cur-

rent use, the Sanger procedure, which is depicted

schematically on the facing page. The first step is

to prime each identical DNA strand in a prepara-

tion of cloned fragments. The preparation is then

divided into four portions, each of which contains

a different reaction-terminating nucleotide,

together with the usual reagents for replication. In

one batch, the replication reaction always pro-

duces complementary strands that end with A; in

another, with G; and so on. Gel electrophoresis is

used to sift the resulting products according to size,

allowing one to infer the exact nucleotide

sequence for the original DNA strand. ❖

SPELLING OUT THE ANSWER. In the much-automated Sanger sequencing method, the single-stranded DNAto be sequenced is “primed” for replication with a short com-plementary strand at one end. This preparation is then dividedinto four batches, and each is treated with a different replica-tion-halting nucleotide (depicted here with a diamond shape),together with the four “usual” nucleotides. Each replicationreaction then proceeds until a reaction-terminating nucleotide isincorporated into the growing strand, whereupon replicationstops. Thus, the “C” reaction produces new strands thatterminate at positions corresponding to the G’s in the strandbeing sequenced. (Note that when long strands are beingsequenced the concentration of the reaction-terminatingnucleotide must be carefully chosen, so that a “normal” C isusually paired with a G; otherwise, replication would typicallystop with the first or second G.) Gel electrophoresis—onelane per reaction mixture—is then used to separate the repli-cation products, from which the sequence of the original singlestrand can be inferred.

A third necessary tool is some means of

DNA “amplification.” The classic example is the

cloning vector, which may be circular DNA mole-

cules derived from bacteria or from bacteriophages

(viruslike parasites of bacteria), or artificial chro-

mosomes constructed from yeast

or bacterial genomic DNA. The

characteristic all these vectors

share is that fragments of “foreign”

DNA can be inserted into them,

whereby the inserted DNA is

replicated along with the rest of

the vector as the host reproduces

itself. A yeast artificial chromo-

some, or YAC, for instance, is

constructed by assembling the

essential functional parts of a nat-

ural yeast chromosome—DNA

sequences that initiate replication,

sequences that mark the ends of

the chromosomes, and sequences

required for chromosome separa-

tion during cell division—then splicing in a frag-

ment of human DNA. This engineered chromo-

some is then reinserted into a yeast cell, which

reproduces the YAC during cell division, as if it

were part of the yeast’s normal complement of

chromosomes. The result is a colony of yeast cells,

each containing a copy, or clone, of the same

fragment of human DNA. One of the important

achievements of the Human Genome Project has

been to establish several libraries of such cloned

fragments, using several different vectors (bacterial

artificial chromosomes, P1 phages, and P1-derived

cloning systems), that cover the entire human

genome.

Another way of amplifying DNA is the poly-

merase chain reaction, or PCR. This enzymatic

replication technique requires that initiators, or

PCR primers, be attached as short complementary

strands at the ends of the separated DNA fragments

to be replicated. An enzyme then completes the

synthesis of the complementary strands, thus dou-

One of theimportant

achievements ofthe project has

been to establishseveral libraries ofcloned fragments

covering theentire human

genome.

Page 22: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

Tools of the Trade

19

Primer

Primer

Primer

Replication productsof “C” reaction

Separate products by gel electrophoresis

Read sequence as complement of bands

containing labeled strands

Prepare four reaction mixtures; include in each a different replication-stopping nucleotide

AG

CT

C

CTGAA

T

CTGCTG

AAT

CC

TGCTGAT A

AT

AT T C

GC

CGC

G

TA

AT

A G CA G G A C T A

Strand to be sequenced

Primed DNA +

G

C

AT

A T T C A G C A G G A C T A

CT

AG

Page 23: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

20

I.M.A.G.E. had distributed over 250,000partial and complete cDNA clones, most ofthem with one or both ends sequenced toprovide unique identifiers. These identifiers,expressed sequence tags (ESTs), are usually300–500 base pairs each. Twenty-five hun-dred genes have also been newly mapped aspart of this coordinated effort.

S H O T G U N S A N D T R A N S P O S O N S

Such advances as these, in both tech-nology development and the assembly ofresource libraries, have brought much nearerthe day when “production sequencing” canbegin. A great deal of variety remains, how-ever, in the approaches available to sequenc-ing the human genome, and it is not yet clearwhich will prove the most efficient and mostcost-effective way to read long stretches ofDNA over the next decade. One of the avail-able choices, for example, is between “shot-gun” and “directed” strategies. Another isthe degree of redundancy—that is, howmany times must a given strand be sequencedto ensure acceptable confidence in the result?

Shotgun sequencing derives its namefrom the randomly generated DNA frag-ments that are the objects of scrutiny. Manycopies of a single large clone are broken intopieces of perhaps 1500 base pairs, either byrestriction enzymes or by physical shearing.Each fragment is then separately cloned, anda convenient portion of it sequenced. A com-putational assembly process then comparesthe terminal sequences of the many frag-ments and, by finding overlaps that indi-cate neighboring fragments, constructs anordered library for the parent clone. Themembers of this ordered library can then besequenced from end to end to yield a com-plete sequence for the parent. The statisticsinvolved in taking this approach require thatmany copies of the original clone berandomly fragmented, if no gaps are to betolerated in the final sequence. A benefit isthat the final sequence is highly reliable; themain disadvantage is that the same sequencemust be done many times (in the many over-lapping fragments). Nevertheless, shotgun

Marked progress is also evident in thedevelopment of sequencing technologies,though all of those in widespread current useare still based on methods developed in 1977by Allan Maxam and Walter Gilbert and byFrederick Sanger and his coworkers (see“Tools of the Trade,” pages 17–19). Both ofthese methods rely on gel-based elec-trophoresis systems to separate DNA frag-ments, and recent advances in commercialsystems include increasing the number of gellanes, decreasing run times, and enhancingthe accuracy of base identification. As aresult of such improvements, a standardsequencing machine can now turn out raw,unverified sequences of 50,000 to 75,000bases per day.

Equally important to the sequencinggoals of the genome project is a rationalsystem for organizing and distributing thematerial to be sequenced. The DOE’s com-mitment to such resources dates back to

1984, when it organizedthe National LaboratoryGene Library Project.Based on cell- and chromo-some-sorting technologiesdeveloped at Livermoreand Los Alamos, librariesof clones were establishedfor each of the humanchromosomes, and the indi-vidual clones are widelyavailable for mapping andfor isolating genes. Theseclones were invaluable in

such notable “gene hunts” as the successfulsearches for the cystic fibrosis andHuntington disease genes. More recently, asmore efficient vectors have become available,complete human DNA libraries have beenestablished using BACs, PACs, and YACs.

Another critical resource is beingassembled in an effort known as I.M.A.G.E.(Integrated Molecular Analysis of Genomesand their Expression), cofounded by theLivermore Human Genome Center. The aimis a master set of mapped and sequencedhuman cDNA, representing the expressedparts of the human genome. By early 1996,

Advanceshave broughtmuch nearerthe day when“productionsequencing”can begin.

Page 24: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

sequencing has been the primary means forgenerating most of the genomic sequencedata in public DNA databases. This includesthe longest contiguous fragment of se-quenced human DNA, from the humanT-cell receptor beta region, of about 685,000base pairs—a product of DOE-supportedwork at the University of Washington.

The shotgun strategy is also being usedat the Genome Therapeutics Corporation andThe Institute for Genomic Research (TIGR),as part of the DOE-supported MicrobialGenome Initiative. Genome Therapeuticshas sequenced 1.8 million base pairs ofMethanobacterium thermoautotrophicum, a bac-terium important in energy production andbioremediation, and TIGR has successfullysequenced the complete genomes of threefree-living bacteria, Haemophilus influenzae(1,830,137 base pairs; an effort supportedmostly by private funds), Mycoplasma genita-lium (580,070 base pairs), and Methanococcusjannaschii (1,739,933 base pairs).

The alternative to shotgun sequencingis a directed approach, in which one seeks tosequence the target clone from end to endwith a minimum of duplication. The essenceof this approach is embodied in a techniqueknown as primer walking. Starting at one endof a single large fragment, one replicates astretch of DNA—say, 400 base pairs long—that can be sequenced in one run. With thesequence for this first segment in hand, thenext stretch of DNA, just overlapping thefirst, is then tackled in the same way. In prin-ciple, one can thus “walk” the entire length ofthe original clone. Unfortunately, this con-ceptually simple approach has been histori-cally beset with disadvantages, mainly theexpense and inconvenience of custom-synthesizing a primer as the necessary start-ing point for each sequencing step. Thewidely automated Sanger sequencing methodinvolves a DNA replication step that must be“primed” by a DNA fragment that is comple-mentary to 15 to 20 base pairs of the strand tobe sequenced (see “Tools of the Trade,” pages17–19). Until recently, making these primerswas an expensive and time-consuming busi-ness, but recent innovations have made

Exploring the Genomic Landscape

21

Large clone

Subclones

Select ordered set of subclones

Sequence regions on both sides of transposonsSelect subclones

to yield minimum tiling path

3000 bp

3000 bp+ transposons ( )

FIGURE 7. TAKING A DIRECTED APPROACH. One directed sequencingstrategy exploits a naturally occurring genetic element known as a transposon. The startingpoint is an ordered set of subclones, each about 3000 base pairs long, derived from a muchlarger clone (say, a YAC). For each subclone, a preparation is then made in which transposonsinsert themselves randomly into the subclone—on average, one transposon in each 3000-base-pair strand. The positions of the transposons are mapped, and a set of strands is selected suchthat the insertion points are about 300 base pairs apart. Sequencing then proceeds in bothdirections from the transposon insertion points, using the known transposon sequence as aprimer. The full set of overlapping regions yields the sequence for the entire subclone, and thesequences of the full set of subclones yield the sequence for the larger original clone.

primer walking, and similar directed strate-gies, more and more economically feasible.

One way to deal with the primer bottle-neck, for example, is to use sets of very shortfragments to prime the next sequencing step.As an illustration, the four nucleotides (A, T,C, and G) can be ordered in more than 68 bil-lion ways to create an 18-base primer, animposing set of possibilities. But it is emi-nently practical to create a library of the 4096possible 6-base primers. Three of these“6-mers” can be matched to the end of the

Page 25: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

22

Berkeley researchers are interested in aregion of about two million base pairs that isimplicated in 15 to 20 percent of all primarybreast carcinomas. As an example of thekind of output these efforts produce, Figure8 shows a stretch of sequence data from chro-mosome 5.

Researchers supported by the DOE atthe University of Utah are also pursuing theuse of directed sequencing. In addition, theyhave developed a methodology for “multi-plex” DNA sequencing, which offers a wayof increasing throughput with either shot-gun or directed approaches. By attaching aunique identifying sequence to each sequenc-ing sample in a mixture of, say, 50 such sam-ples, the entire mixture can be analyzed in asingle electrophoresis lane. The 50 samplescan be resolved sequentially by probing, first,for bands containing the first identifier, thenfor bands containing the second, and soforth. In a similar way, multiplexing can alsobe used for mapping. The Utah group is nowable to map almost 5000 transposons in a sin-gle experiment, and they are using multiplex-ing in concert with a directed sequencingstrategy to sequence the 1.8 million basepairs of the thermophilic microbe Pyrococcusfuriosus and two important regions of humanchromosome 17.

The completed physical maps of chro-mosomes 16 and 19, with their extensivecoverage in many different kinds of cloningvectors, are especially ripe for large-scalesequencing. Los Alamos scientists havetherefore begun sequencing chromosome 16,focusing special effort on locating the esti-mated 3000 expressed genes on that chromo-some and using those sites as starting pointsfor directed genomic sequencing. A region of60,000 base pairs has already been sequencedaround the adult polycystic kidney gene, andgood starts have been made in mapping othergenes. Interestingly, even random sequenc-ing has led to the identification of gene DNAin over 15 percent of the samples, confirmingthe apparent high density of genes on thischromosome. Between chromosome 16 andthe short arm of chromosome 5, anotherLos Alamos target, the genome center there

fragment to be sequenced, thus serving as an18-base primer. This modular primer tech-nology, developed at the BrookhavenNational Laboratory, is currently beingapplied to Borrelia burgdorferi, the organismthat causes Lyme disease; a 34,000-base-pairfragment has already been sequenced.

Another directed approach uses a natu-rally occurring genetic element called a trans-poson, which insinuates itself more or less ran-domly in longer DNA strands. This predilec-tion for random insertion and the factthat the transposon’s DNA sequence is wellknown are the keys to the sequencingstrategy depicted schematically in Figure 7.The largest clones are broken into smallersubclones (each of about 3000 base pairs),which then become the targets of the trans-posons. Multiple copies of each subclone areexposed to the transposons, and reaction

conditions are controlled toyield, on average, a singleinsertion in each 3000-base-pair strand. The individualstrands are then analyzed toyield, for each, the approxi-mate position of the insertedtransposon. By mappingthese positions, a “minimumtiling path” can be deter-mined for each subclone—that is, a set of strands can beidentified whose transposoninsertions are roughly 300base pairs apart. In this setof strands, the region around

each transposon is then sequenced, using theinserted transposons as starting points. Theknown transposon sequence allows a singleprimer to be used for sequencing the full setof overlapping regions.

At the Lawrence Berkeley NationalLaboratory, this technique has been used tosequence over 1.5 million base pairs of DNAon human chromosomes 5 and 20, as well asover three million base pairs from the fruit flyDrosophila melanogaster. On chromosome 5,interest focuses on a region of three millionbase pairs that is rich in growth factor andreceptor genes; whereas, on chromosome 20,

The completedphysical mapsof chromosomes16 and 19 areespecially ripefor large-scalesequencing.

Page 26: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

Exploring the Genomic Landscape

23

has produced almost two million base pairs ofhuman DNA sequence.

A parallel effort is under way atLivermore on chromosome 19 and other tar-geted genomic regions. Using a shotgunapproach, researchers there have completedover 1.3 million bases of genomic sequence.Initially, they are attacking two major regionsof chromosome 19: one of about two millionbase pairs, containing several genes involvedin DNA repair and replication, and anotherof approximately one million base pairs,containing a kidney disease gene. TheLivermore scientists are making use of theI.M.A.G.E. cDNA resource to sequence thecDNA from these regions, along with theassociated segments of the genome. In addi-tion, Livermore scientists have targetedDNA repair gene regions throughout thegenome and, in many cases, have done com-parative sequencing of these genes in other

FIGURE 8. SEQUENCE DATA: THE FINAL PRODUCT. The ultimatedescription of the genome, though only a prelude to full understanding, is the base-pairsequence. This computer display shows results from the use of transposons at Berkeley. The array of triangles represents the transposons inserted into a 3000-base-pair subclone;the 11 selected by the computer to build a minimum tiling path are shown below the heaviestblack line. The subclone segments sequenced by using these 11 starting points are depictedby the horizontal lines; the arrowheads indicate the sequencing directions. The expandedregion between bases 2042 and 2085 is covered by three sequencing reactions, whichproduced the three traces at the bottom of the figure. Above the traces, the results aresummarized, together with a consensus sequence (just below the numbers).

Continues on p. 26

Page 27: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

24

The Mighty Mouseestablished for genetic risk assessment and toxi-cology studies, the Oak Ridge facility is one of theworld’s largest. Mutant strains there express avariety of inherited developmental and health dis-orders, ranging from dwarfism and limb deformi-ties to sickle cell anemia, atherosclerosis, andunusual susceptibilities to cancer.

Most of these existing mutant strains havearisen from random alterations of genes, causedby the same processes that occur naturally in allliving populations. However, other, more directedmeans of gene alteration are also available. So-called transgenic methods, which have beendeveloped and refined over the past 15 years,allow DNA sequences engineered in thelaboratory to be introduced directly into thegenomes of mouse embryos. The embryos aresubsequently transferred to a foster mother, wherethey develop into mice carrying specificallydesigned alterations in a particular gene. The dif-ferences in form, basic health, fertility, andlongevity produced by these “designer mutations”then allow researchers to study the effects ofgenetic defects that can mimic those found inhuman patients. The payoff can be clues that aidin the design of drugs and other treatments for thehuman diseases.

The Human Genome Center at Berkeley isusing mice for similar purposes. In vivo librariesof overlapping human genome fragments (each100,000 to 1,000,000 base pairs long) are beingpropagated in transgenic mice. The region ofchromosome 21 responsible for Down syndrome,for example, is now almost fully represented in apanel of transgenic mice. Such libraries have sev-eral uses. For example, the precise biochemicalmeans by which identified genes produce theireffects can be studied in detail, and new genescan be recognized by analyzing the effects ofparticular genome fragments on the transgenicanimals. In such ways, the promise of the massiveeffort to map and sequence the human genomecan be translated into the kind of biological

he human genome is not so verydifferent from that of chimpanzees ormice, and it even shares many com-mon elements with the genome of thelowly fruit fly. Obviously, the differ-

ences are critical, but so are the similarities. Inparticular, genetic experiments on other organismscan illuminate much that we could not otherwiselearn about homologous human genes—thatis, genes that are basically the same in thetwo species.

In some cases, the connection between anewly identified human gene and a known healthdisorder can be quickly established. More often,however, clear links between cloned genes andhuman hereditary diseases or disease susceptibili-ties are extremely elusive. Diseases that are mod-ified by other genetic predispositions, for example,or by environment, diet, and lifestyle can beexceedingly difficult to trace in human families.The same holds for very rare diseases and forgenetic factors contributing to birth defects andother developmental disorders. By contrast, disor-ders such as these can sometimes be followedrelatively easily in animal systems, where uniformgenetic backgrounds and controlled breedingschemes can be used to avoid the variability thatoften confounds human population studies. As aconsequence, researchers looking for clues to thecauses of many complex health problems arefocusing more and more attention on model ani-mal systems.

Among such systems, which range in com-plexity from yeast and bacteria to mammals, themost prominent is the mouse. Because of its smallsize, high fertility rate, and experimental manipu-lability, the mouse offers great promise in studyingthe genetic causes and pathological progress ofailments, as well as understanding the genetic rolein disease susceptibility. In pursuing such studies,the DOE is exploiting several resources, amongthem the experimental mouse genetics facility atthe Oak Ridge National Laboratory. Initially

T

Page 28: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

The Mighty Mouse

25

knowledge coveted by pharmaceutical designersand medical researchers.

Adding to the potential value of mutantmice as models for human genetic disease is grow-ing evidence of similarities between mouse andhuman genes. Indeed, practically every humangene appears to have a counterpart in the mousegenome. Furthermore, the related mouse andhuman genes often share very similar DNAsequences and the same basic biological function.If we imagine that the 23 pairs of human chromo-somes were shattered into smaller blocks—to yielda total of, say, 150 pieces, ranging in size fromvery small bits containing just a few genes towhole chromosome arms—those pieces could bereassembled to produce a serviceable model of themouse genome. This mouse genome jigsaw puz-zle is shown to the right. Thanks to this mouse-human genomic homology, a newly located geneon a human chromosome can often lead to a con-fident prediction of where a closely related genewill be found in the mouse—and vice versa.

Thus, a crippling heritable muscle disorderin mice maps to a location on the mouse X chro-mosome that is closely analogous to the map loca-tion for the X-linked human Duchenne musculardystrophy gene (DMD). Indeed, we now knowthat these two similar diseases are caused by themouse and human versions of the same gene.Although mutations in the mouse mdx gene pro-duce a muscle disease that is less severe than theheartbreaking, fatal disease resulting from theDMD mutation in humans, the two genes produceproteins that function in very similar ways and thatare clearly required for normal muscle develop-ment and function in the corresponding species.Likewise, the discovery of a mouse gene associ-ated with pigmentation, reproductive, and bloodcell defects was the crucial key to uncovering thebasis for a human disease known as the piebaldtrait. Owing to such close human-mouse relation-ships as these, together with the benefits of trans-genic technologies, the mouse offers enormouspotential in identifying new human genes, deci-phering their complex functions, and even treatinggenetic diseases. ❖

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9

10

Mouse chromosomes

Human chromosomes

11

19

6109

8

34

31

4

8

9

1

4

7

13

77

2310

12

19

1115

11161011

8

19419

16

1

19 1119

11

156

3

21115

20

2

18

1

62272165

17

2

7

14

76

119

10X

Y

5

310148

13

5

8

22

12

1622

3

21

61621619182

10185

18

1022211912

20 21 22 X

19 X Y

Y

12 13 14 15 16 17 18

10 11 12 13 14 15 16 17 18

OF MICE AND MEN. The genetic similarity (or homology) of superficially dissimilarspecies is amply demonstrated here. The full complement of human chromosomes can be cut,schematically at least, into about 150 pieces (only about 100 are large enough to appear inthis illustration), then reassembled into a reasonable approximation of the mouse genome.The colors of the mouse chromosomes and the numbers alongside indicate the human chromo-somes containing homologous segments. This piecewise similarity between the mouse andhuman genomes means that insights into mouse genetics are likely to illuminate humangenetics as well.

Page 29: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

26

species, especially the mouse. Such compara-tive sequencing has identified conservedsequence elements that might act as regulatoryregions for these genes and has also assisted inthe identification of gene function (see “TheMighty Mouse,” pages 24–25).

H O W G O O D I S G O O D E N O U G H ?

The goal of most sequencing to datehas been to guarantee an error rate below1 in 10,000, sometimes even 1 in 100,000.However, the difference between one humanbeing and another is more like one base pair infive hundred, so most researchers now agreethat one error in a thousand is a more reason-able standard. To assure a higher level of con-fidence, and perhaps to uncover importantindividual differences, the most biologically ormedically important regions would still besequenced more exhaustively, but using thislowered standard would greatly reduce the costof acquiring sequence data for the bulk ofhuman DNA.

With this philosophy in mind, LosAlamos scientists have begun a project todetermine the cost and throughput of a low-redundancy sequencing strategy known assample sequencing (SASE, or “sassy”). Clonesare selected from the high-resolution LosAlamos cosmid map, then physically brokeninto 3000-base-pair subclones—much as inother sequencing approaches. In contrast to,say, shotgun sequencing, though, only a smallrandom set of the subclones is then selected forsequencing. Sequence fragments alreadyknown—end sequences, sequence-tagged sites,and so forth—are used as the starting points.The result is sequence coverage for about 70percent of the original cosmid clone, enough toallow identification of genes and ESTs, thuspinpointing the most critical targets for later,more thorough sequencing efforts. Further,the SASE-derived sequences provide enoughinformation for researchers elsewhere to pur-sue just such comprehensive efforts, usingwhole genomic DNA. In addition, the cost ofSASE sequencing is only one-tenth the cost ofobtaining a complete sequence, and a genomicregion can be “sampled” ten times as fast.

As the first major target of SASE analy-sis, Los Alamos scientists chose a cosmidcontig of four million base pairs at the end(the telomere) of the short arm of chromosome16. By early 1996, over 1.4 million base pairshad been sequenced, and a gene, EST, or sus-pected coding region had been located onevery cosmid sampled.

In addition, Los Alamos is building onthe SASE effort by using SASE sequencedata as the basis for an efficient primer walk-ing strategy for detailed genomic sequencing.The first application of this strategy, to atelomeric region on the long arm of chromo-some 7, proved to be as efficient as typicalshotgun sequencing, but it required onlytwo- to threefold redundancy to producea complete sequence, in contrast to theseven- to tenfold redundancy required inshotgun approaches. The resulting 230,000-base-pair sequence is the second-longeststretch of contiguous human DNA sequenceever produced.

In a sense, though, even a complete genomesequence—the ultimate physical map—isonly a start in understanding the humangenome. The deepest mystery is how thepotential of 100,000 genes is regulated andcontrolled, how blood cells and brain cells areable to perform their very different functionswith the same genetic program, and howthese and countless other cell types arise inthe first place from an single undifferentiatedegg cell. A first step toward solving thesesubtle mysteries, though, is a more completephysical picture of the master molecules thatlie at the heart of it all.

Page 30: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

27

include a high-speed, robotics-compati-ble thermal cycler developed at Berkeley,which greatly accelerates PCR amplifica-tions, and instruments developed at Utahfor automated hybridization in multiplexsequencing schemes.

S M A L L E R I S B E T T E R — A N D

O T H E R D E V E L O P M E N T S

Beyond “mere” automation are effortsaimed at more fundamental enhancementsof established techniques. In particular, anumber of DOE-supported efforts aim atimproved versions of the automated gel-based Sanger sequencing tech-nique. For example, in place ofthe conventional slab gels, ultra-thin gels, less than 0.1 millime-ter thick, can be used to obtain400 bases of sequence from eachlane in a hour’s run, a fivefoldimprovement in throughputover conventional systems.Even faster speedups are seenwhen arrays of 0.1-millimetercapillaries are used as the sepa-ration medium. Both of theseapproaches exploit higher elec-tric field strengths to increase DNA mobilityand to reduce analysis times. And Livermorescientists are looking beyond even capillar-ies, to sequencing arrays of rigid glassmicrochannels, supplemented by automatedgel and sample loading.

The capillary approach is especiallyripe for further development. Challengesinclude providing uniform excitation over

ROM THE START, it has been clearthat the Human Genome Projectwould require advanced instru-mentation and automation if itsmapping and sequencing goals

were to be met. And here, especially, theDOE’s engineering infrastructure and tradi-tion of instrumentation development havebeen crucial contributors to the internationaleffort. Significant DOE resources have beencommitted to innovations in instrumentation,ranging from straightforward applications ofautomation to improve the speed and effi-ciency of conventional laboratory protocols(see, for example, Figure 9a) to the develop-ment of technologies on the cutting edge—technologies that might potentially increasemapping and sequencing efficiencies byorders of magnitude.

On the first of these fronts, genomeresearchers are seeing significant improve-ments in the rate, efficiency, and economy oflarge-scale mapping and sequencing effortsas a result of improved laboratory automa-tion tools. In many cases, commercial robotshave simply been mechanically reconfiguredand reprogrammed to perform repetitivetasks, including the replication of large clonelibraries, the pooling of libraries as a preludeto various assays, and the arraying of clonelibraries for hybridization studies. In othercases, custom-designed instruments haveproved more efficient. A notable illustra-tion is the world’s fastest cell and chromo-some sorter, developed at Livermore andnow being commercialized, which is usedto sort human chromosomes for chromo-some-specific libraries. Other examples

Beyond BiologyI N S T R U M E N T A T I O N A N D I N F O R M A T I C S

The project willrequire advancedinstrumentationand automationif its goals are

to be met.

F

Page 31: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

28

Bead

Biotin

Avidin

Laser beam

Computer

AGGTCATCGA TTCAC

A

G

T

G

TC

CA

AT

G

C

Detectors

Water flow

FIGURE 9. FASTER, SMALLER, CHEAPER. Innovations in automation and instrumentation promisenot only the virtues of speed, reduced size, and economy, but also a reduction in the drudgery of repetition.The examples shown here illustrate three technological advances. (a) One of the tediously repetitive tasks ofmolecular genetics is transferring randomly plated bacterial colonies, as seen in the foreground video image, tomicrotitre array plates. An automated colony picker robot developed at Berkeley, then modified at Livermore,can pick 1000 colonies per hour and place them in array plates such as the one being examined here by aLivermore researcher. (b) Photolithographic techniques inspired by the semiconductor industry are the basis forpreparing high-density oligonucleotide arrays. Shown here is a 1.28x1.28–cm array of more than 10,000different nucleotide sequences (probes), which was then incubated with a cloned fragment (the target) from thegenome of the HIV-1 virus. If the fluorescently labeled target contained a region complementary to a sequence inthe array, the target hybridized with the probe, the extent of the hybridization depending on the extent of thematch. This false-color image depicts different levels of detected fluorescence from the bound target fragments.Techniques such as this may ultimately be used in sequencing applications, as well as in exploring genetic diversity,probing for mutations, and detecting specific pathogens. Photo courtesy of Affymetrix. (c) Sequencing based onthe detection of fluorescence from single molecules is being pursued at Los Alamos. The strand of DNA to besequenced is replicated using nucleotides linked to a fluorescent tag—a different tag for each of the fournucleotides. The tagged strand is then attached to a polystyrene bead suspended in a flowing stream of water,and the nucleotides are enzymatically detached, one at a time. Laser-excited fluorescence then yields thenucleotide sequence, base by base. Much development remains to be done on this technique, but success promisesa cheaper, faster approach to sequencing, one that might be applicable to intact cosmid clones 40,000 bases long.

a

b

c

Page 32: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

Beyond Biology

29

arrays of 50 to 100 capillaries and thenefficiently detecting the fluorescence emittedby labeled samples. Technologies underinvestigation include fiber-optic arrays, scan-ning confocal microscopy, and cooled CCDcameras. Some of this effort has alreadybeen transferred to the private sector, andtenfold improvements in speed, economy,and efficiency are projected in futurecommercial instruments.

The move toward miniaturization isafoot elsewhere as well. Building on experi-ences in the electronics industry, severalDOE-supported groups are exploring waysto adapt high-resolution photolithographicmethods to the manipulation of minusculequantities of biological reagents, followed byassays performed on the same “chip.”Current thrusts of this “nanotechnology”approach include the design of microscopicelectrophoresis systems and ultrasmall-vol-ume, high-speed thermal cycling systems forPCR. A miniaturized, computer-controlledPCR device under development at Livermoreoperates on 9-volt batteries and might ulti-mately lead to arrays of thousands of individ-ually controlled micro-PCR chambers.

Another miniaturization effort aims atthe fabrication of high-density combinatorialarrays of custom oligomers (short chains ofnucleotides), which would make feasiblelarge-scale hybridization assays, includingsequencing by hybridization. This innova-tive technique uses short oligomers thatpair up with corresponding sequences ofDNA. The oligomers are placed on an arrayby a process similar to that of making siliconchips for electronics. Successful matchesbetween oligomers and genomic DNA arethen detected by fluorescence, and the appli-cation of sophisticated statistical analysesreassembles the target sequence. This sametechnology has already been used for geneticscreening and cDNA fingerprinting. Figure9b illustrates a DOE-supported applicationof high-density oligonucleotide arrays to thedetection of mutations in the HIV-1 genome.Similar approaches can be envisioned tounderstand differences in patterns of geneexpression: Which genes are active (which

are producing mRNA) in which cells? Whichare active at different times during an organ-ism’s development? Which are active, or inac-tive, in disease?

Sequencing by hybridization is only oneof several forward-looking ideas for revolu-tionizing sequencing technology. In spite ofcontinuing improvements to sequencers basedon the classic methods, it isnonetheless desirable to explorealtogether new approaches, withan eye to simplifying samplepreparation, reducing measure-ment times, increasing the lengthof the strands that can be analyzedin a single run, and facilitatinginterpretation of the results. Overthe course of the past few years,several alternative approaches todirect sequencing have beenexplored, including atomic-resolu-tion molecular scanning, single-molecule detection of individualbases, and mass spectrometry ofDNA fragments.

All of these alternatives look promisingin the long term, but mass spectrometry hasperhaps demonstrated the greatest near-termpotential. Mass spectrometry measures themasses of ionized DNA fragments by record-ing their time-of-flight in vacuum. It wouldtherefore replace traditional gel electrophore-sis as the last step in a conventional sequenc-ing scheme. Routine application of this tech-nique still lies in the future, but fragments ofup to 500 bases have been analyzed, and prac-tical systems based on high-resolution massseparations of DNA fragments of fewer than100 bases are currently being developed atseveral universities and national laboratories.

Another innovative sequencing methodis under investigation at Los Alamos. Asdepicted in Figure 9c, each of the four bases(A, T, C, G) in a single strand of DNAreceives a different fluorescent label, then thebases are enzymatically detached, one at atime. The characteristic fluorescence isdetected by a laser system, thereby yieldingthe sequence, base by base. This approach isbeset by major technical challenges, and direct

In spite ofimprovements tosequencers based

on the classicmethods, it isnonethelessdesirable to

explore altogethernew approaches.

Page 33: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

30

sequencing has not yet been achieved. Butthe potential benefits are great, and much ofthe instrumentation for sensitive detection offluorescence signals has already proveduseful for molecular sizing in mappingapplications.

D E A L I N G W I T H T H E D A T A

Among the less visible challenges of theHuman Genome Project is the dauntingprospect of coping with all the data that suc-cess implies. Appropriate information sys-tems are needed not only during data acqui-sition, but also for sophisticated data analysisand for the management and public distribu-tion of unprecedented quantities of biologicalinformation. Further, because much of thechallenge is interpreting genomic data andmaking the results available for scientific andtechnological applications, the challengeextends not just to the Human Genome

Project, but also to the microbial genomeprogram and to public- and private-sectorprograms focused on areas such as healtheffects, structural biology, and environmentalremediation. Efforts in all these areas are themandate of the DOE genome informaticsprogram, whose products are already widelyused in genome laboratories, general molecu-lar biology and medical laboratories, biotech-nology companies, and biopharmaceuticalcompanies around the world.

The roles of laboratory data acquisitionand management systems include the con-struction of genetic and physical maps, DNAsequencing, and gene expression analysis.These systems typically comprise databasesfor tracking biological materials and experi-mental procedures, software for controllingrobots or other automated systems, andsoftware for acquiring laboratory data andpresenting it in useful form. Among suchsystems are physical mapping databases

FIGURE 10. GENE HUNTS. Genes, the regions that actually code for proteins, constitute only a small fraction, perhaps10%, of the human genome. Thus, even with sequence in hand, finding the genes is yet another daunting step away. One tooldeveloped to help in the hunt is GRAIL, a computer program developed at Oak Ridge that uses heuristics based on existing data,together with artificial neural networks, to identify likely genes. Coding and noncoding regions of the genome differ in many subtlerespects—for example, the frequency with which certain short sequences appear. Further, particular landmarks are known to character-ize the boundaries of many genes. In the example shown here, GRAIL has searched for likely genes in both strands of a 3583-base-pair sequence. The results are shown at the upper left. The upper white trace indicates five possible exons (coding regions within asingle gene) in one strand, whereas the lower white trace suggests two possible exons in the other strand. However, the lower tracescores worse on other tests, leading to a candidate set of exons shown by the five green rectangles. By refining this set further, GRAILthen produces the final gene model shown in light blue. The lower part of the figure zeros in on the end of the candidate exon outlinedin yellow, thus providing a detailed look at one of the differences between the preliminary and final models. The sequence is shown inviolet, together with the amino acids it codes for, in yellow. The preliminary model thus begins with the sequence GTCGCA. . . , whichcodes for the amino acids valine and alanine. In fact, though, almost all genes begin with the amino acid methionine, a feature of thefinal gene model. At the upper right, GRAIL displays the results of a database search for sequences similar to the final five-exon genemodel. Close matches were found among species as diverse as soybean and the nematode Caenorhabditis elegans.

Page 34: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

Beyond Biology

31

developed at Livermore and Los Alamos,robot control software developed at Berkeleyand Livermore, and DNA sequence assemblysoftware developed at the University ofArizona. These systems are the keys to effi-cient, cost-effective data production in bothDOE laboratories and the many other labo-ratories that use them.

The interpretation of map and sequencedata is the job of data analysis systems.These systems typically include task-specificcomputational engines, together with graph-ics and user-friendly interfaces that invitetheir use by biologists and other non–com-puter scientists. The genome informaticsprogram is the world leader in developingautomated systems for identifying genesin DNA sequence data from humans andother organisms, supporting efforts at OakRidge National Laboratory and elsewhere.The Oak Ridge–developed GRAIL system,illustrated in Figure 10, is a world-standardgene identification tool. In 1995 alone, morethan 180 million base pairs of DNA wereanalyzed with GRAIL.

A third area of informatics reflects, in asense, the ultimate product of the HumanGenome Project—information readily avail-able to the scientific and lay communities.

Public resource databases must provide dataand interpretive analyses to a worldwideresearch and development community. Asthis community of researchers expands andas the quantity of data grows, the chal-lenges of maintaining accessible and usefuldatabases likewise increase. For example, itis critical to develop scientific databases that“interoperate,” sharing data and protocols sothat users can expect answers to complexquestions that demand information from geo-graphically distributed data resources. Asthe genome project continues to provide datathat interlink structural and functional bio-chemistry, molecular, cellular, and develop-mental biology, physiology and medicine, andenvironmental science, such interoperabledatabases will be the critical resourcesfor both research and technology develop-ment. The DOE genome informatics pro-gram is crucial to the multiagency effort todevelop just such databases. Systems nowin place include the Genome Database ofhuman genome map data at Johns HopkinsUniversity, the Genome Sequence DataBaseat the National Center for GenomeResources in Santa Fe, and the MolecularStructure Database at Brookhaven NationalLaboratory.

Page 35: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

32

Ethical, Legal, andSocial Implications

A N E S S E N T I A L D I M E N S I O N O F G E N O M E R E S E A R C H

stigmatization. Consider, for example, theimpact of information that is likely to beincomplete and indeterminate (say, an indica-tion of a 25 percent increase in the risk ofcancer). And further, if handled carelessly,genetic information could threaten us withdiscrimination by potential employers andinsurers. Other issues are perhaps lessimmediate than these personal concerns, butthey are no less challenging. How, for exam-ple, are the “products” of the HumanGenome Project to be patented and commer-cialized? How are the judicial, medical,and educational communities—not to men-tion the public at large—to be effectivelyeducated about genetic research and itsimplications?

To confront all these issues, the NIH-DOE Joint Working Group on Ethical,Legal, and Social Implications of HumanGenome Research was created in 1990 tocoordinate ELSI policy and researchbetween the two agencies. One focus ofDOE activity has been to foster educationalprograms aimed both at private citizens andat policy-makers and educators. Fruits ofthese efforts include radio and television doc-umentaries, high school curricula and othereducational material, and science museumdisplays. In addition, the DOE has concen-trated on issues associated with privacy andthe confidentiality of genetic information, onworkplace and commercialization issues(especially screening for susceptibilities toenvironmental or workplace agents), and onthe implications of research findings regard-ing the interactions among multiple genesand environmental influences.

HE HUMAN GENOME PROJECT isrich with promise, but alsofraught with social implications.We expect to learn the under-lying causes of thousands of

genetic diseases, including sickle cell anemia,Tay-Sachs disease, Huntington disease,myotonic dystrophy, cystic fibrosis, andmany forms of cancer—and thus to predictthe likelihood of their occurrence in any indi-vidual. Likewise, genetic information mightbe used to predict sensitivities to variousindustrial or environmental agents. The dan-gers of misuse and the potential threats to

personal privacy are not tobe taken lightly.

In recognition of theseimportant issues, both theDOE and the NationalInstitutes of Health devote aportion of their resources tostudies of the ethical, legal,and social implications(ELSI) of human genomeresearch. Perhaps the mostcritical of social issues arethe questions of privacy andfair use of genetic informa-tion. Most observers agreethat personal knowledge ofgenetic susceptibility can be

expected to serve us well, opening the door tomore accurate diagnoses, preventive inter-vention, intensified screening, lifestylechanges, and early and effective treatment.But such knowledge has another side, too:the risk of anxiety, unwelcome changes inpersonal relationships, and the danger of

Both theDOE and theNIH devote aportion of their

resources tostudies of ethical,legal, and social

implications.

T

Page 36: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

Ethical, Legal, and Social Implications

33

Whereas the issues raised by moderngenome research are among the most chal-lenging we face, they are not unprecedented.Issues of privacy, knotty questions of howknowledge is to be commercialized, problemsof dealing with probabilistic risks, and theimperatives of education have all been con-fronted before. As usual, defensible perspec-

HE AGE OF DISCOVERY was the ageof da Gama, Columbus, andMagellan, an era when Europeancivilization reached out to theFar East and thus filled many of

the voids in its map of the world. But in alarger sense, we have never ceased from ourexploration and discovery. Science has beenunstinting over the ages in its efforts tocomplete our intellectual picture of the uni-verse. In this century, our explorations haveextended from the subatomic to the cosmic, aswe have mapped the heavens to their farthestreaches and charted the properties of themost fleeting elementary particles. Nor havewe neglected to look inward, seeking, as itwere, to define the topography of the humanbody. Beginning with the first modernanatomical studies in the sixteenth century,we have added dramatically to our picture ofhuman anatomy, physiology, and biochem-istry. The Human Genome Project is thus thenext stage in an epic voyage of discovery—avoyage that will bring us to a profound under-standing of human biology.

In an important way, though, thegenome project is very different from many ofour exploratory adventures. It is spurred bya conviction of practical value, a certaintythat human benefits will follow in the wake ofsuccess. The product of the Human GenomeProject will be an enormously rich biological

database, the key to tracking down everyhuman gene—and thus to unveiling, andeventually to subverting, the causes of thou-sands of human diseases. The sequence ofour genome will ultimately allow us to unlockthe secrets of life’s processes, the biochemicalunderpinnings of our senses and our memory,our development and our aging, our similari-ties and our differences.

It has further been said that the HumanGenome Project is guaranteed to succeed: Itsgoal is nothing more assuming than asequence of three billion characters. And wehave a very good idea of how to read thosecharacters. Unlike perilous voyages orsearches for unknown subatomic particles,this venture is assured of its goal. Butbeyond a detailed picture of human DNA, noone can predict the form success will take.The genome project itself offers no promisesof cancer cures or quick fixes for Alzheimer’sdisease, no detailed understanding of geniusor schizophrenia. But if we are ever touncover the mysteries of carcinogenesis, ifwe are ever to know how biochemistry con-tributes to mental illness and dementia, if weever hope to really understand the processesof growth and development, we must firsthave a detailed map of the genetic landscape.That’s what the Human Genome Projectpromises. In a way, it’s a rather prosaic step,but what lies beyond is breathtaking.

tives and reasonable arguments, even pre-cious rights, exist on opposing sides of everyissue. It is a balance that must be sought.Accordingly, further study is needed, as wellas continuing efforts to promote public aware-ness and understanding, as we strive to definepolicies for the intelligent use of the profoundknowledge we seek about ourselves.

T❖

Page 37: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar

This document was prepared as an account of work sponsored by the United States Government. While this documentis believed to contain correct information, neither the United States Government nor any agency thereof, nor TheRegents of the University of California, nor any of their employees, makes any warranty, express or implied, or assumesany legal responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or processdisclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific com-mercial product, process, or service by its trade name, trademark, manufacturer, or otherwise, does not necessarily con-stitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency there-of, or The Regents of the University of California. The views and opinions of authors expressed herein do not neces-sarily state or reflect those of the United States Government or any agency thereof, or The Regents of the Universityof California.

Ernest Orlando Lawrence Berkeley National Laboratory is an equal opportunity employer.

Prepared for the U.S. Department of Energy under Contract No. DE-AC03-76SF00098. PUB-773/July 1996.

Printed on recycled paper.

ACKNOWLEDGMENTS

This booklet was prepared at the request of the U.S. Department of Energy, Office of Health andEnvironmental Research, as an overview of the Human Genome Project, especially the role of the DOE inthis international, multiagency effort. Though edited and produced at the Lawrence Berkeley NationalLaboratory, this account aims to reflect the full scope of the DOE-sponsored effort. In pursuit of this goal,the contributions of many have been essential. Within the Department of Energy, David A. Smith deservesspecial mention. He managed the DOE Human Genome Program until his retirement this year, and he wasthe principal catalyst of this effort to summarize its achievements. Also contributing program descriptions,illustrations, advice, and criticism: at DOE, Daniel W. Drell; at Berkeley, Michael Palazzolo, ChristopherH. Martin, Sylvia Spengler, David Gilbert, Joseph M. Jaklevic, Eddy Rubin, Kerrie Whitelaw, andManfred Zorn; at Lawrence Livermore National Laboratory, Anthony Carrano, Gregory G. Lennon, andLinda Ashworth; at Los Alamos National Laboratory, Robert K. Moyzis and Larry Deaven; at Oak RidgeNational Laboratory, Lisa Stubbs; at the National Center for Genome Resources, Christopher Fields; andat Affymetrix, Robert J. Lipshutz. Behind the scenes, many others no doubt had a hand.

DOUGLAS VAUGHAN

Editor

Design: Debra Lamfers DesignIllustrations: Marilee Bailey

34

The World Wide Web offers the easiest path to current news about the Human Genome Project.Good places to start include the following:

• DOE Human Genome Program—http://www.er.doe.gov/production/oher/hug_top.html

• NIH National Center for Human Genome Research—http://www.nchgr.nih.gov

• Human Genome Management Information System at Oak Ridge National Laboratory—http://www.ornl.gov/TechResources/Human_Genome/home.html

• Lawrence Berkeley National Laboratory Human Genome Center—http://www-hgc.lbl.gov/GenomeHome.html

• Lawrence Livermore National Laboratory Human Genome Center—http://www-bio.llnl.gov/bbrp/genome/genome.html

• Los Alamos National Laboratory Center for Human Genome Studies—http://www-ls.lanl.gov/LSwelcome.html

• The Genome Database at Johns Hopkins University School of Medicine—http://gdbwww.gdb.org/

• The National Center for Genome Resources—http://www.ncgr.org/

Page 38: T K OURSELVES - ORNL · 2 T THE END OF THE ROAD in Little Cottonwood Canyon, near Salt Lake City, Alta is a place of near-mythic renown among skiers. In time it may well assume similar