ashg2015 grc-pruitt

19
RefSeq curation and annotation of the reference human genome GRCh38 Kim D. Pruitt National Center for Biotechnology Information National Library of Medicine National Institutes of Health www.ncbi.nlm.nih.gov/refseq/

Upload: genome-reference-consortium

Post on 17-Jan-2017

791 views

Category:

Science


0 download

TRANSCRIPT

RefSeq curation and annotation of the reference human genome GRCh38

Kim D. PruittNational Center for Biotechnology Information

National Library of MedicineNational Institutes of Health

www.ncbi.nlm.nih.gov/refseq/

RefSeq Background

• RefSeq provides -• Human genome annotation • Known transcripts & proteins (manually curated) • Model transcripts & proteins (annotation pipeline)

• Collaborations -• Genome Reference Consortium (GRC)• HUGO Gene Nomenclature Committee (HGNC)• Consensus CDS (CCDS) Collaboration (HAVANA curators)• RefSeqGene/Locus Reference Genomic (LRG)/LSDB

RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/

An NCBI project to provide reference sequence standards that incorporate current knowledge.

Archaea – Bacteria – Eukaryotes - Virus

Curation support of genic regions of the reference human assembly

• RefSeqGene and LRG collaboration• Genomic and cDNA standards for clinical reporting• Report potential issues to the GRC

• Consensus CDS collaboration • Stabilized human CDS annotation • Report potential issues to the GRC

• RefSeq• Curation of genes, transcript & protein records• Report potential issues to the GRC• Review GRC patch updates for gene annotation impact

Genome annotation leverages curation + computation

Genes:• Type, location, length

Sequence:• Accuracy, length• Alternate splice products• Functional annotation

Align curated RefSeqsAlign transcripts, proteinsAlign RNA-SeqFilter best alignmentsBuild model RefSeqsAssign accessions, GeneID

Evidence-based genome annotation pipeline

Manual CurationSequence - Literature

Transcripts ProteinsKnown RefSeqs 50,540 39,363

Model RefSeqs 112,735 60,599

Annotated Genes CountProtein-coding 20,576Non-coding 18,037Pseudogene 12,474

Transition from GRCh37 to GRCh38 • Identify gene/sequence differences vs. GRCh38• Automatic update at synonymous mismatches• Curation review of remainder• >5,100 Known RefSeq transcripts updated since October 2013• 47,031 Known RefSeqs identical to genome• 2,916 intentionally retain a mismatch or indel• ~600 pending• ~132 genes merged

0 200 400 600 800 1000 1200

2013 Q1

2013 Q3

2014 Q1

2014 Q3

2015 Q1

2015 Q3

Number of updates

* GRCh38 12/24/2013

*

Updating RefSeq to match GRCh38

• Post GRCh38 review: • NM_173477 updated to match genome (NM_173477.4)• Model RefSeq XM_005257026.1 promoted to Known RefSeq

GRCh38

GRCh37

alignment

alignment

RefSeq curation & genome maintenance

GRCh38

GRCh37

GRCh37 Issue: SCX duplicationMROH1 split

GRCh38 update:Gap closedMROH1 completeOne SCX gene

gap

RefSeq curation & genome maintenance

• POLR2A (GeneID:5430) NM_000937.4 has a 2 nt deletion vs. GRCh38

• This maintains the correct reading frame GRCh38

alignment

RefSeq curation & genome maintenance

• RefSeq reported this sequence issue to the GRC

GRCh38 ALT LOCI and PATCHES

Pre-Patch & ALT reviewPolymorphic pseudogenesHaplotype & CNV variation

ALT-specific RefSeq recordsCurator-stored placement data

Evidence-based genome annotation pipelineManual Curation

Assembly-ALT alignmentsAlignment quality reports

Subsequent genome annotation build corrects the annotation

Interim alignment updates

Polymorphic pseudogenes

• RefSeq provides different transcripts to represent the protein-coding gene versus the pseudogene

• Curators store assembly placement information (chromosome versus ALT) in a local database

• This is used by annotation pipeline to ensure correct annotation

Assembly Unit GSTT1 GSTT2 GSTT2B GSTTP1 GSTTP2GRCh38 chr22 null pseudo coding pseudo nullALT_REF_LOCI_1 coding coding coding pseudo pseudo

An example – GSTT cluster on chromosome 22:

GSTT* variation, chromosome 22

• Copy number variation of glutathione-S-transferase theta genes is associated with digestive track cancers and more

• Accurate gene annotation is important to downstream users

GRCh38 chr22

GRCh38 ALT

pseudogene

chr22 = null allelecoding allele

ulcerative colitis - laryngeal cancer - esophageal cancer - colorectal cancer

GSTT2 polymorphism

AT splice donor Premature stop codon

GT splice donor Stop codon

GRCh38 chr22

GRCh38 ALT

GRCh38 chr22 GSTT2 pseudogene

GRCh38 chr22

Data access• Genes:

• <…ncbi root url…>/gene/• ftp://ftp.ncbi.nlm.nih.gov/gene/• NCBI YouTube ‘Download genomic sequence for a gene’

• https://www.youtube.com/watch?v=RHz2nZbzjpA

• RefSeq transcripts and proteins:• Links from NCBI Gene• Nucleotide/protein query:

• human[organism] + use facets to specify RefSeq and molecule type• ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/

• NCBI Genome Annotation• Links from NCBI Assembly or Genome resources

• <ncbi>/assembly/ or <ncbi>/genome/

Data access to annotated genome

Gene

Assembly details

Genome FTP formats• FASTA

• genome, transcripts, proteins • GenBank file format

• – genome transcripts, proteins• GFF genome annotation • Feature table

• features and locations in tabular format

• AGP, Assembly details & statistics • Repeat masker results• Md5checksums• Documentation

• README files• <ncbi>/genome/doc/ftpfaq/

AcknowledgementsRefSeq Curators

Annotation pipelinePaul KittsTerence MurphyFrancoise Thibaud-Nissen

Eric CoxCatherine FarrellTamara GoldfarbTripti GuptaVinita JoardarVamsi Kodali

Kelly McGarveyMike MurphyNuala O'LearyShashi PujarBhanu RajputSanjida Rangwala

Lillian RiddickDave WebbMatt Wright

Susan Hiatt

www.ncbi.nlm.nih.gov/refseq/

CollaboratorsElspeth Bruford (HGNC)Jen Harrow (HAVANNA)Locus-Specific DatabasesExpert databasesIndividual scientists

NCBI Posters & Booth 2405