welch wordifier bosc2009

21
"Junk" DNA Proves to be Highly Valuable 1 What was once thought of as DNA with zero value in plants--dubbed "junk" DNA--may turn out to be key in helping scientists improve the control of gene expression in transgenic crops. 2 Cooper and collaborators investigated "junk" DNA in the model plant Arabidopsis thaliana, using a computer program to find short segments of DNA that appeared as molecular patterns…These linked patterns are called pyknons… This discovery in plants illustrates that the link between coding DNA and junk DNA crosses higher orders of biology and suggests a universal genetic mechanism at play that is not yet fully understood. 1-Alfredo Flores , June 2, 2009; http://www.ars.usda.gov/is/pr/2009/090602.htm . 2-Bret Cooper , Soybean Genomics and Improvement Laboratory , Agricultural Research Service , USDA.

Upload: bosc

Post on 28-Nov-2014

1.299 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Welch Wordifier Bosc2009

"Junk" DNA Proves to be Highly Valuable1

• What was once thought of as DNA with zero value in plants--dubbed "junk" DNA--may turn out to be key in helping scientists improve the control of gene expression in transgenic crops.2

• Cooper and collaborators investigated "junk" DNA in the model plant Arabidopsis thaliana, using a computer program to find short segments of DNA that appeared as molecular patterns…These linked patterns are called pyknons…

• This discovery in plants illustrates that the link between coding DNA and junk DNA crosses higher orders of biology and suggests a universal genetic mechanism at play that is not yet fully understood.

1-Alfredo Flores, June 2, 2009; http://www.ars.usda.gov/is/pr/2009/090602.htm.2-Bret Cooper, Soybean Genomics and Improvement Laboratory, Agricultural Research Service, USDA.

Page 2: Welch Wordifier Bosc2009

The genome genes

Functional elements?

Functional Elements: 90%?? Junk: 10%??

“Perhaps it is time to bid farewell to the term ‘junk’ DNA – we knew not your true nature.” (Regulatory RNAs and the demise of ‘junk’ DNA. Genome Biology 2006, 7:328)

"...a certain amount of hubris was required for anyone to call any part of the genome 'junk,' given our level of ignorance."(Francis Collins, 2006)

Page 3: Welch Wordifier Bosc2009

Copyright ©2006 by the National Academy of Sciences

Rigoutsos, Isidore et al. (2006) Proc. Natl. Acad. Sci. USA 103, 6605-6610

Fig. 1. Pyknons in the 3' UTRs of the apoptosis inhibitor birc4 (shown above the horizontal line) and nine other genes

Page 4: Welch Wordifier Bosc2009
Page 5: Welch Wordifier Bosc2009
Page 6: Welch Wordifier Bosc2009

WordSeeker

A Software Suite for Discovery

and Characterization of Genomic Words

and Genome-Wide Patterns

Page 7: Welch Wordifier Bosc2009

www.word-seeker.org

Page 8: Welch Wordifier Bosc2009

word discovery methods

sequence-driven(alignment-based)

pattern-driven(enumerative)

exhaustive optimizeddeterministicoptimization

probabilisticoptimization

AlignAceMEMEpreprocesscombine

short patterns

heuristic exact

suffix tree,Weeder

YMF

Teiresias,WordSeeker

WINNOWER

GuhaThakurta D., Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res. 2006 Jul 19;34(12):3585-98. Print 2006. Review. Sandve GK, Drabløs F., A survey of motif discovery methods in an integrated framework. Biol Direct. 2006 Apr 6;1:11.

Page 9: Welch Wordifier Bosc2009

WORDIFIER

sequence(s) words

scientist scientist

The WORDIFIER Pattern for Functional and Regulatory Genomics

Page 10: Welch Wordifier Bosc2009

OWEF: An Open Source Word Enumeration Framework for Bioinformatics

Kyle Kurz, Lonnie R. Welch, Frank Drews, Lee Nau,

Jens Lichtenberg

Ohio University School of EECS Bioinformatics Laboratory

Page 11: Welch Wordifier Bosc2009

Motivation

• Create a robust Motif Discovery framework using abstracted core algorithms

• Use a modular design, allowing new methods and algorithms to be implemented quickly and easily– Abstract C++ classes– Easily extensible

• Support the Scientific Discovery process

Page 12: Welch Wordifier Bosc2009

Approach

General Framework

Abstract Word Counting Class

Page 13: Welch Wordifier Bosc2009

Project Information• Project:

– http://bio-s1.cs.ohiou.edu/~wordseek/download/ • Open Source License:

– GNU General Public License (GPL v3)• Language:

– C++• Applications:

– Currently in final testing phase• Future Work:

– Will provide backend for WordSeeker tool at Ohio University and Ohio Supercomputer Center

– Will be used to fully analyze the Arabidopsis thaliana genome

Page 14: Welch Wordifier Bosc2009

Open Source Implementation of Batch Extraction for Coding and Non-Coding

Sequences

Jens Lichtenberg, Lonnie R. WelchBioinformatics Laboratory

School of EECSOhio University

Page 15: Welch Wordifier Bosc2009

Motivation

• Regulatory Genomics tools return and operate on lists of Gene Symbols (e.g. STAT5A, Cd59a, Slc35f4)

• To our knowledge, no currently supported, open source “tool” that allows extraction of specific non-coding sequences for any organism

• Ensembl API provides limited functionality

Page 16: Welch Wordifier Bosc2009

Approach

Gene Symbol

Set up repository

Retrieve Gene Adaptor

Retrieve 5’UTR

Retrieve 3’UTR

Retrieve Exons

Retrieve Upstream Adaptor

Retrieve Introns

Input

Output Files

Output

Retrieve Promoter

Promoter length

connect to Ensembl database

create gene object

Page 17: Welch Wordifier Bosc2009

Project Information• Project:

– http://opensource.msseeker.org– GNU General Public License (GPL)

• Language: – Perl

• Integrated in WordSeeker motif discovery tool of Ohio University Bioinformatics Lab

• Future Work:– Connection to Genbank repository information– Release into BioPerl or CPAN

Page 18: Welch Wordifier Bosc2009

• Thomas Bitterman, OSC• Laura Elnitski, NHGRI• Susan Evans, OU• Matt Geisler, SIU• Erich Grotewold , OSU• Edwin Jacox, NHGRI• Stephen S. Lee, U. Idaho• Pooja M. Majmudar, OU• Paul Morris, BGSU• Chase Nelson, Oberlin• Eric Stockinger , OSU• Sarah Wyatt, OU• Alper Yilmaz, OSU• Jeffrey Parvin, OSU• Kun Huang, OSU• Thomas Mitchell , OSU• Kengo Morohashi, OSU• Rebecca Lamb , OSU• John Finer, OSU

Lonnie Welch

Jens Lichtenberg

Rami Alouran

Frank Drews

Kyle Kurz

Xiaoyu Liang

Lee Nau

Matt Wiley

Razvan Bunescu

Joshua D. Welch

Klaus Ecker

Mohit Alam

Nathaniel George

Dazhang Gu

Eric Petri

Josiah Seaman

Kaiyu ShenWo

rdS

eeke

r T

eam

Fo

rmer

Mem

ber

s o

f th

e te

am

Co

llab

ora

tors

Acknowledgements

Page 19: Welch Wordifier Bosc2009

a pattern “describes a problem which occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use the solution a million times over, without ever doing it the same way twice [1].”

C. Alexander, S. Ishikawa, and M. Silverstein, A Pattern Language: Towns, Buildings, Construction. Oxford University Press, 1977.

Page 20: Welch Wordifier Bosc2009

Alexander Pattern Format 1. Picture – a representative example 2. Introductory paragraph - sets the context 2. Headline - the essence of the problem in one or two sentences. 3. Body –

• empirical background of the pattern • evidence for its validity • range of different ways the pattern can be manifested

4. Solution• relationships which are required to solve the stated problem in the stated context. • stated in the form of an instruction—so that you know exactly what you need to do, to

build the pattern5. Diagram - shows the solution, with labels to indicate its main components6. A paragraph which ties the pattern to all those smaller patterns in the language,

which are needed to complete this pattern, to embellish it, to fill it out…

Page 21: Welch Wordifier Bosc2009

Picture, Introduction, Headline

With the availability of the genomic sequences ofnumerous organisms, life scientists are working in conjunction with bioinformaticians to decipher the meanings of the genomes. Projects such as Encyclopedia of Genomic Elements (ENCODE) [2] and Pyknons [3], seek to identify and charatcetrize the functional elements in genomes. The functional elements are often referred to as words.

The WORDIFIER Pattern for Functional and Regulatory Genomics

Given a genomic sequence (or a set of sequences), an important problem is the enumeration of all subsequences (words) contained in the sequence (or the set of sequences).