devin petersohn poster

1
To convert a DNA sequence into a grayscale image, we first convert each character into a unique specific value: A=0, C=1, G=2, T=3 Then, in order to convert those values into a 4bit grayscale value (gray color values from 015), we use the following formula: (P 1 *4)+(P 2 ) Where P 1 is the character in the first posiHon, and P 2 is the second The resulHng grayscale values form the pixels of images that represent the original sequence. In order to get a 10x10 image, a sequence of 101 base pairs is required. Example: CA TGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT P 1 = C = 1, P 2 = A = 0 => (1*4) + (0) = 4 Using a sliding window, the second posiHon becomes the first. CAT GCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT P 1 = A = 0, P 2 = T = 3 => (0*4) + (3) = 3 Each twocharacter sequence receives a unique value from 015, which corresponds to its grayscale value in the 10x10 image: Abstract Co-Occurrence Matrix and Texture Measurement Locating Potential DNA Mutations Discussion IdenHfying the Long Ultra Similar Elements (LUSEs) in genomes can yield a myriad of new informaHon regarding the result of a geneHcally and evoluHonarily significant mutaHon. However, current methods of idenHfying LUSEs cannot capture every possible mutaHon (inserHon, deleHon, and base pair subsHtuHon) without an exhausHve pairwise comparison using the Levenshtein Similarity measurement. Alignment algorithms aYempt to solve this problem, but can only calculate the maximum consecuHvely similar elements in a string of base pairs. We have developed an imagebased method of idenHfying LUSEs in genomes that has a strong correlaHon to the Levenshtein Similarity measurement. Our approach first converts a sequence into a 10x10 grayscale image. Then, using exisHng cooccurrence matrix based texture feature metrics, we generate a unique feature vector for each sequence by which other sequences can be compared. These feature vectors can then be ploYed and, using a clustering algorithm, we will then be able to idenHfy clusters of sequences that share a Levenshtein Similarity greater than 90% (or another threshold of our choosing). Because of the correlaHon between clusters and the Levenshtein Similarity measurement, we can avoid pairwise comparisons altogether. Because there are no pairwise comparisons, these algorithms can run in parallel using a MapReduce funcHon in a Big Data Ecosystem (Hadoop), offering a suitable soluHon to this Big Data problem that is scalable to the amount of hardware available. The final product will be a hash funcHon that can return all clustered LUSEs very quickly for biology researchers to access in real Hme. The final product is a searchable database for evoluHonary biologists to be able to upload and compare organism genomes against all other genomes already in the database. The Levenshtein Similarity measurement calculates similarity between strings based on the minimum number of deleHons, inserHons, and subsHtuHons it takes to get from one string to another [7]. Retrieved from: hYp://images.flatworldknowledge.com/ballgob/ballgobfig19_015.jpg Purpose of this approach: Work in Big Data Ecosystem Algorithm can run in parallel Scalable performance to amount of hardware available No pairwise comparison Contrast Homo geneity Entropy Dissim ilarity Contrast & Homogen. Homogen. & Entropy Entropy & Dissim. Contrast & Entropy Contrast & Dissim. Homogen. & Dissim. Contrast, Homogen., & Entropy Contrast, Homogen., & Dissim. Contrast, Entropy, & Dissim. Homogen., Entropy, & Dissim. Contrast, Homogen., Entropy, & Dissim. CorrelaHon 0.8738 0.4313 0.7540 0.8691 0.8270 0.7884 0.8861 0.8697 0.8737 0.8198 0.8648 0.8507 0.8986 0.8750 0.8880 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Correla@on with Levenshtein Similarity (1 is perfectly correlated) Texture Feature Measurement Method(s) Correla@on between Levenshtein Distance and Texture Feature Measurement Methods The CoOccurrence Matrix is created by counHng the number of grayscale pixel values that occur near another in a given image [4]. From the Co Occurrence Matrix, we can generate features with exisHng methods [4]. Contrast Dissimilarity Homogeneity Entropy These feature measurement metrics are used to reduce the cooccurrence matrix down to values that can be measured or ploYed against other images [4]. Below, the graph details the correlaHon between Levenshtein Similarity and all possible combinaHons of the above feature metrics. The most correlated combinaHon of metrics is Contrast, Entropy, and Dissimilarity with a strong 0.8986 correlaHon (1 is perfectly correlated). Image Pixel that is compared against all neighbors Window Window PosiHon 1 Window PosiHon 2 Query (Sequence) Feature Metric CalculaHon User (Start/End) Cluster with Similar Sequences User submits query sequence of at least 101 Feature Metrics are generated from query Metrics are ploYed and clustered Finding LUSE Overview These metrics can next be ploYed in 3 dimensional space and clustered using the K Means algorithm. Because of the strong correlaHon, each cluster will represent a sequence of a measurable similarity threshold. Contrast Entropy Benefits to approach: MapReduce works in parallel => very fast: Linear Hme vs. ExponenHal Same Hme cost to compare 1 vs. 1 and 1 vs. all Scalable to amount of hardware available: More nodes = BeYer Performance Setup can handle enHre genomes to be compared at once Only need to run a sequence once – results will conHnue to be added as database grows Poten@al Uses: IdenHfy Ultra Conserved Elements (UCEs) [1] IdenHfy evoluHonarily significant mutaHons PotenHal for medical uses Disease diagnosis, GeneHc Research, etc. Others What’s Next: TesHng different clustering algorithms – Sop Clustering Implement and test Spark June PublicaHon Yellow area is calculated, blank pixels are not [1] Reneker J, Lyons E, Conant GC, Pires JC, Freeling M, Shyu CR, Korkin D.Proc Natl Acad Sci U S A. 2012 May 8;109(19):E118391. doi: 10.1073/pnas.1121356109. Epub 2012 Apr 10. [2] J. Dean and S. Ghemawat, "Mapreduce: Simplified data processing on large clusters," ACM Commun., vol. 51, Jan. 2008, pp. 107113. [3] Hadoop, hYp://hadoop.apache.org/ [4] CoOccurrence Matrix, hYp://www.fp.ucalgary.ca/mhallbey/texture_calculaHons.htm [5] Apache Spark, hYp://spark.apache.org/ [6] Apache Hbase, hYp://hbase.apache.org/ [7] Levenshtein, Vladimir I. (February 1966). "Binary codes capable of correcHng deleHons, inserHons, and reversals". Soviet Physics Doklady 10 (8): 707–710. MapReduce, Hadoop, Spark, & HBase – Big Data Ecosystem Retrieved from: hYp://hadoop.apache.org MapReduce Overview Cluster Setup 10 Intel NUC computers 1 Master Node: 16GB RAM Dual Core 2.0GHz CPU 1TB Hard Disk Space 480GB Solid State Drive 9 Compute Nodes 8GB RAM Dual Core 2.0GHz CPU 1TB Hard Disk Space 480GB Solid State Drive Retrieved from: hYp://spark.apache.org // Map Func@on 1: input <k,v> k is offset for current file block (in bytes); v is a sequence in chromosome C 1: v = P(v) // remove invalid characters 2: for i = 0 to mn do{ 3: FV = generateFV(v[i to i+n]) //generate feature vector 4: start_pos = i + k 5: return (FV, (start_pos, C)) } // Reduce Func@on 1: input <k,v> k is the feature vector (FV); v is the star@ng posi@on of the subsequence w.r.t the chromosome sequence 1: pos = merge(v) 2: return (k, pos) // Map Func@on 2: input <k,v> k is feature vector; v is the list of posi@ons matching the feature vector 1: k = normalize(k) //normalize data 2: return (k, v) // Reduce Func@on 2: input <k,v> k is the normalized feature vector; v is the list of star@ng posi@ons 1: cl = kmean(k) //cluster data using k means 2: return (cl, v) Original Data (Sequence) Mapper 1 <FV, (Ch ID, Pos)> Mapper 2 <FV, (Ch ID, Pos)> Mapper 3 <FV, (Ch ID, Pos)> Mapper n <FV, (Ch ID, Pos)> Output to HBase <FV, (List of Pos IDs)> Reducer 1 <FV, (List of Pos IDs)> Reducer 2 <FV, (List of Pos IDs)> Reducer 3 <FV, (List of Pos IDs)> Reducer n Master Node .... .... Retrieved from: hYp://hbase.apache.org Cooccurrence Matrix FV calculated Aggregate elements with matching FV Iden@fying Long Ultra Similar Elements (LUSEs) in Genomes Using Image Based Texture CoOccurrence Matrix Devin Petersohn 1 and Chi-Ren Shyu (Mentor) 1,2 1 Department of Computer Science, College of Engineering, 2 MU Informatics Institute, University of Missouri References HBase Table Schema Feature Vector <Contrast, Entropy, Dissimilarity> Table of ordered Pairs (Ch ID, Pos) K Mean Cluster ID (Calculated 2 nd IteraHon) .... .... .... <Contrast, Entropy, Dissimilarity> Ch ID 1 Pos 1 K Mean Cluster ID Ch ID 2 Pos 2 ... ... Ch ID n Pos n Shuffling Acknowledgements This project was sponsored by the MU College of Engineering Undergraduate Honors Research Program Undergraduate Research Forum – Spring 2014 0 20 40 60 80 100 120 140 160 180 200 0 250 500 750 1,000 1,250 1,500 1,750 2,000 2,250 Time (minutes) Number of Base Pairs (in Millions) Running Time for 1st MapReduce Func@on on a 6 Node Cluster

Upload: devin-petersohn

Post on 14-Apr-2017

113 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Devin Petersohn Poster

To  convert  a  DNA  sequence  into  a  grayscale  image,  we  first  convert  each  character  into  a  unique  specific  value:    A=0,  C=1,  G=2,  T=3    Then,  in  order  to  convert  those  values  into  a  4-­‐bit  grayscale  value  (gray  color  values  from  0-­‐15),  we  use  the  following  formula:    (P1*4)+(P2)  Where  P1  is  the  character  in  the  first  posiHon,  and  P2  is  the  second    The  resulHng  grayscale  values  form  the  pixels  of  images  that  represent  the  original  sequence.    In  order  to  get  a  10x10  image,  a  sequence  of  101  base  pairs  is  required.    Example:            CATGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT              P1  =  C  =  1,  P2  =  A  =  0  =>  (1*4)  +  (0)  =  4              Using  a  sliding  window,  the  second  posiHon  becomes  the  first.              CATGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT                P1  =  A  =  0,  P2  =  T  =  3  =>  (0*4)  +  (3)  =  3    Each  two-­‐character  sequence  receives  a  unique  value  from  0-­‐15,  which  corresponds  to  its  grayscale  value  in  the  10x10  image:  

Abstract

             

Co-Occurrence Matrix and Texture Measurement Locating Potential DNA Mutations

Discussion

IdenHfying  the  Long  Ultra  Similar  Elements  (LUSEs)  in  genomes  can  yield  a  myriad  of  new  informaHon  regarding  the  result  of  a  geneHcally  and  evoluHonarily  significant  mutaHon.  However,  current  methods  of  idenHfying  LUSEs  cannot  capture  every  possible  mutaHon  (inserHon,  deleHon,  and  base  pair  subsHtuHon)  without  an  exhausHve  pair-­‐wise  comparison  using  the  Levenshtein  Similarity  measurement.    Alignment  algorithms  aYempt  to  solve  this  problem,  but  can  only  calculate  the  maximum  consecuHvely  similar  elements  in  a  string  of  base  pairs.    We  have  developed  an  image-­‐based  method  of  idenHfying  LUSEs  in  genomes  that  has  a  strong  correlaHon  to  the  Levenshtein  Similarity  measurement.    Our  approach  first  converts  a  sequence  into  a  10x10  grayscale  image.  Then,  using  exisHng  co-­‐occurrence  matrix  based  texture  feature  metrics,  we  generate  a  unique  feature  vector  for  each  sequence  by  which  other  sequences  can  be  compared.    These  feature  vectors  can  then  be  ploYed  and,  using  a  clustering  algorithm,  we  will  then  be  able  to  idenHfy  clusters  of  sequences  that  share  a  Levenshtein  Similarity  greater  than  90%  (or  another  threshold  of  our  choosing).    Because  of  the  correlaHon  between  clusters  and  the  Levenshtein  Similarity  measurement,  we  can  avoid  pair-­‐wise  comparisons  altogether.    Because  there  are  no  pairwise  comparisons,  these  algorithms  can  run  in  parallel  using  a  MapReduce  funcHon  in  a  Big  Data  Ecosystem  (Hadoop),  offering  a  suitable  soluHon  to  this  Big  Data  problem  that  is  scalable  to  the  amount  of  hardware  available.    The  final  product  will  be  a  hash  funcHon  that  can  return  all  clustered  LUSEs  very  quickly  for  biology  researchers  to  access  in  real  Hme.  

The  final  product  is  a  searchable  database  for  evoluHonary  biologists  to  be  able  to  upload  and  compare  organism  genomes  against  all  other  genomes  already  in  the  database.      

The  Levenshtein  Similarity  measurement  calculates  similarity  between  strings  based  on  the  minimum  number  of  deleHons,  inserHons,  and  subsHtuHons  it  takes  to  get  from  one  string  to  another  [7].  

Retrieved  from:    hYp://images.flatworldknowledge.com/ballgob/ballgob-­‐fig19_015.jpg  

 Purpose  of  this  approach:  

Work  in  Big  Data  Ecosystem  Algorithm  can  run  in  parallel  Scalable  performance  to  amount  of  hardware  available  No  pairwise  comparison

Contrast   Homo-­‐geneity   Entropy   Dissim-­‐

ilarity  Contrast  &  Homogen.  

Homogen.  &  Entropy  

Entropy  &  Dissim.  

Contrast  &  Entropy  

Contrast  &  Dissim.  

Homogen.  &  Dissim.  

Contrast,  Homogen.,  &  Entropy  

Contrast,  Homogen.,  &  Dissim.  

Contrast,  Entropy,  &  Dissim.  

Homogen.,  Entropy,  &  Dissim.  

Contrast,  Homogen.,  Entropy,  &  Dissim.  

CorrelaHon   0.8738   0.4313   0.7540   0.8691   0.8270   0.7884   0.8861   0.8697   0.8737   0.8198   0.8648   0.8507   0.8986   0.8750   0.8880  

0.00  0.10  0.20  0.30  0.40  0.50  0.60  0.70  0.80  0.90  1.00  

Correla@

on  with

 Leven

shtein  Sim

ilarity  

(1  is  perfectly  correlated)  

Texture  Feature  Measurement  Method(s)    

Correla@on  between  Levenshtein  Distance  and  Texture  Feature  Measurement  Methods  

The  Co-­‐Occurrence  Matrix  is  created  by  counHng  the  number  of  grayscale  pixel  values  that  occur  near  another  in  a  given  image  [4].    From  the  Co-­‐Occurrence  Matrix,  we  can  generate  features  with  exisHng  methods  [4].    

Contrast   Dissimilarity  

Homogeneity   Entropy  

These  feature  measurement  metrics  are  used  to  reduce  the  co-­‐occurrence  matrix  down  to  values  that  can  be  measured  or  ploYed  against  other  images  [4].    Below,  the  graph  details  the  correlaHon  between  Levenshtein  Similarity  and  all  possible  combinaHons  of  the  above  feature  metrics.    The  most  correlated  combinaHon  of  metrics  is  Contrast,  Entropy,  and  Dissimilarity  with  a  strong  0.8986  correlaHon  (1  is  perfectly  correlated).  

Image  

Pixel  that  is  compared  against  all  neighbors  

Window  

Window  PosiHon  1  

Window  PosiHon  2  

Query  (Sequence)  

Feature  Metric  CalculaHon  

User  (Start/End)  

Cluster  with  Similar  

Sequences  

User  submits  query  sequence  of  at    least  101  

Feature  Metrics  are    generated  from    query  

Metrics  are  ploYed  and    clustered  

Finding  LUSE  Overview  

These  metrics  can  next  be  ploYed  in  3-­‐dimensional  space  and  clustered  using  the  K  Means  algorithm.    Because  of  the  strong  correlaHon,  each  cluster  will  represent  a  sequence  of  a  measurable  similarity  threshold.  

Contrast  

Entrop

y  

Benefits  to  approach:            MapReduce  works  in  parallel  =>  very  fast:                      Linear  Hme  vs.  ExponenHal                      Same  Hme  cost  to  compare  1  vs.  1  and  1  vs.  all            Scalable  to  amount  of  hardware  available:                    More  nodes  =  BeYer  Performance            Setup  can  handle  enHre  genomes  to  be  compared  at  once            Only  need  to  run  a  sequence  once  –  results  will  conHnue  to            be  added  as  database  grows  Poten@al  Uses:            IdenHfy  Ultra  Conserved  Elements  (UCEs)  [1]            IdenHfy  evoluHonarily  significant  mutaHons              PotenHal  for  medical  uses                        Disease  diagnosis,  GeneHc  Research,  etc.            Others  What’s  Next:            TesHng  different  clustering  algorithms  –  Sop  Clustering            Implement  and  test  Spark            June  PublicaHon  

Yellow  area  is  calculated,  blank  pixels  are  not  

[1]  Reneker  J,  Lyons  E,  Conant  GC,  Pires  JC,  Freeling  M,  Shyu  CR,  Korkin  D.Proc  Natl  Acad  Sci  U  S  A.  2012  May  8;109(19):E1183-­‐91.  doi:  10.1073/pnas.1121356109.  Epub  2012  Apr  10.  [2]  J.  Dean  and  S.  Ghemawat,  "Mapreduce:  Simplified  data  processing  on  large  clusters,"  ACM  Commun.,  vol.  51,  Jan.  2008,  pp.  107-­‐113.    [3]  Hadoop,  hYp://hadoop.apache.org/  [4]  Co-­‐Occurrence  Matrix,  hYp://www.fp.ucalgary.ca/mhallbey/texture_calculaHons.htm    [5]  Apache  Spark,  hYp://spark.apache.org/  [6]  Apache  Hbase,  hYp://hbase.apache.org/  [7]  Levenshtein,  Vladimir  I.  (February  1966).  "Binary  codes  capable  of  correcHng  deleHons,  inserHons,  and  reversals".  Soviet  Physics  Doklady  10  (8):  707–710.  

MapReduce, Hadoop, Spark, & HBase – Big Data Ecosystem

Retrieved  from:    hYp://hadoop.apache.org  

MapReduce  Overview   Cluster  Setup    10  Intel  NUC  computers  1  Master  Node:          16GB  RAM          Dual  Core  2.0GHz  CPU          1TB  Hard  Disk  Space          480GB  Solid  State  Drive  9  Compute  Nodes          8GB  RAM          Dual  Core  2.0GHz  CPU          1TB  Hard  Disk  Space          480GB  Solid  State  Drive  

Retrieved  from:    hYp://spark.apache.org  

//  Map  Func@on  1:  input  <k,v>  k  is  offset  for  current  file  block  (in  bytes);  v  is  a  sequence  in  chromosome  C

1: v  =  P(v)              //  remove  invalid  characters 2: for  i  =  0  to  m-­‐n  do{ 3:            FV  =  generateFV(v[i  to  i+n])  //generate  feature  

vector 4:            start_pos  =  i  +  k 5:            return  (FV,  (start_pos,  C))  }    

//  Reduce  Func@on  1:  input  <k,v>  k  is  the  feature  vector  (FV);  v  is  the  star@ng  posi@on  of  the  subsequence  w.r.t  the  chromosome  sequence

1:  pos  =  merge(v) 2:  return  (k,  pos)

//  Map  Func@on  2:  input  <k,v>  k  is  feature  vector;  v  is  the  list  of  posi@ons  matching  the  feature  vector 1: k  =  normalize(k)  //normalize  data 2: return  (k,  v)  

//  Reduce  Func@on  2:  input  <k,v>  k  is  the  normalized  feature  vector;  v  is  the  list  of  star@ng  posi@ons    1:  cl  =  kmean(k)  //cluster  data  using  k  means 2:  return  (cl,  v)    

Orig

inal  Data  

(Seq

uence)  

Mapper  1   <FV,  (Ch  ID,  Pos)>  

Mapper  2   <FV,  (Ch  ID,  Pos)>  

Mapper  3   <FV,  (Ch  ID,  Pos)>  

Mapper  n   <FV,  (Ch  ID,  Pos)>  

Output  to  HBase  

<FV,  (List  of  Pos  IDs)>  Reducer  1  

<FV,  (List  of  Pos  IDs)>  Reducer  2  

<FV,  (List  of  Pos  IDs)>  Reducer  3  

<FV,  (List  of  Pos  IDs)>  Reducer  n  

Master  

Nod

e  

.  .  .  .  

.  .  .  .  

Retrieved  from:    hYp://hbase.apache.org  

Co-­‐occurrence  Matrix  FV  calculated   Aggregate  elements  with  matching  FV  

Iden@fying  Long  Ultra  Similar  Elements  (LUSEs)  in  Genomes  Using  Image  Based  Texture  Co-­‐Occurrence  Matrix  

Devin Petersohn1 and Chi-Ren Shyu (Mentor)1,2

1Department of Computer Science, College of Engineering, 2MU Informatics Institute, University of Missouri

References

HBase  Table  Schema  Feature  Vector  

<Contrast,  Entropy,  Dissimilarity>  Table  of  ordered  Pairs  

(Ch  ID,  Pos)  K  Mean  Cluster  ID  

(Calculated  2nd  IteraHon)  

.  .  .  .  

.  .  .  .    

.  .  .  .  

<Contrast,  Entropy,  Dissimilarity>  

Ch  ID  1   Pos  1  

K  Mean  Cluster  ID  Ch  ID  2   Pos  2  

.  .  .  

.  .  .  

Ch  ID  n   Pos  n  

Shuffling  

Acknowledgements

This  project  was  sponsored  by  the  MU  College  of  Engineering  Undergraduate  Honors  Research  Program  

Undergraduate  Research  Forum  –  Spring  2014  

0  20  40  60  80  

100  120  140  160  180  200  

 0      250      500      750      1,000      1,250      1,500      1,750      2,000      2,250    

Time  (m

inutes)  

Number  of  Base  Pairs  (in  Millions)  

Running  Time  for  1st  MapReduce  Func@on  on  a  6  Node  Cluster