cancer genomics big_datascience_meetup_july_14_2014

15
Java and Scala for Cancer Genomics By Ayush Sarkar Irvington High School July 14, 2014 1

Upload: shyam-sarkar

Post on 15-Jan-2015

250 views

Category:

Data & Analytics


4 download

DESCRIPTION

Java and Scala coding for Cancer Genomics

TRANSCRIPT

Page 1: Cancer genomics big_datascience_meetup_july_14_2014

1

Java and Scala for

Cancer GenomicsBy

Ayush Sarkar

Irvington High School

July 14, 2014

Page 2: Cancer genomics big_datascience_meetup_july_14_2014

2

Comparing Reference Genome and Subject Genome

R

S

There are gaps and mutations.Analysis should be done to identify gaps and mutations.

Page 3: Cancer genomics big_datascience_meetup_july_14_2014

3

First we go through an Example Java Program Creation and Execution

Next we looking into Java code in BioJava

(open source project for Bioinformatics)

BioJava Location: http://biojava.org/wiki/Main_Page

Page 4: Cancer genomics big_datascience_meetup_july_14_2014

4

An Example Java Program to Check Validity and to Compare Two Strings

• In this example we examine the SerialNumber class, which is used by the Home Software Company to validate software serial numbers. A valid software serial number is in the form LLLLL-DDDD-LLLL , where L indicates an alphabetic letter and D indicates a numeric digit. For example, WRXTQ-7786-PGVZ is a valid serial number. Notice that a serial number consists of three groups of characters, delimited by hyphens.

• After checking the validity, a serial number assigned to a customer will be compared to a serial number stored in database to check for equality.

• This example shows steps similar to DNA sequence alignment.

Page 5: Cancer genomics big_datascience_meetup_july_14_2014

5

The fields first, second, and third are used to hold the first, second, and third groups of characters in a serial number. The valid field is set to true by the constructor to indicate a valid serial number, or false to indicate an invalid serial number.

SerialNumber Class Definition

Class Instance Constructor

General Methods

Internal Variables (Instance Variable)

Page 6: Cancer genomics big_datascience_meetup_july_14_2014

6

Method Description for the Class

• Constructor: The constructor accepts a string argument that contains a serial number. The string is tokenized and its tokens are stored in the first , second , and third fields. The validate method is called.

• isValid: This method returns the value in the valid field.• Validate: This method calls the isFirstGroupValid , isSecondGroupValid ,

and isThirdGroupValid methods to validate the first , second , and third• fields.• isThirdGroupValid: Methods to validate the first , second , and third

fields.• isFirstGroupValid: This method returns true if the value stored in the

first field is valid. Otherwise, it returns false .• isSecondGroupValid: This method returns true if the value stored in

the second field is valid. Otherwise, it returns false .• isThirdGroupValid: This method returns true if the value stored in the

third field is valid. Otherwise, it returns false .• EqualityTest: This method is called to check if serial number is equal to a

serial number in the database.

Page 7: Cancer genomics big_datascience_meetup_july_14_2014

7

SerialNumber Class (without main method) and SerialNumberTester Class (with main method)

Use Eclipse to create SerialNumber.java file:

import java.util.StringTokenizer;public class SerialNumber{ …..... ……..}

Use Eclipse to create SerialNumberTester.java under same project:

public class SerialNumberTester{ public static void main(String[] args) { …… …… }}

See details of the classes on Eclipse Integrated Development Environment and Execute them.

Page 8: Cancer genomics big_datascience_meetup_july_14_2014

8

Using Eclipse to Create, Execute and Debug Java and Scala programs

1. Get Eclipse from: http://www.eclipse.org/downloads/2. Unzip and Install on your Laptop;3. Install Java 1.6 or 1.7 version;4. Create a Project (from File tab at the top) in Eclipse;5. Create a Java Class under the project;6. Define methods and variables for the class;7. Import necessary packages;8. Compile and Execute the class created;9. Try to debug using debug windows and commands;10. It is easy !!!

Page 9: Cancer genomics big_datascience_meetup_july_14_2014

9

BioJava

BioJava is an open-source project dedicated to providing a Java framework for processing biological data. It includes objects for manipulating biological sequences, file parsers, access to BioSQL and Ensembl databases, tools for making sequence analysis GUIs and powerful analysis and statistical routines including a dynamic programming toolkit.

BioJava takes part in Google Summer of Code as part of the OBF - the Open Bioinformatics Foundation. Please visit:https://developers.google.com/open-source/soc/?csw=1

Page 10: Cancer genomics big_datascience_meetup_july_14_2014

10

BioJavaThe core sequence classes:• AbstractSequence

• DNASequence• ChromosomeSequence• GeneSequence• IntronSequence• ExonSequence• TranscriptSequence

• RNASequence• ProteinSequence

By using the Sequence Interface one can easily extend the concept of local sequence storage in a fasta (sequence file format) file to loading the sequence from Uniprot (Protein database over the internet) or NCBI (Genome database over the internet) based on an accession ID.

ProteinSequence proteinSequence = new ProteinSequence("ARNDCEQGHILKMFPSTWYVBZJX");

DNASequence dnaSequence = new DNASequence("ATCG");

UniprotProxySequenceReader<AminoAcidCompound> uniprotSequence = new UniprotProxySequenceReader<AminoAcidCompound>("YA745_GIBZE", AminoAcidCompoundSet.getAminoAcidCompoundSet()); ProteinSequence proteinSequence = new ProteinSequence(uniprotSequence);

Page 11: Cancer genomics big_datascience_meetup_july_14_2014

11

BioJavaDNA translation follows the normal biological flow where a portion of DNA (assumed to be CDS) is translated to mRNA. This is translated to a protein sequence using codons.

ProteinSequence protein = new DNASequence("ATG").getRNASequence().getProteinSequence();

The BioJava sequence I/O code is designed to be flexible and easy to adapt for a wide variety of purposes. All methods take a Java BufferedReader object, and return an iterator which allows you to scan through the sequences in a file. For example:

BufferedReader br = new BufferedReader( new FileReader(fileName) ); SequenceIterator stream = SeqIOTools.readFastaDNA(br); while (stream.hasNext()) { Sequence seq = stream.nextSequence(); /

// do something with the sequence. }

Page 12: Cancer genomics big_datascience_meetup_july_14_2014

12

Java:

List<Integer> iList = Arrays.asList(2, 7, 9, 8, 10);List<Integer> iDoubled = new ArrayList<Integer>();for(Integer number: iList){ if(number % 2 == 0){ iDoubled.add(number 2); }}

Scala:

val iList = List(2, 7, 9, 8, 10);val iDoubled = iList.filter(_ % 2 == 0).map(_ 2)

Scala:

object HelloWorld { def main(args: Array[String]){ println("Hello, world!") } }

Java:

public class HelloWorldApp { public static void main(String[] args) { System.out.println("Hello World!"); } }

Scala Vs. Java -- Scala runs on Java Virtual Machine

Page 13: Cancer genomics big_datascience_meetup_july_14_2014

13

public class PSA_DNA { public static void main(String[] args){ String targetSeq = "CACGTTTCTTGTGGCAGCTTAAGTTT" ; DNASequence target = new DNASequence(targetSeq,

AmbiguityDNACompoundSet.getDNACompoundSet());

String querySeq = "ACGAGTGCGTGTTTTCCCGCCTGGTC"; DNASequence query = new DNASequence(querySeq,

AmbiguityDNACompoundSet.getDNACompoundSet()); SubstitutionMatrix<NucleotideCompound> matrix = SubstitutionMatrixHelper.getNuc4_4(); SimpleGapPenalty gapP = new SimpleGapPenalty(); gapP.setOpenPenalty((short)5); gapP.setExtensionPenalty((short)2); SequencePair<DNASequence, NucleotideCompound> psa =

Alignments.getPairwiseAlignment(query, target,PairwiseSequenceAlignerType.LOCAL, gapP, matrix);

System.out.println(psa); }}

Calculating a local Alignment -- Java code using Java packages

Variable Definitions

Method Calls

Class Definition

Variable Definitions with Method Call

Page 14: Cancer genomics big_datascience_meetup_july_14_2014

14

import org.biojava3.alignment.{Alignments, SimpleGapPenalty, SubstitutionMatrixHelper}import org.biojava3.alignment.Alignments.PairwiseSequenceAlignerType.LOCALImport org.biojava3.core.sequence.DNASequenceimport org.biojava3.core.sequence.compound.AmbiguityDNACompoundSet

object PSA_DNA { implicit def str2DNA(seq: String) = new DNASequence(seq,AmbiguityDNACompoundSet.getDNACompoundSet)

def main(args: Array[String]) { // Note implicit cast from strings to DNASequence val target: DNASequence = "CACGTTTCTTGTGGCAGCTTAAGTTTGAAT"

val query: DNASequence = "ACGAGTGCGTGTTTTCCCGCCTGGTCCCCA"

val matrix = SubstitutionMatrixHelper.getNuc4_4()

val gapP = new SimpleGapPenalty() gapP.setOpenPenalty(5) gapP.setExtensionPenalty(2)

val psa = Alignments.getPairwiseAlignment(query, target, LOCAL, gapP, matrix)

println(psa) }}

Calculating a local Alignment -- Scala code using Java packages

Java Packages

Implicit Method

Variable Definitions

Method Calls

Variable Definitions with Method Call

Page 15: Cancer genomics big_datascience_meetup_july_14_2014

15

Thank You!

E-mail: [email protected]

Watch Java and Scala “Hello World” program execution on Eclipse !