bioinformacs resources - structural resources / sql · bioinfres sose 17 bioinformacs resources -...

BioinfRes SoSe 17

Bioinforma)csResources-StructuralResources/SQL-

Lecture&ExercisesProf.B.Rost,Dr.L.Richter,J.Reeb

Ins)tutfürInforma)kI12

BioinfRes SoSe 17

PreliminaryScheduleApr. 28th Intro, General Overview (1. sh.) Jun 16th No Lecture May 5th Sequence Databases (2. sh.) Jun 23rd NoSql 2 (7.sh.) May 12th Sequence Databases (3. sh.) Jun 30th MongoDB, JavaScript (8.sh.) May 19th Structure Databases (4. sh.) Jul 7th Node.js Applications (9.sh.) May 26th No Lecture Jul 14st PredictProtein Jun 2nd SQL (5. sh.) Jul 21st Wrap Up, Q&A Jun 9th SQL, NoSql (6. sh) Jul 28th Exam

* These exercises can earn you a bonus

BioinfRes SoSe 17

Orga-ExamDate

●  ExamscheduledforFriday,Jul28th

●  Time:16:30-18:00

●  Room:MW0350Egbert-von-HoyerLectureHall(MechanicalEngineeringBuilding)

●  Registra)onisMANDATORY

●  sofar6studentsregistered

BioinfRes SoSe 17

SecondaryDatabases

●  Databaseswhichdigestandstructuredatafromprimarydatabases

●  Notalways“true”databasesystems●  SCOP/CATH

●  PFAM

●  PROSITE

BioinfRes SoSe 17

Classifica)onofStructures:CATH-Gene3D/SCOP

●  cameupinthemiddleofthe1990s●  botharequitesimilar

●  aim:organizetheproteinstructuresavailableinPDB,basedonsingledomains

●  hierarchicalsystem(roughly):-  secondarystructurecontent-  fold-  superfamilies-  families

BioinfRes SoSe 17

SCOP:aStructuralClassifica)onofProteins

●  Murzin,A.,Brenner,S.E.,Hubbard,T.J.P.andChothia,C.(1995)J.Mol.Biol.,247,536-540

●  Hubbard,T.P.,Murzin,A.,Brenner,S.E.andChothia,C.(1997),Nucl.AcidsRes.25(1),236-239(easiertoobtain)

●  fullymanuallycurated,drivenbyexpertanalysis

●  associatedwiththeASTRALcompendium

●  latestnews:SCOPe(UCBerkeley),SCOP2(MRCLabMolBiol,Cambridge,UK)

BioinfRes SoSe 17

SCOP:aStructuralClassifica)onofProteins

●  J.-M.Chandonia,etal.,SCOPe:ManualCura)onandAr)factRemovalintheStructuralClassifica)onofProteins–extendedDatabase,J.Mol.Biol.(2016),hjp://dx.doi.org/10.1016/j.jmb.2016.11.023

●  A.Andreeva,D.Haworth,C.Cho)a,E.Kulesha,A.Murzin.SCOP2prototype:anewapproachtoproteinstructuremining.NucleicAcidsRes.2014Jan1;42(Databaseissue):D310–D314.Publishedonline2013Nov29.doi:10.1093/nar/gkt1242

BioinfRes SoSe 17

HierarchicalLevel

1.  Classes:Considersecondarystructurecomposi)on(allα,allβ,α/β,α+β,mul)-domain,membrane/cellsurface/pep)des,...)

2.  Fold:Shapeofadomain.Proteinsofthesamefoldhavethesamemajorsecondarystructureelementsinthesamearrangementwiththesametopologicalfeatures

3.  Superfamily:Groupsofdomainwhichhaveatleastadistantcommonancestor

BioinfRes SoSe 17

HierarchicalLevel

5.  Family:Groupswithinsuperfamilieswithamorerecentcommonancestor(>30%sequencesidentyor>15%seq.id.plussamefunc)on

6.  Proteindomain:Groupswithinfamilies,essen)allythesameprotein(isoform,thesameproteinbutfromdifferentspecies)

7.  Species:Proteindomainsaccordingtospecies

8.  Domain:thesingledomain

BioinfRes SoSe 17

Developmentstar)ngfromyear2000

taken from http://scop.berkeley.edu/help/ver=2.06#scopchanges

BioinfRes SoSe 17

taken from http://scop.berkeley.edu/help/ver=2.06#scopchanges

BioinfRes SoSe 17

CATH-Faces

taken from http://www.ebi.ac.uk/about/people/janet-thornton

taken from http://www.tgac.ac.uk/scientific-advisory-board/

BioinfRes SoSe 17

Publica)ons●  SillitoeI,Lewis,TE,CuffAL,DasS,AshfordP,DawsonNL,FurnhamN,LaskowskiRA,LeeD,LeesJ,Leh)nenS,StuderR,ThorntonJM,OrengoCA.CATH:comprehensivestructuralandfunc)onalannota)onsforgenomesequences.NucleicAcidsRes.2015Jandoi:10.1093/nar/gku947

●  LamSD,DawsonNL,DasS,SillitoeI,AshfordP,LeeD,Leh)nenS,OrengoCA,LeesJG.Gene3D:expandingtheu)lityofdomainassignments.NucleicAcidsRes.2016Jandoi:10.1093/nar/gkv1231

BioinfRes SoSe 17

CATH●  semi-automa)cprocedureforderivinganovelhierarchicalclassifica)onofproteindomainstructures

●  fourmainlevels:-  C:proteinclass,mainlysecondarystructurecomposi)onofeachdomain

-  A:architecture,summarizesshapesbasedonorienta)onofsecondarystructureelements

-  T:topology,sequen)alconnec)vityisconsidered-  H:homologoussuperfamily,highsimilaritywithsimilarfunc)ons,evolu)onaryrela)onshipassumed

BioinfRes SoSe 17

some nine highly populated families (‘superfolds’ [1]),with important implications for prediction algorithms,and it illustrated the insights to be gained from orderingthe data in this way.

Several other groups have also classified the known struc-tures, focusing on a variety of local and global topologi-cal features and employing a range of algorithms (struc-ture comparison algorithms and classification generallyare reviewed in [13–16]). The SCOP database, developedby Murzin et al. [17], groups proteins having significantsequence similarity into homologous families, whereasmore distant structural similarities are largely identifiedmanually. This database places emphasis on evolutionaryrelationships and information from the literature relatingto well-studied fold families is also incorporated (e.g. the βtrefoils [18] and the OB fold [19]). By contrast Holm andSander, use the structure comparison algorithm DALI torecognise structural neighbours, whether motif or foldbased, without formally ordering proteins in the PDB intofamilies [20]. The ENTREZ database of Hogue et al. [21],uses a similar approach to DALI, listing neighbours by afast vector-based comparison algorithm (VAST).

The task of defining structural relationships is furthercomplicated by the existence of multidomain proteins;more than 30% of non-identical structures in the currentPDB contain two or more domains. A number of domainrecognition algorithms have appeared recently to address

this problem [22–26]. The 3Dee database of Siddiquiand Barton (http://snail.biop.ox.ac.uk:8080/3Dee) sepa-rates the constituent folds of multidomain proteins usingthe DOMAK algorithm. Similarly, Sowdhaminini et al.have constructed a database of single domain families[27], using the domain recognition algorithm DIAL [26]and the structural comparison procedure SEA [28]. Bothdatabases contain data that is generated largely automati-cally, but is subsequently checked and where appropri-ate reordered manually.

In recognition of the need to regularly maintain and updatedata on structural relatives, we have further developed ourautomatic procedures for identifying and classifying struc-tural families [6] to construct a database of single-domainfold families. Any multidomain proteins are first dividedinto their constituent domain folds by an automatic consen-sus procedure which is in agreement between three inde-pendent algorithms (SJ et al. unpublished data). As well asclustering proteins by sequence and structure, recognisedfamilies are also grouped according to similarity in proteinclass (i.e. secondary structure composition and contacts).Finally, the architecture (shape, defined by the assembly ofsecondary structures, regardless of their connectivity) adop-ted by each protein fold, is assigned manually. Althoughthis is a somewhat subjective process, based largely on com-monly used descriptions in the literature (e.g. sandwich,barrel and propellor), it is an essential first step towardsordering the known folds in a useful and practical way.

1094 Structure 1997, Vol 5 No 8

Figure 1

Annual increase in the numbers of proteindomain structures in the PDB (top plot,[11,12]). The lower lines show the numbers ofidentical families (I-level, 100% sequenceidentity between structures within the familyand 100% overlap), non-identical families(N-level, > 95% sequence identity, 85%overlap), sequence families (S-level, > 35%sequence identity, 60% overlap), homologoussuperfamilies (H-level, > 25% sequenceidentity, SSAP >80 and 60% overlap), andtopological or fold families (T-level, SSAP>70), where SSAP is a structural comparisonscore.

7500��

3000�

2500�

2000�

1500�

1000�

500�

0'85 '86 '87 '88 '89 '90 '91 '92 '93 '94 '95 '96

Domain

Identical

Non-identical

Sequence familyHomologous superfamilyTopology

1985–95

Deposition date

Domain fold distribution

from Structure 15, August 1997, 5:1093–1108 http://biomednet.com/elecref/0969212600501093

BioinfRes SoSe 17

CurrentRelease●  CATHDBversion:4.0●  235,000domain

●  25mioproteinpredic)ons

●  new:-  improvedpredic)onoffunc)onalfamilies-  currentputa)vedomainassignments(CATH-B)-  CATH-40:anon-redundantsetofCATHdomainsforhomolgybenchmarkingexperiments(<40%seq.idwith60%overlap)

●  hjp://www.cathdb.info/wiki/doku/?id=release_notes#cath_release_notes

BioinfRes SoSe 17

NumberingScheme

●  C:1,2,3,4(alpha,beta,alpha/beta,none)(4)●  A:samearchitecture,differenttopology(40)

●  T:Topology(connec)onofsecondarystructureelements)(1373)

●  H:Homology(families)(2737)

BioinfRes SoSe 17

appear less distinct and may reflect the tolerance of helixpacking modes that allows diverse combinations of two-and three-helix motifs. This gives rise to a continuum offolds within which helix packing angles range fromaligned through to orthogonal. Despite this variety, certainmotifs appear to recur frequently — the aligned α hairpinand the two-helix and three-helix orthogonal motifs com-mon in the repressor and globin-like folds. Therefore, inthis class, it may ultimately be more appropriate to sepa-rate fold families into architectural groups that reflect spe-cific combinations of these common motifs.

By contrast to the mainly α class, in the mainly β class, theconstraints on β strands to be hydrogen bonded withinsheets and also on sheet–sheet packing gives rise to somevery distinct and easily recognisable architectures. In par-ticular, the β prisms, β propellors and β solenoids demon-strate the symmetry and regularity of structures satisfyingthese preferred packing constraints. In contrast to the fewarchitectures observed within the mainly α class, at least

16 different, relatively simple, architectures can be dis-cerned in the mainly β class.

The diversity of the mainly β class is not currently observedwithin the α−β class, in which only eight regular architec-tures are apparent to date. This may simply reflect a bias inthe structures determined or could suggest that in this classthe preferred motifs are more constrained in the ways inwhich they combine. The βαβ motif appears to be highlyfavoured and is observed within a large proportion of folds.In some topologies, the β strands are adjacent in space(classic motif) but in others they are separated by a thirdantiparallel strand, forming a three-stranded β sheet (splitmotif) [31]. Although both the classic and the split βαβmotifs are most commonly found in two and three-layerarchitectures, the classic motif is also found to recur withinbarrel and semi-barrel or horseshoe architectures (Figure 4).

The structures that fall outside these rather simple layerarchitectures tend to be quite complex. Compared to the

1098 Structure 1997, Vol 5 No 8

Table 2

The numbers of fold families (T-level), homologous superfamilies (H-level) and domain structures in different architectures are shown forthe mainly α, mainly β and α−β classes.

Number of Percentage of Number of Percentage of Number of Percentage ofC lass Architecture T-levels all T-levels* H-levels all H-levels* domains all domains*

Mainly α Non-bundle 86 17.03 93 14.42 1455 18.01Bundle 34 6.73 39 6.05 226 2.80Few SS 25 4.95 25 3.88 112 1.39

Mainly β Ribbon 17 3.37 17 2.64 114 1.41S ingle sheet 5 0.99 6 0.93 56 0.69Roll 6 1.19 6 0.93 55 0.68Barrel 22 4.36 29 4.50 861 10.66C lam 1 0.20 1 0.16 1 0.01Sandwich 21 4.16 43 6.67 1236 15.30D istorted sandwich 14 2.77 14 2.17 83 1.03Trefoil 1 0.20 4 0.62 49 0.61O rthogonal prism 1 0.20 1 0.16 4 0.05A ligned prism 1 0.20 2 0.31 3 0.04Four-propellor 1 0.20 1 0.16 3 0.04S ix-propellor 1 0.20 1 0.16 37 0.46Seven-propellor 2 0.40 2 0.31 11 0.14E ight-propellor 1 0.20 1 0.16 2 0.02Two-solenoid 2 0.40 3 0.47 5 0.06Three-solenoid 1 0.20 1 0.16 1 0.01Complex 5 0.99 5 0.78 104 1.29

α–β Roll 24 4.75 33 5.12 469 5.81Barrel 8 1.58 20 3.10 365 4.52Two-layer sandwich 77 15.25 112 17.36 957 11.85Three-layer (αβα) sandwich 78 15.45 115 17.83 1396 17.28Three-layer (ββα) sandwich 3 0.59 3 0.47 11 0.14Four-layer sandwich 4 0.79 4 0.62 12 0.15Box 1 0.20 1 0.16 2 0.02Horseshoe 1 0.20 1 0.16 1 0.01Complex 34 6.73 34 5.27 253 3.13Few SS 14 2.77 14 2.17 96 1.19

Few SS Irregular 14 2.77 14 2.17 98 1.21

*The percentages of total fold families, total homologous superfamilies and total domain structures adopting a particular architecture are shown.

BioinfRes SoSe 17

Pfam●  currentversionis31.0,March2017,16712familiesin604clans

●  hostedbytheEBI●  Cita)on:“ThePfamproteinfamiliesdatabase:towardsamoresustainablefuture”Nucl.AcidsRes.(04January2016)44(D1):D279-D285.doi:10.1093/nar/gkv1344

BioinfRes SoSe 17

Pfam●  Pfam-A:curatedseedalignmentderivedfromPfamseq(UniProtKBbased),profileHMMsfortheseedalignment,fullalignmentwithallHMMdetectedsequences

●  Pfam-B:un-annotated,automa)callygeneratedfromnon-redundantclusterfromADDA

●  focusesonsingledomains

BioinfRes SoSe 17

●  Family:collec)onofrelatedproteinregions●  Domain:structuralunit

●  Repeat:shotunitwhichisunstableinisola)onbutformsastablestructurewhenfoundinmul)plecopies

●  Mo)f:shortunitfoundoutsideglobulardomains●  Clans:relatedgroupofPfamentriesbasedonsimilarityinsequence,structureofprofile-HMM

BioinfRes SoSe 17

PfamNumbers(rel.31)

●  16712Pfam-Afamilies●  36%ofthefamiliesareclassifiedinto604clans

●  thePfam-Areleasematches73%ofthe26.7MiosequencesinthecorrespondingUniProtreferenceproteomedatabase

●  coverageof90.5%ofSwissProthuman●  useofjackhmmer(fromHMMER3package)

●  considerCATHandPDB

BioinfRes SoSe 17

does not currently present a scalability problem, aidinghuman interpretation through visualization has becomeincreasingly difficult. Most approaches for facilitatingalignment visualization natively in the browser do notscale well. Applets, such as the Jalview alignment viewer(12), partly solve the problem, but require Java to beinstalled and coupled to the browser.For example, the largest Pfam-A family (version 27.0)

with >363 000 matches to the profile HMM is the ABCtransporters family (ABC_tran, accession PF00005)—itsfull alignment is thus too large to be useful for mostpurposes. The seed alignment, by contrast, contains just55 representative sequences, which may be an insufficientnumber to represent the sequence diversity within thefamily. To provide more useable samples of the sequencediversity within a family, we now calculate model-matchesfor four additional sequence sets, based on‘Representative Proteomes’ (RPs) (13). For theABC_tran family, the RP alignments range in size fromapproximately a quarter of the size of the full alignment toless than one tenth.In an RP set, each member proteome is selected from a

grouping of similar proteomes. The selected proteome ischosen to best represent the set of grouped proteomes interms of both sequence and annotation information. Thegrouping of proteomes is based on a clustering of UniProt,UniRef50, and includes all complete proteome sequences.In each cluster, sequences have !50% identity and have atleast an 80% overlap with the longest sequence. The simi-larity of two proteomes is determined by considering justthe clusters containing sequences from either of the twoproteomes. The two proteomes are grouped when thefraction of clusters that contain sequences from bothproteomes out of the subset of proteome-specific clustersexceeds a given threshold. This threshold is termed theco-membership threshold. The percentage threshold ofco-membership (or common clusters) can be adjusteddown to produce larger groupings, and hence less redun-dant sequence sets.We use the RP sequence sets constructed using co-

membership thresholds of 75, 55, 35 and 15%, giving arange of sequence redundancy for each family. Using rep-resentative proteomes has the advantage that it still allows

for organism-specific copy numbers to be assessed, afeature that can be lost when using global non-redundancythresholds on an entire sequence database. However, themajor advantage for Pfam is the dramatic reduction in thesize of the family full alignments, as shown in Table 1,which illustrates the reductions with increasingly redun-dant RPs for the 10 biggest families in Pfam. The RPsets do not currently include viruses, and so for somefamilies such as GP120, there may not be a match to theRP sets.

The reduction in the size of the full alignments variesfrom family to family, reflecting in part the bias in thesequence database. Overall, across the whole of thedatabase, using RP at 75, 55, 35 and 15% co-membershipthresholds results in average alignment sizes that are, re-spectively, 38.8, 29.7, 20.4 and 11.6% of the full alignmentsize. As the number of sequences in the sequence databaseincreases, we anticipate that the alignments based on RPswill grow at a more linear rate and provide a more con-venient way of sampling the full alignment sequencediversity.

As illustrated in Table 1, the full alignment size for thetop 10 families ranges from 129 000 to 363 000 sequences.With alignments of this size, it is no longer practical tocalculate the neighbour-joining trees provided in previousPfam releases. Before release 27.0, these approximateneighbour-joining phylogenetic trees (with bootstrappingvalues based on 100 replicas) were used to order thealignments, such that phylogenetically related sequenceswould be grouped together. From release 27.0 onwards,the full alignments are ordered according to theHMMER bit score of the match, with the highestscoring sequence found at the top of the alignment.The same phylogenetic trees are still provided for theseed alignments, but are merely a guide as they arecalculated with the FastTree approximation algorithm(14). The seed alignment sequences remain ordered ac-cording to the calculated tree.

In the Pfam website, we use two different colouringschemes when displaying our alignments in a webbrowser: the Clustal scheme (15), based on the chemicalproperties of the amino acids found in the column,and a heat-map scheme that reflects the posterior

Table 1. The reduction in size of RP versus full alignments

Family identifier (accession) Seed Full RP75 RP55 RP35 RP15

ABC_tran (PF00005) 55 363 409 26% (93 265) 21% (77 150) 16% (57 358) 8% (28 903)COX1 (PF00115) 94 254 351 1% (2006) 0.7% (1661) 0.4% (1218) 0.2% (538)zf-H2C2_2 (PF13465) 163 227 898 61% (138 033) 27% (60 664) 15% (34 039) 9% (21 562)WD40 (PF00400) 1804 193 252 65% (125 805) 52% (100 531) 36% (69 386) 23% (21 562)MFS_1 (PF07690) 195 181 668 30% (55 719) 25% (55 719) 17% (55 719) 8% (55 719)RVT_1 (PF00078) 152 172 360 5% (8257) 4% (6662) 3% (5373) 2% (3604)BPD_transp_1 (PF00528) 81 156 339 23% (36 523) 19% (29 422) 14% (22 134) 7% (10 630)Response_reg (PF00072) 57 151 337 29% (44 329) 25% (37 848) 20% (29 453) 10% (15 208)GP120 (PF00516) 24 146 453 N/A N/A N/A N/AHATPase_c (PF02518) 659 129 386 28% (36 085) 24% (30 935) 19% (24 121) 10% (12 473)

The seed alignment is used to construct the profile HMM and contains a representative set of sequences of the family. The full alignment contains allhits in pfamseq scoring above the gathering threshold. In Pfam 27.0, we have introduced four additional alignments based on RPs, which containdecreasing amounts of sequence redundancy from RP75 to RP15. For each RP data set, the percentage reduction in the size of the full alignment isshown, with the number of sequences given in brackets.

D224 Nucleic Acids Research, 2014, Vol. 42, Database issue

from release 27

BioinfRes SoSe 17

String

●  ProteinInterac)onNetworks:”STRINGisadatabaseofknownandpredictedprotein-proteininterac)ons.Theinterac)onsincludedirect(physical)andindirect(func)onal)associa)ons;theystemfromcomputa)onalpredic)on,fromknowledgetransferbetweenorganisms,andfrominterac)onsaggregatedfromother(primary)databases.”

●  2031organisms

●  9.6mioproteins●  1,380miointerac)ons

BioinfRes SoSe 17

String

●  ProteinInterac)onNetworks●  2031organisms

●  9.6mioproteins

●  1,380miointerac)ons

BioinfRes SoSe 17

String

●  SzklarczykD,MorrisJH,CookH,KuhnM,WyderS,SimonovicM,SantosA,DonchevaNT,RothA,BorkP,JensenLJ,vonMeringC.TheSTRINGdatabasein2017:quality-controlledprotein-proteinassocia)onnetworks,madebroadlyaccessible.NucleicAcidsRes.2017Jan;45:D362-68.

BioinfRes SoSe 17

Prosite

●  PROSITEconsistsofdocumenta)onentriesdescribingproteindomains,familiesandfunc)onalsitesaswellasassociatedpajernsandprofilestoiden)fythem

●  SigristCJA,deCastroE,CeruxL,CucheBA,HuloN,BridgeA,BougueleretL,XenariosI.Newandcon)nuingdevelopmentsatPROSITE.NucleicAcidsRes.2012;doi:10.1093/nar/gks1067PubMed:23161676

BioinfRes SoSe 17

ENCODE/UCSCGenomeBrowser●  TheENCODEProjectConsor)um.AnIntegratedEncyclopediaofDNAElementsintheHumanGenome.Nature.2012Sep6;489(7414):57–74.doi:10.1038/nature11247

●  RosenbloomKR,SloanCA,MalladiVS,DreszerTR,LearnedK,KirkupVM,WongMC,MaddrenM,FangR,HeitnerSG,LeeBT,BarberGP,HarteRA,DiekhansM,LongJC,WilderSP,ZweigAS,KarolchikD,KuhnRM,HausslerD,KentWJ.ENCODEdataintheUCSCGenomeBrowser:year5update.NucleicAcidsRes.2013Jan;41(Databaseissue):D56-63.

BioinfRes SoSe 17

ENCODE/UCSCGenomeBrowser●  UCSCGenomeBrowser:KentWJ,SugnetCW,FureyTS,RoskinKM,PringleTH,ZahlerAM,HausslerD.ThehumangenomebrowseratUCSC.GenomeRes.2002Jun;12(6):996-1006.

BioinfRes SoSe 17

ENCODE/UCSCGenomeBrowser●  ENCODE:EncyclopediaofDNAElements●  interna)onalcollabora)onofresearchgroups

●  fundedbytheNa)onalHumanGenomeResearchIns)tute(NHGRI)

●  buildacomprehensivepartslistoffunc)onalelementinthehumangenome

●  includeselementsthatactonproteinandRNAlevelandregulatoryelements

●  TheENCODEProjectConsor)um.AnIntegratedEncyclopediaofDNAElementsintheHumanGenome

BioinfRes SoSe 17

taken from https://www.encodeproject.org/

BioinfRes SoSe 17

taken from https://www.encodeproject.org/

BioinfRes SoSe 17

GenomicAnnota)ons

taken from https://www.encodeproject.org/data/annotations

BioinfRes SoSe 17

UCSCGenomeBrowser

●  actuallyacollec)onofintegratedservices●  hjps://genome.ucsc.edu/index.html

●  providesamoregraphicalinterfacetoaccesstheENCODEdataandalotofaddi)onaltools

BioinfRes SoSe 17 taken from https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&...

BioinfRes SoSe 17

Databases-SQL

●  Overlapwithdatabaselecture●  “SQLcrashcourse”

●  nodesigntheory

●  nonormaliza)on●  standardbookslike:

●  A.Kemper&A.EicklerDatenbanksysteme–EineEinführung9.Auflage,2013OldenbourgVerlag,München

BioinfRes SoSe 17

MoreBooks●  R.Elmasri,S.B.Navathe:FundamentalsofDatabaseSystems,BenjaminCummings,RedwoodCity,Ca,USA,5.Ed.,2006

●  R.Ramakrishnan,J.Gehrke:DatabaseManagementSystems,3.Ed.,2009.

●  G.Vossen:Datenmodelle,DatenbanksprachenundDatenbank-Management-Systeme.5.Auflage,Oldenbourg,2008.

●  C.J.Date:AnIntroduc)ontoDatabaseSystems.McGraw-Hill,8.Ed.,2003.

BioinfRes SoSe 17

SelectedSQLTopics●  Tablemodifica)ons-  insert,update,create,alter

●  Dataretrievalandrepor)ng/aggrega)on-  select,average,sum

●  Combina)onandPerformance-  join

●  Accesscontrolandpermissions-  grant

●  BackupandRestore/Input-output

BioinfRes SoSe 17

ReasonsforDBMS

●  redundancy,consistency●  limitedaccess

●  difficultmul)-useraccess

●  lossofinforma)on●  lossofintegrity

●  securityissues

●  expensiveapplica)ondevelopment

BioinfRes SoSe 17

Abstrac)onlayers

Physical Layer

Logical Layer

View 1 View 2

BioinfRes SoSe 17

VariousDataModels

●  Networkmodel●  Hierarchicalmodel

●  Rela;onalModel

●  XMLschema●  Object-orientedmodel

●  Deduc)vemodel

BioinfRes SoSe 17

Rela)onalModelStudents

Matric Name

123455 Mayer

233457 Huber

... ...

Attends

Matric LectureNo

123455 2

233457 5

... ...

Lectures

LectureNo Title

2 Bioinformtics

5 Genomics

... ...

SelectNameFromStudents,Ajends,LecturesWhereStudents.Matric=Ajends.Matricand Ajends.LectureNo=Lectures.LectureNoand Lectures.Title=‘Genomics’;

UpdateLecturesSetTitle=‘GenomicsofMammalian’WhereLectureNo=5;

BioinfRes SoSe 17

En)tyRela)onshipModel

●  GraphicalNota)on●  Modelsrealworld“en))es”and“rela)on”

●  allowsfor“ajributes”

●  allowsforfunc)onali)es(1:1,1:n,n:m)●  allowstodefinekeys

●  key:asetforajributeswhichvaluescombina)onallowunambiguousinstanceiden)fica)on

BioinfRes SoSe 17

Nota)on

(strong)En)ty

Ajribute,key:underlined

Rela)on

weakEn)ty(dependonothers)

Student

Attends

BioinfRes SoSe 17

ERExampleMatric Name Semester

Student

Attends

Lecture

LectureNo Title Reader

BioinfRes SoSe 17

Func)onality

Attends

Student

Lecture Grade

BioinfRes SoSe 17

Studenten

Assistenten

MatrNr

PersNr

Semester

Fachgebiet

hören

prüfen

arbeitenFür Professoren

Vorlesungen

voraussetzen

VorlNr

PersNr

Nach- folger Vorgänger

Funktionalitäten

taken from Prof. Kempers database lecture WS 13/14

BioinfRes SoSe 17 25

Prüfungen als schwacher Entitytyp

Studenten ablegen Prüfungen 1 N Note

PrüfTeil

MatrNr

Vorlesungen

umfassen

VorlNr

abhalten

Professoren

PersNr

• Mehrere Prüfer in einer Prüfung

• Mehrere Vorlesungen werden in einer Prüfung abgefragt

taken from Prof. Kempers database lecture WS 13/14

BioinfRes SoSe 17

●  standardized,SQL99(1999)andSQL3(2003)●  implementedbymostavailabledatabasemanagementsystemmanufacturer

●  but:notalwaysallspecifiedfeaturesimplemented

●  noteverythingisspecified!●  especiallyadmin/servermaintenanceiso�envendorspecific

BioinfRes SoSe 17

SQLDataTypes

●  char●  varchar

●  binaryandvarbinary

●  blobandtext●  numeric,decimal,integer(exact)

●  approximate:float,double

BioinfRes SoSe 17

SQLDataTypes

●  variousformatsfor)meanddate●  enum:oneoutofadefinedset

●  set:zeroormoreitemsoutofapredefinedlist

Formoreinforma)onseethelivetourthrough

hjp://dev.mysql.com/doc/refman/5.6/en/index.html

BioinfRes SoSe 17

ACID-PrincipleforTransac)ons

●  A:Atomicity:All-or-nothing,i.e.asequenceofopera)onsisexecutedlikeasingleatomicopera)onwhichcannotbeinterrupted

●  C:Consistency:A�ereveryopera)onthedatabaseisconsistent,i.e.allcondi)onsandconstraintsaboutcontextandrela)onshipsarefulfilled

BioinfRes SoSe 17

ACID-PrincipleII

●  I:Isola)on:Concurrentopera)onstonotaffecteachother

●  D:Durability:Uponsuccessfulcomple)onofatransac)onitisguaranteedthatallmodifica)onsarepersistent,i.e.theyarestoredinthedatabase,evenincaseofanunexpectedpowerloss.

BioinfRes SoSe 17

Rela)onalAlgebra

●  σSelec)on●  πProjec)on

●  ρRename

●  xCrossProduct●  Join

●  −Difference

●  ÷Division

BioinfRes SoSe 17

●  ∪Union●  ∩Intersec)on

●  SemiJoin(le�)

●  Le�OuterJoin●  (Full)OuterJoin

Rela)onalAlgebra

BioinfRes SoSe 17

Demonstra)onTable

gene indiv organism function status cytox 1 mouse prep gapdh 1 human glycolysis completed gapdh 2 human glycolysis completed ttn 2 human muscle ongoing unkno 3 human NULL prep

BioinfRes SoSe 17

Selec)on

●  TheSELECTopera)on(denotedbyσ(sigma))isusedtoselectasubsetofthetuplesfromarela)onbasedonaselec)oncondi)on

●  Itactsasa(row)filter

●  SpecifiedintheWHERE-clause

●  σ status = “ongoing” (STATUS)

BioinfRes SoSe 17

Selec)on

●  General:theselectopera)onisdenotedbyσ<selec)oncondi)on>(R)where:-  theσ(sigma)isusedtodenotetheselectoperator-  theselec)oncondi)onisaBoolean(condi)onal)expressionspecifiedontheajributesofrela)onR

-  tuplesthatmakethecondi)ontrueareselected(appearintheresultoftheopera)on)

-  tuplesthatmakethecondi)onfalsearefilteredout(discardedfromtheresultoftheopera)on)

BioinfRes SoSe 17

Selec)on

●  TheBooleanexpressionspecifiedin<selec)oncondi)on>ismadeupofanumberofclausesoftheform:<ajributename><comparisonop><constantvalue>or<ajributename><comparisonop><ajributename>

●  <ajributename>isthenameofanajributeofR,<comparisonop>idnormallyoneoftheopera)ons{=,>,>=,<,<=,!=}

●  ClausescanbearbitrarilyconnectedbytheBooleanoperatorsand,orandnot

BioinfRes SoSe 17

Selec)on

●  NULListestedforwithspecialoperators●  Selectσiscommuta)ve

●  canbecascadeofselectopera)onsofaconjunc)onofcondi)ons:σ<condi)on1>(σ<condi)on2>(R))=σ<condi)on2>(σ<condi)on1>(R))σ<cond1>(σ<cond2>(σ<cond3>(R))=σ<cond1>AND<cond2>AND<cond3>(R)

BioinfRes SoSe 17

Projec)on

●  PROJECTOpera)onisdenotedbyπ(pi)●  usePROJECTtoretrievespecificajributesofrela)onR

●  Itactsasa(column)filterofthetuples

●  Example:πGene,status(STATUS)

●  Projectremovesduplicateswhichmightoccur(inSQL:SELECTDISTINCTinsteadofsimpleSELECT)

BioinfRes SoSe 17

SingleExpressionvs.SequenceofRela)onalOpera)ons

●  Toretrievecompletedgenesfromourexample:●  Singleexpression:πgene,status(σstatus=completed(STATUS))

●  Sequenceofopera)on:ALL_COMP<-σstatus=completed(STATUS)RESULT<-πgene,status(ALL_COMP)

BioinfRes SoSe 17

Rename

●  RENAMEisdenotedbyρ(rho)●  Insomecases,wemaywanttorenametheajributesofarela)onortherela)onnameorboth-  Usefulwhenaqueryrequiresmul)pleopera)ons-  Necessaryinsomecases(seeJOINopera)onlater)

BioinfRes SoSe 17

RENAME

●  RENAMEopera)onsρcanbeexpressedbyanyofthefollowingforms:-  ρS(R)changes:therela5onnameonlytoS-  ρ(B1,B2,…,Bn)(R)changes:thecolumn(a9ribute)namesonlytoB1,B1,…,Bn

-  ρS(B1,B2,…,Bn)(R)changesboth:therela)onnametoS,andthecolumn(ajribute)namestoB1,B1,…,Bn

BioinfRes SoSe 17

Rela)onalOperatorsfromSetTheory

●  Union●  Intersec)on

●  Minus

●  CartesianProducts

BioinfRes SoSe 17

●  ItisaBinaryopera)on,denotedby∪●  TheresultofR∪S,isarela)onthatincludesalltuplesthatareeitherinRorinSorinbothRandS

●  Duplicatetuplesareeliminated

●  RandShavetotypecompa)ble:-  theymusthavethesamenumberofajributes-  correspondingajributesaretypecompa)ble

BioinfRes SoSe 17

Intersec)on

●  INTERSECTIONisdenotedby∩●  Theresultoftheopera)onR∩S,isarela)onthatincludesalltuplesthatareinbothRandS

●  TheajributenamesintheresultwillbethesameastheajributenamesinR

●  Thetwooperandrela)onsRandSmustbe“typecompa)ble”

BioinfRes SoSe 17

SetDifference

●  SETDIFFERENCE(alsocalledMINUSorEXCEPT)isdenotedby–

●  TheresultofR–S,isarela)onthatincludesalltuplesthatareinRbutnotinS

●  TheajributenamesintheresultwillbethesameastheajributenamesinR

●  Thetwooperandrela)onsRandSmustbe“typecompa)ble”

BioinfRes SoSe 17

Proper)esofUnion,Intersec)onandDifference

●  Bothunionandintersec)onarecommuta)ve;thatis:R∪S=S∪R,andR∩S=S∩R

●  Unionandintersec)onareassocia)veopera)ons;thatis:R∪(S∪T)=(R∪S)∪T(R∩S)∩T=R∩(S∩T)

●  Theminusopera)onisnotcommuta)ve;thatis:R–S≠S–R

BioinfRes SoSe 17

CrossProduct(CartesianProduct)

●  CROSSPRODUCTOpera)on●  Usedtocombinetuplesfromtworela)onsinacombinatorialfashion

●  DenotedbyR(A1,A2,...,An)xS(B1,B2,...,Bm)

●  Resultisarela)onQwithdegreen+majributes:Q(A1,A2,...,An,B1,B2,...,Bm)

BioinfRes SoSe 17

CartesianProduct(CrossProduct)●  Theresul)ngrela)oncontainseverypossiblecombina)onofthetuplesfromRandS--onefromRandonefromS

●  Hence,ifRhasnRtuples(denotedas|R|=nR),andShasnStuples,thenRxSwillhavenR*nStuples

●  ThetwooperandsdoNOThavetobe"typecompa)ble”

●  Generally,CARTESIANPRODUCTisnotameaningfulopera)on,butcanbecomemeaningfulwhenfollowedbyotheropera)ons

BioinfRes SoSe 17

●  JOINOpera)on(denotedby)●  SequenceofCARTESIANPRODUCTfollowedbySELECTisusedtoiden)fyandselectrelatedtuplesfromtworela)ons

●  veryimportantforanyrela)onaldatabasewithmorethanasinglerela)on,becauseitallowstocombinerelatedtuplesfromvariousrela)ons

BioinfRes SoSe 17

●  Thegeneralformofajoinopera)onontworela)onsR(A1,A2,...,An)andS(B1,B2,...,Bm)is:R<joincondi)on>S

●  whereRandScanbeanyrela)onsthatresultfromgeneralrela)onalalgebraexpressions

BioinfRes SoSe 17

Join●  ConsiderthefollowingJOINopera)on:-  IfR(A1,A2,...,An)andS(B1,B2,...,Bm)ThinkaboutR.Ai=S.Bj

-  Resultisarela)onQwithdegreen+majributes:Q(A1,A2,...,An,B1,B2,...,Bm)

-  Theresul)ngrela)onstatehasonetupleforeachcombina)onoftuples–rfromRandsfromS,butonlyiftheysa)sfythejoincondi)onr[Ai]=s[Bj]

-  ifRhasnRtuples,andShasnStuples,thenthejoinresultwillgenerallyhavelessthannR*nStuples

BioinfRes SoSe 17

Join(moreprecise)

●  ThegeneralcaseofJOINopera)oniscalledaTheta-join:RthetaS

●  Thejoincondi)oniscalledtheta●  ThetacanbeanygeneralbooleanexpressionontheajributesofRandS;forexample:R.Ai<S.BjAND(R.Ak=S.BlORR.Ap<S.Bq)

BioinfRes SoSe 17

Equijoin

●  Themostcommonuseofjoininvolvesjoincondi)onswithequalitycomparisonsonly

●  Suchajoin,whereonlythecomparisonoperatorusedis=,iscalledanEQUIJOIN

●  TheJOINseeninthepreviousexamplewasanEQUIJOIN

BioinfRes SoSe 17

NaturalJoin●  Anothervaria)onofJOINcalledNATURALJOIN—denotedby*orwithoutcondi)on

●  Itwascreatedtogetridofthesecond(superfluous)ajributeinanEQUIJOINcondi)on.

●  Q←R(A,B,C,D)*S(C,D,E)

●  implicitjoincondi)onincludeseachpairofajributeswiththesamename,“AND”edtogether:R.C=S.CANDR.D=S.D

●  keepsonlyoneajributeofeachsuchpair:Q(A,B,C,D,E)

BioinfRes SoSe 17

SemiJoin

●  actslikeafilterbasedonaspecifiedajribute●  RSmeans:ifRandShaveacommonajributeCtheresultarealltuplesfromRwhichCvalueoccursalsoinS,nQ≤nRtuples

●  Q<-R(A,B,C)S(C,D,E)

●  Q(A,B,C)withnRajributes●  πA,B,C(σR.C=S.C(RxS))

BioinfRes SoSe 17

Le�OuterJoin●  Rightversionisanalogous●  addinforma)ontocorrespondingle�sidetuples

●  RSmeans:ifRandShaveacommonajributeCtheresultareallcombinedtuplesfromRandSwhereR.C=S.Candinaddi)onallremainingtuplesfromR,nQ=nRtuples

●  Q(A,B,C,D,E)withnR∪Sajributes

●  ifnomatchingtuplesfoundinSajributesDandEcontainnovalues

BioinfRes SoSe 17

(Full)OuterJoin●  combinescorrespondingtuplesvonRandSwherepossible,elseajributesle�blank

●  RSmeans:ifRandShaveacommonajributeCtheresultareallcombinedtuplesfromRandSwhereR.C=S.Candinaddi)onallremainingtuplesfromRandS,nQ≤nR+Stuples

●  Q(A,B,C,D,E)withnR∪Sajributes●  ifnomatchingtuplesfoundinRorSajributesA,BorDandEcontainnovalues

BioinfRes SoSe 17

Division

●  GivesallajributetupleforR-SwhereavaluesforR-Sco-occurswithalltuplesinS

●  R(A,B)andS(B)●  R÷S:Q(A)whereeachresulttupleinQcanbefoundinRincombina)onwitheverytuplefromS

BioinfRes SoSe 17

CompleteSetofRela)onalOpera)ons

●  Thesetofopera)onsincludingSELECTσ,PROJECTπ,UNION∪,DIFFERENCE-,RENAMEρ,andCARTESIANPRODUCTXiscalledacompletesetbecauseanyotherrela)onalalgebraexpressioncanbeexpressedbyacombina)onofthesefiveopera)ons.

●  Examples:-  R∩S=(R∪S)–((R-S)∪(S-R))-  R<joincondi)on>S=σ<joincondi)on>(RXS)

BioinfRes SoSe 17

BeyondClassicalAlgebra

●  Grouping:groupby●  Aggrega)on:count,sum,average,min,max

BioinfRes SoSe 17

KeysandIndexes

●  Eachrela)onrepresentsasubsetofthecartesianproductofitsdomains(ajributes)

●  Somevaluesmightbeuniqueforarowothersarenot

●  Toaddressandaccessaspecifictupleinarela)onweneedtodefineaprimarykey

●  Aprimarykeyissetofajributeswhichcombina)onallowsustounambiguouslyiden)fyacertainrowintherela)on

BioinfRes SoSe 17

KeysandIndexes●  Consequences:-  Eachprimarykey(combina)on)canoccuronlyonceinatable

-  Entrieswhichmissatoneoftheseajributevaluesarenotallows(NOTNULL)

-  Defaultvaluesfortheseajributesmakenosense-  Thesesystemhastokeeptrackwhichthehelpofanindex

●  Thekeydependsonthemodelingandthedomain

BioinfRes SoSe 17

Indexes/Constraints

●  PRIMARYKEY:UNIQUE,NOTNULL●  UNIQUE:Ifthereisavalueitmustbeunique,ifthereisnovaluebutNULLitcanoccurmul)ple)mes

●  INDEX:Asearchstructurewhichallowstofindtuples(rows)whichaspecificajributevalueefficiently-  mustexplicitlyrequestedinthetablestructure-  forcharactertypesyoucantheprefixlength

BioinfRes SoSe 17

PerformanceConsidera)ons

●  Therearethreerela)ontojoinA*B*C:-  A(1.000.000rows)-  B(100rows)-  C(10.000rows)

BioinfRes SoSe 17

PerformanceConsidera)ons●  (Worst)Casew/oindexesandbadsequence:A*C:10.000.000.000comparisonsO(n*m)->D(10.000.000.000rows)D*B(1.000.000.000.000comparisons)O(n*m)-  ofcoursetuplesmightbedroppedinrealitybecauseofmissingjoinpartners

●  Casewithindexesandcleversequence:B*A:100*log(10.000.000)comparisons->D(10.000.000rows)C*D:10.000*log(10.000.000)comparisons

BioinfRes SoSe 17

PerformanceConsidera)ons●  Sequenceofevalua)oncanbeop)mizedbythedatabaseengine-  cleverorderwithexploita)onofassocia)vityandcommuta)vity

-  example:100*log(10.000.000)vs10.000.000*log(100)

-  maybenoteffec)veinworstcasebutdefinitelyevery)meelse

bioinformacs resources - structural resources / sql · bioinfres sose 17 bioinformacs resources -...

Documents

controlling resources in sql server

database overview - halvorsen.blog › documents ›...

linux introduc)on to linux and unix and the …on to...

bioinformacs resources - genbank...april 27th sequence...

bioinformacs resources - rostlab · 2017-06-30 ·...

sql server 2008 tutorial -...

spotlight on sql server enterprise and foglight...

introduction to structured query language...

the case for sql server consolidation · sql server 2000...

final exam - database programming with sql -...

zd-xl sql accelerator 1.5 features · zd-xl sql accelerator...

sql - cse, iit bombaycs317/resources/lectures/sql.pdf ·...

u-sql learning resources (sqlbits 2016)

sybase - computer history...

using&“hmmfamilies”&in&bioinformacs&& … ·...

bioinformacs resources - nosql 2-€¦ · apr. 28th intro,...

crisprdesign&consideraons& · 2018-04-17 ·...

bioinformacs resources - sql - rostlab · bioinfres sose 16...

canadian&bioinformacs&workshops&bioinformatics-ca.github.io/resources/rreview_slides.pdf ·...

sql reference & resources