bioinformacs resources - structural resources / sql · bioinfres sose 17 bioinformacs resources -...
Post on 31-Mar-2018
227 Views
Preview:
TRANSCRIPT
BioinfRes SoSe 17
Bioinforma)csResources-StructuralResources/SQL-
Lecture&ExercisesProf.B.Rost,Dr.L.Richter,J.Reeb
Ins)tutfürInforma)kI12
BioinfRes SoSe 17
PreliminaryScheduleApr. 28th Intro, General Overview (1. sh.) Jun 16th No Lecture May 5th Sequence Databases (2. sh.) Jun 23rd NoSql 2 (7.sh.) May 12th Sequence Databases (3. sh.) Jun 30th MongoDB, JavaScript (8.sh.) May 19th Structure Databases (4. sh.) Jul 7th Node.js Applications (9.sh.) May 26th No Lecture Jul 14st PredictProtein Jun 2nd SQL (5. sh.) Jul 21st Wrap Up, Q&A Jun 9th SQL, NoSql (6. sh) Jul 28th Exam
* These exercises can earn you a bonus
BioinfRes SoSe 17
Orga-ExamDate
● ExamscheduledforFriday,Jul28th
● Time:16:30-18:00
● Room:MW0350Egbert-von-HoyerLectureHall(MechanicalEngineeringBuilding)
● Registra)onisMANDATORY
● sofar6studentsregistered
BioinfRes SoSe 17
SecondaryDatabases
● Databaseswhichdigestandstructuredatafromprimarydatabases
● Notalways“true”databasesystems● SCOP/CATH
● PFAM
● PROSITE
BioinfRes SoSe 17
Classifica)onofStructures:CATH-Gene3D/SCOP
● cameupinthemiddleofthe1990s● botharequitesimilar
● aim:organizetheproteinstructuresavailableinPDB,basedonsingledomains
● hierarchicalsystem(roughly):- secondarystructurecontent- fold- superfamilies- families
BioinfRes SoSe 17
SCOP:aStructuralClassifica)onofProteins
● Murzin,A.,Brenner,S.E.,Hubbard,T.J.P.andChothia,C.(1995)J.Mol.Biol.,247,536-540
● Hubbard,T.P.,Murzin,A.,Brenner,S.E.andChothia,C.(1997),Nucl.AcidsRes.25(1),236-239(easiertoobtain)
● fullymanuallycurated,drivenbyexpertanalysis
● associatedwiththeASTRALcompendium
● latestnews:SCOPe(UCBerkeley),SCOP2(MRCLabMolBiol,Cambridge,UK)
BioinfRes SoSe 17
SCOP:aStructuralClassifica)onofProteins
● J.-M.Chandonia,etal.,SCOPe:ManualCura)onandAr)factRemovalintheStructuralClassifica)onofProteins–extendedDatabase,J.Mol.Biol.(2016),hjp://dx.doi.org/10.1016/j.jmb.2016.11.023
● A.Andreeva,D.Haworth,C.Cho)a,E.Kulesha,A.Murzin.SCOP2prototype:anewapproachtoproteinstructuremining.NucleicAcidsRes.2014Jan1;42(Databaseissue):D310–D314.Publishedonline2013Nov29.doi:10.1093/nar/gkt1242
BioinfRes SoSe 17
HierarchicalLevel
1. Classes:Considersecondarystructurecomposi)on(allα,allβ,α/β,α+β,mul)-domain,membrane/cellsurface/pep)des,...)
2. Fold:Shapeofadomain.Proteinsofthesamefoldhavethesamemajorsecondarystructureelementsinthesamearrangementwiththesametopologicalfeatures
3. Superfamily:Groupsofdomainwhichhaveatleastadistantcommonancestor
BioinfRes SoSe 17
HierarchicalLevel
5. Family:Groupswithinsuperfamilieswithamorerecentcommonancestor(>30%sequencesidentyor>15%seq.id.plussamefunc)on
6. Proteindomain:Groupswithinfamilies,essen)allythesameprotein(isoform,thesameproteinbutfromdifferentspecies)
7. Species:Proteindomainsaccordingtospecies
8. Domain:thesingledomain
BioinfRes SoSe 17
Developmentstar)ngfromyear2000
taken from http://scop.berkeley.edu/help/ver=2.06#scopchanges
BioinfRes SoSe 17
taken from http://scop.berkeley.edu/help/ver=2.06#scopchanges
BioinfRes SoSe 17
CATH-Faces
taken from http://www.ebi.ac.uk/about/people/janet-thornton
taken from http://www.tgac.ac.uk/scientific-advisory-board/
BioinfRes SoSe 17
Publica)ons● SillitoeI,Lewis,TE,CuffAL,DasS,AshfordP,DawsonNL,FurnhamN,LaskowskiRA,LeeD,LeesJ,Leh)nenS,StuderR,ThorntonJM,OrengoCA.CATH:comprehensivestructuralandfunc)onalannota)onsforgenomesequences.NucleicAcidsRes.2015Jandoi:10.1093/nar/gku947
● LamSD,DawsonNL,DasS,SillitoeI,AshfordP,LeeD,Leh)nenS,OrengoCA,LeesJG.Gene3D:expandingtheu)lityofdomainassignments.NucleicAcidsRes.2016Jandoi:10.1093/nar/gkv1231
BioinfRes SoSe 17
CATH● semi-automa)cprocedureforderivinganovelhierarchicalclassifica)onofproteindomainstructures
● fourmainlevels:- C:proteinclass,mainlysecondarystructurecomposi)onofeachdomain
- A:architecture,summarizesshapesbasedonorienta)onofsecondarystructureelements
- T:topology,sequen)alconnec)vityisconsidered- H:homologoussuperfamily,highsimilaritywithsimilarfunc)ons,evolu)onaryrela)onshipassumed
BioinfRes SoSe 17
some nine highly populated families (‘superfolds’ [1]),with important implications for prediction algorithms,and it illustrated the insights to be gained from orderingthe data in this way.
Several other groups have also classified the known struc-tures, focusing on a variety of local and global topologi-cal features and employing a range of algorithms (struc-ture comparison algorithms and classification generallyare reviewed in [13–16]). The SCOP database, developedby Murzin et al. [17], groups proteins having significantsequence similarity into homologous families, whereasmore distant structural similarities are largely identifiedmanually. This database places emphasis on evolutionaryrelationships and information from the literature relatingto well-studied fold families is also incorporated (e.g. the βtrefoils [18] and the OB fold [19]). By contrast Holm andSander, use the structure comparison algorithm DALI torecognise structural neighbours, whether motif or foldbased, without formally ordering proteins in the PDB intofamilies [20]. The ENTREZ database of Hogue et al. [21],uses a similar approach to DALI, listing neighbours by afast vector-based comparison algorithm (VAST).
The task of defining structural relationships is furthercomplicated by the existence of multidomain proteins;more than 30% of non-identical structures in the currentPDB contain two or more domains. A number of domainrecognition algorithms have appeared recently to address
this problem [22–26]. The 3Dee database of Siddiquiand Barton (http://snail.biop.ox.ac.uk:8080/3Dee) sepa-rates the constituent folds of multidomain proteins usingthe DOMAK algorithm. Similarly, Sowdhaminini et al.have constructed a database of single domain families[27], using the domain recognition algorithm DIAL [26]and the structural comparison procedure SEA [28]. Bothdatabases contain data that is generated largely automati-cally, but is subsequently checked and where appropri-ate reordered manually.
In recognition of the need to regularly maintain and updatedata on structural relatives, we have further developed ourautomatic procedures for identifying and classifying struc-tural families [6] to construct a database of single-domainfold families. Any multidomain proteins are first dividedinto their constituent domain folds by an automatic consen-sus procedure which is in agreement between three inde-pendent algorithms (SJ et al. unpublished data). As well asclustering proteins by sequence and structure, recognisedfamilies are also grouped according to similarity in proteinclass (i.e. secondary structure composition and contacts).Finally, the architecture (shape, defined by the assembly ofsecondary structures, regardless of their connectivity) adop-ted by each protein fold, is assigned manually. Althoughthis is a somewhat subjective process, based largely on com-monly used descriptions in the literature (e.g. sandwich,barrel and propellor), it is an essential first step towardsordering the known folds in a useful and practical way.
1094 Structure 1997, Vol 5 No 8
Figure 1
Annual increase in the numbers of proteindomain structures in the PDB (top plot,[11,12]). The lower lines show the numbers ofidentical families (I-level, 100% sequenceidentity between structures within the familyand 100% overlap), non-identical families(N-level, > 95% sequence identity, 85%overlap), sequence families (S-level, > 35%sequence identity, 60% overlap), homologoussuperfamilies (H-level, > 25% sequenceidentity, SSAP >80 and 60% overlap), andtopological or fold families (T-level, SSAP>70), where SSAP is a structural comparisonscore.
7500��
3000�
2500�
2000�
1500�
1000�
500�
0'85 '86 '87 '88 '89 '90 '91 '92 '93 '94 '95 '96
Domain
Identical
Non-identical
Sequence familyHomologous superfamilyTopology
Num
ber o
f dom
ains
1985–95
Deposition date
Domain fold distribution
from Structure 15, August 1997, 5:1093–1108 http://biomednet.com/elecref/0969212600501093
BioinfRes SoSe 17
CurrentRelease● CATHDBversion:4.0● 235,000domain
● 25mioproteinpredic)ons
● new:- improvedpredic)onoffunc)onalfamilies- currentputa)vedomainassignments(CATH-B)- CATH-40:anon-redundantsetofCATHdomainsforhomolgybenchmarkingexperiments(<40%seq.idwith60%overlap)
● hjp://www.cathdb.info/wiki/doku/?id=release_notes#cath_release_notes
BioinfRes SoSe 17
NumberingScheme
● C:1,2,3,4(alpha,beta,alpha/beta,none)(4)● A:samearchitecture,differenttopology(40)
● T:Topology(connec)onofsecondarystructureelements)(1373)
● H:Homology(families)(2737)
BioinfRes SoSe 17
appear less distinct and may reflect the tolerance of helixpacking modes that allows diverse combinations of two-and three-helix motifs. This gives rise to a continuum offolds within which helix packing angles range fromaligned through to orthogonal. Despite this variety, certainmotifs appear to recur frequently — the aligned α hairpinand the two-helix and three-helix orthogonal motifs com-mon in the repressor and globin-like folds. Therefore, inthis class, it may ultimately be more appropriate to sepa-rate fold families into architectural groups that reflect spe-cific combinations of these common motifs.
By contrast to the mainly α class, in the mainly β class, theconstraints on β strands to be hydrogen bonded withinsheets and also on sheet–sheet packing gives rise to somevery distinct and easily recognisable architectures. In par-ticular, the β prisms, β propellors and β solenoids demon-strate the symmetry and regularity of structures satisfyingthese preferred packing constraints. In contrast to the fewarchitectures observed within the mainly α class, at least
16 different, relatively simple, architectures can be dis-cerned in the mainly β class.
The diversity of the mainly β class is not currently observedwithin the α−β class, in which only eight regular architec-tures are apparent to date. This may simply reflect a bias inthe structures determined or could suggest that in this classthe preferred motifs are more constrained in the ways inwhich they combine. The βαβ motif appears to be highlyfavoured and is observed within a large proportion of folds.In some topologies, the β strands are adjacent in space(classic motif) but in others they are separated by a thirdantiparallel strand, forming a three-stranded β sheet (splitmotif) [31]. Although both the classic and the split βαβmotifs are most commonly found in two and three-layerarchitectures, the classic motif is also found to recur withinbarrel and semi-barrel or horseshoe architectures (Figure 4).
The structures that fall outside these rather simple layerarchitectures tend to be quite complex. Compared to the
1098 Structure 1997, Vol 5 No 8
Table 2
The numbers of fold families (T-level), homologous superfamilies (H-level) and domain structures in different architectures are shown forthe mainly α, mainly β and α−β classes.
Number of Percentage of Number of Percentage of Number of Percentage ofC lass Architecture T-levels all T-levels* H-levels all H-levels* domains all domains*
Mainly α Non-bundle 86 17.03 93 14.42 1455 18.01Bundle 34 6.73 39 6.05 226 2.80Few SS 25 4.95 25 3.88 112 1.39
Mainly β Ribbon 17 3.37 17 2.64 114 1.41S ingle sheet 5 0.99 6 0.93 56 0.69Roll 6 1.19 6 0.93 55 0.68Barrel 22 4.36 29 4.50 861 10.66C lam 1 0.20 1 0.16 1 0.01Sandwich 21 4.16 43 6.67 1236 15.30D istorted sandwich 14 2.77 14 2.17 83 1.03Trefoil 1 0.20 4 0.62 49 0.61O rthogonal prism 1 0.20 1 0.16 4 0.05A ligned prism 1 0.20 2 0.31 3 0.04Four-propellor 1 0.20 1 0.16 3 0.04S ix-propellor 1 0.20 1 0.16 37 0.46Seven-propellor 2 0.40 2 0.31 11 0.14E ight-propellor 1 0.20 1 0.16 2 0.02Two-solenoid 2 0.40 3 0.47 5 0.06Three-solenoid 1 0.20 1 0.16 1 0.01Complex 5 0.99 5 0.78 104 1.29
α–β Roll 24 4.75 33 5.12 469 5.81Barrel 8 1.58 20 3.10 365 4.52Two-layer sandwich 77 15.25 112 17.36 957 11.85Three-layer (αβα) sandwich 78 15.45 115 17.83 1396 17.28Three-layer (ββα) sandwich 3 0.59 3 0.47 11 0.14Four-layer sandwich 4 0.79 4 0.62 12 0.15Box 1 0.20 1 0.16 2 0.02Horseshoe 1 0.20 1 0.16 1 0.01Complex 34 6.73 34 5.27 253 3.13Few SS 14 2.77 14 2.17 96 1.19
Few SS Irregular 14 2.77 14 2.17 98 1.21
*The percentages of total fold families, total homologous superfamilies and total domain structures adopting a particular architecture are shown.
BioinfRes SoSe 17
Pfam● currentversionis31.0,March2017,16712familiesin604clans
● hostedbytheEBI● Cita)on:“ThePfamproteinfamiliesdatabase:towardsamoresustainablefuture”Nucl.AcidsRes.(04January2016)44(D1):D279-D285.doi:10.1093/nar/gkv1344
BioinfRes SoSe 17
Pfam● Pfam-A:curatedseedalignmentderivedfromPfamseq(UniProtKBbased),profileHMMsfortheseedalignment,fullalignmentwithallHMMdetectedsequences
● Pfam-B:un-annotated,automa)callygeneratedfromnon-redundantclusterfromADDA
● focusesonsingledomains
BioinfRes SoSe 17
Terms
● Family:collec)onofrelatedproteinregions● Domain:structuralunit
● Repeat:shotunitwhichisunstableinisola)onbutformsastablestructurewhenfoundinmul)plecopies
● Mo)f:shortunitfoundoutsideglobulardomains● Clans:relatedgroupofPfamentriesbasedonsimilarityinsequence,structureofprofile-HMM
BioinfRes SoSe 17
BioinfRes SoSe 17
PfamNumbers(rel.31)
● 16712Pfam-Afamilies● 36%ofthefamiliesareclassifiedinto604clans
● thePfam-Areleasematches73%ofthe26.7MiosequencesinthecorrespondingUniProtreferenceproteomedatabase
● coverageof90.5%ofSwissProthuman● useofjackhmmer(fromHMMER3package)
● considerCATHandPDB
BioinfRes SoSe 17
does not currently present a scalability problem, aidinghuman interpretation through visualization has becomeincreasingly difficult. Most approaches for facilitatingalignment visualization natively in the browser do notscale well. Applets, such as the Jalview alignment viewer(12), partly solve the problem, but require Java to beinstalled and coupled to the browser.For example, the largest Pfam-A family (version 27.0)
with >363 000 matches to the profile HMM is the ABCtransporters family (ABC_tran, accession PF00005)—itsfull alignment is thus too large to be useful for mostpurposes. The seed alignment, by contrast, contains just55 representative sequences, which may be an insufficientnumber to represent the sequence diversity within thefamily. To provide more useable samples of the sequencediversity within a family, we now calculate model-matchesfor four additional sequence sets, based on‘Representative Proteomes’ (RPs) (13). For theABC_tran family, the RP alignments range in size fromapproximately a quarter of the size of the full alignment toless than one tenth.In an RP set, each member proteome is selected from a
grouping of similar proteomes. The selected proteome ischosen to best represent the set of grouped proteomes interms of both sequence and annotation information. Thegrouping of proteomes is based on a clustering of UniProt,UniRef50, and includes all complete proteome sequences.In each cluster, sequences have !50% identity and have atleast an 80% overlap with the longest sequence. The simi-larity of two proteomes is determined by considering justthe clusters containing sequences from either of the twoproteomes. The two proteomes are grouped when thefraction of clusters that contain sequences from bothproteomes out of the subset of proteome-specific clustersexceeds a given threshold. This threshold is termed theco-membership threshold. The percentage threshold ofco-membership (or common clusters) can be adjusteddown to produce larger groupings, and hence less redun-dant sequence sets.We use the RP sequence sets constructed using co-
membership thresholds of 75, 55, 35 and 15%, giving arange of sequence redundancy for each family. Using rep-resentative proteomes has the advantage that it still allows
for organism-specific copy numbers to be assessed, afeature that can be lost when using global non-redundancythresholds on an entire sequence database. However, themajor advantage for Pfam is the dramatic reduction in thesize of the family full alignments, as shown in Table 1,which illustrates the reductions with increasingly redun-dant RPs for the 10 biggest families in Pfam. The RPsets do not currently include viruses, and so for somefamilies such as GP120, there may not be a match to theRP sets.
The reduction in the size of the full alignments variesfrom family to family, reflecting in part the bias in thesequence database. Overall, across the whole of thedatabase, using RP at 75, 55, 35 and 15% co-membershipthresholds results in average alignment sizes that are, re-spectively, 38.8, 29.7, 20.4 and 11.6% of the full alignmentsize. As the number of sequences in the sequence databaseincreases, we anticipate that the alignments based on RPswill grow at a more linear rate and provide a more con-venient way of sampling the full alignment sequencediversity.
As illustrated in Table 1, the full alignment size for thetop 10 families ranges from 129 000 to 363 000 sequences.With alignments of this size, it is no longer practical tocalculate the neighbour-joining trees provided in previousPfam releases. Before release 27.0, these approximateneighbour-joining phylogenetic trees (with bootstrappingvalues based on 100 replicas) were used to order thealignments, such that phylogenetically related sequenceswould be grouped together. From release 27.0 onwards,the full alignments are ordered according to theHMMER bit score of the match, with the highestscoring sequence found at the top of the alignment.The same phylogenetic trees are still provided for theseed alignments, but are merely a guide as they arecalculated with the FastTree approximation algorithm(14). The seed alignment sequences remain ordered ac-cording to the calculated tree.
In the Pfam website, we use two different colouringschemes when displaying our alignments in a webbrowser: the Clustal scheme (15), based on the chemicalproperties of the amino acids found in the column,and a heat-map scheme that reflects the posterior
Table 1. The reduction in size of RP versus full alignments
Family identifier (accession) Seed Full RP75 RP55 RP35 RP15
ABC_tran (PF00005) 55 363 409 26% (93 265) 21% (77 150) 16% (57 358) 8% (28 903)COX1 (PF00115) 94 254 351 1% (2006) 0.7% (1661) 0.4% (1218) 0.2% (538)zf-H2C2_2 (PF13465) 163 227 898 61% (138 033) 27% (60 664) 15% (34 039) 9% (21 562)WD40 (PF00400) 1804 193 252 65% (125 805) 52% (100 531) 36% (69 386) 23% (21 562)MFS_1 (PF07690) 195 181 668 30% (55 719) 25% (55 719) 17% (55 719) 8% (55 719)RVT_1 (PF00078) 152 172 360 5% (8257) 4% (6662) 3% (5373) 2% (3604)BPD_transp_1 (PF00528) 81 156 339 23% (36 523) 19% (29 422) 14% (22 134) 7% (10 630)Response_reg (PF00072) 57 151 337 29% (44 329) 25% (37 848) 20% (29 453) 10% (15 208)GP120 (PF00516) 24 146 453 N/A N/A N/A N/AHATPase_c (PF02518) 659 129 386 28% (36 085) 24% (30 935) 19% (24 121) 10% (12 473)
The seed alignment is used to construct the profile HMM and contains a representative set of sequences of the family. The full alignment contains allhits in pfamseq scoring above the gathering threshold. In Pfam 27.0, we have introduced four additional alignments based on RPs, which containdecreasing amounts of sequence redundancy from RP75 to RP15. For each RP data set, the percentage reduction in the size of the full alignment isshown, with the number of sequences given in brackets.
D224 Nucleic Acids Research, 2014, Vol. 42, Database issue
from release 27
BioinfRes SoSe 17
String
● ProteinInterac)onNetworks:”STRINGisadatabaseofknownandpredictedprotein-proteininterac)ons.Theinterac)onsincludedirect(physical)andindirect(func)onal)associa)ons;theystemfromcomputa)onalpredic)on,fromknowledgetransferbetweenorganisms,andfrominterac)onsaggregatedfromother(primary)databases.”
● 2031organisms
● 9.6mioproteins● 1,380miointerac)ons
BioinfRes SoSe 17
String
● ProteinInterac)onNetworks● 2031organisms
● 9.6mioproteins
● 1,380miointerac)ons
BioinfRes SoSe 17
String
● SzklarczykD,MorrisJH,CookH,KuhnM,WyderS,SimonovicM,SantosA,DonchevaNT,RothA,BorkP,JensenLJ,vonMeringC.TheSTRINGdatabasein2017:quality-controlledprotein-proteinassocia)onnetworks,madebroadlyaccessible.NucleicAcidsRes.2017Jan;45:D362-68.
BioinfRes SoSe 17
BioinfRes SoSe 17
Prosite
● PROSITEconsistsofdocumenta)onentriesdescribingproteindomains,familiesandfunc)onalsitesaswellasassociatedpajernsandprofilestoiden)fythem
● SigristCJA,deCastroE,CeruxL,CucheBA,HuloN,BridgeA,BougueleretL,XenariosI.Newandcon)nuingdevelopmentsatPROSITE.NucleicAcidsRes.2012;doi:10.1093/nar/gks1067PubMed:23161676
BioinfRes SoSe 17
ENCODE/UCSCGenomeBrowser● TheENCODEProjectConsor)um.AnIntegratedEncyclopediaofDNAElementsintheHumanGenome.Nature.2012Sep6;489(7414):57–74.doi:10.1038/nature11247
● RosenbloomKR,SloanCA,MalladiVS,DreszerTR,LearnedK,KirkupVM,WongMC,MaddrenM,FangR,HeitnerSG,LeeBT,BarberGP,HarteRA,DiekhansM,LongJC,WilderSP,ZweigAS,KarolchikD,KuhnRM,HausslerD,KentWJ.ENCODEdataintheUCSCGenomeBrowser:year5update.NucleicAcidsRes.2013Jan;41(Databaseissue):D56-63.
BioinfRes SoSe 17
ENCODE/UCSCGenomeBrowser● UCSCGenomeBrowser:KentWJ,SugnetCW,FureyTS,RoskinKM,PringleTH,ZahlerAM,HausslerD.ThehumangenomebrowseratUCSC.GenomeRes.2002Jun;12(6):996-1006.
BioinfRes SoSe 17
ENCODE/UCSCGenomeBrowser● ENCODE:EncyclopediaofDNAElements● interna)onalcollabora)onofresearchgroups
● fundedbytheNa)onalHumanGenomeResearchIns)tute(NHGRI)
● buildacomprehensivepartslistoffunc)onalelementinthehumangenome
● includeselementsthatactonproteinandRNAlevelandregulatoryelements
● TheENCODEProjectConsor)um.AnIntegratedEncyclopediaofDNAElementsintheHumanGenome
BioinfRes SoSe 17
taken from https://www.encodeproject.org/
BioinfRes SoSe 17
taken from https://www.encodeproject.org/
BioinfRes SoSe 17
GenomicAnnota)ons
taken from https://www.encodeproject.org/data/annotations
BioinfRes SoSe 17
UCSCGenomeBrowser
● actuallyacollec)onofintegratedservices● hjps://genome.ucsc.edu/index.html
● providesamoregraphicalinterfacetoaccesstheENCODEdataandalotofaddi)onaltools
BioinfRes SoSe 17 taken from https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&...
BioinfRes SoSe 17
Databases-SQL
● Overlapwithdatabaselecture● “SQLcrashcourse”
● nodesigntheory
● nonormaliza)on● standardbookslike:
● A.Kemper&A.EicklerDatenbanksysteme–EineEinführung9.Auflage,2013OldenbourgVerlag,München
BioinfRes SoSe 17
MoreBooks● R.Elmasri,S.B.Navathe:FundamentalsofDatabaseSystems,BenjaminCummings,RedwoodCity,Ca,USA,5.Ed.,2006
● R.Ramakrishnan,J.Gehrke:DatabaseManagementSystems,3.Ed.,2009.
● G.Vossen:Datenmodelle,DatenbanksprachenundDatenbank-Management-Systeme.5.Auflage,Oldenbourg,2008.
● C.J.Date:AnIntroduc)ontoDatabaseSystems.McGraw-Hill,8.Ed.,2003.
BioinfRes SoSe 17
SelectedSQLTopics● Tablemodifica)ons- insert,update,create,alter
● Dataretrievalandrepor)ng/aggrega)on- select,average,sum
● Combina)onandPerformance- join
● Accesscontrolandpermissions- grant
● BackupandRestore/Input-output
BioinfRes SoSe 17
ReasonsforDBMS
● redundancy,consistency● limitedaccess
● difficultmul)-useraccess
● lossofinforma)on● lossofintegrity
● securityissues
● expensiveapplica)ondevelopment
BioinfRes SoSe 17
Abstrac)onlayers
Physical Layer
Logical Layer
View 1 View 2
BioinfRes SoSe 17
VariousDataModels
● Networkmodel● Hierarchicalmodel
● Rela;onalModel
● XMLschema● Object-orientedmodel
● Deduc)vemodel
BioinfRes SoSe 17
Rela)onalModelStudents
Matric Name
123455 Mayer
233457 Huber
... ...
Attends
Matric LectureNo
123455 2
233457 5
... ...
Lectures
LectureNo Title
2 Bioinformtics
5 Genomics
... ...
SelectNameFromStudents,Ajends,LecturesWhereStudents.Matric=Ajends.Matricand Ajends.LectureNo=Lectures.LectureNoand Lectures.Title=‘Genomics’;
UpdateLecturesSetTitle=‘GenomicsofMammalian’WhereLectureNo=5;
BioinfRes SoSe 17
En)tyRela)onshipModel
● GraphicalNota)on● Modelsrealworld“en))es”and“rela)on”
● allowsfor“ajributes”
● allowsforfunc)onali)es(1:1,1:n,n:m)● allowstodefinekeys
● key:asetforajributeswhichvaluescombina)onallowunambiguousinstanceiden)fica)on
BioinfRes SoSe 17
Nota)on
(strong)En)ty
Ajribute,key:underlined
Rela)on
weakEn)ty(dependonothers)
Student
Name
Attends
BioinfRes SoSe 17
ERExampleMatric Name Semester
Student
Attends
Lecture
LectureNo Title Reader
BioinfRes SoSe 17
Func)onality
Attends
Student
Lecture Grade
N
M
BioinfRes SoSe 17
20
Studenten
Assistenten
MatrNr
PersNr
Semester
Name
Name
Fachgebiet
Note
hören
prüfen
arbeitenFür Professoren
Vorlesungen
lesen
voraussetzen
SWS
VorlNr
Titel
Raum
Rang
PersNr
Nach- folger Vorgänger
Name
Funktionalitäten
1
N
1
1
N N
N
M
M M N
taken from Prof. Kempers database lecture WS 13/14
BioinfRes SoSe 17 25
Prüfungen als schwacher Entitytyp
Studenten ablegen Prüfungen 1 N Note
PrüfTeil
MatrNr
Vorlesungen
umfassen
VorlNr
abhalten
Professoren
PersNr
N N
M M
• Mehrere Prüfer in einer Prüfung
• Mehrere Vorlesungen werden in einer Prüfung abgefragt
taken from Prof. Kempers database lecture WS 13/14
BioinfRes SoSe 17
SQL
● standardized,SQL99(1999)andSQL3(2003)● implementedbymostavailabledatabasemanagementsystemmanufacturer
● but:notalwaysallspecifiedfeaturesimplemented
● noteverythingisspecified!● especiallyadmin/servermaintenanceiso�envendorspecific
BioinfRes SoSe 17
SQLDataTypes
● char● varchar
● binaryandvarbinary
● blobandtext● numeric,decimal,integer(exact)
● approximate:float,double
BioinfRes SoSe 17
SQLDataTypes
● variousformatsfor)meanddate● enum:oneoutofadefinedset
● set:zeroormoreitemsoutofapredefinedlist
Formoreinforma)onseethelivetourthrough
hjp://dev.mysql.com/doc/refman/5.6/en/index.html
BioinfRes SoSe 17
ACID-PrincipleforTransac)ons
● A:Atomicity:All-or-nothing,i.e.asequenceofopera)onsisexecutedlikeasingleatomicopera)onwhichcannotbeinterrupted
● C:Consistency:A�ereveryopera)onthedatabaseisconsistent,i.e.allcondi)onsandconstraintsaboutcontextandrela)onshipsarefulfilled
BioinfRes SoSe 17
ACID-PrincipleII
● I:Isola)on:Concurrentopera)onstonotaffecteachother
● D:Durability:Uponsuccessfulcomple)onofatransac)onitisguaranteedthatallmodifica)onsarepersistent,i.e.theyarestoredinthedatabase,evenincaseofanunexpectedpowerloss.
BioinfRes SoSe 17
Rela)onalAlgebra
● σSelec)on● πProjec)on
● ρRename
● xCrossProduct● Join
● −Difference
● ÷Division
BioinfRes SoSe 17
● ∪Union● ∩Intersec)on
● SemiJoin(le�)
● Le�OuterJoin● (Full)OuterJoin
Rela)onalAlgebra
BioinfRes SoSe 17
Demonstra)onTable
gene indiv organism function status cytox 1 mouse prep gapdh 1 human glycolysis completed gapdh 2 human glycolysis completed ttn 2 human muscle ongoing unkno 3 human NULL prep
BioinfRes SoSe 17
Selec)on
● TheSELECTopera)on(denotedbyσ(sigma))isusedtoselectasubsetofthetuplesfromarela)onbasedonaselec)oncondi)on
● Itactsasa(row)filter
● SpecifiedintheWHERE-clause
● σ status = “ongoing” (STATUS)
BioinfRes SoSe 17
Selec)on
● General:theselectopera)onisdenotedbyσ<selec)oncondi)on>(R)where:- theσ(sigma)isusedtodenotetheselectoperator- theselec)oncondi)onisaBoolean(condi)onal)expressionspecifiedontheajributesofrela)onR
- tuplesthatmakethecondi)ontrueareselected(appearintheresultoftheopera)on)
- tuplesthatmakethecondi)onfalsearefilteredout(discardedfromtheresultoftheopera)on)
BioinfRes SoSe 17
Selec)on
● TheBooleanexpressionspecifiedin<selec)oncondi)on>ismadeupofanumberofclausesoftheform:<ajributename><comparisonop><constantvalue>or<ajributename><comparisonop><ajributename>
● <ajributename>isthenameofanajributeofR,<comparisonop>idnormallyoneoftheopera)ons{=,>,>=,<,<=,!=}
● ClausescanbearbitrarilyconnectedbytheBooleanoperatorsand,orandnot
BioinfRes SoSe 17
Selec)on
● NULListestedforwithspecialoperators● Selectσiscommuta)ve
● canbecascadeofselectopera)onsofaconjunc)onofcondi)ons:σ<condi)on1>(σ<condi)on2>(R))=σ<condi)on2>(σ<condi)on1>(R))σ<cond1>(σ<cond2>(σ<cond3>(R))=σ<cond1>AND<cond2>AND<cond3>(R)
BioinfRes SoSe 17
Projec)on
● PROJECTOpera)onisdenotedbyπ(pi)● usePROJECTtoretrievespecificajributesofrela)onR
● Itactsasa(column)filterofthetuples
● Example:πGene,status(STATUS)
● Projectremovesduplicateswhichmightoccur(inSQL:SELECTDISTINCTinsteadofsimpleSELECT)
BioinfRes SoSe 17
SingleExpressionvs.SequenceofRela)onalOpera)ons
● Toretrievecompletedgenesfromourexample:● Singleexpression:πgene,status(σstatus=completed(STATUS))
● Sequenceofopera)on:ALL_COMP<-σstatus=completed(STATUS)RESULT<-πgene,status(ALL_COMP)
BioinfRes SoSe 17
Rename
● RENAMEisdenotedbyρ(rho)● Insomecases,wemaywanttorenametheajributesofarela)onortherela)onnameorboth- Usefulwhenaqueryrequiresmul)pleopera)ons- Necessaryinsomecases(seeJOINopera)onlater)
BioinfRes SoSe 17
RENAME
● RENAMEopera)onsρcanbeexpressedbyanyofthefollowingforms:- ρS(R)changes:therela5onnameonlytoS- ρ(B1,B2,…,Bn)(R)changes:thecolumn(a9ribute)namesonlytoB1,B1,…,Bn
- ρS(B1,B2,…,Bn)(R)changesboth:therela)onnametoS,andthecolumn(ajribute)namestoB1,B1,…,Bn
BioinfRes SoSe 17
Rela)onalOperatorsfromSetTheory
● Union● Intersec)on
● Minus
● CartesianProducts
BioinfRes SoSe 17
Union
● ItisaBinaryopera)on,denotedby∪● TheresultofR∪S,isarela)onthatincludesalltuplesthatareeitherinRorinSorinbothRandS
● Duplicatetuplesareeliminated
● RandShavetotypecompa)ble:- theymusthavethesamenumberofajributes- correspondingajributesaretypecompa)ble
BioinfRes SoSe 17
Intersec)on
● INTERSECTIONisdenotedby∩● Theresultoftheopera)onR∩S,isarela)onthatincludesalltuplesthatareinbothRandS
● TheajributenamesintheresultwillbethesameastheajributenamesinR
● Thetwooperandrela)onsRandSmustbe“typecompa)ble”
BioinfRes SoSe 17
SetDifference
● SETDIFFERENCE(alsocalledMINUSorEXCEPT)isdenotedby–
● TheresultofR–S,isarela)onthatincludesalltuplesthatareinRbutnotinS
● TheajributenamesintheresultwillbethesameastheajributenamesinR
● Thetwooperandrela)onsRandSmustbe“typecompa)ble”
BioinfRes SoSe 17
Proper)esofUnion,Intersec)onandDifference
● Bothunionandintersec)onarecommuta)ve;thatis:R∪S=S∪R,andR∩S=S∩R
● Unionandintersec)onareassocia)veopera)ons;thatis:R∪(S∪T)=(R∪S)∪T(R∩S)∩T=R∩(S∩T)
● Theminusopera)onisnotcommuta)ve;thatis:R–S≠S–R
BioinfRes SoSe 17
CrossProduct(CartesianProduct)
● CROSSPRODUCTOpera)on● Usedtocombinetuplesfromtworela)onsinacombinatorialfashion
● DenotedbyR(A1,A2,...,An)xS(B1,B2,...,Bm)
● Resultisarela)onQwithdegreen+majributes:Q(A1,A2,...,An,B1,B2,...,Bm)
BioinfRes SoSe 17
CartesianProduct(CrossProduct)● Theresul)ngrela)oncontainseverypossiblecombina)onofthetuplesfromRandS--onefromRandonefromS
● Hence,ifRhasnRtuples(denotedas|R|=nR),andShasnStuples,thenRxSwillhavenR*nStuples
● ThetwooperandsdoNOThavetobe"typecompa)ble”
● Generally,CARTESIANPRODUCTisnotameaningfulopera)on,butcanbecomemeaningfulwhenfollowedbyotheropera)ons
BioinfRes SoSe 17
Join
● JOINOpera)on(denotedby)● SequenceofCARTESIANPRODUCTfollowedbySELECTisusedtoiden)fyandselectrelatedtuplesfromtworela)ons
● veryimportantforanyrela)onaldatabasewithmorethanasinglerela)on,becauseitallowstocombinerelatedtuplesfromvariousrela)ons
BioinfRes SoSe 17
Join
● Thegeneralformofajoinopera)onontworela)onsR(A1,A2,...,An)andS(B1,B2,...,Bm)is:R<joincondi)on>S
● whereRandScanbeanyrela)onsthatresultfromgeneralrela)onalalgebraexpressions
BioinfRes SoSe 17
Join● ConsiderthefollowingJOINopera)on:- IfR(A1,A2,...,An)andS(B1,B2,...,Bm)ThinkaboutR.Ai=S.Bj
- Resultisarela)onQwithdegreen+majributes:Q(A1,A2,...,An,B1,B2,...,Bm)
- Theresul)ngrela)onstatehasonetupleforeachcombina)onoftuples–rfromRandsfromS,butonlyiftheysa)sfythejoincondi)onr[Ai]=s[Bj]
- ifRhasnRtuples,andShasnStuples,thenthejoinresultwillgenerallyhavelessthannR*nStuples
BioinfRes SoSe 17
Join(moreprecise)
● ThegeneralcaseofJOINopera)oniscalledaTheta-join:RthetaS
● Thejoincondi)oniscalledtheta● ThetacanbeanygeneralbooleanexpressionontheajributesofRandS;forexample:R.Ai<S.BjAND(R.Ak=S.BlORR.Ap<S.Bq)
BioinfRes SoSe 17
Equijoin
● Themostcommonuseofjoininvolvesjoincondi)onswithequalitycomparisonsonly
● Suchajoin,whereonlythecomparisonoperatorusedis=,iscalledanEQUIJOIN
● TheJOINseeninthepreviousexamplewasanEQUIJOIN
BioinfRes SoSe 17
NaturalJoin● Anothervaria)onofJOINcalledNATURALJOIN—denotedby*orwithoutcondi)on
● Itwascreatedtogetridofthesecond(superfluous)ajributeinanEQUIJOINcondi)on.
● Q←R(A,B,C,D)*S(C,D,E)
● implicitjoincondi)onincludeseachpairofajributeswiththesamename,“AND”edtogether:R.C=S.CANDR.D=S.D
● keepsonlyoneajributeofeachsuchpair:Q(A,B,C,D,E)
BioinfRes SoSe 17
SemiJoin
● actslikeafilterbasedonaspecifiedajribute● RSmeans:ifRandShaveacommonajributeCtheresultarealltuplesfromRwhichCvalueoccursalsoinS,nQ≤nRtuples
● Q<-R(A,B,C)S(C,D,E)
● Q(A,B,C)withnRajributes● πA,B,C(σR.C=S.C(RxS))
BioinfRes SoSe 17
Le�OuterJoin● Rightversionisanalogous● addinforma)ontocorrespondingle�sidetuples
● RSmeans:ifRandShaveacommonajributeCtheresultareallcombinedtuplesfromRandSwhereR.C=S.Candinaddi)onallremainingtuplesfromR,nQ=nRtuples
● Q<-R(A,B,C)S(C,D,E)
● Q(A,B,C,D,E)withnR∪Sajributes
● ifnomatchingtuplesfoundinSajributesDandEcontainnovalues
BioinfRes SoSe 17
(Full)OuterJoin● combinescorrespondingtuplesvonRandSwherepossible,elseajributesle�blank
● RSmeans:ifRandShaveacommonajributeCtheresultareallcombinedtuplesfromRandSwhereR.C=S.Candinaddi)onallremainingtuplesfromRandS,nQ≤nR+Stuples
● Q<-R(A,B,C)S(C,D,E)
● Q(A,B,C,D,E)withnR∪Sajributes● ifnomatchingtuplesfoundinRorSajributesA,BorDandEcontainnovalues
BioinfRes SoSe 17
Division
● GivesallajributetupleforR-SwhereavaluesforR-Sco-occurswithalltuplesinS
● R(A,B)andS(B)● R÷S:Q(A)whereeachresulttupleinQcanbefoundinRincombina)onwitheverytuplefromS
BioinfRes SoSe 17
CompleteSetofRela)onalOpera)ons
● Thesetofopera)onsincludingSELECTσ,PROJECTπ,UNION∪,DIFFERENCE-,RENAMEρ,andCARTESIANPRODUCTXiscalledacompletesetbecauseanyotherrela)onalalgebraexpressioncanbeexpressedbyacombina)onofthesefiveopera)ons.
● Examples:- R∩S=(R∪S)–((R-S)∪(S-R))- R<joincondi)on>S=σ<joincondi)on>(RXS)
BioinfRes SoSe 17
BeyondClassicalAlgebra
● Grouping:groupby● Aggrega)on:count,sum,average,min,max
BioinfRes SoSe 17
KeysandIndexes
● Eachrela)onrepresentsasubsetofthecartesianproductofitsdomains(ajributes)
● Somevaluesmightbeuniqueforarowothersarenot
● Toaddressandaccessaspecifictupleinarela)onweneedtodefineaprimarykey
● Aprimarykeyissetofajributeswhichcombina)onallowsustounambiguouslyiden)fyacertainrowintherela)on
BioinfRes SoSe 17
KeysandIndexes● Consequences:- Eachprimarykey(combina)on)canoccuronlyonceinatable
- Entrieswhichmissatoneoftheseajributevaluesarenotallows(NOTNULL)
- Defaultvaluesfortheseajributesmakenosense- Thesesystemhastokeeptrackwhichthehelpofanindex
● Thekeydependsonthemodelingandthedomain
BioinfRes SoSe 17
Indexes/Constraints
● PRIMARYKEY:UNIQUE,NOTNULL● UNIQUE:Ifthereisavalueitmustbeunique,ifthereisnovaluebutNULLitcanoccurmul)ple)mes
● INDEX:Asearchstructurewhichallowstofindtuples(rows)whichaspecificajributevalueefficiently- mustexplicitlyrequestedinthetablestructure- forcharactertypesyoucantheprefixlength
BioinfRes SoSe 17
PerformanceConsidera)ons
● Therearethreerela)ontojoinA*B*C:- A(1.000.000rows)- B(100rows)- C(10.000rows)
BioinfRes SoSe 17
PerformanceConsidera)ons● (Worst)Casew/oindexesandbadsequence:A*C:10.000.000.000comparisonsO(n*m)->D(10.000.000.000rows)D*B(1.000.000.000.000comparisons)O(n*m)- ofcoursetuplesmightbedroppedinrealitybecauseofmissingjoinpartners
● Casewithindexesandcleversequence:B*A:100*log(10.000.000)comparisons->D(10.000.000rows)C*D:10.000*log(10.000.000)comparisons
BioinfRes SoSe 17
PerformanceConsidera)ons● Sequenceofevalua)oncanbeop)mizedbythedatabaseengine- cleverorderwithexploita)onofassocia)vityandcommuta)vity
- example:100*log(10.000.000)vs10.000.000*log(100)
- maybenoteffec)veinworstcasebutdefinitelyevery)meelse
top related