![Page 1: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/1.jpg)
Morgan Langille
UC Davis
![Page 2: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/2.jpg)
![Page 3: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/3.jpg)
Questions
If we wanted to start studying a gene of unknown function, which one(s) should we study first?
How many un-annotated genes could be annotated?
What proportion of unknown genes (hypothetical proteins) are probably not real proteins (i.e. pseudo-genes, mis-predicted orfs, etc.) ?
What proportion of unknown gene families are probably phage-related?
Can some of these families (hopefully the top ranking ones) be characterized using non-similarity based bioinformatic approaches?
![Page 4: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/4.jpg)
Outline of project
Genomic Data Pfam SearchFilter for unknown
genes
Build HMMs for unknown genes
Rank Families
•Universality
•Evenness
•Pathogen / Non-pathogen
•Etc.
Create unknown families for
metagenomics data
Identify unknown families that now
merge with known families
Quantify families that are likely
phage
Use several non-similarity based methods to predict family function
•Community Profiling**
•3D structure similarity
•Neighboring genes
![Page 5: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/5.jpg)
![Page 6: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/6.jpg)
Phylogenetic profiling
C. hydrogenoformans
identified presence or
absence of homologs in
all other completely
sequence genomes
Identified many
hypothetical proteins that
had the same profile as
other sporulation
proteins
Wu, et al., PLOS Genetics, 2005
![Page 7: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/7.jpg)
Community ProfilingKEGG COG
Delong, et al., Science, 2006
![Page 8: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/8.jpg)
Community Profiling
Look across multiple metagenomic
samples
Gene families that have similar profiles
may have similar function
Similar to using co-expression to identify
similar functioning genes
![Page 9: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/9.jpg)
So what have I done?
"all metagenomics peptides" from
CAMERA
43M sequences (mostly GOS)
Searched against 11,000 Pfams using
HMMER 3
Used “cluster” to group genes and samples
![Page 10: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/10.jpg)
Results
Red = above avg.
number of pfams
Green = below avg.
number of pfams
Have not normalized
Number of sequences
per sample
For number of pfams
Metagenomic Samples
Pfams
![Page 11: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/11.jpg)
Example of phage Pfams
clustering together
![Page 12: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/12.jpg)
Measuring functional
relatedness Need to measure community profiling performance
The hierarchal clusters were broken into 575 groups using a correlation cutoff of 0.90 or above.
PFams were mapped to GO terms using pfam2GO 1893 PFams had no associated GO term
○ 695 of these were Domains of Unknown Function:DUFs
3377 PFams had one or more associated GO terms and could be used for further analysis
Only 67 (of 575) clusters contained 4 or more PFamswith at least one GO term
![Page 13: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/13.jpg)
Measuring GO similarity
G-SESAME
Measures the semantic similarity of any two GO
terms
Not downloadable so queries had to be
made to their web server (not fun)
Pair-wise similarity was measure for each
pair of GO terms in each cluster
had to check if terms were in same namespace
![Page 14: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/14.jpg)
Results
Average G-Sesame scores for each cluster
The average of all cluster averages was 0.484 10 clusters had a score of 0.60 or greater.
The data was then randomized by using the same GO terms but in different random clusters and a score of 0.412-0.420 over 4 iterations Each of the 4 iterations had only 1 or 0 clusters with
a score of 0.60 or greater
![Page 15: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/15.jpg)
Community Profiling Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96
G-S
es
am
e S
co
re
Cluster Correlation Coefficient
• Average of all clusters= 0.49
• 10 clusters are > 0.60
![Page 16: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/16.jpg)
Random Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96
G-S
esam
e S
co
re
Cluster Correlation Coefficient
• Average of all clusters (4 iterations) = 0.41 - 0.42
• 1 or 0 clusters are > 0.60
![Page 17: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/17.jpg)
![Page 18: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/18.jpg)
Bittorrent
A peer-to-peer file sharing protocol
~ 27-55% of all Internet traffic
Mostly illegal file sharing
Files are shared in small
pieces between several
users
![Page 19: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/19.jpg)
Torrents for Biology
Why use torrent technology?
1. Download large datasets much faster
2. Searchable central listing
3. Decentralization of data
![Page 20: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/20.jpg)
What is BioTorrents?
A legal file sharing website for scientists
Users can upload their own research results, data, software
Users can browse or search through all datasets
Data is not hosted on BioTorrents
![Page 21: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/21.jpg)
www.biotorrents.net
![Page 22: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/22.jpg)
Browse & Search
![Page 23: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/23.jpg)
Details
![Page 24: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/24.jpg)
Sign Up
![Page 25: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/25.jpg)
Upload
![Page 26: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/26.jpg)
Other Features
Forum
RSS Feed
Top 10
FAQ
Links
![Page 27: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/27.jpg)
Who will upload data?
Everyone!
Realistically,
Large organizations (e.g. NCBI, CAMERA, etc.)
○ May need some convincing to host their data via
torrents in addition to FTP, HTTP, etc.
Scientists that really support open science
○ Sharing data before formally complete and
published
![Page 28: Unknown Genes, Community Profiling, & Biotorrents.net](https://reader031.vdocuments.site/reader031/viewer/2022032422/55a8e01c1a28abcb4e8b4586/html5/thumbnails/28.jpg)
Technical Challenges
Many institutions frown on BitTorrent technology
A port must be opened/forwarded
Client program and computer must be left running
Ensuring data is legal, virus free, etc. Users that upload many legitimate torrents will provide
more confidence to people downloading
Making downloading and uploading easy