ten years and change
DESCRIPTION
Ten Years and Change. the MX data archive at ALS 8.3.1. Acknowledgements. ALS 8.3.1 creator: Tom Alber 8.3.1 PRT head: Jamie Cate Center for Structure of Membrane Proteins Membrane Protein Expression Center II Center for HIV Accessory and Regulatory Complexes W. M. Keck Foundation - PowerPoint PPT PresentationTRANSCRIPT
Ten Years and Change
the MX data archive at ALS 8.3.1
AcknowledgementsALS 8.3.1 creator: Tom Alber 8.3.1 PRT head: Jamie Cate
Center for Structure of Membrane ProteinsMembrane Protein Expression Center II
Center for HIV Accessory and Regulatory Complexes
W. M. Keck FoundationPlexxikon, Inc.
M D Anderson CRCUniversity of California Berkeley
University of California San FranciscoNational Science Foundation
University of California Campus-Laboratory Collaboration GrantHenry Wheeler
The Advanced Light Source is supported by the Director, Office of Science, Office of Basic Energy Sciences, Materials Sciences Division, of the US Department of Energy under contract No. DE-AC02-05CH11231 at Lawrence Berkeley National Laboratory.
ALS 8.3.1 data collection history
0
10
20
30
40
50
60
70
2001200220032004200520062007200820092010201120122013
actual
doubling = 2.8 years
tera
byte
s (u
ncom
pres
sed)
ALS 8.3.1 data collection history
0
10
20
30
40
50
60
70
2001200220032004200520062007200820092010201120122013
Proteum 300
Q210
Q315 (907)
Q315r (926)
tera
byte
s (u
ncom
pres
sed)
ALS 8.3.1 data collection history
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
2001200220032004200520062007200820092010201120122013
Proteum 300
Q210
Q315 (907)
Q315r (926)
imag
es x
106
DVD data archive: 68 TB
DVD data archive
50 TB
Primary failure mode of DVDs
Primary failure mode of DVDs
3000 files remain unrecoverable (~0.1%)
Which data go with which PDB?
• 260,000 images are called “test”
• cell: 48 62 84 90 101 104– is within 5 Å and 5° of 16,000 PDBs
focusing on 2001-2006
• 490 PDBs credit ALS 8.3.1 with data
• 44 of these didn’t actually collect data
• 64 collected data, but no credit
1. images from 2001-2006
2. collected “near” edges
3. find “runs” of >10 images
4. unify multi-wedge sets
5. run labelit & XDS
6. >70% complete?
7. I/σ > 10
8. reduced cell vs PDB
1,604,031
682,712
3602
3331
2524
1479
1054
1 to 200+
Which data go with which PDB?
Unit Cell: 90.9 90.9 46.8 90 90 120
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.00 0.50 1.00 1.50 2.00
best
Rcr
yst a
fter
rig
id-b
ody
refin
emen
t
RMS unit cell length deviation (Å)
1hh7 M. TB CSOR
1rb5
myoglobin
MAD/SAD datasets
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.20 0.30 0.40 0.50 0.60
Ris
o vs
PD
B d
epos
it
best Rcryst after rigid-body refinement
Published
non-isomorphous
Unsolved?
Responses to inquiries
“I have to find my old note book as I have no idea what that is.”
“I have changed jobs a few times since and am really far away from crystallography now.”
“Will see what I can find.”
“We solved it but never published it. Sorry!”
EGDA
Dec 01 19:45:12 2001 egda46_*1_E#_###.img (1112 images, Se MAD)Dec 02 15:10:06 2001 egda27_*1_###.img (180, 1A, native?)Dec 02 19:21:55 2001 egdau1_*1_###.img (427, 8000eV (U?) SAD)Dec 02 20:58:26 2001 egdau1_*2_###.img (360, 8000eV (U?) SAD)Jun 01 14:07:43 2002 egda60_*1_###.img (360, Lutetium SAD)
“I think that these EGDA data sets are very likely some of xxx’s data sets, he was working on E.coli guanine deaminase, something he brought from yyy. No structure was ever published James, xxx was unable to solve the structure from these data.”
~2.9 ÅP21212
R = 0.32Rfree = 0.39
PDB ID: ????
E. coliguaninedeaminase
Metadata: can we rely on it?
Duquerroy, et al. (1994). "Lobster enolase crystallized by serendipity", Proteins: Struct., Funct., Bioinf. 18, 390-393.
authors were after lobsterarginine kinase
got enolase instead
arginine kinase structurestill unknown
compresses 4.2x
raw image
compresses 337x
just spots
compresses 5x, but only one per dataset!
pixel-wisemedianacross
dataset
compresses 3.5x
deviationfrom
median in“non-spot”
areas
compressed ~50x
after h264of non-spot
areas
compresses 5.2x
differencebetweenraw and
compressed
Lossy compression vs R/Rfree
0.18
0.2
0.22
0.24
0.26
0.28
0.3
0.32
0.34
0.36
0.38
1 10 100
R_cryst
R_free
R f
acto
r
compression ratio
backblaze.com “pod” server
backblaze.com offers “unlimited storage” data backup for $5/month.
backblaze offers
“unlimited storage” data backup for
$5/month.
backblazedoes not sellthese “pods”,but “protocase.com” does.
Summary
• saving data could double productivity
• unit cell is not a good score
• lossy compression: rallying cry?
• backup vs archive
• metadata: what do we really know?
Brief Summary
• this is a lot of work.
• who is going to pay for it?