index building. -2--2- overview database tables building flow (logical) sequential drawbacks...
TRANSCRIPT
Index Building
Index Building
-2-
Overview
• Database tables• Building flow (logical)• Sequential• Drawbacks• Parallel processing• Recovery• Helpful rules
Index Building
-3-
Database tables
Word Index:• Z97 - word dictionary• Z98 - bitmap• Z980 - cache of bitmap updates• Z95 - words in document
Index Building
-4-
Database tables
Z97• translation from word to
internal representation (sequence)
• same character set as documents
Index Building
-5-
Database tables
Z98• “bitmap” of word occurrence in
documents• each bitmap is physically made
up of one or more records• compressed• one bitmap for every
combination of word and index
Index Building
-6-
Database tables
Z980• cache of bitmap updates • increases speed of large bitmap
updates• 1/1000
Index Building
-7-
Database tables
Z95• list of words and their location
in a document• adjacency
Index Building
-8-
Database tables
Heading index:• Z01 - phrase dictionary• Z02 - phrase->document
mapping
Index Building
-9-
Database tables
Z01:• filing phrase• connection to authority
database• hash key (display text)
Index Building
-10-
Building flow - word
Stage 1: Retrieval + Sort• Read document• prepare list of words and
locations• for each word find list of indices
it belongs to• sort according to words
Index Building
-11-
Building flow - word
Stage 2: Word Dictionary• read intermediate file from
stage 1• build up word dictionary (check
+ load)• replace word with internal
representation• create 2nd intermediate file
Index Building
-12-
Building flow - word
Stage 3: Sort + Build Z95• sort intermediate file from
stage 2 - by document number• create Z95 records• load Z95 sequential file to
database
Index Building
-13-
Building flow - word
Stage 4: Merge + Build Z98• intermediate file from stage 2
already sorted by word number• split words into a number of
files according to range of word numbers
• merge into Z98 records• load sequential files
Index Building
-14-
Building flow - heading
Stage 1: Retrieval + Sort• Read document• prepare list of phrases• for each phrase find list of
indices it belongs to• sort according to hash key
Index Building
-15-
Building flow - heading
Stage 2: Phrase Dictionary• read intermediate file from stage
1• build up phrase dictionary• generate unique key - acc
sequence• load Z01 sequential file to
database• build Z02 - non unique
Index Building
-16-
Building flow - heading
Stage 3: Sort + Load Z02• sort non unique Z02 sequential
file• load Z02 sequential file to
database
Index Building
-17-
Sequential - word
• Every stage is handled by a single process
• Only after handling by a previous stage would the next stage proceed
• stage 4 would proceed after all other stages were finished
Index Building
-18-
Sequential - word
Example from version 12.1 csh -f p_manage_01_a $1 >& $data_scratch/p_manage_01_a.log & csh -f p_manage_01_b $1 >& $data_scratch/p_manage_01_b.log & csh -f p_manage_01_c $1 >& $data_scratch/p_manage_01_c.log & csh -f p_manage_01_d $1 >& $data_scratch/p_manage_01_d.log
csh -f p_manage_01_e $1 >& $data_scratch/p_manage_01_e.log
Index Building
-19-
Sequential - word
• p_manage_01_a: retrieval• p_manage_01_b: sort (by word)• p_manage_01_c: build Z97• p_manage_01_d: build Z95• p_manage_01_e: merge + build
Z98
Index Building
-20-
Drawbacks
• Minimum parallel processing• Single process per stage• No recoverability - Z97 could be
reused but the whole building process needed to be rerun
• Computer resources not fully utilized
• Long run time
Index Building
-21-
Parallel processing
• Large databases - multiple processors
• Identify stages that are not “workflow” bottlenecks
• Coordinate parallel processes with assignment/progress table
Index Building
-22-
Parallel processing (word)
Stage 1: Retrieval + Sort• Retrieval is parallel - “io” not
“workflow” bottleneck• Split into cycles of range
document numbers
Index Building
-23-
Parallel processing (word)
p_manage_01_a.cycles - initial
0001 - - - - 000000001 0000100000002 - - - - 000010001 0000200000003 - - - - 000020001 0000300000004 - - - - 000030001 0000400000005 - - - - 000040001 0000500000006 - - - - 000050001 0000600000007 - - - - 000060001 0000700000008 - - - - 000070001 0000800000009 - - - - 000080001 0000900000010 - - - - 000090001 0001000000011 - - - - 000100001 0001100000012 - - - - 000110001 000110511
Index Building
-24-
Parallel processing (word)
p_manage_01_a.cycles - 3 processes, 1st retrieval cycle
0001 ? - - - 000000001 0000100000002 ? - - - 000010001 0000200000003 ? - - - 000020001 0000300000004 - - - - 000030001 0000400000005 - - - - 000040001 0000500000006 - - - - 000050001 0000600000007 - - - - 000060001 0000700000008 - - - - 000070001 0000800000009 - - - - 000080001 0000900000010 - - - - 000090001 0001000000011 - - - - 000100001 0001100000012 - - - - 000110001 000110511
Index Building
-25-
Parallel processing (word)
p_manage_01_a.cycles - 3 processes, 2nd retrieval cycle
0001 + + ? - 000000001 0000100000002 + ? - - 000010001 0000200000003 + - - - 000020001 0000300000004 ? - - - 000030001 0000400000005 ? - - - 000040001 0000500000006 ? - - - 000050001 0000600000007 - - - - 000060001 0000700000008 - - - - 000070001 0000800000009 - - - - 000080001 0000900000010 - - - - 000090001 0001000000011 - - - - 000100001 0001100000012 - - - - 000110001 000110511
Index Building
-26-
Parallel processing (word)
• Whenever possible stages were split into separate sub-stages
• Usually in cases of non-parallel stages
• stages 2 and 3 were not made into parallel processes - retrieval was by far the most costly stage
Index Building
-27-
Parallel processing (word)
Stage 2 and 3 were subdivided into the 3 sub stages:
• build Z97 + load• sort intermediate file by
document number• build Z95 + load
Index Building
-28-
Parallel processing (word)
p_manage_01_a.cycles - example
0001 + + + + 000000001 0000100000002 + + + ? 000010001 0000200000003 + + ? - 000020001 0000300000004 + + - - 000030001 0000400000005 + ? - - 000040001 0000500000006 + - - - 000050001 0000600000007 ? - - - 000060001 0000700000008 ? - - - 000070001 0000800000009 ? - - - 000080001 0000900000010 - - - - 000090001 0001000000011 - - - - 000100001 0001100000012 - - - - 000110001 000110511
Index Building
-29-
Parallel processing (word)
Stage 4 is split into sub stages:• pre-processing of intermediate
files from stage 2 - distribution of words
• build Z98 - parallel• load Z98 sequential file• input files are compressed and
stored in separate directory
Index Building
-30-
Parallel processing (word)
Pre-processing:• generate histogram - # of lines
per 5000 words• determine range of words - no
more than 1G in intermediate files
Index Building
-31-
Parallel processing (word)
p_manage_01_e.cycles
0001 - - 000000001 0006000000002 - - 000600001 0009000000003 - - 000900001 999999999
Index Building
-32-
Parallel processing (word)
Build Z98:• intermediate files - split into
discrete range of words• parallel merging and building of
Z98
Index Building
-33-
Parallel processing (word)
p_manage_01_e.cycles - example
0001 + ? 000000001 0006000000002 ? - 000600001 0009000000003 ? - 000900001 999999999
Index Building
-34-
Parallel processing (heading)
Stage 1: Retrieval + Sort• same handling as word index
stage 1• “io” bottleneck • Split into cycles of range
document numbers
Index Building
-35-
Parallel processing (heading)
p_manage_02.cycles
0001 - - - - 000000001 0000050000002 - - - - 000005001 0000100000003 - - - - 000010001 0000150000004 - - - - 000015001 0000200000005 - - - - 000020001 0000250000006 - - - - 000025001 0000300000007 - - - - 000030001 0000350000008 - - - - 000035001 0000400000009 - - - - 000040001 0000450000010 - - - - 000045001 000048435
Index Building
-36-
Parallel processing (heading)
Stage 2 and 3 were subdivided into the 3 sub stages:
• build Z01 + load + build Z02• sort non unique Z02 sequential
file• load Z02
Index Building
-37-
Parallel processing (heading)
p_manage_02.cycles - example
0001 + + + ? 000000001 0000050000002 + + ? - 000005001 0000100000003 + + - - 000010001 0000150000004 + ? - - 000015001 0000200000005 + - - - 000020001 0000250000006 ? - - - 000025001 0000300000007 ? - - - 000030001 0000350000008 ? - - - 000035001 0000400000009 - - - - 000040001 0000450000010 - - - - 000045001 000048435
Index Building
-38-
Parallel processing (heading)
Building of headings is conceptually and practically similar to word building, except for the building of bitmaps (Z98)
Index Building
-39-
Recovery
Word index:• stages 1-3 and stage 4 are
separate• stage 4 runs only after all
processing is done in stage 3
Index Building
-40-
Recovery
Stage 1-3 - scenarios:• database tables need to be
enlarged• not enough disk space -
intermediate files• not enough disk spaces - sort• general disaster?
Index Building
-41-
Recovery
Stage 1-3:• identify last successful section• change “in process” signs (?) to
“not processed” sign (-)• rerun discrete stage scripts:
– p_manage_01_a– p_manage_01_c– p_manage_01_d– p_manage_01_d1
Index Building
-42-
Recovery
Stage 4:• must be rerun in totality• input files are saved and
compressed• $word_compress_dir• p_manage_01_e
Index Building
-43-
Helpful rules
Stage 1 outrunning stage 2-3:• decide on number of stage 1
processes to stop (p_manage_01_a)
• kill shell and program process• reset associated cycle in
p_manage_01_a.cycles
Index Building
-44-
Helpful rules
Log file names:p_manage_01_a_{process_number}.logp_manage_01_e_{process_number}.log
others are without process_number
p_manage_01_c.logp_manage_01_d.logp_manage_01_d1.logp_manage_01_e1.logp_manage_01_e2.log
Index Building
-45-
Helpful rules
cycle size:
# docs<2M - 50k# docs<4M - 100kotherwise - 200k
Index Building
-46-
Helpful rules
Disk space calculation:
d = no. documentsc = no. cycles p = no. processorss = size of retrieval file
Index Building
-47-
Helpful rules
Sort space ($TMPDIR):
sort = p*s + 20%
stage 1 sort (parallel) +stage 2,3 sorting (single file)
Index Building
-48-
Helpful rules
Scratch space:
scratch = p*1.5*s +c*s*1/3
output from stage 1 (in process and not yet processed) +
output from stage 3
Index Building
-49-
Helpful rules
Example: UBU
d=2M cycle size=50kp=4, c=40, s= ~0.5G
sort=4*0.5*1.2=2.4Gscratch=4*1.5*0.5 + 40*0.5*1/3
= 3G + 6.67G= 10.67G