copyright © 2008 mark logic corporation. all rights reserved.1 unlock content™ copyright © 2008...
Post on 21-Dec-2015
218 views
TRANSCRIPT
Copyright © 2008 Mark Logic Corporation. All rights reserved. 1
Unlock Content™
Copyright © 2008 Mark Logic Corporation. All rights reserved. 1
MarkLogic Server: Under The Hood
Mary HolstegePrincipal Engineer
Copyright © 2008 Mark Logic Corporation. All rights reserved. 2
MarkLogic Server
XML Server
Special-purpose DBMS for XML
Semi-structured
Hierarchical
Designed for 100s of TB of XML
Copyright © 2008 Mark Logic Corporation. All rights reserved. 3
How Did We Get Here?
Founder: Christopher Lindblad
MIT
Architect of Ultraseek ServerIntranet seach engine product
Met people that wanted to use a search engine like a database
Rich query language
Guaranteed correctness
Transactions
Copyright © 2008 Mark Logic Corporation. All rights reserved. 4
Consider an Application
Documents + metadata
Documents: rich, variable structure
Want: complex full-text search
Want: combined text, metadata, structure-aware search
Want: granular ad hoc access
Want: real-time query
How do you build it?
Copyright © 2008 Mark Logic Corporation. All rights reserved. 6
A Different Approach
Soul of Search Engine: Data Model And Queries
Database: On-disk Organization And Transactions
Copyright © 2008 Mark Logic Corporation. All rights reserved. 7
Data Model
Document
Title
Author
Abstract
Section
Section
Footer
Section
Section
Section (cont’d)
Metadata
Copyright © 2008 Mark Logic Corporation. All rights reserved. 8
Data Model
A database for XML . . .
. . . uses the XML Data Model
XML is a tree
Document
Title Author
Section
Section Section Section Section Section
FirstLast
Metadata
Copyright © 2008 Mark Logic Corporation. All rights reserved. 11
Example Document
<article>
<title>MarkLogic Server: The Best Place for XML</title>
<author><first-name>John</first-name><last-name>Kreisa</last-name></author>
<abstract>
Where should one put their XML? <company>Mark Logic</company> has the best answer to this question: MarkLogic Server. . . .
</abstract>
<body>
<section>
<section> This high performance engine can . . . </section>
</section>
<section> Using an inverted index technique . . . </section>
</body>
<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>
</article>
Copyright © 2008 Mark Logic Corporation. All rights reserved. 12
What Queries Is It Good At?
1) Full-Text Search
Find all documents that contain the phrase “high performance”.
2) XML Structure
Find all articles that have an abstract.
3) XML Semantics
Find all documents that mention the company “Mark Logic”.
4) All of the above . . .
Find all articles that contain the phrase “high performance” and mention the company Mark Logic in the abstract.
at the same time
Copyright © 2008 Mark Logic Corporation. All rights reserved. 13
1) Full-text Search
Find all documents that contain the phrase “high performance”
<article>
<title>MarkLogic Server: The Best Place for XML</title>
<author><first-name>John</first-name><last-name>Kreisa</last-name></author>
<abstract>
Where should one put their XML? <company>Mark Logic</company> has the best answer to this question: MarkLogic Server. . . .
</abstract>
<body>
<section>
<section> This high performance engine can . . . </section>
</section>
<section> Using an inverted index technique . . . </section>
</body>
<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>
</article>
Copyright © 2008 Mark Logic Corporation. All rights reserved. 14
1) Full-text Search
very
high
perform
ance
index
122 0 1 0 0
123 1 0 1 1
124 0 0 0 0
125 0 1 0 0
126 0 1 1 0
127 1 0 0 0
129 1 1 0 0
130 0 1 1 1
Find all documents that contain the phrase “high performance”
Copyright © 2008 Mark Logic Corporation. All rights reserved. 15
1) Full-text Search
UNIVERSAL INDEX
“very”
“high”
“performance”
“index”
“high performance”
“very high”
“performance index”
123, 127, 129, 152, 344, 791 . . .
122, 125, 126, 129, 130, 167 . . .
123, 126, 130, 142, 143, 167 . . .
123, 130, 131, 135, 162, 177 . . .
126, 130, 167, 212, 219, 377 . . .
. . .
. . .
Document References
126, 130, 167, 212, 219, 377 . . .
Find all documents that contain the phrase “high performance”
Copyright © 2008 Mark Logic Corporation. All rights reserved. 16
2) XML Structure
Find all articles that have an abstract
<article><title>MarkLogic Server: The Best Place for XML</title>
<author><first-name>John</first-name><last-name>Kreisa</last-name></author>
<abstract>Where should one put their XML? <company>Mark Logic</company> has the best answer to this question: MarkLogic Server. . . .
</abstract>
<body>
<section>
<section> This high performance engine can . . . </section>
</section>
<section> Using an inverted index technique . . . </section>
</body>
<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>
</article>
Copyright © 2008 Mark Logic Corporation. All rights reserved. 17
2) XML Structure
UNIVERSAL INDEX
“very”
“high”
“performance”
“index”
“high performance”
<article>
<article>/<abstract>
123, 127, 129, 152, 344, 791 . . .
122, 125, 126, 129, 130, 167 . . .
123, 126, 130, 142, 143, 167 . . .
123, 130, 131, 135, 162, 177 . . .
126, 130, 167, 212, 219, 377 . . .
. . .
. . .
Document References
126, 130, 167, 212, 219, 377 . . .
Find all articles that have an abstract
Copyright © 2008 Mark Logic Corporation. All rights reserved. 18
3) XML Semantics
Find all documents that mention the company “Mark Logic”
<article>
<title>MarkLogic Server: The Best Place for XML</title>
<author><first-name>John</first-name><last-name>Kreisa</last-name></author>
<abstract>
Where should one put their XML? <company>Mark Logic</company> has the best answer to this question: MarkLogic Server. . . .
</abstract>
<body>
<section>
<section> This high performance engine can . . . </section>
</section>
<section> Using an inverted index technique . . . </section>
</body>
<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>
</article>
Copyright © 2008 Mark Logic Corporation. All rights reserved. 19
3) XML Semantics
UNIVERSAL INDEX
“very”
“high”
“performance”
“index”
“high performance”
<article>
<article>/<abstract>
<company>Mark Logic</
123, 127, 129, 152, 344, 791 . . .
122, 125, 126, 129, 130, 167 . . .
123, 126, 130, 142, 143, 167 . . .
123, 130, 131, 135, 162, 177 . . .
126, 130, 167, 212, 219, 377 . . .
. . .
. . .
Document References
126, 130, 167, 212, 219, 377 . . .
Find all documents that mention the company “Mark Logic”
Copyright © 2008 Mark Logic Corporation. All rights reserved. 20
4) All Of The Above
Find all articles that contain the phrase “high performance” and mention the company “Mark Logic” in the abstract
<article><title>MarkLogic Server: The Best Place for XML</title>
<author><first-name>John</first-name><last-name>Kreisa</last-name></author>
<abstract>
Where should one put their XML? <company>Mark Logic</company> has the best answer to this question: MarkLogic Server. . . .
</abstract>
<body>
<section>
<section> This high performance engine can . . . </section>
</section>
<section> Using an inverted index technique . . . </section>
</body>
<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>
</article>
Copyright © 2008 Mark Logic Corporation. All rights reserved. 21
4) All Of The Above
UNIVERSAL INDEX
“very”
“high”
“performance”
“index”
“high performance”
<article>
<article>/<abstract>
<abstract>/<company>
<company>Mark Logic</
123, 127, 129, 152, 344, 791 . . .
122, 125, 126, 129, 130, 167 . . .
123, 126, 130, 142, 143, 167 . . .
123, 130, 131, 135, 162, 177 . . .
126, 130, 167, 212, 219, 377 . . .
. . .
. . .
Document References
126, 130, 167, 212, 219, 377 . . .
Find all articles that contain the phrase “high performance” and mention the company “Mark Logic” in the abstract
Copyright © 2008 Mark Logic Corporation. All rights reserved. 22
Scalar Indexes
UNIVERSAL INDEX
“very”
“high”
“performance”
“index”
“high performance”
<article>
<article>/<abstract>
<abstract>/<company>
<company>Mark Logic</
123, 127, 129, 152, 344, 791 . . .
122, 125, 126, 129, 130, 167 . . .
123, 126, 130, 142, 143, 167 . . .
123, 130, 131, 135, 162, 177 . . .
126, 130, 167, 212, 219, 377 . . .
. . .
. . .
Document References
126, 130, 167, …
Identify a set of documents based on criteria and then characterize the set with scalar indexes (float, dateTime, string etc.)
Copyright © 2008 Mark Logic Corporation. All rights reserved. 23
Geospatial, too
UNIVERSAL INDEX
“very”
“high”
“performance”
“index”
“high performance”
<article>
<article>/<abstract>
<abstract>/<company>
<company>Mark Logic</
123, 127, 129, 152, 344, 791 . . .
122, 125, 126, 129, 130, 167 . . .
123, 126, 130, 142, 143, 167 . . .
123, 130, 131, 135, 162, 177 . . .
126, 130, 167, 212, 219, 377 . . .
. . .
. . .
Document References
126, 130, 167, …
Just a special kind of scalar index, except values are points and scan operators know about Earth geometry
Copyright © 2008 Mark Logic Corporation. All rights reserved. 25
Universal Index Is Our Hammer
We turn queries into nails
Copyright © 2008 Mark Logic Corporation. All rights reserved. 26
Examples Of Nails
Directories
Exclusive, hierarchical, analogous to file
system, map to URI
Collections
Set-based, N:N relationship
Security
Invisible to your app
Copyright © 2008 Mark Logic Corporation. All rights reserved. 27
Many Shapes And Sizes
News Article Book Research Report
Slide Presentation Product Sheet Operations Manual
Copyright © 2008 Mark Logic Corporation. All rights reserved. 28
Load As Is
XML is self-describing
<article>
<title>MarkLogic Server: . . .</title>
<author>
<first-name>John</first-name>
<last-name>Kreisa</last-name>
</author>
<abstract>
. . . . <company>Mark Logic</company>
</abstract>
<body>
<section>
<section> . . .</section>
</section>
<section> . . . index . . . </section>
</body>
<copyright>Copyright© . . . </copyright>
</article>
Copyright © 2008 Mark Logic Corporation. All rights reserved. 29
Load As Is
<article>
<title>MarkLogic Server: . . .</title>
<author>
<first-name>John</first-name>
<last-name>Kreisa</last-name>
</author>
<abstract>
. . . . <company>Mark Logic</company>
</abstract>
<body>
<section>
<section> . . .</section>
</section>
<section> . . . index . . . </section>
</body>
<copyright>Copyright© . . . </copyright>
</article>
XML is self-describing
<article>
<author>
<title>
<abstract>
<body>
<copyright>
<first-name>
<last-name>
<company>
<section>
<section>
<section>
MarkLogic Server: . . .
John
Kreisa
MarkLogic
. . . index. . .
Copyright © 2008 Mark Logic Corporation. All rights reserved. 30
Load As Is
<article>
<title> <abstract><body> <copyright>
<author>
<first-name>
<last-name>
<section> <section>
<section>
<company>
"MarkLogic Server: . . ."
"John"
"Kreisa"
"MarkLogic"
" . . . " " . . . "
" . . . "
“ . . . "" . . . index. . . "
XML is self-describing
Copyright © 2008 Mark Logic Corporation. All rights reserved. 31
Load As Is
<article>
<title> <abstract><body> <copyright>
<author>
<first-name>
<last-name>
<section> <section>
<section>
<company>
"MarkLogic Server: . . ."
"John"
"Kreisa"
"MarkLogic"
" . . . " " . . . "
" . . . "
“ . . . "" . . . index. . . "
XML is self-describing No Schema Needed!
Copyright © 2008 Mark Logic Corporation. All rights reserved. 32
Degrees Of Flexibility
Str
uct
ure
Ad
hoc
Pre
defin
ed
Queries
Ad hocPredefined
IMSIDMS
RelationalDatabases
Search Engines MarkLogic
ServerXML
Server
Copyright © 2008 Mark Logic Corporation. All rights reserved. 33
The Query Language
XMLUniversal
Index
XQuery
Full-Text Search XML StructureXML Semantics
Application Logic Manipulate XML Render Results
Load As Is
Copyright © 2008 Mark Logic Corporation. All rights reserved. 34
The Programming Language
XMLUniversal
Index
XQuery
Full-Text Search XML StructureXML Semantics
Application Logic Manipulate XML Render Results
Load As Is
Copyright © 2008 Mark Logic Corporation. All rights reserved. 37
A Different Approach
Sould of a Search Engine: Data Model And Queries
Database: On-disk Organization And Transactions
Copyright © 2008 Mark Logic Corporation. All rights reserved. 38
What’s In A Database?
No tables
No rows
forests . . .
. . . . of trees
Database
Forest1 Forest2Forest3
Copyright © 2008 Mark Logic Corporation. All rights reserved. 39
The Cluster
Host e1
Forest1Forest1
Host ek
Host d1 Host d2 Host d3 Host dl
Forest2Forest2 Forest3
Forest3 ForestmForestm
Host e2
Forest4Forest4
Copyright © 2008 Mark Logic Corporation. All rights reserved. 40
What About Updates?
Typical XML document:
10KB – 1MB
Referenced by 1,000s to 10,000s of term lists
Search engines are bad at updates
Many indexes to update
Option: Index and Information out of sync
Option: Slow
We want
High throughput
Transactions (ACID)
So how do we avoid updates?
Copyright © 2008 Mark Logic Corporation. All rights reserved. 41
Solution: Temporal Database
No update! No delete!
Only insert and read-at-a-time
Every document has two timestamps
“created”, “expired”
Copyright © 2008 Mark Logic Corporation. All rights reserved. 42
Temporal Database
520 528
Createa.xml
Createb.xml
Updatea.xml Updatea.xml
Deleteb.xml...
QueryQuery
Copyright © 2008 Mark Logic Corporation. All rights reserved. 43
The Cluster
Host e1
Forest1Forest1
Host ek
Host d1 Host d2 Host d3 Host dl
Forest2Forest2 Forest3
Forest3 ForestmForestm
Host e2
Forest4Forest4
Copyright © 2008 Mark Logic Corporation. All rights reserved. 44
Host
A Single Forest
Stand1 Stand2 Standn
…
BufferForestk
Buffer
Copyright © 2008 Mark Logic Corporation. All rights reserved. 45
Host
1. Create A New Tree
Stand1 Stand2 Standn
…
BufferForestk
Buffer
Copyright © 2008 Mark Logic Corporation. All rights reserved. 46
Host
2. Expire Trees
Stand1 Stand2 Standn
…
BufferForestk
Buffer
Copyright © 2008 Mark Logic Corporation. All rights reserved. 47
Host
3. Save A Buffer To Disk
Stand1 Stand2 Standn
…
BufferForestk
Buffer
Copyright © 2008 Mark Logic Corporation. All rights reserved. 48
Host
4. Optimization: Merge Stands
Buffer
Forestk
Copyright © 2008 Mark Logic Corporation. All rights reserved. 49
The Four Forest Operations
1. Create a new document• Into a buffer
2. Mark a document as expired• Memory-mapped document timestamps per stand
3. Write buffer out to disk• Our buffers are 100s of megabytes• For performance, double buffer
4. Merge• Background process• Optimization: reduces number of stands in forest
Copyright © 2008 Mark Logic Corporation. All rights reserved. 50
Consistency And Throughput
2-phase commit
Transactions span forests
Recovery
Forest Journals
Lock-free queries
Use the search engine at a point-in-time
Increased throughput
Time travel?
Copyright © 2008 Mark Logic Corporation. All rights reserved. 51
A Different Approach
Sould of a Search Engine: Data Model And Queries
Database: On-disk Organization And Transactions
Copyright © 2008 Mark Logic Corporation. All rights reserved. 52
Summary
XML as data model
Ad hoc schema
A search engine core
Universal Index
Temporal transaction model
High throughput while keeping . . .
Performance and scalability of a search engine
Copyright © 2008 Mark Logic Corporation. All rights reserved. 53
Mary Holstege
Principal Engineer
t: 650.655.2336
f: 650.655.2310
Thank You