hku csis db seminar: hku csis db seminar: efficient filtering of xml documents for selective...
Post on 29-Dec-2015
221 Views
Preview:
TRANSCRIPT
HKU CSIS DB Seminar:HKU CSIS DB Seminar:Efficient Filtering of XML Documents for Selective
Dissemination of Information
Mehmet Altinel, Micheal J. FranklinVLDB2000
Speaker: Eric Lo
Introduction Increasing volume of data available in
electronic forms and the proliferation of Internet have accelerated the development of SDI (Selective Dissemination of Information)
Selective dissemination of information is to avoid sending users/subscribers unnecessary information
The SDI applications:- - timely received/collected new data such as stock
quotes, traffic news, sports tickers and music- - filter against subscribers profile- - delivering relevant data to interested subscribers
Introduction Current SDI… …
- - based of simple keyword matching and typical IR techniques
- - e.g. a subscriber profile has the keyword “NBA” will match all those news with the keyword “NBA” exists
HOWEVER… … - Still suffering from typical problems:
Subscriber will also receive irrelevant information such as news with headline “Bill Gate loves to watch NBA”
Even the current system drawn large concern on improving the effectiveness, they miss out the EFFICIENCY!
Introduction One of the usage of XML is to be a
standard information exchange mechanism
XML allows encoding of structural information within documents and can create more focused and accurate profiles of user interests.
“XFilter” in this paper addressed the mentioned concerns
XML-based SDI Architecture
Subscribers has a GUI interface to specify the profiles
The underlying language is XPath
E.g. /sports/nba//news
Input
XFilter Architecture 4 major
components 1. Event-base
parser for XML document
2. XPath parser for user profiles
3. Filter engine, matching between profile and XML documents
4. Dissemination engine, for delivery the filtered data
Generally, how the system work?
<sports> <nba> <chicago>…</chicago> </nba></sports>
New_incoming_document.xml
Q1: /sports / nba //news [Q1-1] [Q1-2] [Q1-3]Q2: //nba/*/ news [Q2-1] [Q2-2]Q3: /stocks/quotes/PCCW [Q3-1] [Q3-2] [Q3-3]
3 subscribers
sports
nba
news
stocks
quotes
PCCW
Q1-1
Q2-1
Q1-2
Q1-3 Q2-2
Q3-1
Q3-2
Q3-3
Candidate List Wait List
Q1-1
Q1-2
Filter Engine of XFilter XFilter convert the XPath query to a
Finite State Machine A subscriber XPath (Profile) is MATCH
with the XML document WHEN the FSM of the XPath query reach its final state
A Query Index is built over the states of the (FSM) XPath queries.
Inside Filter Engine
Path Nodes XPath parser decompose XPath to set of path nodes Elements are nodes (no attribute) and act as state
of FSM /sports/nba//news Wildcard (*) is ignored
sports nba news
Path Nodes InformationQuery IDPositionRelative Position:
=0 for 1st node if 1st node is not follow by “//”
=-1 if any node followed by “//”
Else =1+ (no of “*” nodes between itself and predecessor node)
Level:If 1st node and have absolute
distance from the root, then level = 1+ distance from root
If Rel. Pos. is –1, it is also –1, else =0
Q1=/sports/nba//news
Q1 Q1 Q1
1 2 3
0 1 -1
1 0 -1Q1-1 Q1-2 Q1-3
Q2 Q2 Q2
1 2 3
-1 2 1-1 0 0
Q2-1 Q2-2 Q2-3
Q2=//nba/*/news/Bulls
Query Index All the nodes added
to the Query Index(a hash table based on element names)
Each unique element name associate with two lists: Candidate List and Wait List
The current node of each query is placed in CL, others are in WL
The FSM will move to next state when a path node promote to CL from WL
sports
nba
news
stocks
quotes
PCCW
Q1-1
Q2-1
Q1-2
Q1-3 Q2-2
Q3-1
Q3-2
Q3-3
Candidate List Wait List
XML Parsing and Filtering When a XML document arrives, it run thru the
SAX XML Parser (event-driven) and will check with the Query Index when encountering:
A begin element tag An end element tag Data internal to an element
Input XML SAX API
<?xml version=“1.0”><sports><news><ball games><nba>Michael Jordan … </nba></ball games></news></sports>
Start documentStart element: sportsStart element: newsStart element: ball gamesStart element: nbaCharacters: Michael JordonEnd element: nba …
XML Parsing and Filtering (cont) Start_Element_Handler
(element_name, element level, attribute name, attribute values) { Lookup the element name in the
Query Index and examines all nodes in the CL and perform LEVEL CHECK and ATTRIBUTE FILTER CHECK
}
Q1
1
0
1Q1-1
Level Check and Attribute Check Level check is to ensure the element
appears in the document matches the expected level in the user query
Recall: - the level of a path node is –1 relative
pos is –1 a “//” is before this node unrestricted
- else the level of path node must = the level of the input element
The attribute filter check applies any simple predicates that reference the attributes of the element
Level Check and Attribute Check If both level check and attribute check
succeed, that node is pass. If that node is the final path node (final
state) of the query (e.g. Q1-3) then the document is match the query, if that node is not the final path node, the query is then moved the next state.
State move is done by copying the next node of the query from WL to CL and update the corresponding relative position and level
End element handler and character handler When an end element is encounter in
SAX parser, the path node of that element is deleted from CL
When element data is encounter in SAX parser, it works like start element handler except it performs a content check rather than attribute check
List Balancing Recall:
The first path node of the XPath query is placed on the CL and remaining path node are placed on WL
Inefficient for many situations as the 1st element usually have poor selectively
Some CL has long length, some CL has short length, and not balancing! (e.g. the length of CL of element “news” usually much longer than the length of CL of element “NBA”
List Balancing List balancing introduce a “pivot” node
When a new query is adding to the index, the element node of the query whose entry in the index has shortest CL is chosen as pivot and placed it on the CL (instead of the 1st node)
E.g. When a new subscriber add /sports/worldcup//news, if the length of “worldcup” element is shortest compare with “sports” and “news”, “worldcup” is the pivot and add to CL
The prefix “sports” will then be a precondition and use a stack to hold it, the filter will stop is the precondition for the node fails
List Balancing
Q3=/*/sports/news//bulls
Q3 Q3 Q3
1 2 3
0 1 -1
1 0 -1Q1-1 Q1-2 Q1-3
Q3 Q3
1 2
0 -1
1 -1Q1-1 Q1-2
Assume the element “news” has the shortest CL among the 3 elements
Stack: “sport”
List Balancing
Prefiltering Prefiltering is to eliminate from
consideration, any query that contains an element name that is not present in the input document to avoid unnecessary work done
Done before order and filter checking (thus every incoming XML is parsed twice)
Prefiltering A “key” element is chosen for each
query when initially parsed The key is chosen like List Balancing
whereas a hash table(call occurrence table) containing an entry of <element name, QueryID1, …, QueryIDn> is constructed when a document arrives
The queries referenced by the table are checked to see if all of the element names exist in the document, only the successful queries would go further
Prefiltering Assume the key is in blue color Q1: /sports/nba//news/scores Q2: /sports/NHL//news Q3: /sports/nba/Bulls//news Q4: /sports//Bulls/ranking
<sports><nba> <Lakers> <news>O’ Neal…</news> </Lakers> <Bulls> <news>Bulls beat Lakers</news> </Bulls></nba></sports>
Sports18012002.xml
sports
nba Q1
Lakers
news
Bulls Q3,Q4Occurrence Table
Q3All elements inQueries exists inThe document?
Performance evaluation Evaluate the performance by varying: Number of subscribers profile Depth of subscribers queries and
incoming XML document Probability of wildcards Filter placement and selectively List Balance with Prefiltering has the
best performance
Related Work Enhance XFilter by considering not only
element but also attributes Enhance XFilter by reordering the input
profiles (XPath queries of subscribers) when building the index so as to have more well-balance Candidates List
Refer to “Indexing Attributes and Reordering Profiles for XML Document Filtering and Information Devliery” by Wang Lian, David Cheung and S.M. Yiu, WAIM 2001
End
top related