streamit: dynamic visualization and interactive exploration of text streams
Post on 22-Feb-2016
113 Views
Preview:
DESCRIPTION
TRANSCRIPT
STREAMIT: Dynamic Visualization and Interactive Exploration of Text Streams
STREAMIT: Dynamic Visualization and Interactive Exploration of Text Streams
Jamal AlsakranKent State University, OhioYang Chen University of North Carolina - CharlotteYe Zhao (Presenter)Kent State University, OhioJing YangUniversity of North Carolina - CharlotteDongning LuoUniversity of North Carolina - CharlotteText Stream
Textual Data Explosion
Emails, news, messages, broadcasts,
Daily, hourly, minutely
Urgent need for efficient processing and analysis
Visualization is an effective approach
Text stream
Text collections constantly evolve with continuously new incoming documents
Keywords/topics not known in advance
Challenges to Visual Exploration
Temporal evolution
Existing topics
Emerging topics
Their relations
Clusters and Outliers
No collection pre-scanning or presumably priori knowledge
Live processing required
In contrast to traditional text database
Flexible user interaction for changing and adjusting
information seeking focus/preference
Process large volumes of texts in real time
SREAMIT System
Dynamic force-directed simulation
Naturally handle continuously inserted documents
Continual evolvement
Continuous depiction and analysis of growing document collections
Automatic grouping and separating
No time window used
No abrupt change
Dynamic processing
Keyword vectors dynamically updated
No prerecorded scan
SREAMIT System (continued)
Interactive exploration
Live adjustment of visualization parameters
Dynamic keyword importance
Present the significance of a keyword at a certain time
Reflect changing user demand and interest
Scalable optimization
Fast computing
GPU acceleration
Animation and interaction
Easy user control and interaction tools
Related Work
Multidimensional scaling (MDS) & projection :
IN-SPIRE 99, InfoSky 02, Hipp 08, Exemplar-based 09
Temporal data trends
ThemeRiver 02, LensRiver 07, T-scroll 07, Meme-tracking 09, Themail 06, Topic-based 09
Text streams
TextPool 05, Moving time window Wong03, Eventriver 10, Text pipe 05
Force-based placement
Graph drawing 91, Chalmers96, Morrison02, etc.
System Overview
Potential and Similarity
Potential energy between pairs of document particles
is a control parameter
li and lj are locations of particle i and j
lij is the ideal distance of them
Ideal distance computed from document similarity
Cosine similarity
Large similarity leads to smaller ideal distance, move documents closer to form clusters
Force-directed Model
Global potential function
Forces computed from minimization
Attract or repulse document particles
DYNAMIC KEYWORD IMPORTANCE
Cosine similarity can be improved by introducing importance
Importance Ik freely modified by users at any time
According to interest/preference
According to discovered knowledge from prior period
A powerful tool for users to manipulate layout and analyze data
Importance might be changed from automatic scheme
E.g. for keyword k,
Ok: occurance;
tek:last time it appears; tsk: first time it appears;
nk : the number of documents that contain the keyword
Visualization Interface
Visualization Tools
Main window
Major layout
Animation Control Panel
Play, pause, stop
Drag by mouse
Keyword table
Dynamic update
Change importance
Document table
Text information
Labeling
Use text document titles
Reduce cluttering
Recent semantic titles
User controlled clutter levels
Group title label
Use color and opacity to display clear layout
User Interaction
Adjusting Keyword Importance
Grouping and Tracking Documents
Halo for interested topics
Browsing and Tracking Keywords
Selection
Manual, example-based, keyword-based
Integrated shoebox for details
Case Study: New York Times News
Total article number: 230
Time period Jul. 19 and Sep. 18, 2010
About Barack Obama
Articles continuously injected, new keywords added to the keyword table, and their frequencies are updated on-the-fly
Keyword importance automatically assigned
Case Study: New York Times News
136 news articles
High frequency keywords:
Politics and Government, International Relations, Terrorism
Increase the importance of International Relations
Highlight the group with Afghanistan War in pink halo (2)
Terrorism in orange halo (3)
All documents are shown
Terrorism becomes larger, and one item (outlier) between Afghanistan War and
Terrorism
Case Study: US NSF Award Abstracts
1000 National Science Foundation (NSF) IIS award abstracts
Funded between Mar. 2000 and Aug. 2003
Each document characterized by a set of keywords
Size of a document circle represents funding amount
Case Study: US NSF Award Abstracts
Aug. 1, 2000
95 projects
Sep. 1, 2000,172 projects;
many large projects started;
Highlight Management in red and Database in green;
Increase their importance
Mar. 15, 2002,672 projects;
many large projects started;
Highlight Sensor with halo;
(2) is an outlier far away from the other projects with halo
It is about just-in-time information retrieval on wearable
computers
Case Study: Video on NSF Dataset
Case Study: Video on NSF Dataset
Performance Optimization
Initial positions of document particles affect computational steps and cost
Similarity Grid
New documents roughly inserted within the proximity of similar documents
Each grid cell has a special keyword vector consisting of the average keyword weights from the documents inside the cell
data set of
7100 documents
Performance Optimization
GPU acceleration
CUDA implementation of the N-body problem
Good performance achieved
NVidia Quadro NVS 295 GPU with 2GB texture memory
Intel Core2 1.8GHz CPU with 2GB RAM
GPU Performance
Experiments with 50 by 50 grid
Achieve good average speed
More importantly, maximum simulation time after document insertion on the GPU was less than a second
Fast for human perception and analysis
Discussion
The system has the ability to handle live text streams with document arrival interval around 1 second
On consumer PC and graphic card
E.g., New York Times news has an averaging 3 documents per hour and a maximum 8 documents per hour at the peak time
A very large number of documents inside the system will undoubtedly introduce visual clutters and hinder the ingestion of analyzers
Natural perception limit and device limit
Clutter reduction and simplification algorithms needed
Further increase the power
Advanced hardware
Hierarchical or multiple-resolution simulation
Conclusion
STREAMIT: An efficient visual exploration system for live text streams
Dynamic physical system
Keyword manipulation with importance
Visual tools
Acknowledgment:
National Science Foundation IIS-0915528, IIS-0916131 and NSFDACS10P1309.
Thanks!
Questions!
Text Document
Particles
Dynamic
Keyword
Importance
top related