big data & text mining

15
www.decideo.fr/bruley Text mining Text mining [email protected] Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …

Upload: michel-bruley

Post on 14-Jul-2015

4.009 views

Category:

Business


3 download

TRANSCRIPT

www.decideo.fr/bruley

Text miningText mining

[email protected]

Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …

www.decideo.fr/bruley

Information contextInformation context

� Big amount of information is available in textual form in databases and online sources

� In this context, manual analysis and effective extraction of useful information are not possible

� It is relevant to provide automatic tools for analyzing large textual collections

www.decideo.fr/bruley

Text mining definition Text mining definition

The objective of Text Mining is to exploit information contained in textual documents in various ways, including … discovery of patterns and trends in data, associations among entities, predictive rules, etc.

The results can be important both for:� the analysis of the collection, and� providing intelligent navigation and browsing methods

www.decideo.fr/bruley

Text mining pipeline Text mining pipeline

Unstructured Text(implicit knowledge)

Structured content(explicit knowledge)

Informationextraction

Semantic metadata

Knowledge Discovery

InformationRetrieval

Semantic Search/

Data Mining

www.decideo.fr/bruley

Text mining processText mining process

Text preprocessingSyntactic/Semantic text analysis

Features Generation Bag of words

Features SelectionSimple countingStatistics

Text/Data MiningClassification- Supervised learningClustering- Unsupervised learning

Analyzing resultsMapping/VisualizationResult interpretation Iterative and interactive process

www.decideo.fr/bruley

PublishersPublishers

Enriched contentAnnotation tools Tools for authors

New applications based on annotation layers Richer cross linking based on content…

AnalystsAnalysts

Empowers themAnnotating research output

Hypothesis generation Summarisation of findingsFocused semantic search…

LibrariesLibraries

Linking between Institutional repositoriesAccess to richer metadata

Aggregation Aids to subject analysis/classification …

Text mining actorsText mining actors

www.decideo.fr/bruley

Challenges in text miningChallenges in text mining

� Data collection is “free text”, is not well-organized (Semi-structured or unstructured)

� No uniform access over all sources, each source has separate storage and algebra, examples: email, databases, applications, web

� A quintuple heterogeneity: semantic, linguistic, structure, format, size of unit information

� Learning techniques for processing text typically need annotated training

� XML as the common model, it allows:– Manipulation data with standards– Mining becomes more data mining– RDF emerging as a complementary model

� The more structure you can explore the better you can do mining

www.decideo.fr/bruley

Intranet

Internet

On-lineDatabank

Information Provider

File SystemDatabasesEDMS

Web Crawling

XML Normalisation-subject-Author-text corpora-keywords

Format filter

Data source administrationData source administration

www.decideo.fr/bruley

Text mining tasks Text mining tasks

TM

Text AnalysisTools

Feature extraction

Categorization

Summarization

Clustering

Name Extractions

Term Extraction

Abbreviation Extraction

Relationship Extraction

Hierarchical Clustering

Binary relational Clustering

Web Searching Tools

Text search engine

NetQuestion Solution

Web Crawler

www.decideo.fr/bruley

Information extraction Information extraction

Extract domain-specific information from natural language text – Need a dictionary of

extraction patterns (e.g., “traveled to <x>” or “presidents of <x>”)• Constructed by hand• Automatically learned

from hand-annotated training data

– Need a semantic lexicon (dictionary of words with semantic category labels)• Typically constructed

by hand

Link Analysis

Query Log Analysis

Metadata Extraction

Keyword Ranking

Intelligent Match

Duplicate Elimination

www.decideo.fr/bruley

CategorizationCategorization

Document collections treatment Document collections treatment

ClusteringClustering

www.decideo.fr/bruley

Text Mining example:Text Mining example: Obama vs. McCain

www.decideo.fr/bruley

Aster Data position for Text Aster Data position for Text AnalysisAnalysis

Data Acquisition

Data Acquisition Pre-ProcessingPre-Processing MiningMining Analytic

ApplicationsAnalytic

Applications

Perform processing required to transform and

store text data and information

(stemming, parsing, indexing, entity extraction, …)

Gather text from relevant sources

(web crawling, document scanning, news feeds,

Twitter feeds, …)

Apply data mining techniques to derive insights about stored

information

(statistical analysis, classification, natural

language processing, …)

Leverage insights from text mining to provide

information that improves decisions and processes

(sentiment analysis, document management, fraud analysis,

e-discovery, ...)

Third-Party Tools Fit

Aster Data Fit

Aster Data Value: Massive scalability of text storage and processing, Functions for text processing, Flexibility to develop diverse custom analytics and incorporate third-party libraries

www.decideo.fr/bruley

• Ability to store and process massive volumes of text data– Massively parallel data stores and massively parallel analytics engine– SQL-MapReduce framework enables in-database processing for

specialized text analytics tools

• Tools and extensibility for processing diverse text data– SQL-MapReduce framework enables loading and transforming diverse

sources and types of text data– Pre-built functions for text processing

• Flexible platform for building and processing diverse analytics– SQL-MapReduce framework enables creation of flexible, reusable

analytics– Embedded MapReduce processing engine for high-performance analytics

Aster Data Value for Text Aster Data Value for Text AnalyticsAnalytics

www.decideo.fr/bruley

• Data transformation utilities

- Pack: compress multi-column data into a single column

- Unpack: extract nested data for further analysis

• Web log analysis

- Sessionization: identify unique browsing sessions in clickstream data

• Text analysis

- Text parser: general tool for tokenizing, stemming, and counting text data

- nGram: split text into component parts (words & phrases)

- Levenstein distance: compute “distance” between words

Aster Data Capabilities for Text Aster Data Capabilities for Text DataData

Pre-built SQL-MapReduce functions for text processing

Data Data Data

Aster Data Analytic Foundation

SQL SQL-MapReduce

App App AppApp App App

Custom and Packaged Analytics

Aster Data nCluster