apache tika
Post on 10-May-2015
7.442 Views
Preview:
TRANSCRIPT
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
Apache Tika
An extensible, configurable
content analysis frameworktoolkit
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
Agenda
The Problem
The Solution
The Project
The Design
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
The Problem
PDFBoxApache Poi
Apache XercesICU4J
NekoHTMLetc.
Lucene index
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
It’s Worse Than That
LicensingDependencies
Metadata extractionStructured content
Encryption/CompressionPackage formats
Streaming
Processing ofdigital media
?
?
?
???
??
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
Agenda
The Problem
The Solution
The Project
The Design
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
The Solution: Technical
• Generic API for extracting metadata and structured text content from a document– Input: byte stream + optional metadata– Output: XHTML SAX events + metadata
• Automatic content type detection– Magic bytes– File name patterns
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
The Solution: Legal / Social
• Apache License– (L)GPL projects can implement the Tika API
• Pooling of efforts– Active development and maintenance– Already beyond the functionality of most
custom solutions– Cool future goals: OCR, speech recognition, …
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
Agenda
The Problem
The Solution
The Project
The Design
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
Project Status
• Initially planned already in early 2006
• Incubating since March 2007
• Sponsoring PMC: Apache Lucene
• No releases yet– 0.1 release being planned
• Small development team– 6 committers, 3-4 currently active
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
Current Features
• Media type framework– Shared MIME info spec (freedesktop.org)– Default media type registry (incl. glob and magic patterns)
• Parser components– PDF (PDFBox)– Plain text (ICU4)– XML (SAX)– HTML (NekoHTML)– Word, PowerPoint, Excel (POI)– ODF (SAX)– RTF (Swing)
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
Project Statistics
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
Codebase History
LiusNutch
Lius Lite
Tika
textmining
Jackrabbit
Andy Clark
Jukka Zitting
Rida BenjellounChris MattmanJerome Charron
Sami Siren
Bertrand DelacretazKeith Bennett
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
Agenda
The Problem
The Solution
The Project
The Design
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
Content Extraction
PPT
Type: application/vnd.ms-powerpointTitle: Apache Tika
Author: Jukka Zitting
new PowerPointParser().parse(…);
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
Media Type Detection
application/vnd.ms-powerpoint
MimeTypes types = …;MimeType type = types.getMimeType(…);
tika-mimetypes.xml/etc/magic
mime.types
?
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
Combined Detection and Extraction
PPT
Type: application/vnd.ms-powerpointTitle: Apache Tika
Author: Jukka Zitting
TXT
XML
new AutoDetectParser().parse(…);
?
Apache Tika2007-11-15
Jukka Zittingjukka@apache.org
Agenda
The Problem
The Solution
The Project
The DesignThank You!
top related