content analysis for ecm with apache tika
DESCRIPTION
Presentation at ApacheCon US 2008 (New Orleans) by Paolo Mottadelli. This is about the Apache Tika project and how it was integrated in Alfresco in order to support Open XML format Full Text Search.TRANSCRIPT
![Page 1: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/1.jpg)
Content analysis for ECM with Apache Tika
Paolo Mottadelli -
![Page 3: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/3.jpg)
Paolo Mottadelli
ON BOARD!
3
![Page 4: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/4.jpg)
Paolo Mottadelli
Agenda
4
![Page 5: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/5.jpg)
Paolo Mottadelli
Main challenge
5
Luceneindex
![Page 6: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/6.jpg)
Paolo Mottadelli
Other challenges
6
![Page 7: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/7.jpg)
Paolo Mottadelli
A real world challenge
? ? ?
7
Searching .docx .xlsx .pptx in Alfresco ECM
![Page 8: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/8.jpg)
Paolo Mottadelli
Agenda
8
![Page 9: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/9.jpg)
Paolo Mottadelli
What is Tika?
9
Another Indian Lucene project? No.
![Page 10: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/10.jpg)
Paolo Mottadelli
What is Tika?
It is a Toolkit
10
![Page 11: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/11.jpg)
Paolo Mottadelli
Current coverage
11
![Page 12: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/12.jpg)
Paolo Mottadelli
A brief history of Tika
Sponsored by the Apache Lucene PMC
12
![Page 13: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/13.jpg)
Paolo Mottadelli
Tika organization
13
Changing after graduation
![Page 14: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/14.jpg)
Paolo Mottadelli
Getting Tika
… and contributing
14
![Page 15: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/15.jpg)
Paolo Mottadelli
Tika Design
15
![Page 16: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/16.jpg)
Paolo Mottadelli
The Parser interfacevoid parse(InputStream stream, ContentHandler
handler, Metadata metadata) throws IOException, SAXException, TikaException;
16
![Page 17: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/17.jpg)
Paolo Mottadelli
Tika Design
17
![Page 18: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/18.jpg)
Paolo Mottadelli
Document input stream
18
![Page 19: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/19.jpg)
Paolo Mottadelli
Tika Design
19
![Page 20: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/20.jpg)
Paolo Mottadelli
XHTML SAX events<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>...</title>
</head>
<body> ... </body>
</html>
20
![Page 21: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/21.jpg)
Paolo Mottadelli
Why XHTML?
• Reflect the structured text content of the document
• Not recreating the low level details• For low level details use low level parser libs
21
![Page 22: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/22.jpg)
Paolo Mottadelli
ContentHandler (CH) and Decorators (CHD)
22
![Page 23: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/23.jpg)
Paolo Mottadelli
Tika Design
23
![Page 24: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/24.jpg)
Paolo Mottadelli
Document metadata
24
![Page 25: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/25.jpg)
Paolo Mottadelli
… more metadata: HPSF
25
![Page 26: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/26.jpg)
Paolo Mottadelli
Tika Design
26
![Page 27: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/27.jpg)
Paolo Mottadelli
Parser implementations
27
![Page 28: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/28.jpg)
Paolo Mottadelli
The AutoDetectParser
• Encapsulates all Tika functionalities• Can handle any type of document
28
![Page 29: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/29.jpg)
Paolo Mottadelli
Type DetectionMimeType type = types.getMimeType(…);
29
![Page 30: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/30.jpg)
Paolo Mottadelli
tika-mimetypes.xml
An example: Gzip
<mime-type type="application/x-gzip">
<magic priority="40">
<match value="\037\213" type="string“ offset="0" />
</magic>
<glob pattern="*.tgz" />
<glob pattern="*.gz" />
<glob pattern="*-gz" />
</mime-type>
30
![Page 31: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/31.jpg)
Paolo Mottadelli
Supported formats
31
![Page 32: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/32.jpg)
Paolo Mottadelli
A really simple exampleInputStream input =
MyTest.class.getResourceAsStream("testPPT.ppt");
Metadata metadata = new Metadata();
ContentHandler handler = new BodyContentHandler();
new OfficeParser().parse(input, handler, metadata);
String contentType = metadata.get(Metadata.CONTENT_TYPE);
String title= metadata.get(Metadata.TITLE);
String content = handler.toString();
32
![Page 33: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/33.jpg)
Paolo Mottadelli
Demo
33
?
![Page 34: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/34.jpg)
Paolo Mottadelli
Future Goals
34
![Page 35: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/35.jpg)
Paolo Mottadelli
Who uses Tika?
35
![Page 36: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/36.jpg)
Paolo Mottadelli
Agenda
36
![Page 37: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/37.jpg)
Paolo Mottadelli
ECM: what is it?
37
![Page 38: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/38.jpg)
Paolo Mottadelli
ECM: Manage
• Indexing• Categorization
*
*
38
![Page 39: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/39.jpg)
Paolo Mottadelli
ECM: we love SEARCHING!
39
![Page 40: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/40.jpg)
Paolo Mottadelli
ECM: we love SEARCHING!
40
![Page 41: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/41.jpg)
Paolo Mottadelli
ECM: we love SEARCHING!
41
![Page 42: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/42.jpg)
Paolo Mottadelli
Don’t do it on your own
Tika shields ECMfrom usingmany single components
42
![Page 43: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/43.jpg)
Paolo Mottadelli
Agenda
43
![Page 44: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/44.jpg)
Paolo Mottadelli
Alfresco: short presentation
44
![Page 45: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/45.jpg)
Paolo Mottadelli
Alfresco: short presentation
45
![Page 46: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/46.jpg)
Paolo Mottadelli
Who uses Alfresco?
46
![Page 47: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/47.jpg)
Paolo Mottadelli
Alfresco RepositoryJSR-170 Level2 Compatible
47
![Page 48: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/48.jpg)
Paolo Mottadelli
Repository Architecture
Hibernate
Content
Lucene
Content IndexDatabase
SearchNode
Node Content QueryIndex
Services
Components
Storage
48
![Page 49: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/49.jpg)
Paolo Mottadelli
Repository Architecture
Hibernate
Content
Lucene
Content IndexDatabase
SearchNode
Node Content QueryIndex
Services
Components
Storage
49
![Page 50: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/50.jpg)
Paolo Mottadelli
Alfresco Search
50
![Page 51: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/51.jpg)
Paolo Mottadelli
Alfresco Search
51
![Page 52: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/52.jpg)
Paolo Mottadelli
Use case
52
![Page 53: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/53.jpg)
Paolo Mottadelli
Use case
53
![Page 54: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/54.jpg)
Paolo Mottadelli
Without Tika:
54
![Page 55: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/55.jpg)
Paolo Mottadelli
Step 1
55
![Page 56: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/56.jpg)
Paolo Mottadelli
Step 2
for (ContentTransformer transformer : transformers)
{
long transformationTime = transformer.getTransformationTime();
if (bestTransformer == null || transformationTime < bestTime)
{
bestTransformer = transformer;
bestTime = transformationTime;
}
}
return bestTransformer;
ContentTransformerRegistryProvides the most appropriate
ContentTransformer
56
![Page 57: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/57.jpg)
Paolo Mottadelli
Step 2 (explained)Too many differentContentTransformer implementations
57
![Page 58: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/58.jpg)
Paolo Mottadelli
Step 3Transform
public void transformInternal(ContentReader reader, ContentWriter writer, TransformationOptions options) throws Exception { ... HSSFWorkbook workbook = new HSSFWorkbook(is); ... for (int i = 0; i < sheetCount; i++) { HSSFSheet sheet = workbook.getSheetAt(i); String sheetName = workbook.getSheetName(i); writeSheet(os, sheet, encoding); } ... }
Example: PoiHssfContentTransformer
58
![Page 59: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/59.jpg)
Paolo Mottadelli
Step 3 (explained)
Too many differentContentTransformer implementations
... again !?!
59
![Page 60: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/60.jpg)
Paolo Mottadelli
Step 4
Lucene index creationContentReader reader = contentService.getReader(nodeRef, propertyName);
ContentTransformer transformer = contentService.getTransformer(reader.getMimetype(), MimetypeMap.MIMETYPE_TEXT_PLAIN);
transformer.transform(reader, writer); reader = writer.getReader();
. . . . . . . .
doc.add(new Field(attributeName, reader, Field.TermVector.NO));
60
![Page 61: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/61.jpg)
Paolo Mottadelli
Let’s do it using Tika
61
![Page 62: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/62.jpg)
Paolo Mottadelli
Step 1 + Step 2 + Step 3
String name = “resource.doc”InputStream input = getResourceAsStream(name);
Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler();
new AutoDetectParser().parse(input, handler, metadata);
String title = metadata.get(Metadata.TITLE);String content = handler.toString();
62
![Page 63: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/63.jpg)
Paolo Mottadelli
Step 1 to 4 (compressed)
String name = “resource.doc”InputStream input = getResourceAsStream(name);
Reader reader = new ParsingReader(input, name);
. . . . . .
doc.add(new Field(attributeName, reader, Field.TermVector.NO));
63
![Page 64: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/64.jpg)
Paolo Mottadelli
Results: 1 & 2
64
![Page 65: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/65.jpg)
Paolo Mottadelli
Extension use caseAdding support forMicrosoft Office Open XML Documents(Office 2007+)
65
![Page 66: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/66.jpg)
Paolo Mottadelli
Apache POI
Apache POI providesText Extraction support
for Office OpenXML formatsand
An advanced coverage ofSpreadsheetML specification
(WordprocessingML & PresentationML to come)
66
![Page 67: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/67.jpg)
Paolo Mottadelli
Apache POIApache POI status
67
![Page 68: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/68.jpg)
Paolo Mottadelli
Apache POI TextExtractors
POIXMLDocument document;
Package pkg = Package.open(stream);
textExtractor = ExtractorFactory.createExtractor(pkg);
if (textExtractor instanceof XSSFExcelExtractor) {
setType(metadata, OOXML_EXCEL_MIMETYPE
document = new XSSFWorkbook(pkg);
}
else if (textExtractor instanceof XWPFWordExtractor){…}
else if (textExtractor instanceof XSLFPowerPointExtractor){…}
setPOIXMLProperties(metadata, document);
68
![Page 69: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/69.jpg)
Paolo Mottadelli
Can we find it?
69
![Page 70: Content analysis for ECM with Apache Tika](https://reader037.vdocuments.site/reader037/viewer/2022103113/554c7fbbb4c905834a8b4842/html5/thumbnails/70.jpg)
Paolo Mottadelli
Results: 3 & 4
70