large scale text analysis using the map/reduce hierarchy butler.pdf · large scale text analysis...
TRANSCRIPT
![Page 1: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/1.jpg)
Large Scale Text Analysis Using the Map/Reduce HierarchyHierarchy
David Buttler
Lawrence Livermore National Laboratory
This work is performed under the auspices of the U.S. Department of Energy byLawrence Livermore National Laboratory under Contract DE-AC52-07NA27344
![Page 2: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/2.jpg)
Large scale computing with commodity hardware
OriginsG l GFS M /R d Bi T bl Google GFS, Map/Reduce, BigTable
Microsoft Azure Hadoop: Yahoo! / Open Source Software Hadoop: Yahoo! / Open Source Software
Why do we care: k-mer lexing• 10 hours on a single fat node• 1 hour on an old cluster
2LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 3: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/3.jpg)
The M/R stack of open source software
KattaWorkflowSolr / Lucene
Workflow(Cascading / Azkaban)
ZookeeperPig Hive
HBase
HDFS
Map / Reduce
3LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
HDFS
![Page 4: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/4.jpg)
The M/R stack of open source software – HDFS
Workflow KattaWorkflow(Cascading / Azkaban) Solr / Lucene
ZookeeperPig Hive
HBase
Map / Reduce
HDFS
4LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
HDFS
![Page 5: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/5.jpg)
HDFS
ReplicatedDi t ib t d
Data Node 1
Distributed Centrally managed
• SPOF Data Node 2• SPOF• Limited number of files
Not POSIX compliant
a a ode
p Rack-aware …
Name NodeData Node n
File: path, {Blocks}
Name Node…
5LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
File: path, {Blocks}
![Page 6: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/6.jpg)
The M/R stack of open source software – Map / Reduce
Workflow KattaWorkflow(Cascading / Azkaban) Solr / Lucene
ZookeeperPig Hive
HBase
HDFS
Map / Reduce
6LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
HDFS
![Page 7: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/7.jpg)
Map/Reduce is functional programming distributed over a cluster
Distributed computationT h t ti Two phase computation
Built-in shuffle/sort between phases Canonical example: word frequency count for the web Canonical example: word frequency count for the web
……be, 1 …
Map
…to, 1be, 1or 1
shuffle / sort
be, 1…not 1 Reduce
…be, 2+…not 1+or, 1
not, 1to, 1
not, 1…to, 1to 1
not, 1+…to, 2+
7LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
be, 1…
to, 1…
…
Input Documents
![Page 8: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/8.jpg)
More interesting M/R examples
Map input: documentM t t t t Map output: raw text
Map input: text Map input: text Map output: Named entity annotations
8LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 9: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/9.jpg)
The M/R stack of open source software – Zookeeper
Workflow KattaWorkflow(Cascading / Azkaban) Solr / Lucene
Pig Hive
HBase
Zookeeper
Map / Reduce
HDFS
9LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
HDFS
![Page 10: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/10.jpg)
Zookeeper
A highly available, scalable, distributed, configuration, consensus group membership leader electionconsensus, group membership, leader election, naming, and coordination service
Uses:• HBase: row locking; region key ranges; region
server addressesK tt h d l ti i f ti• Katta: shard location information
• Message queues Not: a large scale data store Not: a large scale data store
10LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 11: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/11.jpg)
Zookeeper Guarantees
1. Clients will never detect old data.2 Cli t ill t tifi d f h t d t th2. Clients will get notified of a change to data they are
watching within a bounded period of time.3. All requests from a client will be processed in order.3. All requests from a client will be processed in order.4. All results received by a client will be consistent with
results received by all other clients.
11LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 12: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/12.jpg)
Zookeeper Data Model
Hierarchal namespace Each znode has data and
/
services Each znode has data and children
data is read and written in its servers
YaView
entirety Nodes store < 1MB data Writes go to all nodes
Name n
Name 1…
Writes go to all nodeslocks
read-1
Katta
HBase
12LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 13: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/13.jpg)
ZooKeeper Service
ZooKeeper Service
ServerServer ServerServerServerServerLeader
All servers store a copy of the data (in memory) A leader is elected at startup
Client ClientClientClientClient ClientClient
A leader is elected at startup Followers service clients, all updates go through leader Update responses are sent when a majority of servers have persisted the change
13LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 14: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/14.jpg)
The M/R stack of open source software – HBase
Workflow KattaWorkflow(Cascading / Azkaban) Solr / Lucene
ZookeeperPig Hive
HBase
HDFS
Map / Reduce
14LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
HDFS
![Page 15: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/15.jpg)
HBase
Distributed column oriented data storeO l t d t t• Only supports one data type
• Tables are broken into regions• Regions are automatically split and redistributed• Regions are automatically split and redistributed• All data is local
Scales to > 1M row / second insert rate (20 node (cluster)
Tightly integrated with Hadoop -> rows can be i t/ t t f / d t kinput/output for map/reduce tasks
15LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 16: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/16.jpg)
HBase Data model
Table
ColumnRow ID
ColumnFamilyMap
Qualifier MapFamily Qua e apa yValueQualifier Version
Physical files
16LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 17: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/17.jpg)
HBase Data model (simplified)
Table
Row ID Family: Column: Version
Key Valueey a ue
……
R i i i d kRegions partitioned on row key
17LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 18: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/18.jpg)
HBase System Architecture
18LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
![Page 19: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/19.jpg)
HBase Master manages region servers
19LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 20: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/20.jpg)
Hbase Client directly access region servers for data
20LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 21: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/21.jpg)
The M/R stack of open source software – Hive
Workflow KattaWorkflow(Cascading / Azkaban) Solr / Lucene
Pi HiPig Zookeeper
HBase
Hive
Map / Reduce
HDFS
21LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
HDFS
![Page 22: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/22.jpg)
Hive provides and SQL-like interface to data
ComponentsSh ll SQL lik d li W b JDBC• Shell: SQL-like command line; Web; JDBC
• Driver: API interface• Compiler: parse plan optimize• Compiler: parse, plan, optimize• Execution Engine: DAG of stages (M/R, HDFS, or
metadata)• Metastore: schema, location in HDFS, SerDe
http://www.cloudera.com/videos/introduction_to_hive
22LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 23: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/23.jpg)
The M/R stack of open source software – Solr / Katta
Workflow KattaWorkflow(Cascading / Azkaban) Solr / Lucene
Pi Hi ZookeeperPig Hive
HBase
HDFS
Map / Reduce
23LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
HDFS
![Page 24: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/24.jpg)
Solr
Faceted text search interface built on top of LuceneB ilt ti b d i t b Built as a native web app – drops into any web server
24LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 25: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/25.jpg)
Faceted search is a foundational component for ad hoc document analysis
25LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 26: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/26.jpg)
Solr architecture
HTTP Request Servlet Update Servlet
AdminInterface Standard
Request
DisjunctionMaxRequest
CustomRequest
XMLResponse
XMLUpdate
Handler RequestHandler Handler Writer Interface
Solr Core
Config Schema Caching
UpdateHandler Replication
Analysis Concurrencyp
26LLNL-PRES-428291 Computation / Center for Applied Scientific ComputingDiagram by Yonik Seeley
Lucene
![Page 27: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/27.jpg)
Katta provides vertical and horizontal scalability
Node 1Shard A
Shard Z
…
Text search
interface
1) Query
Shard Z
Shard A
…
Solr5) Combined Results
Node n
Shard Z
…
Zookeeper
27LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 28: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/28.jpg)
Projects using Hadoop at LLNL
Student projectsBi i f ti• Bioinformatics [James Leek]
• Continuous time LDA [Kurt Miller and Tina Elliasi-Rad]
Advanced R&D projects Advanced R&D projects• Network analytics• Keyword tagging and entity extractiony gg g y• Faceted Search
Research projects• READ LDRD
Program deployments
28LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
• BKMS
![Page 29: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/29.jpg)
Bioinformatics (student project)
KPATH: produce DNA signatures for detection of pathogenspathogens• k-mer lexing: produce set of unique DNA sequences
of length 15-60g Sliding window
• Discover k-mers that are unique between bacteria d iand viruses
29LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 30: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/30.jpg)
K-mer parsing performance comparisons
Lexing bacteria file 30 k-mer length [120 GB]O ti i d ffi t [ C i l t ti ]• Optimized suffix tree [ C implementation] on single node, 256 GB RAM, 16 processor
systemsystem 10.5 hours
• Custom hadoop implementation 85 nodes, 8 GB RAM, dual processor [old] ~1 hour
30LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 31: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/31.jpg)
Unique K-mer grouping performance
Group unique k-mers of length 15 [13 GB data]Pi i l t ti i t j i [10 LOC]• Pig implementation using outer joins [10 LOC]More than 9 hours
• Customer hadoop implementation:• Customer hadoop implementation: 3 hours 26 minutes
31LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 32: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/32.jpg)
Network data
HDFS provides storage layer for large repositories of network datanetwork data
Hive provides an SQL interface
Performance on single query for 6 months of data:• Tuned Oracle DB: hours to days • Hive: minutes
32LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 33: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/33.jpg)
Hadoop-based document management architecture
nt SourceHBase
IngestPubM d
RSS
cum
en SourceTextMetadata
Process
Med
Do Metadata
Annotations
Katta/Solr Tomcat
Access
Katta/Solr
Pubmed NYT
TomcatDocument Viewer
Access
33LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
RSS etc. Faceted Search
![Page 34: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/34.jpg)
Example Data Flow
InitialLoad
• Load original documents into document table
Load
Parser • Custom map code to extract text and meta data
NLP• Named Entity Extraction (SNER)• Parsing / Coreference
Topics • Send corpus slices to LDA for topic modeling
Index• Write specific HBase columns to faceted Solr index shards
34LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
Serve • Manage indexes with Katta over HDFS
![Page 35: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/35.jpg)
Keyword tagging & Entity extraction
Keyword taggingL di ti i (100K t )• Large dictionaries (100K terms)
• Finite state machine to store dictionary
Named Entity Recognition• Stanford NER• CRF model [People, Organizations, Location]
35LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 36: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/36.jpg)
Performance of Keyword tagging & Entity extraction
21M Pubmed entries + 1M news articles11M P b d b t t• 11M Pubmed abstracts
55K dictionary key phrases 6 node cluster [16 core 96 GB RAM 6 TB disk] 6 node cluster [16 core, 96 GB RAM, 6 TB disk]
Keyword Taggingy gg g• 8 minutes, 34 seconds
Named Entity Annotation• 1 hr 58 minutes
36LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 37: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/37.jpg)
37LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 38: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/38.jpg)
38LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 39: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/39.jpg)
39LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 40: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/40.jpg)
Faceted Search Indexing Performance
Creating 1 Solr index on 1M news articles: 8h 16 i• 8hrs 16 minMap: 37 min
Reduce: 8 hrs 14 minReduce: 8 hrs 14 min Creating 50 Solr indexes on 1M news articles:
• 55 minMap: 7 minReduce: 54 min
40LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
![Page 41: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/41.jpg)
Open-sourced products and others in the open source pipeline
iScoreC t t b d li ti• Content-based personalization
• [pre-Hadoop] Reconcile Reconcile
• Coreference resolution software built on open source tools [with Cornell and U. Utah]
• Additional adaptation to Hadoop Dunk
• An elegant java annotation system that allows you to have the fields of a java object serialized (deserialized) to (from) an HBase table
41LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
(deserialized) to (from) an HBase table• Simplifies queries, object construction, and
map/reduce formulation
![Page 42: Large Scale Text Analysis Using the Map/Reduce Hierarchy Butler.pdf · Large Scale Text Analysis Using the Map/Reduce Hierarchy ... The M/R stack of open source software Workflow](https://reader033.vdocuments.site/reader033/viewer/2022052607/5a71a4177f8b9abb538cf5a0/html5/thumbnails/42.jpg)
Questions?
DisclaimerThis document was prepared as an account of work sponsored by an agencyThis document was prepared as an account of work sponsored by an agency
of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product process or serviceReference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes
42LLNL-PRES-428291 Computation / Center for Applied Scientific Computing
advertising or product endorsement purposes.