bp-3 taking your bulk content ingestions to the next level
DESCRIPTION
Learn about the Alfresco Bulk Filesystem Import Tool, a community developed extension to Alfresco that provides a high performance bulk import feature. Discover how different tuning parameters affect import performance, and learn how to determine the optimum configuration for your Alfresco environment.TRANSCRIPT
Peter Monks!Director of Technology, Strategic Alliances!
Agenda!1. Introduction to the Bulk Filesystem Import Tool!2. Demo!3. Performance analysis!
1. Methodology!2. Results!3. Conclusions!
4. Roadmap for the Bulk Filesystem Import Tool!5. Q&A!6. Appendices!
Introduction to the ulk ile ystem mport ool!
( for short)!
Introduction to the BFSIT! = ulk ile ystem mport ool
• Primary use case: one-off content migration / ingestion!• Provides high-performance import of content!
• A community maintained extension to Alfresco!• Hosted on Google Code [1]!• LGPL licensed!• Widely adopted!
Introduction to the BFSIT!Why not use…
• Web UIs?!• ACP Files?!• CIFS, FTP, NFS, WebDAV, IMAP?!• CMIS?!
All of the above suffer from one or more of: • Content sent over network!• External (out of process) orchestration!• Content requires pre-/post-processing (e.g. ACP)!• Chatty (e.g. CIFS, NFS)!• Overly general (e.g. CMIS)!
Introduction to the BFSIT!Solution • Import content from the Alfresco server!• Load folders & content as they appear on disk!• Content is imported in batches!• The “unit of work” is the directory!• Each directory is imported in at least one batch!• More if lots of content!
• Batches within a directory are processed serially!
Introduction to the BFSIT!Usage:
1. Initiated via a simple repo Web Script:!
(can also be initiated via wget, curl, et al)!
2. Import runs in background!3. Detailed status is displayed while in progress!• Weʼll see that during the demo!
Introduction to the BFSIT!Process details: • Place source directory on job queue & immediately return!
• Worker thread pulls a single directory off job queue, and:!1. Lists the contents of the directory!2. Groups entries into “importable items”!3. Filters importable items, based on admin-defined filtering rules!4. Subdivides list of importable items into batches!5. Imports batches, one at a time (serially)!6. Places all subdirectories onto the job queue!
Introduction to the BFSIT!Process details: • Place source directory on job queue & immediately return!
• Worker thread pulls a single directory off job queue, and:!1. Lists the contents of the directory!2. Groups entries into “importable items”!3. Filters importable items, based on admin-defined filtering rules!4. Subdivides list of importable items into batches!5. Imports batches, one at a time (serially)!6. Places all subdirectories onto the job queue!
I/O Bound Phases
CPU Bound Phases
Demo!
Performance Analysis:Methodology!
Goals and Test Plan!Goals: • Benchmark total time taken for bulk imports, using combination of:!• Machine environments!• Source content sets!• Alfresco repository configurations!• Bulk import tool configurations!
Test Plan: • Parallel testing in 2 environments!• Two runs per test per machine:!
1. Import into fresh (empty) repository!2. Delete target folder then re-import (without restarting Alfresco)!• Record average of duration of each run!
• Modify only one configuration parameter at a time, resetting earlier modifications in between!
Environments!Environment 1
• 2009 model MacBook Pro!• 2.8Ghz dual-core CPU!• 4GB RAM!• Solid State Drive (Toshiba OEM)!
• 64bit Mac OSX Lion 10.7.1!• MySQL 5.1!• Apple JDK 1.6.0_26!
Environment 2 • 2006 model Thinkpad T60!
• 2.33Ghz dual-core CPU!• 3GB RAM!• Dual hard drives (Seagate,
Hitachi)!• First used for source directory!• Second used for Alfresco repository!
• 64bit Ubuntu Natty Narwhal 11.04!
• MySQL 5.1!• OpenJDK 1.6.0_22!
NOTE 1: Neither of these environments are “production grade”! NOTE 2: These environments are not directly comparable!
Content Sets!
Name # Folders # Files Total Size Notes
Typical 38 4,640 1.44GB
Extreme File Size 1 9 4.41GB
Extreme File Volume 4 11,100 521.7KB
Extreme Directory Structure
1,021 0 0B 100 levels of nes8ng
Performance Analysis:Repository Tuning Results!
Baseline!Notes: • Repository tuned as
per Day Zero Config Guide!
• BFSIT has default configuration!
Observations: • Environment 2 is
significantly slower at creating cm:folder nodes!
• Theory: creating cm:folder nodes is “seeky” (more on this later)!
Disable User Quotas!Observations: • Quota calculation
performance proportional to number of cm:content nodes!
• Quota calculation performance not affected by content size!
Disable In-txn Indexing!Notes: • This configuration is
not compatible with Share 3.x!!
Observations: • Transactional indexing
slows Alfresco down a lot, particularly in environment 2!
• Theory: indexing is highly “seeky”!
Disable Indexing Entirely!Notes: • This configuration is
not compatible with Share 3.x!!
• This configuration functionally cripples Alfresco!!
Observations: • Some contention
between ingestions & indexing (even async)!
• Theory: SOLR integration in 4.x should provide similar performance!
Optimal Repository Configuration!Optimal repository configuration, without functionally crippling Alfresco, is:
• Disable user quotas:!
• Disable in-transaction indexing:!
• Indexing still occurs, just not synchronously in-transaction!• Incompatible with Share 3.x, but can be disabled temporarily during import,
then re-enabled post-import!
system.usages.enabled=false
index.tracking.disableInTransactionIndexing=true alfresco.cluster.name=dummyCluster
Optimal Repository Configuration – Results!Notes: • This configuration is
not compatible with Share 3.x!!
Observations: • Slower environment (2)
benefits more than the faster environment (1)!
• Configuration canʼt speed up import of large files!• Requires faster storage
devices (e.g. RAID 10)!
Average speedup of ~40%?!
Performance Analysis:BFSIT Tuning Results!
Worker Thread Pool Sizes!Notes: • Baseline is optimal
repository configuration!• Only the “Typical”
content set was used for testing!
Observations: • Multi-threading is
mostly irrelevant!• Not surprising, given
ingestion is I/O bound!• Steady improvement in
environment 1!• Theory: concurrent I/O
support in SSD!
Batch Weights!Observations: • Larger batches = better
performance!
…HOWEVER…!
• UI responsiveness got worse!• A classic trade-off!
• Ultimately, performance similar to baseline (batch weight = 100)!
Optimal BFSIT Configuration!Optimal BFSIT configuration:
• High thread count (mostly irrelevant):!
• More importantly, high batch weight:!
• Impacts UI responsiveness!• Could reduce if needed, at little cost!
alfresco-bulk-filesystem-import.threadpool.size.core=48 alfresco-bulk-filesystem-import.threadpool.size.max=48
alfresco-bulk-filesystem-import.batch.weight=1000
Optimal BFSIT Configuration - Results!Observations: • Modest improvement
over baseline!• Implies default BFSIT
configuration is close to optimal!
Average speedup of ~6.5%?!
Rethinking the Problem!
What if the BFSIT didn’t have to stream content into the repository at all?
What if the source content was already in the contentstore and only had to be “linked” into the
repository?
In-place Import!Notes: • Baseline is optimal
repository configuration!• Optimal repository &
BFSIT configuration!
Observations: • Improvement across
the board!• Best improvement is
extreme file size case –375X faster!!
Average speedup of ~60%?!
Performance Analysis:Conclusions!
Conclusions!Results:
• Minimum improvement of 6%!• Average improvement of 60%!• Maximum improvement of 99.7%!• In absolute terms, saw performance of up to:!
• 16GB / sec!• 120 nodes / sec!
Recall this wasn’t on production hardware!!
Conclusions!Developers: • Macro-optimization will always outperform micro-optimization!!• Multi-threading is not a magic bullet! Itʼs only helpful if a given
operation is CPU bound and can be parallelised.!
Administrators: • Use the Day Zero Configuration Guide for every install you do!!• Donʼt assume superficially similar environments will perform
similarly!• For bulk ingestions Alfresco is (mostly) I/O bound!
BFSIT Roadmap!
BFSIT Roadmap!Official roadmap is on the Google Code project’s wiki [2].
BFSIT v1.1 – Performance: • Issue #91: Optimization of directory analysis phase [complete].!• Issue #8: Multi-threaded imports [complete].!• Issue #86: In-place imports [complete].!• Issue #77: graphical display of throughput.!• Issue #17: Test various different dimensions to see how they affect performance [complete – this talk!]!
BFSIT v1.2 – Alfresco 4.0, Usability & Performance: • Issue #92: Test on Alfresco 4.0!• Issue #26: Integrate into Share's administration console!• Issue #94: Investigate use of Alfresco's BatchProcessor framework for the multi-threaded importer!• Issue #96: Measure performance of alternative batching strategies!• Issue #79: Reimplement the bulk filesystem import as a subsystem!• Issue #62: Add support for cm:content properties!
BFSIT v1.3+: • You tell me – Iʼm always keen to hear feedback!!• The issues list [3] and mailing list [4] are great ways to start getting involved in the project!
References![1] http://code.google.com/p/alfresco-bulk-filesystem-import/ [2] http://code.google.com/p/alfresco-bulk-filesystem-import/wiki/Roadmap [3] http://code.google.com/p/alfresco-bulk-filesystem-import/issues/list [4] http://groups.google.com/group/alfresco-bulk-filesystem-import
Also: • http://blogs.alfresco.com/wp/pmonks/2009/10/22/bulk-import-from-a-filesystem/!• Sessions:!
• BP-1 – Performance Tuning!• BP-6 – Repository Customization Best Practices!• BP-9 – Share Customization Best Practices!
Questions?!
Appendix A – “Typical” Content Set Distributions!