performance improvements

Download Performance Improvements

If you can't read please download the document

Upload: alexandro-colorado

Post on 26-Jun-2015

677 views

Category:

Technology


2 download

DESCRIPTION

Performance improvement in OpenOffice.org

TRANSCRIPT

  • 1. Niklas Nebel Sun Microsystems PERFORMANCE IMPROVEMENTS IN CALC

2. Agenda

  • Introduction and context 3. Local optimizations 4. Handling sheets separately 5. DataPilot performance 6. Load & save outlook

7. Introduction and Context 8. Performance work in all of OOo

  • Performance project
    • Big improvements from 3.0 to 3.2
  • Start-up: Cold start of Writer 20% faster 9. Writer load performance
    • Comparable with MS Word 2007
  • Impress load performance
    • Comparable with MS PowerPoint 2007
  • Calc performance
    • Load and save: Up to twice as fast 10. Recalculation: Up to 20 times faster (extreme case)

11. Local Optimizations 12. API Usage When Saving Text Cells

  • Filter uses getFormula API method 13. Single quote character added if text can be parsed as a number 14. Unnecessary parsing step 15. Can take up to 17% of CPU time

16. Querying the Document Null Date

  • Internal representation: Days since the null date 17. File format: XML Schema dates ( ISO 8601) 18. Utility method for conversion
    • Queries the null date from the document 19. Several UNO calls
  • Querying once is enough 20. 10% of CPU time if only date cells are used

21. Collecting Formatted Cell Ranges

  • Collect cell ranges with equal cell formats
    • For generating automatic styles 22. Keep a list of ranges for each set of formats 23. Try to join adjacent ranges
  • Formats are kept and iterated column-wise
    • Can use this information when trying to join
  • Prevents pathological cases

24. Formula Optimizations

  • String handling when formuas are parsed
    • Functions, references, names are case-insensitive 25. Operators, separators, parentheses are not 26. Reduce case conversion calls
      • 5% of CPU time saved
  • Sorting of values for MEDIAN etc.
    • Not necessary to completely sort the array 27. Use std::nth_element STL method instead 28. Faster calculation after loading

29. Formula Recalculation (1)

  • Detection of duplicate notifications
    • When a cell range is modified 30. Parameter range can contain several changed cells 31. Notify each range only once
  • Also useful for single-cell change
    • Parameter range can contain several changed results 32. Extreme case: Issue 95967 20x faster

33. Handling Sheets Separately 34. Updating Row Heights

  • Optimal row height depends on local conditions
    • Especially fonts
  • Core structures need concrete height values
    • Positioning of shapes: Whole file
      • File format: relative to cell position 35. Internally: absolute positions
    • Screen output: Only single sheet
  • Update row heights
    • After loading: Visible sheet and sheets with shapes 36. Others as needed (display, printing, )

37. Updating Row Heights: Comments

  • Cell comments (formerly: notes) are shapes 38. Often used in large sheets
    • Usually not shown
  • Create shape only when comment is shown
    • Saves time if there are many hidden comments 39. Row heights can be updated later

40. Updating Row Heights: Results

  • No effect for single sheet 41. Little improvement for text and numbers 42. 30% CPU time with date cells on many sheets 43. Formula results don't have to be calculated

44. Partial Saving

  • Don't generate XML elements for whole file 45. Copy unchanged parts on stream level 46. Could copy from temporary storage
    • Storage layer creates copy of the unpacked file
  • Access the original file
    • Uncompress on the fly
  • Cost
    • File access: Read the compressed file 47. CPU: Uncompress

48. Experiment: Incremental Saving

  • Generate XML elements only for changed cells
    • Proof of concept: Only single-cell changes
  • No additional information kept after loading 49. Minimal parsing to find affected cells in stream
    • Takes extra time 50. Less if affected cells near start of file
  • Results (compared to 3.0):
    • 40 70% improvement in CPU time 51. 30 50% improvement in total time

52. Sheet-Wise Saving

  • Handle sheets instead of individual cells 53. Fewer sheets than cells
    • Additional information can be kept in memory
  • Easier to find modified sheets than modified cells 54. One obvious limitation:
    • Only useful with several sheets

55. Finding Modified Sheets

  • Few code changes for most types of changes
    • Formula notification for cell contents 56. Formula calculation for changed results 57. Cell format changes 58. Column widths or row heights 59. Handled separately: Print ranges, etc.
  • Currently no handling of drawing layer changes
    • All sheets are considered modified

60. Automatic Styles

  • Direct formats are collected in automatic styles
    • Referenced by name
      • Generated name (ce1 etc.)
    • One list for the whole document 61. Have to be created with the same names again
  • Implemented for cell contents (incl. comments)
    • Keep a mapping of names to cell/text positions 62. Collect styles for unchanged sheets first 63. Include in existing duplicate detection for other sheets
  • Sheets with shapes always saved normally

64. Putting the Parts Together

  • When loading a file
    • Compatibility checks: Namespaces, encoding 65. Keep stream positions and style information
  • Steps to save a spreadsheet document
    • meta.xml, styles.xml, embedded objects: as usual 66. content.xml
      • Generate common content and modified sheets 67. For each sheet: Generate or copy stream portion
    • For Save and Save As update stream positions

68. Results

  • Influencing factors
    • Unchanged sheets 69. Type of sheet content 70. CPU time / file access
  • Example
    • Text, numbers, dates 71. 16 sheets
  • Single sheet modified
    • Twice as fast 72. On top of other changes

73. Formula Recalculation (2)

  • Sheet area is divided into slots
    • 16 columns by 128 rows 74. Range dependency registered in all affected slots 75. Needs attention when row limit is changed
  • Change: Use hash_set instead of set
    • Faster modification of dependency structures 76. Loading time
  • Change: Separate structures per sheet
    • Faster recalculation if several sheets are used

77. DataPilot Performance 78. DataPilot Memory Usage

  • Issue 55266: Several fields with many items 79. Fix now under way from IBM Symphony team
    • Don't allocate results for all child items 80. New cache table
  • CWS datapilotperf
    • Planned for OOo 3.3 81. Combination of large fields no longer a limitation

82. Load & Save Outlook 83. DOM Usage

  • Prototype by Christian Lippka for Impress
    • Use fast SAX to fill a compact DOM representation 84. Import from DOM, possibly parallel to parsing
  • Results for Impress
    • Only 2% improvement for typical presentation 85. Filling DOM tree uses 2% of CPU time 86. Not worth the effort
  • Calc may be different
    • Larger number of XML elements 87. But: Memory usage twice the XML stream size

88. Further Separation of Sheets

  • Load only the visible sheet
    • Load other sheets as needed, or in background 89. Parse XML fragment from stream, or use DOM 90. Formulas, charts may depend on changed cells
      • Dependencies must be known before saving
  • Parse formulas only as needed
    • Per sheet or individually 91. Already a separate step (but for all formulas)
  • Handle several sheets in parallel
    • More fine-grained locking needed

92. Q & A 93. PERFORMANCE IMPROVEMENTS IN CALC Niklas Nebel [email_address]