naitoh
(NAITOH Jun)
August 27, 2024, 6:35pm
1
I propose to add a SAX-based parser to REXML, as well as other XmlMini, to speed up the XML parsing process.
The purpose of this feature is to provide faster parsing than REXML(DOM) when libxml-ruby or nokogiri is not available.
checkout my PR for this:
rails:main
← naitoh:add_rexml_sax_for_xml_mini
opened 07:02AM - 03 Aug 24 UTC
### Motivation / Background
<!--
Describe why this Pull Request needs to be … merged. What bug have you fixed? What feature have you added? Why is it important?
If you are fixing a specific issue, include "Fixes #ISSUE" (replace with the issue number, remove the quotes) and the issue will be linked to this PR.
-->
This PR adds a SAX-based parser with REXML as with other XmlMini.
Use cases: Provide faster parsing than REXML(DOM) in situations where libxml-ruby or nokogiri is not available.
<!--
Thanks for contributing to Rails!
Please do not make *Draft* pull requests, as they still send
notifications to everyone watching the Rails repo.
Create a pull request when it is ready for review and feedback
from the Rails team :).
If your pull request affects documentation or any non-code
changes, guidelines for those changes are [available
here](https://edgeguides.rubyonrails.org/contributing_to_ruby_on_rails.html#contributing-to-the-rails-documentation)
About this template
The following template aims to help contributors write a good description for their pull requests.
We'd like you to provide a description of the changes in your pull request (i.e. bugs fixed or features added), the motivation behind the changes, and complete the checklist below before opening a pull request.
Feel free to discard it if you need to (e.g. when you just fix a typo). -->
### Detail
This Pull Request requires the latest REXML 3.3.4 or higher to pass the XMLMiniEngineTest::EngineTests.test_exception_thrown_on_expansion_attack test.
See: https://www.ruby-lang.org/en/news/2024/08/01/dos-rexml-cve-2024-41946/
### Additional information
- sax_bench.yaml
```
loop_count: 100
contexts:
- name: YJIT=OFF
- name: YJIT=ON
prelude: |
RubyVM::YJIT.enable
prelude: |
$LOAD_PATH.unshift(File.expand_path("activesupport/lib"))
require 'active_support'
require 'active_support/core_ext'
n_elements = 5000
n_attributes = 10
def build_xml(n_elements, n_attributes)
xml = '<?xml version="1.0"?><root>'
n_elements.times do |i|
xml << '<child '
n_attributes.times {|j| xml << "id#{j}=\"#{i}\" " }
xml << '/>'
end
xml << '</root>'
end
xml = build_xml(n_elements, n_attributes)
benchmark:
'LibXML' : |
ActiveSupport::XmlMini.backend = 'LibXML'
hash = Hash.from_xml(xml)
'LibXMLSAX' : |
ActiveSupport::XmlMini.backend = 'LibXMLSAX'
hash = Hash.from_xml(xml)
'Nokogiri' : |
ActiveSupport::XmlMini.backend = 'Nokogiri'
hash = Hash.from_xml(xml)
'NokogiriSAX' : |
ActiveSupport::XmlMini.backend = 'NokogiriSAX'
hash = Hash.from_xml(xml)
'REXML' : |
ActiveSupport::XmlMini.backend = 'REXML'
hash = Hash.from_xml(xml)
'REXMLSAX' : |
ActiveSupport::XmlMini.backend = 'REXMLSAX'
hash = Hash.from_xml(xml)
```
```
$ ruby -v
ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [arm64-darwin22]
$ gem list rexml libxml-ruby nokogiri
*** LOCAL GEMS ***
rexml (3.3.4)
*** LOCAL GEMS ***
libxml-ruby (5.0.3)
*** LOCAL GEMS ***
nokogiri (1.16.7 arm64-darwin, 1.16.6 arm64-darwin, 1.16.5)
```
```
$ benchmark-driver sax_bench.yaml
Calculating -------------------------------------
YJIT=OFF YJIT=ON
LibXML 16.818 19.854 i/s - 100.000 times in 5.946048s 5.036691s
LibXMLSAX 18.235 23.218 i/s - 100.000 times in 5.483908s 4.306973s
Nokogiri 16.512 16.512 i/s - 100.000 times in 6.056280s 6.056372s
NokogiriSAX 13.469 15.905 i/s - 100.000 times in 7.424489s 6.287452s
REXML 3.341 5.426 i/s - 100.000 times in 29.928804s 18.428204s
REXMLSAX 4.390 6.301 i/s - 100.000 times in 22.777792s 15.869573s
```
- YJIT=OFF : REXMLSAX is 1.31x faster than REXML(DOM)
- YJIT=ON : REXMLSAX is 1.16x faster than REXML(DOM)
### Checklist
Before submitting the PR make sure the following are checked:
* [x] This Pull Request is related to one change. Unrelated changes should be opened in separate PRs.
* [x] Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: `[Fix #issue-number]`
* [x] Tests are added or updated if you fix a bug or add a feature.
* [x] CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.