Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance, especially in data with many CR-LF #137

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jhnstrk
Copy link
Contributor

@jhnstrk jhnstrk commented Mar 31, 2024

The code changes here started as an attempt to improve performance with data loaded with many CRLF pairs (#67). In the current code these cause performance issues because CRLF matches the start of a boundary pattern. This makes advancing through the data very slow due to the number of call-backs.

By far the greatest optmization was to remove the partial Boyer-Moore-Horspool (BMH) implementation and replace it with bytes.find and a bit of logic to ensure partial matches at the end weren't missed. bytes.find appears to be significantly faster (around 3 times) on representative data. I can make a pull request with just this change if you are interested.

In the update I also attempted to reduce the number of times the on_part_data callback was called to a minimum. Whereas the old code would call it every time a partial boundary match was found (i.e. CRLF), now it only calls it when necessary. The conditions for calling the on_part_data are now:

  1. A complete boundary match is found, either for an end of part or final end.
  2. The currently loaded data buffer has been exhausted.

The significant complication is what happens when a partial boundary match overlaps the end of the loaded data. This was addressed with a look-behind buffer before, but the buffer is mostly unnecessary: since we are always matching boundary bytes, the look-behind buffer is always just a copy of the boundary. Only the last few bytes may vary (CRLF vs -- depending on whether it is a part or end boundary). However hitting this condition should be very very rare, and is addressed in the code.

Drops the look-behind buffer since the content is always the boundary.
The Boyer-Moore-Horspool algorithm was removed and replaced with Python's built-in `find` method. This appears to be faster, sometimes by an order of magnitude.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant