Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing is extremely slow if there are a lot of CR LF characters in the stream #67

Open
vgovtva opened this issue Jul 20, 2023 · 1 comment

Comments

@vgovtva
Copy link

vgovtva commented Jul 20, 2023

Noticed a slowness in the code when trying to parse objects with a lot of '\r\n' characters in a row.
It seems that the code does too many callbacks to on_part_data, causing potentially many operations, which are not at all necessary (see example and output below).

py version: python 3.9.2
python-multipart version: 0.0.6

Here's a demonstration

$ cat form.py
import timeit
import requests
from multipart.multipart import parse_options_header, MultipartParser


class TestFormDataParser:
    def __init__(self, headers):
        # Parse the Content-Type header to get the multipart boundary.
        self.headers = headers
        _, params = parse_options_header(self.headers['Content-Type'])
        self.charset = params.get(b'charset', 'utf-8')
        if type(self.charset) == bytes:
            self.charset = self.charset.decode('latin-1')
        try:
            self.boundary = params[b'boundary']
        except KeyError:
            raise MultiPartException('Missing boundary in multipart.')

        self.callbacks = {
            'on_part_begin': lambda: None,
            'on_part_data': self.on_part_data,
            'on_part_end': lambda: None,
            'on_header_field': lambda *_: None,
            'on_header_value': lambda *_: None,
            'on_header_end': lambda: None,
            'on_headers_finished': lambda: None,
            'on_end': lambda: None,
        }
        self.f = MultipartParser(self.boundary, self.callbacks)
        self.on_data_calls = 0

    def on_part_data(self, *_):
        self.on_data_calls += 1

    def parse(self, data):
        self.f.write(data)
        self.f.finalize()
        print(f'on_part_data was called {self.on_data_calls} times')

def test():

    test_data = [
        (b'fo' * 300000 + b'\r\n' + b'foobar', "b'fo' * 300000 + b'\\r\\n' + b'foobar',"),
        (b'fo' * 290000 + b'\r\n' * 10000 + b'foobar',"b'fo' * 290000 + b'\\r\\n' * 10000 + b'foobar'"),
        (b'fo' * 150000 + b'\r\n' * 150000 + b'foobar',"b'fo' * 150000 + b'\r\n' * 150000 + b'foobar'"),
        ((b'fo' + b'\r\n') * 150000 + b'foobar',"(b'fo' + b'\\r\\n') * 150000 + b'foobar'"),
        ((b'fo' + b'\r\n' * 5) * 50000 + b'foobar',"(b'fo' + b'\\r\\n' * 5) * 50000 + b'foobar'"),
        ((b'fo' + b'\r\n' * 50) * 5000 + b'foobar',"(b'fo' + b'\\r\\n' * 50) * 5000 + b'foobar'"),
        (b'fo' + b'\r\n' * 300000 + b'foobar',"b'fo' + b'\\r\\n' * 300000 + b'foobar'"),
    ]

    def run_it(data):
        prep = requests.Request('POST', 'http://example.com', files={'sample': sample, 'other': b'baz'}).prepare()
        parser = TestFormDataParser(prep.headers)
        parser.parse(prep.body)

    _min = 100000
    _max = 0
    iters = 1
    for sample, data_format in test_data:
        sample_size = len(sample)
        crlf_count = sample.count(b'\r\n')
        measurement_time = timeit.timeit(lambda: run_it(sample), setup="pass",number=iters)
        print(f'data format: {data_format}')
        print(f'It took {measurement_time:.3f}s to run {iters} iter with sample size: {sample_size}.')
        print(f'Sample contain {crlf_count} \\r\\n characters.\n\n')
        if measurement_time < _min:
            _min = measurement_time
        if measurement_time > _max:
            _max = measurement_time

    print(f'Biggest difference is {_max-_min:.2f}s ({_max/_min*100:.2f}%) for sample of the same size')

if __name__ == '__main__':
    test()

and here's the output that I get:

on_part_data was called 4 times
data format: b'fo' * 300000 + b'\r\n' + b'foobar',
It took 0.005s to run 1 iter with sample size: 600008.
Sample contain 1 \r\n characters.


on_part_data was called 9986 times
data format: b'fo' * 290000 + b'\r\n' * 10000 + b'foobar'
It took 0.057s to run 1 iter with sample size: 600006.
Sample contain 10000 \r\n characters.


on_part_data was called 149986 times
data format: b'fo' * 150000 + b'
' * 150000 + b'foobar'
It took 1.337s to run 1 iter with sample size: 600006.
Sample contain 150000 \r\n characters.


on_part_data was called 14 times
data format: (b'fo' + b'\r\n') * 150000 + b'foobar'
It took 0.005s to run 1 iter with sample size: 600006.
Sample contain 150000 \r\n characters.


on_part_data was called 14 times
data format: (b'fo' + b'\r\n' * 5) * 50000 + b'foobar'
It took 0.004s to run 1 iter with sample size: 600006.
Sample contain 250000 \r\n characters.


on_part_data was called 170002 times
data format: (b'fo' + b'\r\n' * 50) * 5000 + b'foobar'
It took 1.463s to run 1 iter with sample size: 510006.
Sample contain 250000 \r\n characters.


on_part_data was called 299986 times
data format: b'fo' + b'\r\n' * 300000 + b'foobar'
It took 2.096s to run 1 iter with sample size: 600008.
Sample contain 300000 \r\n characters.


Biggest difference is 2.09s (56888.98%) for sample of the same size

In our use case this can lead to some really slow upload times. For example, a 9MB file filled with CR LF can take 2000s+, and in applications based on FastAPI or Starlette (web frameworks), this may lead to blocking the main thread for seconds and up to a minute at a time (depending on size of chunks in the request stream).

This is an edge case, so to say, but because of it, we are forced to use a different formdata parser.

(If any python webserver uses this to parse incoming data, one can quite easily DoS the service)

@rdenadai
Copy link

At first glance and investigating a little bit ... looks like this line is being called multiple times by your example code.

https://github.com/andrew-d/python-multipart/blob/master/multipart/multipart.py#L1327

Adding a small elif c in (CR, LF): and keep raising the index seems to work at first glance ... but i wonder for a better approach here.

if boundary[index] == c:
    # If we found a match for our boundary, we send the
    # existing data.
    if index == 0:
        data_callback('part_data')

    # The current character matches, so continue!
    index += 1
elif c in (CR, LF):
    index += 1
else:
    index = 0

Still, in some cases, you still would have lots of calls...

on_part_data was called 4 times
data format: b'fo' * 300000 + b'\r\n' + b'foobar',
It took 0.006s to run 1 iter with sample size: 600008.
Sample contain 1 \r\n characters.

on_part_data was called 3 times
data format: b'fo' * 290000 + b'\r\n' * 10000 + b'foobar'
It took 0.044s to run 1 iter with sample size: 600006.
Sample contain 10000 \r\n characters.

on_part_data was called 3 times
data format: b'fo' * 150000 + b'
' * 150000 + b'foobar'
It took 0.591s to run 1 iter with sample size: 600006.
Sample contain 150000 \r\n characters.

on_part_data was called 14 times
data format: (b'fo' + b'\r\n') * 150000 + b'foobar'
It took 0.005s to run 1 iter with sample size: 600006.
Sample contain 150000 \r\n characters.

on_part_data was called 6 times
data format: (b'fo' + b'\r\n' * 5) * 50000 + b'foobar'
It took 0.005s to run 1 iter with sample size: 600006.
Sample contain 250000 \r\n characters.

on_part_data was called 10002 times
data format: (b'fo' + b'\r\n' * 50) * 5000 + b'foobar'
It took 1.081s to run 1 iter with sample size: 510006.
Sample contain 250000 \r\n characters.

on_part_data was called 3 times
data format: b'fo' + b'\r\n' * 300000 + b'foobar'
It took 1.157s to run 1 iter with sample size: 600008.
Sample contain 300000 \r\n characters.

Biggest difference is 1.15s (24253.20%) for sample of the same size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants