UnicodeEncodeError: 'ascii' codec can't encode characters when calling self._output(request.encode('ascii')) #63

Lima-Codes · 2021-03-31T13:25:27Z

Describe the bug
When crawling websites that have non-ascii characters in the URL (for example the character é), I get this error:

UnicodeEncodeError: 'ascii' codec can't encode characters when calling self._output(request.encode('ascii'))

To Reproduce
Steps to reproduce the behavior:

Run seoanalyze https://www.archi-graph.com/
This website has pages with URLs containing non-ascii characters and will throw the error above

Expected behavior
Program should run as normal

Desktop (please complete the following information):

OS: Windows 10
Browser: N/A

Smartphone (please complete the following information):
N/A

Additional context
I propose a fix that sanitizes all URLs passed to the get method in the http module

The text was updated successfully, but these errors were encountered:

Lima-Codes · 2021-03-31T13:28:33Z

Proposed fix in the http module

import certifi
import urllib3
from urllib import parse


class Http():
    def __init__(self):
        user_agent = {'User-Agent': 'Mozilla/5.0'}
        self.http = urllib3.PoolManager(
            timeout=urllib3.Timeout(connect=1.0, read=2.0),
            cert_reqs='CERT_REQUIRED',
            ca_certs=certifi.where(),
            headers=user_agent
        )

    def get(self, url):
        sanitized_url = self.sanitize_url(url)
        return self.http.request('GET', sanitized_url)

    @staticmethod
    def sanitize_url(url):
        scheme, netloc, path, query, fragment = parse.urlsplit(url)
        path = parse.quote(path)
        sanitized_url = parse.urlunsplit((scheme, netloc, path, query, fragment))
        return sanitized_url

http = Http()

Adding the sanitize_url static method fixes the issue described above.

Tested successfully by running seoanalyze https://www.archi-graph.com/ in the command line.

sethblack · 2021-04-04T00:09:07Z

Nice. Thank you for this. I can get your fix dropped in the next release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeEncodeError: 'ascii' codec can't encode characters when calling self._output(request.encode('ascii')) #63

UnicodeEncodeError: 'ascii' codec can't encode characters when calling self._output(request.encode('ascii')) #63

Lima-Codes commented Mar 31, 2021 •

edited

Loading

Lima-Codes commented Mar 31, 2021

sethblack commented Apr 4, 2021

UnicodeEncodeError: 'ascii' codec can't encode characters when calling self._output(request.encode('ascii')) #63

UnicodeEncodeError: 'ascii' codec can't encode characters when calling self._output(request.encode('ascii')) #63

Comments

Lima-Codes commented Mar 31, 2021 • edited Loading

Lima-Codes commented Mar 31, 2021

Proposed fix in the http module

sethblack commented Apr 4, 2021

Lima-Codes commented Mar 31, 2021 •

edited

Loading