Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError: 'ascii' codec can't encode characters when calling self._output(request.encode('ascii')) #63

Open
Lima-Codes opened this issue Mar 31, 2021 · 2 comments

Comments

@Lima-Codes
Copy link

Lima-Codes commented Mar 31, 2021

Describe the bug
When crawling websites that have non-ascii characters in the URL (for example the character é), I get this error:

UnicodeEncodeError: 'ascii' codec can't encode characters when calling self._output(request.encode('ascii'))

To Reproduce
Steps to reproduce the behavior:

  1. Run seoanalyze https://www.archi-graph.com/
  2. This website has pages with URLs containing non-ascii characters and will throw the error above

Expected behavior
Program should run as normal

Desktop (please complete the following information):

  • OS: Windows 10
  • Browser: N/A

Smartphone (please complete the following information):
N/A

Additional context
I propose a fix that sanitizes all URLs passed to the get method in the http module

@Lima-Codes
Copy link
Author

Proposed fix in the http module

import certifi
import urllib3
from urllib import parse


class Http():
    def __init__(self):
        user_agent = {'User-Agent': 'Mozilla/5.0'}
        self.http = urllib3.PoolManager(
            timeout=urllib3.Timeout(connect=1.0, read=2.0),
            cert_reqs='CERT_REQUIRED',
            ca_certs=certifi.where(),
            headers=user_agent
        )

    def get(self, url):
        sanitized_url = self.sanitize_url(url)
        return self.http.request('GET', sanitized_url)

    @staticmethod
    def sanitize_url(url):
        scheme, netloc, path, query, fragment = parse.urlsplit(url)
        path = parse.quote(path)
        sanitized_url = parse.urlunsplit((scheme, netloc, path, query, fragment))
        return sanitized_url

http = Http()

Adding the sanitize_url static method fixes the issue described above.

Tested successfully by running seoanalyze https://www.archi-graph.com/ in the command line.

@sethblack
Copy link
Owner

Nice. Thank you for this. I can get your fix dropped in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants