Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #8369: BeautifulSoup iterator returns Tag #8400

Closed
wants to merge 3 commits into from
Closed

Fix #8369: BeautifulSoup iterator returns Tag #8400

wants to merge 3 commits into from

Conversation

kkirsche
Copy link
Contributor

Fix #8369

The BeautifulSoup class inherits from Tag here:
https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/__init__.py#L74

The __iter__ method is of Tag, which is the one used by BeautifulSoup is seen here:
https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L1521

This iterates over self.contents, which is a list[PageElement] so this makes both unions with Tag to ensure coverage still handles the situations where a non-Tag PageElement is found

Fix #8369 

The `BeautifulSoup` class inherits from `Tag` here:
https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/__init__.py#L74

The `__iter__` method is of `Tag`, which is the one used by `BeautifulSoup` is seen here:
https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L1521

This iterates over `self.contents`, which is a `list[PageElement]` so this makes both unions with `Tag` to ensure coverage still handles the situations where a non-`Tag` `PageElement` is found
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@AlexWaygood
Copy link
Member

AlexWaygood commented Jul 25, 2022

You probably don't need to do quite so many merge commits — they mean that those of us who are subscribed to all activity on this repo get quite a few notifications :)

@kkirsche
Copy link
Contributor Author

Oops, sorry about that! I'll keep that in mind moving forward.

@AlexWaygood
Copy link
Member

Oops, sorry about that! I'll keep that in mind moving forward.

No worries! :)

@github-actions
Copy link
Contributor

According to mypy_primer, this change has no effect on the checked open source code. 🤖🎉

Copy link
Collaborator

@hauntsaninja hauntsaninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by this, since Tags are PageElements, so not sure what the union is accomplishing.

I think the kind of error in #8369 is often unavoidable. It might be safe to change BeautifulSoup.__iter__ (I just don't know), but Tag.__iter__ can most certainly return e.g. a NavigableString

html_doc = "some document"
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
tag = soup.contents[0].contents[0].contents[0]
print(type(tag))
list(tag)[0].get("href")  # explodes
@kkirsche
Copy link
Contributor Author

OK, so it sounds like rather than a merge request, the correct approach would be for the code in the issue to do type-narrowing via isinstance checks or a structural check that the item supports the .get method. Is that understanding what you've said correctly?

@Akuli
Copy link
Collaborator

Akuli commented Jul 26, 2022

Ideally passing parse_only=SoupStrainer("a") to BeautifulSoup(...) would return a soup of Tags. I think it is possible to express that with generics and an overloaded __init__. But making BeautifulSoup generic might not be a good idea if people are currently using it in type hints.

@AIGeneratedUsername
Copy link

AIGeneratedUsername commented Jul 28, 2022

Defining

def __iter__(self) -> Iterator[Tag | NavigableString]: ...

will produce much more meaningful output by a type checker, then the current -> Iterator[PageElement].

It will be clear that you may get any of these two types, and one of them (Tag) has .get() method.

Current (bad) output

error: "PageElement" has no attribute "get" [attr-defined]

This output makes you think that you have no chances to get something with .get() method. Very confusing.

New (good) output

Error: Item "NavigableString" of "Union[Tag, NavigableString]" has no attribute "get" [union-attr]

This new warning makes things clear: you need to check for NavigableString and then do not call .get() for it.

@kkirsche
Copy link
Contributor Author

Happy to make changes, but I may be in over my head with getting this right. If there is a consensus for the correct update I'm happy to do it, otherwise I'll probably close this as I'm not confident enough with my understanding of BeautifulSoup to handle the overload calls at this time

@kkirsche kkirsche closed this Aug 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants