-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #8369: BeautifulSoup iterator returns Tag #8400
Conversation
Fix #8369 The `BeautifulSoup` class inherits from `Tag` here: https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/__init__.py#L74 The `__iter__` method is of `Tag`, which is the one used by `BeautifulSoup` is seen here: https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L1521 This iterates over `self.contents`, which is a `list[PageElement]` so this makes both unions with `Tag` to ensure coverage still handles the situations where a non-`Tag` `PageElement` is found
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
You probably don't need to do quite so many merge commits — they mean that those of us who are subscribed to all activity on this repo get quite a few notifications :) |
Oops, sorry about that! I'll keep that in mind moving forward. |
No worries! :) |
According to mypy_primer, this change has no effect on the checked open source code. 🤖🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by this, since Tag
s are PageElement
s, so not sure what the union is accomplishing.
I think the kind of error in #8369 is often unavoidable. It might be safe to change BeautifulSoup.__iter__
(I just don't know), but Tag.__iter__
can most certainly return e.g. a NavigableString
html_doc = "some document"
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
tag = soup.contents[0].contents[0].contents[0]
print(type(tag))
list(tag)[0].get("href") # explodes
OK, so it sounds like rather than a merge request, the correct approach would be for the code in the issue to do type-narrowing via isinstance checks or a structural check that the item supports the |
Ideally passing |
Defining def __iter__(self) -> Iterator[Tag | NavigableString]: ... will produce much more meaningful output by a type checker, then the current It will be clear that you may get any of these two types, and one of them ( Current (bad) output
This output makes you think that you have no chances to get something with New (good) output
This new warning makes things clear: you need to check for |
Happy to make changes, but I may be in over my head with getting this right. If there is a consensus for the correct update I'm happy to do it, otherwise I'll probably close this as I'm not confident enough with my understanding of BeautifulSoup to handle the |
Fix #8369
The
BeautifulSoup
class inherits fromTag
here:https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/__init__.py#L74
The
__iter__
method is ofTag
, which is the one used byBeautifulSoup
is seen here:https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L1521
This iterates over
self.contents
, which is alist[PageElement]
so this makes both unions withTag
to ensure coverage still handles the situations where a non-Tag
PageElement
is found