Beautiful Soup 4.13.0

“Experience keeps a dear school, yet ... The Crummy.com Review of Things 2024

Beautiful Soup 4.13.0

After a beta period lasting nearly a year, I've released the biggest update to Beautiful Soup in many years. For version 4.13.0 I added type hints to the Python code, and in doing so uncovered a large number of very small inconsistencies in the code. I've fixed the inconsistencies, but the result is a larger-than-usual number of deprecations and changes that may break backwards compatibility.

The CHANGELOG for 4.13.0 is quite large so I'm writing this blog post to highlight just the most important changes, specifically the changes most likely to make you need (or want) to change your code.

Deprecations and backwards-incompatible changes

DeprecationWarning is issued on use for every deprecated method, attribute and class from the 3.0 and 2.0 major versions of Beautiful Soup. These have been deprecated for at least ten years, but they didn't issue DeprecationWarning when you tried to use them. Now they do, and they're all going away soon.

This version drops support for Python 3.6, which went EOL in December 2021. The minimum supported major Python version for Beautiful Soup is now Python 3.7, which went EOL in June 2023.

The storage for a tag's attribute values now modifies incoming values
to be consistent with the HTML or XML spec. This means that if you set an
attribute value to a number, it will be converted to a string
immediately, rather than being converted when you output the document.

More importantly for backwards compatibility, setting an HTML
attribute value to True will set the attribute's value to the
appropriate string per the HTML spec. Setting an attribute value to
False or None will remove the attribute value from the tag
altogether, rather than (effectively, as before) setting the value
to the string "False" or the string "None".

This means that some programs that modify documents will generate
different output than they would in earlier versions of Beautiful Soup,
but the new documents are more likely to represent the intent behind the
modifications.

To give a specific example, if you have code that looks something like this:

checkbox1['checked'] = True
checkbox2['checked'] = False

Then a document that used to look like this (with most browsers
treating both boxes as checked):

<input type="checkbox" checked="True"/>
<input type="checkbox" checked="False"/>

Will now look like this (with browsers treating only the first box
as checked):

<input type="checkbox" checked="checked"/>
<input type="checkbox"/>

You can get the old behavior back by instantiating a TreeBuilder
with attribute_dict_class=dict, or you can customize how Beautiful Soup
treats attribute values by passing in a custom subclass of dict.

If you pass an empty list as the attribute value when searching the
tree, you will now find all tags which have that attribute set to a value in
the empty list--that is, you will find nothing. This is consistent with other
situations where a list of acceptable values is provided. Previously, an
empty list was treated the same as None and False, and you would have
found the tags which did not have that attribute set at all.

When using one of the find() methods or creating a SoupStrainer,
if you specify the same attribute value in attrs and the
keyword arguments, you'll end up with two different ways to match that
attribute. Previously the value in keyword arguments would override the
value in attrs.

The 'html5' formatter is now much less aggressive about escaping
ampersands, escaping only the ampersands considered "ambiguous" by the HTML5
spec (which is almost none of them). This is the sort of change that
might break your unit test suite, but the resulting markup will be much more
readable and more HTML5-ish.

To quickly get the old behavior back, change code like this:

tag.encode(formatter='html5')

to this:

tag.encode(formatter='html5-4.12')

In the future, the 'html5' formatter may be become the default HTML
formatter, which will change Beautiful Soup's default output. This
will break a lot of test suites so it's not going to happen for a
while.

New features

The online documentation now includes full API documentation generated from Python docstrings.
The new ElementFilter class encapsulates Beautiful Soup's rules
about matching elements and deciding which parts of a document to
parse. This gives you direct access to Beautiful Soup's low-level matching API. See the documentation for details.

The new PageElement.filter() method provides a fully general way of
finding elements in a Beautiful Soup parse tree. You can specify a
function to iterate over the tree and an ElementFilter to determine
what matches.

The NavigableString class now has a .string property which returns the
string itself. This makes it easier to iterate over a mixed list
of Tag and NavigableString objects.

Defined a new warning class, UnusualUsageWarning, which is a superclass
for all of the warnings issued when Beautiful Soup notices something
unusual but not guaranteed to be wrong, like markup that looks like
a URL (MarkupResemblesLocatorWarning) or XML being run through an HTML
parser (XMLParsedAsHTMLWarning).

The text of these warnings has been revamped to explain in more
detail what is going on, how to check if you've made a mistake,
and how to make the warning go away if you are acting deliberately.

If these warnings are interfering with your workflow, or simply
annoying you, you can filter all of them by filtering
UnusualUsageWarning, without worrying about losing the warnings
Beautiful Soup issues when there *definitely* is a problem you
need to correct, such as use of a deprecated method.

Emit an UnusualUsageWarning if the user tries to search for an attribute
called _class; they probably mean class_.

Like • 0 comments • flag

Published on February 02, 2025 10:34

No comments have been added yet.

Leonard Richardson's Blog

Leonard Richardson's profile
43 followers

Leonard Richardson isn't a Goodreads Author (yet), but they do have a blog, so here are some recent posts imported from their feed.

Follow Leonard Richardson's blog with rss.

delete edit this post