More on this book
Community
Kindle Notes & Highlights
Started reading
May 16, 2024
Do not open text files in binary mode unless you need to analyze the file contents to determine the encoding—even then, you should be using Chardet instead of reinventing the wheel
Ordinary code should only use binary mode to open binary files, like raster images.
locale.getpreferredencoding() is the most important setting.
Text files use locale.getpreferredencoding() by default.
If you omit the encoding argument when opening a file, the default is given by locale.getpreferredencoding()
the encoding for standard I/O is UTF-8 for interactive I/O, or defined by locale.getpreferredencoding() if the output/input is redirected to/from a file.
sys.getdefaultencoding() is used internally by Python in implicit conversions of binary data to/from str. Changing this setting is not supported.
sys.getfilesystemencoding() is used to encode/decode filenames (not file contents). It is used when open() gets a str argument for the filename; if the filename is given as a byte...
This highlight has been truncated due to consecutive passage length restrictions.
Therefore, the best advice about encoding defaults is: do not rely on them.
String comparisons are complicated by the fact that Unicode has combining characters: diacritics and other marks that attach to the preceding character, appearing as one when printed.
Normalization Form C (NFC) composes the code points to produce the shortest equivalent string, while NFD decomposes, expanding composed characters into base characters and separate combining characters.
The other two normalization forms are NFKC and NFKD, where the letter K stands for “compatibility.” These are stronger forms of normalization, affecting the so-called “compatibility characters.”
In the NFKC and NFKD forms, each compatibility character is replaced by a “compatibility decomposition” of one or more characters that are considered a “preferred” representation, even if there is some formatting loss—ideally, the formatting should be the responsibility of external markup, not part of Unicode.
NFKC and NFKD normalization cause data loss and should be applied only in special cases like search and indexing, and not for permanent storage of text.
Case folding is essentially converting all text to lowercase, with some additional transformations. It is supported by the str.casefold() method.
NFC is the best normalized form for most applications. str.casefold() is the way to go for case-insensitive comparisons.
The Unicode standard provides an entire database—in the form of several structured text files—that includes not only the table mapping code points to character names, but also metadata about the individual characters and how they are related. For example, the Unicode database records whether a character is printable, is a letter, is a decimal digit, or is some other numeric symbol. That’s how the str methods isalpha, isprintable, isdecimal, and isnumeric work. str.casefold also uses information from a Unicode table.
If you build a regular expression with bytes, patterns such as \d and \w only match ASCII characters; in contrast, if these patterns are given as str, they match Unicode digits or letters beyond ASCII.
As the world adopts Unicode, we need to keep the concept of text strings separated from the binary sequences that represent them in files, and Python 3 enforces this separation.
Unicode provides multiple ways of representing some characters, so normalizing is a prerequisite for text matching.
In memory, Python 3 stores each str as a sequence of code points using a fixed number of bytes per code point, to allow efficient direct access to any character or slice.
Since Python 3.3, when creating a new str object, the interpreter checks the characters in it and chooses the most economic memory layout that is suitable for that particular str: if there are only characters in the latin1 range, that str will use just one byte per code point. Otherwise, two or four bytes per code point may be used, depending on the str.
Data classes are like children. They are okay as a starting point, but to participate as a grownup object, they need to take some responsibility.
Python offers a few ways to build a simple class that is just a collection of fields, with little or no extra functionality. That pattern is known as a “data class”—and dataclasses is one of the packages that supports this pattern.
By default, @dataclass produces mutable classes. But the decorator accepts a keyword argument frozen—shown in Example 5-3. When frozen=True, the class will raise an exception if you try to assign a value to a field after the instance is initialized.
Type hints—a.k.a. type annotations—are ways to declare the expected type of function arguments, return values, variables, and attributes.
Think about Python type hints as “documentation that can be verified by IDEs and type checkers.” That’s because type hints have no impact on the runtime behavior of Python programs.
This is normal Python behavior: regular instances can have their own attributes that don’t appear in the class.7
Class attributes are often used as default attribute values for instances, including in data classes.
The default_factory parameter lets you provide a function, class, or any other callable, which will be invoked with zero arguments to build a default value each time an instance of the data class is created.
It’s good that @dataclass rejects class definitions with a list default value in a field. However, be aware that it is a partial solution that only applies to list, dict, and set. Other mutable values used as defaults will not be flagged by @dataclass. It’s up to you to understand the problem and remember to use a default factory to set mutable default values.
Prior to Python 3.9, the built-in collections did not support generic type notation. As a temporary workaround, there are corresponding collection types in the typing module. If you need a parameterized list type hint in Python 3.8 or earlier, you must import the List type from typing and use it: List[str].
Creating a hierarchy of data classes is usually a bad idea,
The @dataclass decorator doesn’t care about the types in the annotations, except in two cases, and this is one of them: if the type is ClassVar, an instance field will not be generated for that attribute.
The other case where the type of the field is relevant to @dataclass is when declaring init-only variables,
These are classes that have fields, getting and setting methods for fields, and nothing else. Such classes are dumb data holders and are often being manipulated in far too much detail by other classes.
A code smell is a surface indication that usually corresponds to a deeper problem in the system.
The main idea of object-oriented programming is to place behavior and data together in the same code unit: a class. If a class is widely used but has no significant behavior of its own, it’s possible that code dealing with its instances is scattered (and even duplicated) in methods and functions throughout the system—a recipe for maintenance headaches.
What makes City or any class work with positional patterns is the presence of a special class attribute named __match_args__, which the class builders in this chapter automatically create.
You can combine keyword and positional arguments in a pattern. Some, but not all, of the instance attributes available for matching may be listed in __match_args__. Therefore, sometimes you may need to use keyword arguments in addition to positional arguments in a pattern.
basic principle of object-oriented programming: data and the functions that touch it should be together in the same class. Classes with no logic may be a sign of misplaced logic.
PEP 526 syntax for annotating instance and class attributes reverses the established convention of class statements: everything declared at the top-level of a class block was a class attribute (methods are class attributes, too). With PEP 526 and @dataclass, any attribute declared at the top level with a type hint becomes an instance attribute:
Python variables are like reference variables in Java; a better metaphor is to think of variables as labels with names attached to objects.
If you imagine variables are like boxes, you can’t make sense of assignment in Python; instead, think of variables as sticky notes,
With reference variables, it makes much more sense to say that the variable is assigned to an object, and not the other way around. After all, the object is created before the assignment.
Variables are bound to objects only after the objects are created
To understand an assignment in Python, read the righthand side first: that’s where the object is created or retrieved. After that, the variable on the left is bound to the object, like a label stuck to it. Just forget about the boxes.
An object’s identity never changes once it has been created; you may think of it as the object’s address in memory. The is operator compares the identity of two objects; the id() function returns an integer representing its identity.
The == operator compares the values of objects (the data they hold), while is compares their identities.
if you are comparing a variable to a singleton, then it makes sense to use is.