More on this book
Community
Kindle Notes & Highlights
Started reading
May 16, 2024
The automatic handling of missing keys is not triggered because pattern matching always uses the d.get(key, sentinel) method—where the default sentinel is a special marker value that cannot occur in user data.
Therefore, they all share the limitation that the keys must be hashable (the values need not be hashable, only the keys).
An object is hashable if it has a hash code which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() method). Hashable objects which compare equal must have the same hash code.
Numeric types and flat immutable types str and bytes are all hashable. Container types are hashable if they are immutable and all contained objects are also hashable. A frozenset is always hashable, because every element it contains must be hashable by definition. A tuple is hashable only if all its items are hashable.
User-defined types are hashable by default because their hash code is their id(), and the __eq__() method inherited from the object class simply compares the object IDs.
The way d.update(m) handles its first argument m is a prime example of duck typing: it first checks whether m has a keys method and, if it does, assumes it is a mapping. Otherwise, update() falls back to iterating over m, assuming its items are (key, value) pairs.
In other words, the end result of this line… my_dict.setdefault(key, []).append(new_value) …is the same as running… if key not in my_dict: my_dict[key] = [] my_dict[key].append(new_value)
Sometimes it is convenient to have mappings that return some made-up value when a missing key is searched. There are two main approaches to this: one is to use a defaultdict instead of a plain dict. The other is to subclass dict or any other mapping type and add a __missing__ method.
Here is how it works: when instantiating a defaultdict, you provide a callable to produce a default value whenever __getitem__ is passed a nonexistent key argument.
The callable that produces the default values is held in an instance attribute named default_factory.
The default_factory of a defaultdict is only invoked to provide default values for __getitem__ calls, and not for the other methods. For example, if dd is a defaultdict, and k is a missing key, dd[k] will call the default_factory to create a default value, but dd.get(k) still returns None, and k in dd is False.
This method is not defined in the base dict class, but dict is aware of it: if you subclass dict and provide a __missing__ method, the standard dict.__getitem__ will call it whenever a key is not found, instead of raising KeyError.
A search like k in my_dict.keys() is efficient in Python 3 even for very large mappings because dict.keys() returns a view, which is similar to a set, as we’ll see in “Set Operations on dict Views”. However, remember that k in my_dict does the same job, and is faster because it avoids the attribute lookup to find the .keys method.
Now that the built-in dict also keeps the keys ordered since Python 3.6, the most common reason to use OrderedDict is writing code that is backward compatible with earlier Python versions.
The regular dict was designed to be very good at mapping operations. Tracking insertion order was secondary.
OrderedDict was designed to be good at reordering operations. Space efficiency, iteration speed, and the performance of update operations were secondary.
A ChainMap instance holds a list of mappings that can be searched as one. The lookup is performed on each input mapping in the order it appears in the constructor call, and succeeds as soon as the key is found in one of those mappings.
Updates or insertions to a ChainMap only affect the first input mapping.
It’s better to create a new mapping type by extending collections.UserDict rather than dict.
The dict instance methods .keys(), .values(), and .items() return instances of classes called dict_keys, dict_values, and dict_items, respectively. These dictionary views are read-only projections of the internal data structures used in the dict implementation.
A view object is a dynamic proxy. If the source dict is updated, you can immediately see the changes through an existing view.
The dict_values class is the simplest dictionary view—it implements only the __len__, __iter__, and __reversed__ special methods.
Keys must be hashable objects. They must implement proper __hash__ and __eq__ methods
Item access by key is very fast. A dict may have millions of keys, but Python can locate a key directly by computing the hash code of the key and deriving an index offset into the hash table, with the possible overhead of a small number of tries to find a matching entry.
Key ordering is preserved as a side effect of a more compact memory layout for dict in CPython 3.6, which became an official language feature in 3.7.
Despite its new compact layout, dicts inevitably have a significant memory overhead. The most compact internal data structure for a container would be an array of pointers to the items.8 Compared to that, a hash table needs to store more data per entry, and Python needs to ...
This highlight has been truncated due to consecutive passage length restrictions.
To save memory, avoid creating instance attributes outside of ...
This highlight has been truncated due to consecutive passage length restrictions.
A set is a collection of unique objects.
Set elements must be hashable. The set type is not hashable, so you can’t build a set with nested set instances. But frozenset is hashable, so you can have frozenset elements inside a set.
Don’t forget that to create an empty set, you should use the constructor without an argument: set(). If you write {}, you’re creating an empty dict—this hasn’t changed in Python 3.
Set elements must be hashable objects. They must implement proper __hash__ and __eq__ methods
Membership testing is very efficient. A set may have millions of elements, but an element can be located directly by computing its hash code and deriving an index offset, with the possible overhead of a small number of tries to find a matching element or exhaust the search.
Sets have a significant memory overhead, compared to a low-level array pointers to its elements—which would be more compact but also much slower to search beyond a handful of elements.
The concept of “string” is simple enough: a string is a sequence of characters.
The identity of a character—its code point—is a number from 0 to 1,114,111 (base 10), shown in the Unicode standard as 4 to 6 hex digits with a “U+” prefix, from U+0000 to U+10FFFF.
The actual bytes that represent a character depend on the encoding in use. An encoding is an algorithm that converts code points to byte sequences and vice versa.
Converting from code points to bytes is encoding; converting from bytes to code points is decoding.
Creating a bytes or bytearray object from any buffer-like source will always copy the bytes. In contrast, memoryview objects let you share memory between binary data structures,
When converting text to bytes, if a character is not defined in the target encoding, UnicodeEncodeError will be raised, unless special handling is provided by passing an errors argument to the encoding method or function.
ASCII is a common subset to all the encodings that I know about, therefore encoding should always work if the text is made exclusively of ASCII characters.
Not every byte holds a valid ASCII character, and not every byte sequence is valid UTF-8 or UTF-16; therefore, when you assume one of these encodings while converting a binary sequence to text, you will get a UnicodeDecodeError if unexpected bytes are found.
UTF-8 is the default source encoding for Python 3, just as ASCII was the default for Python 2.
How do you find the encoding of a byte sequence? Short answer: you can’t.
The way UTF-8 was designed, it’s almost impossible for a random sequence of bytes, or even a nonrandom sequence of bytes coming from a non-UTF-8 encoding, to be decoded accidentally as garbage in UTF-8, instead of raising UnicodeDecodeError.
To avoid confusion, the UTF-16 encoding prepends the text to be encoded with the special invisible character ZERO WIDTH NO-BREAK SPACE (U+FEFF). On a little-endian system, that is encoded as b'\xff\xfe' (decimal 255, 254).
One big advantage of UTF-8 is that it produces the same byte sequence regardless of machine endianness, so no BOM is needed.
This UTF-8 encoding with BOM is called UTF-8-SIG in Python’s codec registry. The character U+FEFF encoded in UTF-8-SIG is the three-byte sequence b'\xef\xbb\xbf'.
Python scripts can be made executable in Unix systems if they start with the comment: #!/usr/bin/env python3.
The best practice for handling text I/O is the “Unicode sandwich” (Figure 4-2).5 This means that bytes should be decoded to str as early as possible on input (e.g., when opening a file for reading). The “filling” of the sandwich is the business logic of your program, where text handling is done exclusively on str objects. You should never be encoding or decoding in the middle of other processing. On output, the str are encoded to bytes as late as possible.
Code that has to run on multiple machines or on multiple occasions should never depend on encoding defaults. Always pass an explicit encoding= argument when opening text files, because the default may change from one machine to the next, or from one day to the next.