High Performance Python Quotes

Rate this book
Clear rating
High Performance Python: Practical Performant Programming for Humans High Performance Python: Practical Performant Programming for Humans by Micha Gorelick
241 ratings, 4.17 average rating, 22 reviews
High Performance Python Quotes Showing 1-21 of 21
“Using threads also allows us to avoid the fork-without-exec POSIX issues that Python’s multiprocessing brings,”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“Use manual sanity checks in data pipelines. When optimizing data processing systems, it’s easy to stay in the “binary mindset” mode, using tight pipelines, efficient binary data formats, and compressed I/O. As the data passes through the system unseen, unchecked (except for perhaps its type), it remains invisible until something outright blows up. Then debugging commences. I advocate sprinkling a few simple log messages throughout the code, showing what the data looks like at various internal points of processing, as good practice — nothing fancy, just an analogy to the Unix head command, picking and visualizing a few data points. Not only does this help during the aforementioned debugging, but seeing the data in a human-readable format leads to “aha!” moments surprisingly often, even when all seems to be going well. Strange tokenization! They promised input would always be encoded in latin1! How did a document in this language get in there? Image files leaked into a pipeline that expects and parses text files! These are often insights that go way beyond those offered by automatic type checking or a fixed unit test, hinting at issues beyond component boundaries. Real-world data is messy. Catch early even things that wouldn’t necessarily lead to exceptions or glaring errors. Err on the side of too much verbosity.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“One of the compelling reasons to switch to Python 3.3+ is that Unicode object storage is significantly cheaper than it is in Python 2.7. If you mainly handle lots of strings and they eat a lot of RAM, definitely consider a move to Python 3.3+. You get this RAM saving absolutely for free.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“If you’re working with a large array or matrix of numbers with Cython and you don’t want an external dependency on numpy, be aware that you can store your data in an array and pass it into Cython for processing without any additional memory overhead.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“Example 11-1. Measuring memory usage of 100,000,000 of the same integer in a list In [1]: %load_ext memory_profiler  # load the %memit magic function In [2]: %memit [0]*int(1e8) peak memory: 790.64 MiB, increment: 762.91 MiB”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“Refer back to Using memory_profiler to Diagnose Memory Usage for a reminder on how to use memory_profiler; here we load it as a new magic function in IPython using %load_ext memory_profiler. Example 11-1. Measuring memory usage of 100,000,000 of the same integer in a list In [1]: %load_ext memory_profiler  # load the %memit magic function In [2]: %memit [0]*int(1e8) peak memory: 790.64 MiB, increment: 762.91 MiB”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“If you’re working on long-running scientific problems where each job takes many seconds (or longer) to run, then you might want to review Gael Varoquaux’s joblib. This tool supports lightweight pipelining; it sits on top of multiprocessing and offers an easier parallel interface, result caching, and debugging features.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“Linux has a forking process to create new processes by cloning the parent process. Windows lacks fork, so the multiprocessing module imposes some Windows-specific restrictions that we urge you to review if you’re using that platform.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“For example, if we can have multiple Python processes all solving the same problem without communicating with one another (a situation known as embarrassingly parallel), not much of a penalty will be incurred as we add more and more Python processes. On the other hand, if each process needs to communicate with every other Python process, the communication overhead will slowly overwhelm the processing and slow things down. This means that as we add more and more Python processes, we can actually slow down our overall performance.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“standard *nix organization stores shared libraries in /usr/lib and /usr/local/lib.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“Python code that tends to run faster after compiling is probably mathematical, and it probably has lots of loops that repeat the same operations many times. Inside these loops, you’re probably making lots of temporary objects.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“It is worth noting that performing CPU and memory profiling on your code will probably start you thinking about higher-level algorithmic optimizations that you might apply. These algorithmic changes (e.g., additional logic to avoid computations or caching to avoid recalculation) could help you avoid doing unnecessary work in your code, and Python’s expressivity helps you to spot these algorithmic opportunities.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“One important lesson to take away from this is that you should always take care of any administrative things the code must do during initialization. This may include allocating memory, or reading configuration from a file, or even precomputing some values that will be needed throughout the lifetime of the program. This is important for two reasons. First, you are reducing the total number of times these tasks must be done by doing them once up front, and you know that you will be able to use those resources without too much penalty in the future. Secondly, you are not disrupting the flow of the program; this allows it to pipeline more efficiently and keep the caches filled with more pertinent data. We also learned more about the importance of data locality and how important simply getting data to the CPU is. CPU caches can be quite complicated, and often it is best to allow the various mechanisms designed to optimize them take care of the issue. However, understanding what is happening and doing all that is possible to optimize how memory is handled can make all the difference. For example, by understanding how caches work we are able to understand that the decrease in performance that leads to a saturated speedup no matter the grid size in Figure 6-4 can probably be attributed to the L3 cache being filled up by our grid. When this happens, we stop benefiting from the tiered memory approach to solving the Von Neumann bottleneck.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“One downfall of numpy’s optimization of vector operations is that it only occurs on one operation at a time. That is to say, when we are doing the operation A * B + C with numpy vectors, first the entire A * B operation completes, and the data is stored in a temporary vector; then this new vector is added with C. The in-place version of the diffusion code in Example 6-14 shows this quite explicitly. However, there are many modules that can help with this. numexpr is a module that can take an entire vector expression and compile it into very efficient code that is optimized to minimize cache misses and temporary space used. In addition, the expressions can utilize multiple CPU cores (see Chapter 9 for more information) and specialized instructions for Intel chips to maximize the speedup. It is very easy to change code to use numexpr: all that’s required is to rewrite the expressions as strings with references to local variables. The expressions are compiled behind the scenes (and cached so that calls to the same expression don’t incur the same cost of compilation) and run using optimized code. Example 6-19 shows the simplicity of changing the evolve function to use numexpr. In this case, we chose to use the out parameter of the evaluate function so that numexpr doesn’t allocate a new vector to which to return the result of the calculation.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“In order to optimize the memory-dominated effects, let us try using the same method we used in Example 6-6 in order to reduce the number of allocations we make in our numpy code. Allocations are quite a bit worse than the cache misses we discussed previously. Instead of simply having to find the right data in RAM when it is not found in the cache, an allocation also must make a request to the operating system for an available chunk of data and then reserve it. The request to the operating system generates quite a lot more overhead than simply filling a cache — while filling a cache miss is a hardware routine that is optimized on the motherboard, allocating memory requires talking to another process, the kernel, in order to complete. In order to remove the allocations in Example 6-9, we will preallocate some scratch space at the beginning of the code and then only use in-place operations. In-place operations, such as +=, *=, etc., reuse one of the inputs as their output. This means that we don’t need to allocate space to store the result of the calculation.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“memory allocations are not cheap. Every time we request memory in order to store a variable or a list, Python must take its time to talk to the operating system in order to allocate the new space, and then we must iterate over the newly allocated space to initialize it to some value.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“in computation there are two phases: generating data and transforming data. This function is very clearly performing a transformation on data, while the fibonacci function generates it. This clear demarcation adds extra clarity and extra functionality: we can move a transformative function to work on a new set of data, or perform multiple transformations on existing data. This paradigm has always been important when creating complex programs; however, generators facilitate this clearly by making generators responsible for creating the data, and normal functions responsible for acting on the generated data.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“test3 defines the sin function as a keyword argument, with its default value being a reference to the sin function within the math module. While we still do need to find a reference to this function within the module, this is only necessary when the test3 function is first defined. After this, the reference to the sin function is stored within the function definition as a local variable in the form of a default keyword argument. As mentioned previously, local variables do not need a dictionary lookup to be found; they are stored in a very slim array that has very fast lookup times. Because of this, finding the function is quite fast! While these effects are an interesting result of the way namespaces in Python are managed, test3 is definitely not “Pythonic.” Luckily, these extra dictionary lookups only start to degrade performance when they are called a lot (i.e., in the innermost block of a very fast loop, such as in the Julia set example). With this in mind, a more readable solution would be to set a local variable with the global reference before the loop is started. We’ll still have to do the global lookup once whenever the function is called, but all the calls to that function in the loop will be made faster. This speaks to the fact that even minute slowdowns in code can be amplified if that code is being run millions of times. Even though a dictionary lookup may only take several hundred nanoseconds, if we are looping millions of times over this lookup it can quickly add up. In fact, looking at Example 4-10 we see a 9.4% speedup simply by making the sin function local to the tight loop that calls it.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“Another benefit of the static nature of tuples is something Python does in the background: resource caching. Python is garbage collected, which means that when a variable isn’t used anymore Python frees the memory used by that variable, giving it back to the operating system for use in other applications (or for other variables). For tuples of sizes 1–20, however, when they are no longer in use the space isn’t immediately given back to the system, but rather saved for future use. This means that when a new tuple of that size is needed in the future, we don’t need to communicate with the operating system to find a region in memory to put the data, since we have a reserve of free memory already. While this may seem like a small benefit, it is one of the fantastic things about tuples: they can be created easily and quickly since they can avoid communications with the operating system, which can cost your program quite a bit of time.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“When a list of size N is first appended to, Python must create a new list that is big enough to hold the original N items in addition to the extra one that is being appended. However, instead of allocating N+1 items, M items are actually allocated, where M > N, in order to provide extra headroom for future appends. Then, the data from the old list is copied to the new list and the old list is destroyed. The philosophy is that one append is probably the beginning of many appends, and by requesting extra space we can reduce the number of times this allocation must happen and thus the total number of memory copies that are necessary.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans
“Rather than using xrange, we’ve used range — the diagram should make it clear that a large list of 1,000,000 elements is instantiated just for the purposes of generating an index, and that this is an inefficient approach that will not scale to larger list sizes (we will run out of RAM!). The allocation of the memory used to hold this list will itself take a small amount of time, which contributes nothing useful to this function.”
Micha Gorelick, High Performance Python: Practical Performant Programming for Humans