J. Bradford DeLong's Blog, page 308
September 8, 2018
Kieran Healy: Data Visualization: Weekend Reading
Kieran Healy: Data Visualization: A Practical Introduction: "You should look at your data. Graphs and charts let you explore and learn about the structure of the information you collect...
...Good data visualizations also make it easier to communicate your ideas and findings to other people. Beyond that, producing effective plots from your own data is the best way to develop a good eye for reading and understanding graphs���good and bad���made by others, whether presented in research articles, business slide decks, public policy advocacy, or media reports. This book teaches you how to do it.
My main goal is to introduce you to both the ideas and the methods of data visualization in a sensible, comprehensible, reproducible way. Some classic works on visualizing data, such as The Visual Display of Quantitative Information (Tufte, 1983), present numerous examples of good and bad work together with some general taste-based rules of thumb for constructing and assessing graphs. In what has now become a large and thriving field of research, more recent work provides excellent discussions of the cognitive underpinnings of successful and unsuccessful graphics, again providing many compelling and illuminating examples (Ware, 2008).
Other books provide good advice about how to graph data under different circumstances (Cairo, 2013; Few, 2009; Munzer, 2014), but choose not to teach the reader about the tools used to produce the graphics they show. This may be because the software used is some (proprietary, costly) point-and-click application that requires a fully visual introduction of its own, such as Tableau, Microsoft Excel, or SPSS. Or perhaps the necessary software is freely available, but showing how to use it is not what the book is about (Cleveland, 1994). Conversely, there are excellent cookbooks that provide code ���recipes��� for many kinds of plot (Chang, 2013). But for that reason they do not take the time to introduce the beginner to the principles behind the output they produce. Finally, we also have thorough introductions to particular software tools and libraries, including the one we will use in this book (Wickham, 2016). These can sometimes be hard for beginners to digest, as they may presuppose a background that the reader does not have.
Each of the books I have just cited is well worth your time. When teaching people how to make graphics with data, however, I have repeatedly found the need for an introduction that motivates and explains why you are doing something but that does not skip the necessary details of how to produce the images you see on the page. And so this book has two main aims. First, I want you get to the point where you can reproduce almost every figure in the text for yourself. Second, I want you to understand why the code is written the way it is, such that when you look at data of your own you can feel confident about your ability to get from a rough picture in your head to a high-quality graphic on your screen or page.
What you will learn: This book is a hands-on introduction to the principles and practice of looking at and presenting data using R and ggplot. R is a powerful, widely used, and freely available programming language for data analysis. You may be interested in exploring ggplot after having used R before, or be entirely new to both R and ggplot and just want to graph your data. I do not assume you have any prior knowledge of R.
After installing the software we need, we begin with an overview of some basic principles of visualization. We focus not just on the aesthetic aspects of good plots, but on how their effectiveness is rooted in the way we perceive properties like length, absolute and relative size, orientation, shape, and color. We then learn how to produce and refine plots using ggplot2, a powerful, versatile, and widely-used visualization library for R (Wickham, 2016). The ggplot2 library implements a ���grammar of graphics��� (Wilkinson, 2005). This approach gives us a coherent way to produce visualizations by expressing relationships between the attributes of data and their graphical representation.
Through a series of worked examples, you will learn how to build plots piece by piece, beginning with scatterplots and summaries of single variables, then moving on to more complex graphics. Topics covered include plotting continuous and categorical variables, layering information on graphics; faceting grouped data to produce effective ���small multiple��� plots; transforming data to easily produce visual summaries on the graph such as trend lines, linear fits, error ranges, and boxplots; creating maps, and also some alternatives to maps worth considering when presenting country- or state-level data. We will also cover cases where we are not working directly with a dataset, but rather with estimates from a statistical model. From there, we will explore the process of refining plots to accomplish common tasks such as highlighting key features of the data, labeling particular items of interest, annotating plots, and changing their overall appearance. Finally we will examine some strategies for presenting graphical results in different formats, and to different sorts of audiences.
If you follow the text and examples in this book, then by the end you will:
Understand the basic principles behind effective data visualization.
Have a practical sense for why some graphs and figures work well, while others may fail to inform or actively mislead.
Know how to create a wide range of plots in R using ggplot2.
Know how to refine plots for effective presentation.
Learning how to visualize data effectively is more than just knowing how to write code that produces figures from data. This book will teach you how to do that. But it will also teach you how to think about the information you want to show, and how to consider the audience you are showing it to���including the most common case, when the audience is yourself.
This book is not a comprehensive guide to R, or even a comprehensive survey of everything ggplot can do. Nor is it a cookbook containing just examples of specific things people commonly want to do with ggplot. (Both these sorts of books already exist: see the references in the Appendix.) Neither is it a rigid set of rules, or a sequence of beautifully finished examples that you can admire but not reproduce. My goal is to get you quickly up and running in R, making plots in a well-informed way, with a solid grasp of the core sequence of steps���taking your data, specifying the relationship between variables and visible elements, and building up images layer by layer���that is at the heart of what ggplot does.
Learning ggplot does mean getting used to how R works, and also understanding how ggplot connects to other tools in the R language. As you work your way through the book, you will gradually learn more about some very useful idioms, functions, and techniques for manipulating data in R. In particular you will learn about some of the tools provided by the tidyverse library that ggplot belongs to. Similarly, although this is not a cookbook, once you get past Chapter 1 you will be able to see and understand the code used to produce almost every figure in the book. In most cases you will also see these figures built up piece by piece, a step at a time.
If you use the book as it is designed, by the end you will have the makings of a version of the book itself, containing code you have written out and annotated yourself. And though we do not go into great depth on the topic of rules or principles of visualization, the discussion in Chapter 1 and its application throughout the book gives you more to think about than just a list of graph types. By the end of the book you should be able to look at a figure and be able to see it in terms of ggplot���s grammar, understanding how the various layers, shapes, and data are pieced together to make a finished plot.
The right frame of mind: It can be a little disorienting to learn a programming language like R, mostly because at the beginning there seem to be so many pieces to fit together in order for things to work properly. It can seem like you have to learn everything before you can do anything. The language has some possibly unfamiliar concepts that define how it works, like ���object���, or ���function���, or ���class���. The syntactic rules for writing code are annoyingly picky. Error messages seem obscure; help pages are terse; other people seem to have had not quite the same issue as you.
Beyond that, you sense that doing one thing often involves learning a bit about some other part of the language. To make a plot you need a table of data, but maybe you need to filter out some rows, recalculate some columns, or just get the computer to see it is there in the first place. And there is also a wider environment of supporting applications and tools that are good to know about, but involve new concepts of their own���editors that highlight what you write; applications that help you organize your code and its output; ways of writing your code that let you keep track of what you have done. It can all seem a bit confusing.
Don���t panic.
You have to start somewhere. Starting with graphics is more rewarding than some of the other places you might begin, because you will be able to see the results of your efforts very quickly. As you build your confidence and ability in this area, you will gradually see the other tools as things that help you sort out some issue, or solve a problem that���s stopping you from making the picture you want. That makes them easier to learn. As you acquire them piecemeal���perhaps initially using them without completely understanding what is happening���you will begin to see how they fit together, and be more confident of your own ability to do what you need to do.
Even better, in the past decade or so the world of data analysis and programming generally has opened up in a way that has made help much easier to come by. Free tools for coding have been around for a long time, but in recent years what you might call the ���ecology of assistance��� has gotten better. There are more resources available for learning the various pieces, and more of them are oriented to the way writing code actually happens most of the time���which is to say, iteratively, in an error-prone fashion, and taking account of problems other people have run into and solved before.
How to use this book: This book can be used in any one of several ways. At a minimum, you can sit down and read it for a general overview of good practices in data visualization, together with many worked examples of graphics from their beginnings to a properly finished state. Even if you do not sit down and work through the code, you will get a good sense of how to think about visualization and a better understanding of the process through which good graphics are produced.
More usefully, if you set things up as described in Chapter 2, and then work through the examplesOr if you bring your own data to explore instead of or alongside the examples, as described in Chapter 2., then you will end up with a data visualization book of your own. If you approach the book this way, then by the end you will be comfortable using ggplot in particular and also be ready to learn more about the R language in general.
This book can also be used to teach with, either as the main focus of a course on data visualization or as a supplement to undergraduate or graduate courses in statistics or data analysis. My aim has been to make the ���hidden tasks��� of coding and polishing graphs more accessible and explicit. I want to make sure you are not left with the ���How to Draw an Owl in Three Steps��� problem common to many tutorials. You know the one. The first two steps are shown clearly enough. Sketch a few bird-shaped ovals. Make a line for a branch. But the final step, an owl such as John James Audubon might have drawn, is presented as a simple extension for readers to figure out for themselves.
If you have never used R or ggplot, you should start at the beginning of the book and work your way through to the end. If you know about R already but only want to learn the core of ggplot, then after installing the software descibed below, focus on Chapters 3 through 5. Chapter 6 (on models) necessarily incorporates some material on statistical modeling that the book cannot develop fully. This is not a statistics text. So, for example, I show generally how to fit and work with various kinds of model in Chapter 6, but I do not go through the important details of fitting, selecting, and fully understanding different approaches. I provide references in the text to other books that have this material as their main focus.
Each chapter ends with a section suggesting where to go next (apart from continuing to read the book). Sometimes I suggest other books or websites to explore. I also ask questions or pose some challenges that extend the material covered in the chapter, encouraging you to use the concepts and skills you have learned....
Before you begin: The book is designed for you to follow along in an active way, writing out the examples and experimenting with the code as you go. You will be able to reproduce almost all of the plots in the text. You will need to install some software first. Here is what to do:
Get the most recent version of R. R is free and available for Windows, Mac, and Linux operating systems. Downloadcloud.r-project.org the version of R compatible with your operating system. If you are running Windows or MacOS, you should choose one of the precompiled binary distributions (i.e., ready-to-run applications) linked at the top of the R Project���s webpage.
Oncerstudio.com R is installed, download and install R Studio. R Studio is an ���Integrated Development Environment���, or IDE. This means it is a front-end for R that makes it much easier to work with. R Studio is also free, and available for Windows, Mac, and Linux platforms.
*Install tidyverse.org....
With these packages available, you can then install one last library of material that���s useful specifically for this book. Itgithub.com is hosted on GitHub,GitHub is a web-based service where users can host, develop, and share code. It uses git, a version control system that allows projects, or repositories, to preserve their history and incorporate changes from contributors in an organized way. rather than R���s central package repository, so we use a different function to fetch it.
#shouldread
#book
#cognitivescience
#computerscience
#statistics
Peter Christensen and Christopher Timmins: Sorting or Ste...
Peter Christensen and Christopher Timmins: Sorting or Steering: Experimental Evidence on the Economic Effects of Housing Discrimination: "Paired-tester audit experiments have revealed evidence of discrimination in the interactions between potential buyers and realtors, raising concern about whether certain groups are systematically excluded from the beneficial effects of healthy neighborhoods...
Nathan P. Kalmoe: Uses & Abuses of Ideology: "Ideology is a central construct in political psychology, and researchers claim large majorities of the public are ideological, but most fail to grapple with evidence of ideological innocence in most citizens...
...Using data from HUD's most recent Housing Discrimination Study and micro-level data on key attributes of neighborhoods in 28 US cities, we find strong evidence of discrimination in the characteristics of neighborhoods towards which individuals are steered. Conditional upon the characteristics of the house suggested by the audit tester, minorities are significantly more likely to be steered towards neighborhoods with less economic opportunity and greater exposures to crime and local pollutants. We find that holding locational preferences or income constant, discriminatory steering alone may contribute substantially to the disproportionate number of minority house- holds found in high poverty neighborhoods in the United States. The steering effect is also large enough to fully explain the differential in proximity to Superfund sites among African American mothers. These results have important implications for studies of ���neighborhood effects��� and confirm an important mechanism underlying observed correlations between race and pollution in the environmental justice literature. Our results also suggest that the basic utility maximization assumptions underlying hedonic and residential sorting models may often be violated, resulting in an important distortion in the provision of local public goods...
#shouldread
September 7, 2018
Andrew Gelman: China Air Pollution Regression Discontinui...
Andrew Gelman: China Air Pollution Regression Discontinuity Update: "Avery writes: 'There is a follow up paper for the paper ���Evidence on the impact of sustained exposure to air pollution on life expectancy from China���s Huai River policy��� [by Yuyu Chen, Avraham Ebenstein, Michael Greenstone, and Hongbin Li].... ���New evidence on the impact of sustained exposure to air pollution on life expectancy from China���s Huai River Policy���'.... The cleanest summary of my problems with that earlier paper is this article, 'Evidence on the deleterious impact of sustained use of polynomial regression on causal inference', written with Adam Zelizer...
...Here���s the key graph, which we copied from the earlier Chen et al. paper:
The most obvious problem revealed by this graph is that the estimated effect at the discontinuity is entirely the result of the weird curving polynomial regression, which in turn is being driven by points on the edge of the dataset. Looking carefully at the numbers, we see another problem which is that life expectancy is supposed to be 91 in one of these places (check out that green circle on the upper right of the plot)���and, according to the fitted model, the life expectancy there would be 5 years higher, that is, 96 years!, if only they hadn���t been exposed to all that pollution. As Zelizer and I discuss in our paper, and I���ve discussed elsewhere, this is a real problem, not at all resolved by (a) regression discontinuity being an identification strategy, (b) high-degree polynomials being recommended in some of the econometrics literature, and (c) the result being statistically significant at the 5% level. Indeed, items (a), (b), (c) above represent a problem, in that they gave the authors of that original paper, and the journal reviewers and editors, a false sense of security which allowed them to ignore the evident problems in their data and fitted model.
We���ve talked a bit recently about ���scientism,��� defined as ���excessive belief in the power of scientific knowledge and techniques.��� In this case, certain conventional statistical techniques for causal inference and estimation of uncertainty have led people to turn off their critical thinking. That said, I���m not saying, nor have I ever said, that the substantive claims of Chen et al. are wrong. It could be that this policy really did reduce life expectancy by 5 years. All I���m saying is that their data don���t really support that claim. (Just look at the above scatterplot and ignore the curvy line that goes through it.)
OK, what about this new paper?... Anyway, I still don���t buy... their statistical claim that their data strongly support their scientific claim.... I feel like kind of a grinch saying this. After all, air pollution is a big problem, and these researchers have clearly done a lot of work with robustness studies etc. to back up their claims. All I can say is: (1) Yes, air pollution is a big problem so we want to get these things right, and (2) Even without the near-certainty implied by these 95% intervals excluding zero, decisions can and will be made. Scientists and policymakers can use their best judgment, and I think they should do this without overrating the strength of any particular piece of evidence...
Andrew Gelman and Adam Zelizer: Evidence on the Deleterious Impact of Sustained Use of Polynomial Regression on Causal Inference: "It is common in regression discontinuity analysis to control for third- or fifth-degree polynomials of the assignment variable...
...Such models can overfit, leading to causal inferences that are substantively implausible and that arbitrarily attribute variation to the high-degree polynomial or the discontinuity.... This study is
indicative of a category of policy analyses where strong claims are based on weak data and methodologies which permit the researcher wide latitude in presenting estimated treatment effects. We then replicate a procedure from Green et��al... to illustrate one practical problem with the regression discontinuity estimate... high-degree polynomials yield noisy estimates of treatment effects that do not accurately convey uncertainty. We recommend that (a) researchers consider the problems which may result from controlling for higher-order polynomials; and (b) that journals recognize that quantitative analyses of policy issues are often inconclusive and relax the implicit rule under which statistical significance is a condition for publication...
#shouldread
What I am going to be trying to figure out this weekend: ...
What I am going to be trying to figure out this weekend: Daniel M. Sullivan: Econtools Documentation: "0.1...
#shouldread
#python
#computers
#statistics
The New York Times has done an appallingly bad job in thi...
The New York Times has done an appallingly bad job in this age of Trump. What they see here about the bad job done by Republican politicians is all true. But I wish they would turn their scrutiny inward a bit: New York Times: The Cult of Trump: "Every now and again, someone sticks a neck out...
...Consider poor Representative Trey Gowdy. In 2015, the South Carolina Republican became a conservative darling as head of the House���s Benghazi inquiry. But last week, Mr. Gowdy, now chairman of the Oversight and Government Reform Committee, went on television and undercut the Spygate conspiracy theory that Mr. Trump has been peddling so vigorously. Mr. Gowdy not only batted down the term ���spy,��� but dared to defend the F.B.I. Quicker than you can say ���collusion,��� the congressman got dog-piled by Trump fans in the conservative media. On the heels of Mr. Gowdy, the House speaker, Paul Ryan, ventured forth this week with his own questioning of the Spygate fantasy. This may well signal growing unease among congressional Republicans with Mr. Trump���s conspiracy mongering. On the other hand, it���s probably not coincidental that Mr. Gowdy and Mr. Ryan have both announced they are retiring at the end of this term.
A week ago, John Boehner, the former House speaker, neatly captured the state of his party during a policy conference in Michigan. ���There is no Republican Party,��� he told the crowd. ���There���s a Trump party. The Republican Party is kind of taking a nap somewhere.���
Sounds peaceful. But where will the party, not to mention the country, be when it finally wakes up?..."
#shouldread
#journamalism
Successful place-based policies require what we used to c...
Successful place-based policies require what we used to call "local boosters". One problem with so much of the so-called "Red States" is that the local rich are no longer boosters for their communities���indeed, no longer feel a part of the community in any meaningful way: Noah Smith: How to Save the Troubled American Heartland: "James Fallows and Deborah Fallows... notice a number of common approaches among towns that are on the mend. Two of these... universities and immigration...
...Creating a skilled workforce and making a town an attractive destination for companies looking to invest... is a function best served by community colleges and specialized public schools.... Universities��� function is different���they draw highly skilled individuals to a town, some of whom then start businesses and do other high-value work.... Immigrants, meanwhile, support a declining region���s tax base.... The authors describe a number of places where immigrants... provided a local labor force to lure business investment, and provided a shot of energy and cultural vitality.... Other successful approaches... local leaders who bring together government, business and nonprofits to carry out big projects.... I was reminded of Pike Powers, the consultant who helped create the public-private partnerships that made Austin, Texas, a world-class tech cluster.... The most successful cities are those where the government, the private sector and nonprofits all work in concert....
Although they don���t explicitly say it, Fallows and Fallows also provide a road map for how the American heartland needs to change. A landscape of small towns with populations in the hundreds or thousands needs to consolidate into a patchwork of small cities with populations in the tens of thousands. Small cities like the ones the authors visit offer much of the comfort, space and friendliness of the small-town atmosphere, while also taking advantage of agglomeration economies...
#shouldread
The Trumpists appear happy to lose, as long as they can c...
The Trumpists appear happy to lose, as long as they can convince themselves that others are losing more. So much "winning". May I have more of this "winning" please?: Edward Luce: The West Minus One: How Donald Trump Is Helping China: "Europe, Canada and Japan are united against Mr Trump���s trade belligerence...
...America First requires diplomatic skill. You need knowledge of those you want to divide and rule. Then you pick them off. Yet Mr Trump is doing the opposite.... Two explanations.... The first is that he is incompetent.... The second is that Mr Trump���s id is bigger than his ego.... The result is rolling confusion..."
#shouldread
#orangehairedbaboons
Ask Not For Whom the Global Warming Bell Tolls...: Live at Project Syndicate
Project Syndicate: Ask Not For Whom the Global Warming Bell Tolls...: Scarcely had I begun my first lecture of the fall semester here at the University of California, Berkeley, when I realized that I was too hot. I desperately wanted to take off my professorial tweed jacket. A tweed jacket is a wonderful but peculiar costume. If all you have for raw material is a sheep, it is the closest thing you can get to Gore-Tex.... Over the past 20 years, professorial garb has become increasingly uncomfortable, even here on the east side of the Bay. The climate now feels more like that of Santa Barbara.... The problems associated with global warming will be neither mere inconveniences, nor as far off as we would like to think. There are currently two billion near-subsistence farmers living in the six great river valleys of Asia, from the Yellow all the way around to the Indus. These farmers have limited means and few non-agricultural skills. It would not be easy for them to pick up and relocate.... The snow melt from the region���s high plateaus has always arrived at precisely the right moment, and in precisely the right volume.... Another billion people depend on the monsoon arriving at the right time, and in the right place.... Cyclones in the Bay of Bengal.... 250 million people living at or near sea level in the greater Ganges Delta, the world will face a long train of catastrophe. The international community is in no way prepared....
���No man������nor nation, region, or country������is an island entire of itself.��� And therefore never send to know for whom the bell tolls; it tolls for thee.��� Read MOAR at Project Syndicate
#shouldread
#projectyndicate
Thomas Jefferson had a... rather odd view of and degree o...
Thomas Jefferson had a... rather odd view of and degree of confidence in George Washington. George Washington ended his life pretty fed up with Thomas Jefferson: Thomas Jefferson: Letter to Walter Jones, 2 January 1814: "I do believe that Genl Washington had not a firm confidence in the durability of our government...
....He was naturally distrustful of men, and inclined to gloomy apprehensions; and I was ever persuaded that a belief that we must at length end in something like a British constitution had some weight in his adoption of the ceremonies of levees, birth-days, pompous meetings with Congress, and other forms of the same character, calculated to prepare us gradually for a change which he believed possible, and to let it come on with as little shock as might be to the public mind...
#shouldread
#history
#politics
I disagree with Simon here. He claims that "Alesina or Ro...
I disagree with Simon here. He claims that "Alesina or Rogoff featured so much in... austerity... not because they were influential, but because they were useful to provide some intellectual credibility to the policy that politicians of the right wanted to pursue". It's not one or the other. They gave credibility. And because they gave credibility the media-political machine made them influential. And there influence was such that they neutralized the rest of us, who understood what was going on and were desperately trying to stop it: Simon Wren-Lewis: The biggest economic policy mistake of the last decade, and it had nothing to do with academic economists: "Reading the article brought back memories of my first year or two writing this blog, where I became part of a mainly US blog scene of mainstream academics opposed to austerity...
...lead by Paul Krugman and Brad DeLong. We were trying to take down the academic arguments for austerity, and we succeeded. As Cooper���s article suggests it was not a very difficult task. Sometimes very senior economists who should have known better made simple mistakes of the kind I discussed here. On other occasions, like the predictions of massive inflation from Quantitative Easing that Cooper discussed, events quickly proved the Keynesians correct. Only in the case of the studies from the two pairs of Alesina and Ardagna and Reinhart and Rogoff was additional research required to challenge their conclusions. As far as us Keynesians were concerned, the intellectual battles were won by the end of 2012 if not before.... Paul De Grauwe���s... pointing to the lack of a sovereign lender of last resort, put an end to the academic credibility of ���we are going to become like Greece��� stories. When the ECB introduced OMT in September 2012 and the Eurozone debt crisis came to an end De Grauwe was proved right....
I want to add two important points that Cooper���s article does not cover. The first is that although by 2013 most academics had become convinced about the austerity mistake (it was always a minority view anyway), economic journalists in the non-partisan media could not recognise that because the politicians were continuing to implement the policy.... Voters were indeed being led astray by a malign or blinkered media, or at least a media that did not have the courage to call the result of the academic debate. The second point is that this academic debate had zero impact on politicians.... I wrote in 2012 that if all academics were united we might have an impact on public opinion, but that illusion did not last very long and Brexit showed it was indeed an illusion..... Economists can be influential, but only when politicians want to listen, or the media is prepared to confront them with academic knowledge....
The reason why economists like Alesina or Rogoff featured so much in the early discussion of austerity is not because they were influential, but because they were useful to provide some intellectual credibility to the policy that politicians of the right wanted to pursue. The influence of their work did not last long among academics, who now largely accept that there is no such thing as expansionary austerity or some danger point for debt. In contrast, the damage done by austerity does not seem to have done the politicians who promoted it much harm, in part because most of the media will keep insisting that maybe these politicians were right, but mainly because they are still in power....
#shouldread
J. Bradford DeLong's Blog
- J. Bradford DeLong's profile
- 90 followers
