Keith McCormick's Blog, page 4
October 25, 2016
World of Watson 2016
I’m in Las Vegas again for the annual IBM Conference. Depending on how you count, it has been a few years, more than 10 years, or just one year. Many years ago, I came to the annual SPSS Inc. conferences that were held here. Attendance was in the hundreds, not thousands. Then they became IBM conferences that had a substantial SPSS presence in the wake of the 2009 purchase of SPSS Inc. by IBM. It also had a large Cognos presence absorbing earlier Cognos conferences that were held in Las Vegas.
Now, in only its second year, the World of Watson has been relocated here. Last year it was held for the first time in NYC. I did not attend, but it was a much smaller conference. This year the combination of all of these original conferences … and more. Even though 17,000 is small by Las Vegas standards, I am feeling the difference. You have to stay attentive to the agenda and move briskly between sessions or you can miss out. The shear volume of brand names and technologies is intimidating, but there is no better way to keep informed than coming here.
For the second year, I will be blogging daily on the IBM Big Data and Analytics Hub.
October 4, 2015
The Data Scientist’s somewhat surprising role as Honest Broker and Change Agent
They say that you can’t be a prophet in your own land. As someone who is typically an outsider in the organizations where I consult, I find this to be true. I find that building a model is rarely more than 10-20% of the time I spend in front of the laptop, and fully a third of my time is not spent in front of a laptop at all. This is an explanation of what I find myself doing in all of those many hours that I am not using Data Mining software, or any software. What else is there to do?
Inspire Calm: I am often greeted with the admission that my new client’s Data Warehouse is not quite as complete, nor quite as sophisticated as they would like. No one’s is! It is interesting that it is one of the first facts that is shared with me because it implies that if only they had the perfect Data Warehouse that the Data Mining project would be easy. Well, they are never easy. Important work is hard work, and no one really has a perfect Data Warehouse because IT has a hard job to do as well. So, the experienced Data Miner is in a good position to explain that the client really isn’t so far behind everyone else.
Advocate for the Analysis Team’s time within their department: Yes, this is a full time endeavor! It is surprising how often Data Mining is confused with ad hoc queries like “How many of X did we sell in Q1 in Region A?” I am not sure where this comes from, but new Data Miners are left wondering how they can perform all six stages of CRISP-DM in time for next Tuesday’s meeting. By the time an external consulting resource is involved this confusion is largely cleared up, but sometimes a little bit of it lingers. How can the internal members perform all of their ongoing functions, and commit to a full time multi-week effort? Of course, they can’t. A bit of realism often sinks in during the first week of a project. Much better addressed earlier than later.
Inspire loftier goals: Data Preparation is said to take 70-90% of the effort. Tom Khabaza has claimed in the 3rd of his 9 Laws of Data Mining that it always takes more that 50%. I have experienced little to convince me that this estimate is far off. The ‘let’s do something preliminary’ idea can be inefficient if you aren’t careful because on a daily basis one is making decisions about how the inputs interact. Refreshing the model on more recent data is straightforward, but if you substantively change the recipe of the variable gumbo that you are mining, you have to repeat a lot of work, and revisit a lot of decisions. It is possible, with careful planning, to minimize the impact, but you risk increasing (albeit not doubling) the data preparation time. It is ultimately best to communicate the importance of the endeavor, knock on doors, marshal resources, and do the most complete job you can right now.
Act as a liaison with IT: An almost universal truth is that IT has been warned that the Data Miner needs their data, but IT has not been warned that the Data Miner needs their time and attention. Of course, no one wants to be a burden to another team, but some additional burden is inevitable. The analyst about to embark on a Data Mining project is going to have unanswered questions or unfulfilled needs that are going to require the IT team. The external Data Mining resource will often to have to explain to IT management that there is no way to completely eliminate this; that it is natural, and it is not the analysis team’s fault. Concurrent with that, the veteran Data Miner has to anticipate when the extra burden will occur, act to mitigate it, and try to schedule it as conveniently as possible.
Fight for project support (and data) from other departments: Certain players in the organization are expecting to be involved, like IT. Often the word has to get out that a successful Data Mining project is a top to bottom search for relevant data. Some will be surprised that it is a stone in their department that has been left unturned. They may not be pleased. Excited as they may be about the benefit that the entire company will derive, you are catching them at inopportune moment as they leave for vacation, or as a critical deadline looms. Fair warning is always wise, and it should come early. Done properly, the key player in a highly visible project gets a little (not a lot of) political capital which they should spend carefully.
Help get everyone thinking about Deployment and ROI from the start: Far too often it is assumed that the analysts are in charge of the “insights”, and the management team, having received the magic power point slides will pick it up from there, and ride the insights all the way to deployment and ROI. Has this ever happened? The Data Miner must coach, albeit gently, that a better plan must be in place, and the better planning must begin the very first week of a data mining project. Let executives play their critical role, but a little coaching is good for everyone. After all, it might be everyone’s first Data Mining project.
Fade into the background: Everyone wants credit for their hard work, but the wise Data Miner lets the project advocates and internal customers do all the talking at the valedictory meeting. The best place to be is on hand, but quiet. Frankly, if the Data Miner is still shoulder deep in the project, the project isn’t ready for a celebration. The “final” meeting, probably the first of many final meetings should be about passing the torch, reporting initial (or estimated) ROI, and announcing deployment details.
May 29, 2015
Four training trends and the Ideal way to conduct SPSS Modeler training
I’ve been using SPSS Modeler since the late 90s when it was still called Clementine, and ISL had just been purchased by SPSS Inc. I don’t know exactly how many folks I’ve trained in Modeler, but I’m sure it is over a thousand. (The grand total I’ve trained in Modeler, SPSS Statistics, and tool neutral classes, is several times more.) There was a point 10-15 years ago when I was holding a public class about once a month and a private class about once a month. While the private classes tend to be small, some of the public classes were fairly large, and it was like that years. It has changed a lot over that time, but with this much experience, some clear recent trends have emerged. After listing and discussing four of the trends, I will summarize with what I believe to be the very best way to teach Modeler.
1) Public Modeler classroom classes are increasingly rare.
Public classes have their place. I taught Introduction to Data Mining, which was a tool neutral SPSS Inc. class, to hundreds of folks. It was discontinued some years ago. (That material is actually part of the ‘Data Mining’ portion of the Intro Modeler class). I really enjoyed it. It worked because everyone came before their bought Modeler so that they knew enough of what it was about prior to sitting down with their sales rep. In some cases, perhaps because of budget, they decided to work with SPSS Statistics. A tool neutral Data Mining class can still work and is still useful and is still needed. The Modeling Agency, for which I sometimes train, has a good series of tool neutral classes, and they still attract a good public classroom and online audience. If you haven’t committed to a software solution yet, consider a tool neural class first. However … if your organization has invested 10s of thousands of dollars in Modeler software, a public class doesn’t feel like the way to go. You need to protect that investment.
2) Self Paced Virtual Classroom (SPVC) is surprisingly popular.
Folks try to sign up for a live classroom public class, and it is cancelled. Then they sign up for an online class, and it is cancelled. (Seek out ‘Guaranteed to Run’ classes to avoid this problem. The best Global Training Providers (GTPs) have them.) Not sure what else to do, some folks seek out self paced “classes” as a substitute, but there is really no substitute for a trainer. The SPVCs are really just a e-book. You are largely on your own. If you are prepared for that, it may be a good option for you, but this is challenging material and the book is hundreds of pages long. Surprisingly, it is fairly expensive. For more specifics, I have a full SPSS Training page, and there is a lot about this option online.
3) Online instructor led public classes are increasing – great option with both strengths and weaknesses.
Online classes are popular, as they should be. They are less expensive to produce, and much of that savings gets passed on to the consumer. They are less likely to be cancelled as a result. You avoid travel expense. You get nearly as much interaction with your instructor as in a live public class. However … you will never get as much interaction as a private class. That is really the issue. How much did you just spend on the software? Do you really want to share the instructors time with others? A great idea, that few consider, is taking the Modeler Intro class, Introduction to IBM SPSS Modeler and Data Mining, with a group, but BEFORE buying the software. Sound crazy? Perhaps, but you would get an excellent introduction to the software to help you decide if you wanted to buy it. If you already own it, and nearly everyone buys it first, you should worry more about the big investment you have just made. Private instruction will probably be a better option.
4) Private classes are popular, but hiring a trainer for a completely custom class can be overwhelming and confusing.
Sounds perfect, doesn’t it? You have complete control. You made the investment in the software, and you are prepared to pull out all the stops in making your fully bespoke training experience the perfect one for you and your team. But you’ve never used Modeler before, so how do you know what you want. Do you want to use an ‘official’ training guide. Of course, right? But … wait a minute … doesn’t that mean that I am using ‘canned’ examples? Was the book written for my new and current software version? The full curriculum is about 10 days, but no one has that much time. So, should we do a week? Two days? Three days? How do we decide? How do we find a trainer? Do we go through IBM itself? Perhaps, you’ve been offered some small amount of free support – in that case do we even need a class now? Maybe we should just bring in a trainer to do the two day Introduction to IBM SPSS Modeler and Data Mining course. It is the official Intro, so isn’t that enough? Frankly, it probably isn’t. And all of these choices are difficult. The reason they are difficult is that each situation is unique. A typical Modeler project is going to involve between 1-6 employees for between 6 and 20 weeks. Why wouldn’t organizing the training take some effort? Why would you want to have your agenda held hostage by a book that was written for another purpose? And has your trainer used Modeler outside of the classroom – in the field? Are they familiar only with the Intro book? What if you have questions that are in one of the other books. (See the Training Page for a description of the full SPSS Modeler curriculum.)
The Solution
In the last two years, I’ve been innovating a solution which I think works for most organizations and that blends the best of all worlds: private custom instruction, online coaching, and do-it-yourself learning.
Step 1: Have a planning meeting with your instructor. Find out who they are if you have booked them through a third party. Google them. Interact with them. Ask what them what you need to do to get the most value out of your training. 30 minutes may be plenty. Even a short email exchange is better than nothing. Insist on it. Whenever possible, arrange to have the training done with your data!
Step 2: If possible, do some reading or watch some videos before training. It might be just a chapter or two, but some preparation before hand will help you get the best value out of training.
Step 3: Invite the trainer live to your location. It is worth the investment. Begin with a 60-90 minute training kickoff, and invite everyone that has an interest in the project. I’ve done this with as many as 30-40 folks in the room. Discuss possible projects. Discuss what the technology can do. Discuss the business problem. Invite the c-suite to come. Get the IT team to come. This can be a powerful experience. I’ve even had offsite folks join the classroom group remotely.
Step 4: Continue the live onsite training for 2-3 days. The Intro class used to be three days. Now it is two days. It is hard to say how much time is necessary as it depends on the nature of the first project. If it involves Modeler Premium then I suggest at least four days. If Modeler Gold, plan on even longer, and the training becomes more complex. For the majority, three days is sufficient. The first half day (after the kickoff meeting) should be generic using practice data sets while everyone is getting acclimated. As soon as possible, start doing examples with your data. By the third day, it should feel almost like a working session.
Step 5: Continue the training remotely for as few as a couple hours a week. This is really the innovation that makes it all work. The internet is stable. It is reliable. Screen shares are amazing now compared to ten years ago. It is almost like you are there. Once you establish rapport, it really works well. I suggest a ratio of about 8 hours of work to about 1 hour of consultation. For every 8 hours that a trainee has worked alone in Modeler, on their first project, they probably need about 1 hour of Q&A. So, with 20-60 hours of mentoring, a team, on their very first project can take on a serious real world project. Rather than requiring 100-300 hours of consulting, you can get away with about 10% of that cost in mentoring. There are risks. Most trainers can’t pull this off – they need to be veteran field tested consultants. On the very first project, you may want a return onsite visit, or more than just mentoring, but this gives you tremendous flexibility.
I’ve been through this style of training now several times. First, working within a consultancy as an executive consultant and training director, and since as an independent consultant/trainer. I still teach public classes, but I have been extremely impressed with what I have seen produced during a first project combined with the best possible training scenarios. Please contact me for more information about this approach, or for advice on any of the public Modeler classes. I’ve taught from all of the SPSS Modeler training guides many times. Whether we meet in a public training or private training setting (or in my consulting work) I’d be glad to help you navigate your choices.
May 27, 2015
Free Content in Support of SPSS Statistics For Dummies 3rd Edition
Today I discovered that the 3rd edition of SPSS Statistics For Dummies is already available on Amazon in Kindle format. Note that you don’t need a Kindle to use this format. I read Kindle titles on my laptop all the time. I often buy technical books in Kindle format so that I have access to them when I’m on the road. Jesus Salcedo and I put a ton of effort into this new edition. The last edition was five years old, and had somewhat more of a programming focus than we favored in an introductory title. Without the help of our colleague Aaron Poh, we could not have got this new edition done so soon after the release of version 23. My thanks goes out to them both.
While you are perhaps waiting for the print edition, or you just want to check out some material right away, I’d like to draw your attention to the free content for the book:
The Cheat Sheet focuses on what we felt was 1) Easy to forget 2) Especially Important 3) Draws new users to useful new areas. If you are an established user, you would have to agree that Level of Measurement, for instance, is critical to get correct in newer versions because SPSS automatically makes assumptions about the three levels. We also point you in the right direction for charting, and list the most common Analyze menus. Finally, we review how to interpret significance using a simple T-Test as an example.
There are also ‘Web Extras’ for each of the book’s sections (except the opening introductory section). In the For Dummies world, these web extras are free content that complements the book, but is different content than the book. Get started here, and you will like what you find there. Then you can get the book, and it will have all new content. Keep in mind that the web extras should be easy to follow, but are not as basic as starting from the beginning of the book and reading the whole thing. I hope that eventually you will have a chance to check out all of the content.
II Getting Data In and Out of SPSS
For this sections, we chose a Web Extra that covers Automatic Recode. This is an easy trick if you have a variable with a bunch of string categories, but it has not been properly set up yet with pairs of values and labels. Much more about values and labels in Chapter 4, Entering and Defining Data.
III Messing with the Data Once it is in there
In Chapter 10, Manipulating Files, we cover both kinds of Merge (merge add cases and merge add variables). In the Web Extra for Part III, we cover Aggregate.
IV: Graphing
Part IV is a fairly extensive introduction to SPSS Statistics charting and graphing. There is a lot available here. Much more than folks think. For Web Extra we chose to show a quite new, and very cool recent feature – the Compare Groups Graph.
V: Analysis
The analysis section is a massive update to the 2nd edition. We placed much more emphasis in the 3rd edition on the generation and interpretation of statistics results. We do not get too advanced given the nature of an introductory text. We keep it brief, and easy to understand, but we do a thorough job on the basics. For instance, we covered Independent Samples T-Test in detail, but Paired Samples T-test is the Web Extra for this section.
VI: Settings, Templates, and Looks
SPSS has just tons of settings, and in the book we explain the default settings, and your other options. We explain how to use existing Table Looks in Part IV. In the Part VI Web Extra we show you have to make your own Table Look.
VI: Programming
We held back a bit on the programming content. We included what felt like a natural fit for this book. I have a lot more information on SPSS Statistics Programming however, including a great chapter in the forthcoming SPSS Statistics for Data Analysis and Visualization. Please contact me if you need more help on SPSS Programming. In the Web Extra we discuss Graphics Production Language (GPL). Very cool stuff.
VII: Part of Tens
The Parts of Tens part of any For Dummies book is a lot of fun. It is a kind of “Top Ten” list. In the book, we offer up list of Ten Modules, Ten Online Resources, and Ten Professional Development Projects. In the Web Extra we list our favorite Top Ten New Features in SPSS Statistics Version 23.
May 2, 2015
SPSS Advanced Statistical Procedures Companion
by Marija Norusis
I make a living as a consultant helping people understand their SPSS results, among other things. I have always been a fan of this author’s books, and I am glad I own this one, but make sure that you own the relevant modules, and will use these advanced techniques. See my review of an earlier edition.
Cluster Analysis
by Brian S. Everitt
Let’s face it. Few SPSS users need a 200+ page on just cluster analysis. As a trainer, I am a happy owner of this book, but partly as a reference to look up rare questions. See my full review of an earlier edition.
Multivariate Statistical Analysis: A Conceptual Introduction
by Sam “Kash” Kachigan
As a statistics software trainer, most folks that I meet are not looking for a book that explains how to use a method by hand, without the computer.
Regression: A Primer
by Paul Allison
This starts with the very basics. What is correlation? What is regression? By the time it gets to multiple regression it is nearly over. ALL to its credit. It is eminently readable. It is non-technical and clear. It doesn’t have any SPSS step by step, but that is not the point of the book. Too basic for some, it is perfect for the novice. I benefited mostly from inspiration on how best to explain regression to others. Brief and relatively inexpensive, probably worth having on hand even if you are not a novice.
Discovering Statistics Using SPSS
by Andy Field
Overview
Best choice for the novice that is going to be studying Stats for awhile. Include plenty of Intermediate, and even some Advanced Material. I have come to the conclusion that if a serious user of SPSS’s statistical features is to get only one reference; this is it. Read more in my review of the 2nd Edition.
SPSS Survival Manual
by Julie Pallant
Overview
This book is often mentioned to me as a great introduction to SPSS. It is my least favorite of the three because I think it is for folks that are not going to continue in their study of statistics as I explain in my review of the 2nd Edition.