Jason Fried's Blog, page 2

September 25, 2020

We write code, not documents

Recently a student asked me:





Could you describe one instance where you had to use a diagramming tool (eg. Google Slides Drawings, Lucidcharts, Miro, Whimsical, Gliffy etc) to accomplish a task?





They also provided an example answer I could follow, which consisted of creating a chart to map a user flow, presenting it, getting feedback, adding it to another larger document, and creating Jira tickets.





I was a little surprised (though simultaneously not surprised) — is this how software development is still being taught?





Without being critical of academia, this seemed like a good opportunity to try to shatter by-the-book software development ideas for some future engineers by sharing a different way — our way.





Here was my answer (slightly edited for clarity and typos):





I hope this answer is helpful, but I actually don’t use a lot of diagramming tools, and I think it’s safe to say they’re not commonly used at Basecamp. We don’t write specs and stuff like that as they’re not “real” enough. We will do high level sketches and rough drawings (usually pen and paper or an iPad and a sketching app), but that’s typically it. So more often I’ll grab a pen and paper and sketch out a rough flow of what I need to do, or write out pseudo code of the steps that I need to take. 









The reason we don’t get too formal about diagramming is because they’re often not close enough to reality. You can draw and draw, but the reality is until you start writing some code or implementing a design, you don’t know what you’re going to hit. Maybe you’ll hit a technical limitation or maybe something just flat out doesn’t work. All the diagrams in the world don’t get you closer to finding out that stuff. So the sooner you get to real code, the sooner you‘ll get to seeing what’s possible. Fast and iterative in real code is where you want to be, not locked into a document.





With regards to frequency, I’d say I do that about every day. Because it’s so lightweight, there’s no barrier to entry. I’ll draw out what I need to do or write out the steps and then get to it. Very often I’ll hit something unexpected and then redraw what I need to do. It’s fast and iterative, and nothing is written in stone (because in software nothing is ever written in stone!)





Whether you agree or disagree with our way of software development, it’s an honest assessment of how we’ve done things at Basecamp for a long time. And I’d say all things being equal, it’s worked out pretty well for us.

1 like ·   •  0 comments  •  flag
Share on Twitter
Published on September 25, 2020 07:10

September 22, 2020

Demand Side Sales 101, a new book on sales by Bob Moesta.

Bob Moesta is a dear friend, mentor, and all around original thinker. He’s helped me see around corners, shine lights on things I didn’t know were there, and approach product development from unusual angles. Every time we talk, I come away inspired and full of optimism.





So when he asked me to help him with something, I jumped at the chance. In this case, it was writing a foreword for his new book Demand-Side Sales 101: Stop Selling and Help Your Customers Make Progress. Bob and I have talked sales for years, and I’m so pleased his ideas are finally collected in one place, in a form anyone can absorb. I highly recommend buying the book, reading the book, absorbing the book, and putting some new ideas in your head.





To get you started, here’s my foreword in its full form:









I learned sales at fifteen.





I was working at a small shoe store in Deerfield, Illinois, where I grew up. I loved sneakers. I was a sneakerhead before that phrase was coined.





I literally studied shoes. The designs, the designers, the brands, the technologies, the subtle improvements in this year’s model over last year’s.





I knew it all, but there was one thing I didn’t know: nothing I knew mattered. Sure it mattered to me, but my job was to sell shoes. I wasn’t selling shoes to sneakerfreaks like me, I was selling shoes to everyday customers. Shoes weren’t the center of their universe.





And I wasn’t alone. The companies that made the shoes didn’t have a clue how to sell shoes either.





Companies would send in reps to teach the salespeople all about the new models. They’d rattle off technical advancements. They’d talk about new breakthroughs in ethylene-vinyl acetate (EVA) which made the shoes more comfortable.





They’d talk about flex grooves and heel counters and Texon boards. Insoles, outsoles, midsoles.





And I’d be pumped. Now I knew everything I needed to know to sell the hell out of these things.





But when customers came in, and I demonstrated my mastery of the subject, they’d leave without buying anything. I could show off, but I couldn’t sell.





It wasn’t until my manager encouraged me to shut up, watch, and listen. Give people space, observe what they’re interested in, keep an eye on their behavior, and be genuinely curious about what they wanted for themselves, not what I wanted for them. Essentially, stop selling and start listening.





I noticed that when people browsed shoes on a wall, they’d pick a few up and bounce them around in their hand to get a sense of the heft and feel. Shoes go on your feet, but people picked the shoe with their hands. If it didn’t feel good in the hand, it never made it to their foot.





I noticed that if someone liked a shoe, they put it on the ground next to their foot. They didn’t want to try it on yet, they simply wanted to see what it looked like from above. Companies spend all this time making the side of the shoe look great, but from the wearer’s perspective, it’s the top of the shoe against their pants (or socks or legs) that seem to have an outsized influence on the buying decision.





I noticed that when people finally got around to trying on a shoe, they’d lightly jump up and down on it, or move side-to- side, simulating some sort of pseudo physical activity. They were trying to see if the shoe “felt right.” They didn’t care what the cushioning technology was, only that it was comfortable. It wasn’t about if it “fit right,” it was about if it “rubbed wrong” or “hurt” or felt “too hard.”





And hardly anyone picked a shoe for what it was intended for. Runners picked running shoes, sure, but lots of people picked running shoes to wear all day. They have the most cushion, they’re generally the most comfortable. And lots of people picked shoes purely based on color. “I like green” was enough to turn someone away from a blue shoe that fit them better.





Turns out, people had different reasons for picking shoes. Different reasons than my reasons, and far different reasons than the brand’s reasons. Hardly anyone cared about this foam vs. that foam, or this kind of rubber vs. that kind. They didn’t care about the precise weight, or that this brand shaved 0.5oz off the model this year compared to last. They didn’t care what the color was called, only that they liked it (or didn’t). The tech- nical qualities weren’t important – in fact, they were irrelevant.





I was selling all wrong.





And that’s really what this book is about. The revelation that sales isn’t about selling what you want to sell, or even what you, as a salesperson, would want to buy. Selling isn’t about you. Great sales requires a complete devotion to being curious about other people. Their reasons, not your reasons. And it’s surely not about your commission, it’s about their progress.





Fast forward twenty-five years.





Today I don’t sell shoes, I sell software. Or do I?





It’s true that I run a software company that makes project management software called Basecamp. And so, you’d think we sell software. I sure did! But once you meet Bob Moesta and Greg Engle, you realize you probably don’t sell what you think you sell. And your customers probably don’t think of you the way you think of yourself. And you almost certainly don’t know who your competition really is.





Over the years, Bob’s become a mentor to me. He’s taught us to see with new eyes and hear with new ears. To go deeper. To not just take surface answers as truth. But to dig for the how and the why—the causation. To understand what really moves someone to want to make a move. To understand the events that drive the purchase process, and to listen intently to the language customers use when they describe their struggles. To detect their energy and feel its influence on their decisions.





Everyone’s struggling with something, and that’s where the opportunity lies to help people make progress. Sure, people have projects, and software can help people manage those projects, but people don’t have a “project management problem.” That’s too broad. Bob taught us to dig until we hit a seam of true understanding. Project management is a label, it’s not a struggle.





People struggle to know where a project stands. People struggle to maintain accountability across teams. People struggle to know who’s working on what, and when those things will be done. People struggle with presenting a professional appearance with clients. People struggle to keep everything organized in one place so people know where things are. People struggle to communicate clearly so they don’t have to repeat themselves. People struggle to cover their ass and document decisions, so they aren’t held liable if a client says something wasn’t delivered as promised. That’s the deep down stuff, the real struggles.





Bob taught us how to think differently about how we talk, market, and listen. And Basecamp is significantly better off for it. We’ve not only changed how we present Basecamp, but we’ve changed how we build Basecamp. We approach design and development differently now that we know how to dig. It’s amazing how things can change once you see the world through a new lens.





Sales is everything. It’s survival. From selling a product, to selling a potential hire on the opportunity to join your company, to selling an idea internally, to selling your partner on this restaurant vs. that one, sales touches everything. If you want to be good at everything else, you better get good at this. Bob and Greg will show you how.




 •  0 comments  •  flag
Share on Twitter
Published on September 22, 2020 14:16

September 11, 2020

Options, Not Roadmaps

Since Shape Up came out, many people asked some version of this question:





I understand you make bets six weeks at a time. But how do you plan in the longer term? Don’t you have some kind of a roadmap?





The short answer is: no. We don’t have roadmaps. We think about what to do at the timescale larger than single bets, but we do it in a different way.





Why not a roadmap?



No matter how you try to hedge it, a roadmap communicates a plan—a series of commitments—to other people.





We of course have lots of ideas about what we’re going to do next. But we don’t want to make any commitments, internally or externally. This is very important.





What’s wrong with committing to future work?





First, there’s the uncertainty. We don’t have a crystal ball. Say we have features A, B, and C we want to build. We don’t know if A is going to work out as planned, and what that would mean for B. We don’t know if we’ll have a eureka in the bathtub one day, and new idea X will suddenly feel much more important than B or C. Or we might start building out B, only to realize that it’s not what we want or it’s harder than we thought and want to bail on it.





In short, we don’t know enough to make good on any promises.





Second, there are expectations. Leadership might be ok with changing course, but what about everyone who saw the roadmap and was eagerly awaiting feature C? What about the conversations with customers where someone in the company assured them to just hold tight because C is coming? Despite our best intentions, if we say we’re going to do something, it’s going to be really hard to back out of that, both internally and externally.





Third, there’s the guilt. Yeah, guilt. Have you ever looked at a long list of things you said were you going to do but haven’t gotten around to yet? How does that list make you feel? The realities of life and uncertainty show us that 100% of the things on the roadmap are not going to happen on time the way we imagine. And meanwhile, the world is not going to stop and wait for us to finish the old list. New ideas are constantly coming up. New requests and new problems constantly arise. If we hold ourselves completely to the roadmap, we’ll have to say no to new important things we actually want to do. And if we interrupt the roadmap to do those new important things, we’ll have to push back other things we promised. And that won’t feel good.





Our solution was to stop making commitments and start collecting options.





A portfolio of options



An option is something you can do but don’t have to do. All our product ideas are exactly that: options we may exercise in some future cycle—or never.





Without a roadmap, without a stated plan, we can completely change course without paying a penalty. We don’t set any expectations internally or externally that these things are actually going to happen.





That means no explicit promises and no implicit promises either. A list on the wall or in some official strategy document is an implicit promise: “this is what we’re doing next.“ There is no official list of what we’re doing next anywhere in the company.





When Jason (CEO) and David (CTO) decided the company would spend X cycles building out HEY, they didn’t have a roadmap. They had what they thought was a good set of options. There were enough good ideas for how to flesh out the app that they felt confident saying “we’ll dedicate X cycles to this.” They decided on which actual ideas to build one cycle at a time.





The overwhelming majority of our good ideas have never been built and never will be. There are things we have badly wanted to build for years that still don’t exist. Why? Something always came up. And that’s ok!





Because we aren’t committing to a roadmap, we aren’t setting expectations. And because we don’t set expectations, we don’t feel guilty when that great idea never gets any build time because we decided something else was more important.




1 like ·   •  0 comments  •  flag
Share on Twitter
Published on September 11, 2020 17:49

September 4, 2020

Inside a CODE RED: Network Edition

I wanted to follow up to Jeremy’s post about our recent outages with a deeper, more personal look behind the scenes. We call our major incident response efforts “CODE REDs” to signify that it is an all-hands-on-deck event and this definitely qualified. I want to go beyond the summary and help you see how an event like this unfolds over time. This post is meant for both people who want a deeper, technical understanding of the outage, as well as some insight into the human side of incident management at Basecamp.









The Prologue



The seeds of our issues this week started a few months ago. Two unrelated events started the ball rolling. The first event was a change in our networking providers. We have redundant metro links between our primary datacenter in Ashburn, VA and our other DC in Chicago, IL. Our prior vendor had been acquired and the new owner wanted us to change our service over to their standard offering. We used this opportunity to resurvey the market and decided to make a change. We ran the new provider alongside the other for several weeks. Then, we switched over entirely in late June.





The second event occurred around this same time when a security researcher notified us of a vulnerability. We quickly found a workaround for the issue by setting rules on our load balancers. These customizations felt sub-optimal and somewhat brittle. With some further digging, we discovered a new version of load balancer firmware that had specific support for eliminating the vulnerability and we decided to do a firmware upgrade. We first upgraded our Chicago site and ran the new version for a few weeks. After seeing no issues, we updated our Ashburn site one month ago. We validated the vulnerability was fixed and things looked good.





Incident #1



Our first incident began on Friday, August 28th at 11:59AM CDT. We received a flood of alerts from from PagerDuty, Nagios and Prometheus. The Ops team quickly convened on our coordination call line. Monitoring showed we lost our newer metro link for about 20-30 seconds. Slow BC3 response times continued despite the return of the network. We then noticed chats and pings were not working at all. Chat reconnections were overloading our network and slowing all of BC3. Since the problem was clearly related to chat, we restarted the Cable service. This didn’t resolve the connection issues. We then opted to turn chat off at the load balancer layer. Our goal was to make sure the rest of BC3 stabilized. The other services did settle as hoped. We restarted Cable again with no effect. Finally, as the noise died down, we noticed a stubborn alert for a single Redis DB instance.





Initially, we overlooked this warning because the DB was not down. We probed it from the command line and it still responded. We kept looking and finally discovered replication errors on a standby server and saw the replica was stuck in a resynchronization loop. The loop kept stealing resources and slowing the primary node. Redis wasn’t down but it was so that slow that it was only responding to monitoring checks. We restarted Redis on the replica and saw immediate improvement. BC3 soon returned to normal. Our issue was not a novel Redis problem but it was new to us. You can find much more detail here.





The Postmortem



The big question lingering afterward was “how can a 30 second loss of connectivity on a single redundant networking link take down BC3?” It was clear that the replication problem caused the pain. But, it seemed out of character that dropping one of two links would trigger this kind of Redis failure. As we went through logs following the incident, we were able to see that BOTH of our metro links had dropped for short periods. We reached out to our providers in search of an explanation. Early feedback pointed to some sub-optimal BGP configuration settings. But, this didn’t fully explain the loss of both circuits. We kept digging.





This seems as good a time as any for the confessional part of the story. Public postmortems can be challenging because not all of the explanations look great for people involved. Sometimes, human error contributes to service outages. In this case, my own errors in judgement and lack of focus came into play. You may recall we tripped across a known Redis issue with documented workaround. I created a todo for us to make those configuration changes to our Redis servers. The incident happened on a Friday when all but 2 Ops team members where off for the day. Mondays are always a busy, kick-off-the-week kind of day and I was also when I started my oncall rotation. I failed to make sure that config change was clearly assigned or finished with the sense of urgency it deserved. I’ve done this for long enough to know better. But, I missed it. As an Ops lead and active member of the team, every outage hurts. But this one is on me and it hurts even more so. 





Incident #2



At 9:39AM on Tuesday, 9/01, the unimaginable happened. Clearly, it isn’t unimaginable and a repeat now seems inevitable. But, this was not our mindset on Tuesday morning. Both metro links dropped for about 30 seconds and Friday began to repeat itself. We can’t know if the Redis config changes would have saved us because they had not been made (you can be sure they are done now!). We recognized the problem immediately and sprang into action. We restarted the Redis replica and the Cable service. It looked like things were returning to normal 5 minutes after the network flap. Unfortunately, our quick response during peak load on a Tuesday had unintended consequences. We saw a “thundering herd” of chat reconnects hit our Ashburn DC and the load balancers couldn’t handle the volume. Our primary load balancer locked up under the load and the secondary tried to take over. The failover didn’t register with the downstream hosts in the DC and we were down in our primary DC. This meant BC3, BC2, basecamp.com, Launchpad and  supporting services were all inaccessible. We attempted to turn off network connections into Ashburn but our chat ops server was impacted and we have to manually reconfigure the routers to disable anycast. The problem of peak traffic on Tuesday is much different than managing problems on a Friday.





We begin moving all of our services to our secondary DC in Chicago. We move BC3 completely. While preparing to move BC2 and Launchpad, we apply the manual router changes and the network in Ashburn settles. We decide to stop all service movement focus on stability for the rest of the day. That night after traffic dies down, we move all of our services back to their normal operating locations.





One new piece of the puzzle drops into place. The second round of network drops allowed our providers to watch in real time as events unfolded. We learn that both of our metro links share a physical path in Pennsylvania, which was affected by a fiber cut. A single fiber cut in the middle of Pennsylvania could still hit us unexpectedly. This was a surprise to us as it was to our providers. At least we could now make concrete plans to remove this new problem from our environment.





Incident #3



We rotate oncall shifts across the Ops team. As 2020 would have it, this was my week. After a late night of maintenances, I hoped for a slow Wednesday morning. At 6:55AM CDT on 9/2, PagerDuty informed me of a different plan. Things were returning to normal by the time I got setup. We could see our primary load balancer had crashed and failed over to the secondary unit. This caused about 2 minutes of downtime across most of our Basecamp services. Thankfully, the failover went smoothly. We immediately ship the core dump file to our load balancer vendor and start combing logs for signs of unusual traffic. This felt the same as Incident #2 but the metrics were all different. While there had been a rise in CPU on the load balancers, it was no where near the 100% utilization of the day before. We wondered about Cable traffic – mostly because of the recent issues. There was no sign of a network flap. We looked for evidence of a bad load balancer device or other network problem. Nothing stood out.





At 10:49AM, PagerDuty reared again. We suffered a second load balancer failover. Now we are back at peak traffic and the ARP synchronization on downstream devices fails. We are hard down for all of our Ashburn-based services. We decide to disable anycast for BC3 in Ashburn and run only from Chicago. This is again a manual change that is hampered by high load but it does stabilize the our services. We send the new core file off to our vendor and start parallel work streams to get us to some place of comfort.





These separate threads spawn immediately. I stay in the middle of coordinating between them while updating the rest of the company on status. Ideas come from all directions and we quickly prioritize efforts across the Ops team. We escalate crash analysis with our load balancer vendor. We consider moving everything to out of Ashburn. We expedite orders for upgraded load balancers. We prep our onsite remote hands team for action. We start spinning up virtual load balancers in AWS. We dig through logs and problem reports looking for any sign of a smoking gun. Nothing emerges … for hours.





Getting through the “waiting place” is hard. On the one hand, systems were pretty stable. On the other hand, we had been hit hard with outages for multiple days and our confidence was wrecked. There is a huge bias to want to “do something” in these moments. There was a strong pull to move out of Ashburn to Chicago. Yet, we have the same load balancers with the same firmware in Chicago. While Chicago has been stable, what if  it is only because it hasn’t seen the same load? We could put new load balancers in the cloud! We’ve never done that before and while we know what problem that might fix – what other problems might it create? We wanted to move the BC3 backend to Chicago – but this process guaranteed a few of minutes of customer disruption when everyone was on shaky ground. We call our load balancer vendor every hour asking for answers.  Our supplier tells us we won’t get new gear for a week. Everything feels like a growing list of bad options. Ultimately, we opt to prioritize customer stability. We prepare lots of contingencies and rules for when to invoke them. Mostly, we wait. It seemed like days.





By now, you know that our load balancer vendor confirms a bug in our firmware. There is workaround that we can apply through a standard maintenance process. This unleashes a wave conflicted feelings. I feel huge relief that we have a conclusive explanation that doesn’t require days of nursing our systems alongside massive frustration over a firmware bug that shows up twice in one day after weeks running smoothly. We set the emotions aside and plan out the remaining tasks. Our services remain stable during the day. That evening, we apply all our changes and move everything back to its normal operating mode. After some prodding, our supplier manages to air ships our new load balancers to Ashburn. Movement feels good. The waiting is the hardest part.





The Aftermath



TL;DR: Multiple problems can chain into several painful, embarrassing incidents in a matter of days. I use those words to truly express how this feels. These events are now understandable and explainable. Some aspects were arguably outside of our control. I still feel pain and embarrassment. But we move forward. As I write this, the workarounds appear to be working as expected. Our new load balancers are being racked in Ashburn. We proved our primary metro can go down without issues since the vendor had a maintenance on their problematic fiber just last night. We are prepping tools and processes for handling new operations. Hopefully, we are on a path to regain your trust.





We have learned a great deal and have much work ahead of us. A couple of things stand out. While we have planned redundancy into our deployments and improved our live testing over the past year, we haven’t done enough and have a false sense of security around that – particularly when running at peak loads. We are going to get much more confidence in our failover systems and start proving them in production at peak load. We have some known disruptive failover processes that we hope to never use and will not run during the middle of your day. But, shifting load across DCs or moving between redundant networking links should happen without issue. If that doesn’t work, I would rather know in a controlled environment with a full team at the ready. We also need to raise our sense of urgency for rapid follow up on outage issues. That doesn’t mean we just add them to our list. We need to clear room for post-incident action explicitly. I will clarify the priorities and and explicitly push out other work.





I could go on about our short comings. However, I want to take time to highlight what went right. First off, my colleagues at Basecamp are truly amazing. The entire company felt tremendous pressure from this series of events. But, no one cracked. Calmness is my strongest recollection from all of the long calls and discussions. There were plenty piercing questions and uncomfortable discussions, don’t get me wrong. The mood, however, remained a focused, respectful search for the best path forward. This is the benefit of working with exceptional people in an exceptional culture. Our redundancy setup did not prevent these outages. It did give us lots of room to maneuver. Multiple DCs, a cloud presence and networking options allowed us to use and explore lots of recovery options in a scenario we had not seen before. You might have noticed that HEY was not impacted this week. If you thought that is because it runs in the cloud, you are not entirely correct. Our outbound mail servers run in our DCs. So no mail actually sends from the cloud. Our redundant infrastructure isolated HEY from any of these Basecamp problems. We will keep adapting and working to improve our infrastructure. There are more gaps than I would like. But, we have a strong base.





If you’ve stuck around to the end, you are likely a longtime Basecamp customer or perhaps a fellow traveller in the operations realm. For our customers, I just want to say again how sorry I am that we were not able to provide the level of service you expect and deserve. I remain committed to making sure we get back to the standard we uphold. For fellow ops travelers, you should know that others struggle with the challenges of keeping complex systems stable and wrestling with feelings of failure and frustration. When I said there was no blaming going on during the incident, that isn’t entirely true. There was a pretty serious self-blame storm going on in my head. I don’t write this level of personal detail as an excuse or to ask for sympathy. Instead, I want people to understand that humans run Internet services. If you happen to be in that business, know that we have all been there. I have developed a lot of tools to help manage my own mental health while working through service disruptions. I could probably write an entire post on that topic. In the meantime, I want to make it clear that I am available to listen and help anyone in the business that struggles with this. We all get better by being open and transparent about how this works.




 •  0 comments  •  flag
Share on Twitter
Published on September 04, 2020 14:11

September 2, 2020

Three Basecamp outages. One week. What happened?

Basecamp has suffered through three serious outages in the last week, on Friday, August 28th, on Tuesday, September 1, and again today. It’s embarrassing, and we’re deeply sorry.





This is more than a blip or two. Basecamp has been down during the middle of your day. We know these outages have really caused issues for you and your work. We’ve put you in the position of explaining Basecamp’s reliability to your customers and clients, too.





We’ve been leaning on your goodwill and we’re all out of it.





Here’s what has happened, what we’re doing to recover from these outages, and our plan to get Basecamp reliability back on track.





What happened







Friday, August 28





What you saw: Basecamp 3 Campfire chat rooms and Pings stopped loading. You couldn’t chat with each other or your teams for 40 minutes, from to 12:15pm to 12:55pm Central Time (17:1517:55 UTC). Incident timeline.What we saw: We have two independent, redundant network links that connect our two redundant datacenters. The fiber optic line carrying one of the network links was cut in a construction incident. No problem, right? We have a redundant link! Not today. Due to a surprise interdependency between our network providers, we lost the redundant link as well, resulting in a brief disconnect between our datacenters. This led to a failure in our cross-datacenter Redis replication when we exceeded the maximum replication buffer size, triggering a catastrophic replication resync loop that overloaded the primary Redis server, causing very slow responses. This took Basecamp 3 Campfire chats and Pings out of commission.



Tuesday, September 1





What you saw: You couldn’t load Basecamp at all for 17 minutes, from 9:51am to 10:08am Central Time (14:5115:08 UTC). Nothing seemed to work. When Basecamp came back online, everything seemed back to normal. Incident timeline.What we saw: Same deal, with a new twist. Our network links went offline, taking down Basecamp 3 Campfire chats and Pings again. While recovering from this, one of our load balancers (a hardware device that directs Internet traffic to Basecamp servers) crashed. A standby load balancer picked up operations immediately, but that triggered a third issue: our network routers failed to automatically synchronize with the new load balancer. That required manual intervention, extending the outage.



Wednesday, September 2





What you saw: You couldn’t load Basecamp for 15 minutes, from 10:50am to 11:05am Central Time (15:5016:05 UTC). When Basecamp came back online, chat messages felt slow and sluggish for hours afterward. Incident timeline.What we saw: Earlier in the morning, the primary load balancer in our Virginia datacenter crashed again. Failover to its secondary load balancer proceeded as expected. Later that morning, the secondary load balancer also crashed and failed back to the former primary. This led to the same desynchronization issue from yesterday, which again required manual intervention to fix.



All told, we’ve tickled three obscure, tricky issues in a 5-day span that led to overlapping, interrelated failure modes. These woes are what we plan for. We detect and avert these sorts of technical issues daily, so this was a stark wake-up call: why not today? We’re working to learn why.





What we’re doing to recover from these outages



We’re working multiple options in parallel to recover and manage any contingencies in case our recovery plans fall through.





We’re getting to the bottom of the load balancer crash with our vendor. We have a preliminary assessment and bugfix.We’re replacing our hardware load balancers. We’ve been pushing them hard. Traffic overload is a driving factor in one outage.We’re rerouting our redundant cross-datacenter network paths to ensure proper circuit diversity, eliminating the surprise interdependency between our network providers.As a contingency, we’re evaluating moving from hardware to software load balancers to decrease provisioning time. When a hardware device has an issue, we’re days out from a replacement. New software can be deployed in minutes.As a contingency, we’re evaluating decentralizing our load balancer architecture to limit the impact of any one failure.



What we’re doing to get our reliability back on track



We engineer our systems with multiple levels of redundancy & resilience precisely to avoid disasters like this one, including practicing our response to catastrophic failures within our live systems.





We didn’t catch these specific incidents. We don’t expect to catch them all! But what catches us by surprise are cascading failures that expose unexpected fragility and difficult paths to recovery. These, we can prepare for.





We’ll be assessing our systems for resilience, fragility, and risk, and we’ll review our assessment process itself. We’ll share what we learn and the steps we take with you.





We’re sorry. We’re making it right.



We’re really sorry for the repeated disruption this week. One thing after another. There’s nothing like trying to get your own work done and your computer glitching out you or just not cooperating. This one’s on us. We’ll make it right.





We really appreciate all your understanding and patience you’ve shown us. We’ll do our best to earn back the credibility and goodwill you’ve extended to us as we get Basecamp back to rock-solid reliability. Expect Basecamp to be up 24/7.





As always, you can follow along with live updates about Basecamp status here and follow the play-by-play on Twitter, and get in touch with our support team anytime.




 •  0 comments  •  flag
Share on Twitter
Published on September 02, 2020 16:53

August 20, 2020

How Basecamp Became a 100% Remote Company

Moving is never fun. It’s bad enough when it’s your stuff, but ten years of stuff at an office you only spent two years in can be daunting! I’m Navid, and part of my job at Basecamp the last two years has been taking care of our office in Chicago. As folks outside of Basecamp learned of our impending office closure, I began to get some questions. The most common being “what did you do with the stuff? What about mail and important documents?” Of course we had to work out some logistical puzzles to keep things running smoothly. Here’s how we used Basecamp and a new service to bid adieu to our office, to make my job remote, and to become a 100% remote company.





We didn’t close down our office because of COVID-19, though it certainly factored in the decision. Basecamp has always been remote. Remote is Basecamp. We wrote the book on it, literally. Our lease was due to expire, and it just didn’t seem worth it to keep it going at the new price. We’d outgrown it as a space for meetups, and it was always too big for the number of Basecampers that reside in Chicago. On a busy day there’d be six people working from the office.





On the other hand, having an office afforded us the standard ways of handling a lot of day-to-day business items. Mail, packages, meetings, storage. It was simple, easy, and the path that most of the world has taken. Losing the office and going 100% remote would take us further down a path less travelled.





Once we came to terms with leaving the office, I got to work on figuring out what to do to meet this goal. I won’t bore you with the minute details that are common to every move, you want to know what we are doing now. How we got to 100% remote. The biggest hurdles to jump were: 1. Primary business address (as most government agencies require a physical address), 2. How to handle the mail/packages, and 3. How to manage key document storage. 





I looked at a few options for our business address and for mail/packages. When the pandemic started, we re-routed our mail to a PO Box near my home. This eliminated the need for me to take public transit or a ride-share to check the mail. The PO Box would’ve been a great long-term solution if it weren’t for two things. 1) We need a new business address, and 2) it still ties me to Chicago.





I also looked at a UPS Store Mailbox. UPS is a great service! You can use it as your business address and they receive your mail and packages, then forward it at your request to anywhere you want. The drag on this is that all the mail will be bundled and shipped, creating further delays in getting the items. So if there is any urgency, you’ll need to get to the mailbox yourself.





In the midst of all of this, someone from Earth Class Mail (ECM) reached out to David via Twitter. ECM, like UPS, offers a business address and they receive your mail and packages. The main, and biggest, difference is that they scan all of your mail for you to review online. If you need any originals, they ship it to you. They also deposit checks for you via overnight shipping to your bank.





Of course, I opted for ECM in the end. They tick all the boxes to make Basecamp 100% remote, and they meet needs we hadn’t considered, like the check deposits. In the first few weeks, I have only tested the mail scanning service, which is working great. I’m looking forward to seeing how mail/package forwarding and check deposits go.





Another question I’ve answered recently is how we handle document retention. I’m definitely not holding onto these items in my home. We use Basecamp! Not long after I started here I began saving digital copies of everything important to Basecamp. I save each document in Basecamp with a name, the amount, and any relevant notes. Keeping only digital copies of invoices, checks, and tax paperwork saves on office space, a luxury we no longer have, and more importantly the documents are secure, searchable, and accessible to anyone who needs them.





When I’m not sure, I check in with our accountants about anything we should keep hard copies of. If there is any chance we would need an original paper copy, we keep it. At the moment we don’t have a permanent solution for these instances (honestly, it isn’t much), so they are locked up in storage. The goal will be to eventually not need a storage space.





That covers how we are remote now! Did I miss anything? Feel free to leave a comment.




3 likes ·   •  0 comments  •  flag
Share on Twitter
Published on August 20, 2020 10:15

August 5, 2020

We’re hiring Rails programmers

We have two rare openings on our Core Product team for Rails programmers. We’ll be accepting applications for the next two weeks, aiming for a flexible start date in October.





We strongly encourage candidates of all different backgrounds and identities to apply. This is an opportunity for us to bring in a different perspective and we’re eager to further diversify our company. Basecamp is committed to building an inclusive, supportive place for you to do the best work of your career. We aren’t looking for ideological clones, but for people who share our beliefs about writing software well.









About the Job
As a programmer on the Core team, you’ll work on shaped projects, big & small, over six-week cycles, typically in tandem with a designer. We’re a small team (just three programmers and three designers) and we work quickly but calmly. You’ll join us, and be responsible for driving projects from concept to completion in an environment of autonomy and creativity.





As a team our focus is firmly on the web: the core technology that powers our applications on all platforms. Ours are majestic monoliths, built using Ruby on Rails. We embrace a classic “HTML over the wire” architecture, of server-rendered HTML, progressively enhanced with JavaScript and CSS. If you think this sounds a lot like the Golden Age of Web Development, well, we couldn’t agree more.





This is an extremely exciting time to join the team. We’re currently all-in on HEY, our brand-new email service, and there’s a ton of fresh work to be done, bugs to fix, and features to polish. That’s where you’ll spend the bulk of your time to start. But we also have our fair share of work supporting Basecamp versions past, present, and future, as well as a back catalogue of applications that we’re committed to maintaining Until The End of the Internet.





Here are some things we’ve worked on recently to give you a sense of what you’ll be doing day-to-day:





Building out a subscription system to allow for free trials, upgrades, credit card payments, cancellations, and everything else needed to launch a new productCreating an email export system using the MBOX format, ensuring that even large mailboxes can be exported efficientlyDesigning a global document browser for convenient access to every mail attachment from every sender, filterable and ordered by recencyRevising our billing system to charge EU VAT where applicable, from payment flows and invoices to internal APIs and support toolingBlocking activity trackers and spy-pixels in incoming mail using a variety of techniques, including image and asset proxying for broad protectionCollaborating with the Android and iOS teams on cross-cutting and platform-specific featuresInvestigating and fixing a race condition when processing incoming email that’s resulting in random duplicate entries



About You
We’re looking for candidates with strong track records of putting Rails to work and bringing products to life. You should be experienced in the framework, in Ruby, and with the staples of full-stack web development: HTTP, HTML, JavaScript, and CSS. But just because you’re experienced doesn’t mean we expect you to hit the ground running. You’ll have a gentle, intentional on-boarding experience and time to learn & grow comfortable with our systems, processes, and people.





You might have a Computer Science degree or you might not. That’s not what we’re looking for. We care about what you can do and how you do it, not about how you got here. A history of conscientious, thoughtful work speaks volumes.





You’re a Manager of One. A committed generalist, eager learner, and curator of the essential. You take ownership and you see things through. You can take a stand yet commit even when you disagree. This is a challenging position for a senior-level programmer, but imposters take heart; we are too.





You’re a good writer. The bulk of our work is written, from the code and interfaces we design, to the pull request discussions and long-form proposals we make. Effective, concise writing leads to effective, concise code.





Basecamp is a remote-first company, and this is a remote job. We’re hiring from anywhere that can maintain 3-4 hours overlap with the US-Central Time zone during a normal working day. This might mean a 11:00-19:00 schedule from Europe, but we’re not hiring from locations that require a graveyard shift to make the overlap happen.





Benefits & Compensation
Basecamp pays in the top 10% of the industry based on San Francisco rates. Same position, same pay, no matter where you live. The salary for this position is $186,850 (Senior Programmer).





Benefits at Basecamp are all about helping you lead a healthy life outside of work. We won’t treat your life as dead code to be optimized away with shallow perks. You won’t find lures to keep you coding ever longer. We believe quality time to focus on work starts with quality time to think, exercise, prepare a meal, be with family & friends, and of course, time to yourself.





We offer fully paid parental leave. We work 4-day weeks in the summer (Northern Hemisphere), and offer a month-long sabbatical every 3 years. We subsidize your home office, wellness and fitness interests, and continuing education. We offer a generous annual charitable contribution match. All on top of top-tier health insurance and a retirement plan with company match. See our full list.





How to Apply
Please submit an application that speaks directly to this position. Tell us about yourself, about what you can bring to Basecamp, and about Basecamp’s role in your future. Tell us about something you’ve done, something that’s relevant to the kind of work we do, something that excites you. Be descriptive, but don’t feel the need to write a novel — 500 words or so should be plenty.





We’re accepting applications until August 19, 2020, at 7PM US-Central time. There’s no benefit to filing early, so take your time.





We expect to take a few weeks to review all applications. You’ll hear from us by Friday, September 4th about advancement to a take-home technical exercise. The exercise is representative of the kind of day-to-day work we do. We invite fewer than 20 candidates to this stage, and those candidates should expect to spend upwards of a full day (but no more!) completing this test. Please consider whether you’d be willing to dedicate that time before applying.





After the written test, the top candidates will proceed to the interview stages. Expect 2-3 interviews, all one hour, all remote, with your future colleagues, on your schedule. We’ll talk through your background, your approach to work, and dive into your technical knowledge. No gotchas, brainteasers, or whiteboards. 





We aim to make an offer by early October with a start date later that month.
Please note that we’re unable to offer individual feedback during the screening process. We usually see 1,000+ applications for developer roles, and our hiring team simply doesn’t have the bandwidth to offer personalized feedback before the technical exercise round.





This is a demanding application process and a significant career move to consider. We appreciate you giving us that consideration, and we promise to give you our full attention in return. We look forward to hearing from you!





Apply here!




3 likes ·   •  0 comments  •  flag
Share on Twitter
Published on August 05, 2020 13:04

July 24, 2020

Remote work is a platform

Back in the mid-90s, just as Netscape Navigator was giving us our first look at what the visual internet could be, web design came in two flavors.





There was the ultra basic stuff. Text on a page, maybe a masthead graphic of some sort. Nothing sophisticated. It often looked like traditional letterhead, or a printed newsletter, but now on the screen. Interactions were few, if any, but perhaps a couple links tied a nascent site together.





And there was the other extreme. Highly stylized, lots of textures, 3D-style buttons, page curls, aggressive shadows, monolithic graphics cut up with image maps to allow you to click on different parts of a single graphic, etc. This style was aped from interactive CD/DVD interfaces that came before it.





Both of these styles — the masthead with text, and the heavily graphical — were ports. Not adaptations, but ports. Designs ported from one medium to another. No one knew what to make of the web at that time, so we pulled over things we were familiar with and sunk them in place. At that time, Web design wasn’t web design – it was print design, multimedia/interactive design, and graphic design. It took years for native web design to come into its own.





The web became great when designers started designing for the web, not bringing other designs to the web.





Porting things between platforms is common, especially when the new thing is truly brand new (or trying to gain traction). As the Mac gained steam in the late 80s and early 90s, and Windows 3 came out in 1990, a large numbers of Windows/PC developers began to port their software to the Mac. They didn’t write Mac software, they ported Windows software. And you could tell – it was pretty shit. It was nice to have at a time when the Mac wasn’t widely developed, but, it was clearly ported.





When something’s ported, it’s obvious.





Stuff that’s ported lacks the native sensibilities of the receiving platform. It doesn’t celebrate the advantages, it only meets the lowest possible bar. Everyone knows it. Sometimes we’re simply glad to have it because it’s either that or nothing, but there’s rarely a ringing endorsement of something that’s so obviously moved from A to B without consideration for what makes B, B.





What we’re seeing today is history repeat itself. This time we’re not talking about porting software or technology, we’re talking about porting a way to work.





In-person office work is a platform. It has its own advantages and disadvantages. Some things are easier in person (meetings, if you’re into those), and some things are harder (getting a few hours to yourself so you can focus, if you’re into that).





Remote work is another platform. It has its own unique flavor, advantages, and disadvantages. Its own efficiencies, its own quirks, its own interface. Upsides, downsides, insides, and outsides. It’s as different from in-office work as the Mac is from Windows. Yes, they’re both operating systems, and methods of computing, but they’re miles apart where it matters. The same is true for the difference between in-office work and remote work. Yup, it’s all still the same work, but it’s a different way to work.





In-office and remote work are different platforms of work. And right now, what we’re seeing a lot of companies attempt to port local work methods to working remotely. Normally have four meetings a day in person? Then let’s have those same four meetings, with those same participants, over Zoom instead. It’s a way, but it’s the wrong way.





Simulating in-person office work remotely does both approaches a disservice.





This is often what happens when change is abrupt. We bring what we know from one to the other. We apply what we’re familiar with to the unfamiliar. But, in time, we recognize that doesn’t work.





The enlightened companies coming out of this pandemic will be the ones that figured out the right way to work remotely. They’ll have stopped trying to make remote look like local. They’ll have discovered that remote work means more autonomy, more trust, more uninterrupted stretches of time, smaller teams, more independent, concurrent work (and less dependent, sequenced work).





They won’t be the ones that just have their waste-of-time meetings online, they’ll be the ones that lay waste to the meetings. They won’t be the ones that depend on checking in on people constantly throughout the day, they’ll be the ones that give their employees time and space to do their best work. They won’t be the ones that can’t wait to pull everyone back to the office, they’ll be the ones that spot the advantages of optionality, and recognize a wonderful resilience in being able to work from anywhere.





And they’ll be the ones that finally realize that there’s nothing magical about the office. It’s just a space where work can happen, but not where it must happen. Anytime a myth is busted is a good time.





Work remotely, don’t port the office.




 •  0 comments  •  flag
Share on Twitter
Published on July 24, 2020 12:49

July 21, 2020

Take a video tour of HEY

Heard about HEY, but haven’t had a chance to check it out yet? In this 37-minute video, I walk through the big ideas, the philosophical underpinnings, and the specific features that make HEY a truly new take on email.










 •  1 comment  •  flag
Share on Twitter
Published on July 21, 2020 07:26

July 17, 2020

Spy pixels are evolving like malware, so HEY’s adapting

We knew that spy-pixel pushers might go down the rabbit hole of escalation once we gave HEY users the power to defend themselves. Just like virus and malware makers are constantly trying to defeat anti-virus and other security protections. But I guess we didn’t realize just how quickly it would happen!





Enter GMass, a plugin for Gmail that adds spy-pixel tracking, amongst a grab bag of other stuff. They hadn’t been on our original list of 50+ services we name’n’shame, but thanks to a new blog post where they brag about defeating protections that recipients might take to defend themselves, they came onto our radar.





This lead to an in-depth investigation into how their latest techniques work, and we spent the whole day coming up with a new process of detecting GMass’ spy pixels. It just shipped! And now HEY will name’n’shame GMass, just like we do the other fifty-odd pushers of this kind of surveillance.





Of course, like those virus and malware makers, GMass may try to defeat our protections again. And we’ll then have to adapt once more, and so we will. Internet security is a constantly moving target. But we can hope that Google will soon stop being a conduit for this kind of privacy abuse on the Gmail platform. Just like they don’t tolerate being used for spamming or phishing.





In the mean time, we’ll continue to do the work both on a general level to protect against all forms of privacy attacks against HEY users, but also specifically to identity bad actors, and to call out the users who employ their software for spying.





Have a surveillance-free Friday!




 •  0 comments  •  flag
Share on Twitter
Published on July 17, 2020 15:25

Jason Fried's Blog

Jason Fried
Jason Fried isn't a Goodreads Author (yet), but they do have a blog, so here are some recent posts imported from their feed.
Follow Jason Fried's blog with rss.