Newsletter Archive
Listen to Our Podcast
Dear Aventine Readers,
In this issue we do a deep dive into what happened last month when CrowdStrike’s flawed software update crashed millions of computers, causing flights to be grounded, stores to close and surgeries to be canceled. This was just one software update among millions that take place across the globe every day, most of which go smoothly. Are there ways to prevent them? Yes! Read on to learn about the solutions and trade-offs.
Also in this issue: A blood test for Alzheimer’s, making rain in the United Arab Emirates, an AI to help make end-of-life decisions and how Fitbit came to be.
Thanks for reading!
Danielle Mattoon
Executive Director, Aventine
Subscribe
Subscribe to our newsletter and be kept up to date on upcoming Aventine projects
A Tiny Software Bug Upended the World. Can We Stop It From Happening Again?
Delta Airlines claims to have lost an astonishing $550 million over the course of just five days in July due to a mistake in a tiny file that was part of a routine software update.
Damage from the error, which has come to be known informally as the 2024 CrowdStrike incident, spread far beyond Delta, causing millions of Windows computers to crash and knocking airports, banks, hospitals and TV broadcasts offline. It’s difficult to assess the full economic and social impact of the incident, though insurers have predicted that the losses easily run into the billions of dollars.
Unlike many cyber outages, this wasn’t the result of hackers, but of an innocent and unforced error — one that thrust the fragility of the digital systems we’ve all come to rely on into public view. But while the CrowdStrike incident caused global chaos and was headline news for days, software failures on a smaller scale happen so frequently they rarely make news, despite the trillions of dollars of damages they incur. (In 2022 alone, software errors cost U.S. businesses and government agencies an estimated $1.8 trillion, according to the Consortium for IT Software Quality, an industry group.) And while such errors are an almost unavoidable cost of doing business in the digital age, experts say that there are ways to minimize their frequency and mitigate potential damage.
According to cybersecurity and resilience experts Aventine spoke to, the CrowdStrike incident revealed some broad vulnerabilities that many systems share, like inconsistent testing and quality assurance by software vendors, the centralization of systems resulting in single points of failure and a lack of resilience preparedness inside large corporations. These are hard problems to solve and are likely impossible to fully overcome. And while the CrowdStrike incident has turned into a blame game that will play out in courts for months or years to come — with passengers suing Delta, Delta suing Microsoft and CrowdStrike and several other lawsuits pointing fingers at various parties — sources who spoke with Aventine argued that the responsibility for ensuring systems are more resilient in the future must be shared.
“Failure does happen, these systems are complex,” said Manuel Hepfer, head of knowledge and insights at the cybersecurity company ISTARI and a research affiliate at the University of Oxford's Saïd Business School. It's often very simple to blame people. And I don't think blame is helpful.”
To prevent such breakdowns in the future, he said, entities have different parts to play. “CrowdStrike can certainly play a role; Microsoft can probably also play a role; and the customer can also do something to be better prepared.”
What exactly happened with CrowdStrike?
CrowdStrike is a cybersecurity company that builds software to detect threats, protect against cyberattacks and keep devices on an organization’s network secure; it serves over half of Fortune 500 companies. One of its core products is Falcon, a software program available for use on Windows, Apple and Linux operating systems that is one of many products available to keep individual devices safe from cyber threats.
To maintain its software CrowdStrike takes a two-pronged approach. First, its core Falcon software is periodically updated, a process that entails an extensive testing period and a rollout process that allows client IT departments to choose how and when the updated version is deployed. Additionally, CrowdStrike provides what it calls rapid response content updates, which allow the core software to identify new threats without making updates to the core software itself. These updates are done on an as-needed basis depending on observed dangers, and have historically been delivered automatically without giving IT departments the choice of accepting them or not.
Early on the morning of July 19, CrowdStrike released two rapid response updates for its Falcon software. One of those files, delivered only to Windows devices, contained an error that caused the software to behave incorrectly. Because Falcon, like other cybersecurity products, runs parts of its logic in the kernel of the Windows operating system — the core of its software, where the most important processes take place — the problem caused computers to crash.
CrowdStrike issued a fix for the bug within 90 minutes, but by that point it was too late: Over 8.5 million devices were unusable and could be fixed only with a manual update. While that number is fewer than one percent of all Windows devices, the impact on day-to-day life was disproportionate because Falcon is so widely used, with many companies using the combination of Windows and Falcon across important parts of their business. Delta, for instance, explained that approximately 60 percent of its mission-critical applications and associated data, including backups, relied on Windows and Falcon. Delta’s CEO, Ed Bastian, told CNBC that the company is the heaviest user of this software combination in its industry.
Quality assurance without much assurance
“To me, the first issue is: How did software like this get out of CrowdStrike?” said Keri Pearlson, who runs Cybersecurity at MIT Sloan, a research consortium focused on helping leaders and managers tackle cybersecurity.
That’s a question that CrowdStrike attempted to answer in the wake of the incident. According to the company, the updates provided on July 19 were delivered only after they had passed the same automated tests that had been used to check and approve similar updates in the past. CrowdStrike later found a bug in the automated tool used to assess the updates, allowing the error to slip through.
There is clearly an inherent tension between how closely an update is scrutinized, which can add to the time it takes for it to be released, and how effective it is at quickly combating risk. “Speed versus accuracy,” said Stuart Madnick, professor of information technology at the MIT Sloan School of Management. “Usually fast is done at the expense of double-checking things.”
But “the nature of these rapid configuration updates makes it really difficult to do more thorough testing, because it would defeat the purpose,” said Hepfer. “You can't have the same kind of rigor to those kinds of [in-the moment threat] updates that you have in your normal software development process.” These automated quality assurance tools are becoming ubiquitous and tend to work well, he added, with advances in artificial intelligence helping the tools improve. “But it's not failure-proof,” he said.
There’s also an issue about whether the update process was conducted properly. “It was a worldwide rollout of the update on all computers at the same time, so everyone got hit,” said Michael Smets, a professor of management at the University of Oxford’s Saïd Business School and a co-author with Hepfer of “The CEO Report on Cyber Resilience.” In the future, he argued, CrowdStrike should phase the rollout of such updates so that any damage can be limited.
While this could obviously slow the rollout of critical security updates, the trade-off clearly seems worth it to CrowdStrike. It has now implemented a “staggered deployment strategy” for its rapid response content updates, enacted more rigorous quality assurance procedures and provided customers with more control over how and when content updates are installed.
“CrowdStrike maintains rigorous testing processes, including manual and automated testing, to ensure that any update maintains the highest standards,” a CrowdStrike spokesperson told Aventine. “Our focus is on using the lessons of this incident to better serve our customers.”
Inside the heart of an operating system
While the origin of the disaster lay inside CrowdStrike’s code, a different technological issue allowed the error to crash computers using Microsoft's operating system. Most software programs run in what’s known as the user space of an operating system, where if something goes wrong with a software update, only the particular program is typically affected. But certain software is optimized by working in the kernel space of an operating system, allowing it to interact directly with a computer's hardware. If something goes wrong in the kernel space, it can cause the whole computer to crash. That’s what happened with the Falcon updates, because its cybersecurity software has kernel access in Windows, allowing it to monitor all system activities to maximum effectiveness.
In the wake of the incident, a spokesperson from Microsoft argued in comments made to The Wall Street Journal that the reason Microsoft systems crashed was rooted in an antitrust agreement the company entered into with the European Commision more than a decade ago. The 2009 agreement, the spokesperson said, specifies that Microsoft must provide third-party security firms with the same level of access to the operating system as it has itself, implying that the agreement prevents Microsoft from protecting users against faulty third-party software. (In contrast, Apple, which controls only a small share of the operating system market and has not been a target of such antitrust action, has taken steps to lock down the kernel of its own operating system and has no obligation to provide access to third parties.)
Even if Microsoft could work around the law somehow and cordon off access to its kernel, changing its practices around access would be complicated. “Windows is a hugely complex operating system that runs millions of kinds of processes and tasks,” said Hepfer. “Just saying, ‘We’re going to revoke kernel access to these things overnight,’ that's not going to work.”
Yet Microsoft does have a role to play in “building systemic resilience,” added Hepfer. While it may be bound to allow third parties access to the core of the operating system, several sources told Aventine that the company has a responsibility as the manufacturer of the most used operating system in the world to ensure that errors resulting from another company’s software don’t result in catastrophic failure. Several sources pointed out that this is particularly important because the ubiquity of Windows makes it a huge target for malicious attacks, which ideally would be handled deftly by the operating system. Microsoft declined to comment for this article.
What can clients do?
While software vendors can build more robust systems and improve quality assurance, companies using the software need to build resilience too, or as Pearlson put it, they should develop “a business recovery plan.” The problem is that the requirements for such a plan often run counter to modern business approaches. “If you wanted to maximize your resilience, you would have different providers for different systems, not every system would upgrade at the same time, [and so on],” said Smets. “And that would be wildly inefficient.” Such practices are also contrary to the common narrative in cybersecurity: “Normally, a cyber incident highlights how your IT landscape is too fragmented, with too many different systems to maintain to keep up to date,” said Smets. “CrowdStrike is the complete opposite, actually highlighting that excessive concentration also carries its risk. The truth is likely to lie somewhere in the middle.”
Resilience should extend beyond digital parameters too. Several of the sources who spoke with Aventine praised the way some airlines reacted to the CrowdStrike incident by checking in passengers by hand using pen and paper. Such responses, they argued, must be well documented — including in physical form — and well rehearsed so that they can swing into action when digital systems fail.
Yet just as we must be prepared for these kinds of incidents to occur, we must also be prepared for attempts at building resilience to fail too. ”Just as 100 percent security is impossible, 100 percent resilience is impossible as well,” wrote Dennis Galletta, a professor of information systems at the University of Pittsburgh in an email. “Buckle your seatbelts for more of the same, and more of worse.”
Ultimately, the message from multiple experts is that software providers can and should improve their processes, that companies must think more carefully about how the systems they use can be more robust in the face of disaster, that software failures will continue to happen and that organizations should learn from each new breakdown.
“Everybody knows that it would be really nice if we had a system that was completely fault tolerant, and that that's impossible,” said Pealson. “My suggestion is resilience thinking. Every time one of these things happens, we put that into our resiliency process and we say: ‘OK, we weren't resilient this time. What can we learn from it so that we can be resilient if this happens again?’”
Listen To Our Podcast
Learn about the past, present and future of artificial intelligence on our latest podcast, Humans vs Machines with Gary Marcus.
Advances That Matter
The first passenger ferry powered by hydrogen fuel cells in San Francisco, July 2024, AP Photo/Terry Chea
The world’s first commercial hydrogen ferry set sail. Making waves around the northeastern coast of San Francisco is a blue and white boat with green ambitions. The vessel, called Sea Change, is a 70-foot, 75-passenger ferryboat that uses hydrogen fuel cells and electric motors to zip through the bay without producing any emissions. It started a six-month pilot service in mid-July, and will carry passengers on a short trip from the Ferry Building to Fisherman’s Wharf. As Canary Media reports, ferries account for just 2 percent of harbor craft in California yet produce 15 percent of their emissions as a result of their inefficient diesel engines. There are other attempts to clean up the emission of harbor vessels: In April, we reported on the first all-electric tugboat in the U.S., which is operating in San Diego Port. But hydrogen has one clear advantage as a fuel source: Boats can carry tanks full of the stuff, which means they could cover longer trips than a battery-powered vessel. Sea Change currently runs on regular hydrogen produced using fossil gas, but the long term aim is for boats like these to be powered by green hydrogen — a theoretically zero-carbon fuel source, but one that is proving difficult to commercialize. Switch Maritime, the startup behind the boat, told Canary Media that it sees itself “playing an integral role in building out the supply chain” for green hydrogen, helping signal the demand that could, over time, drive down the costs of manufacturing the fuel.
Closing in on a blood test for Alzheimer’s. There are clues lurking throughout your body that indicate if your brain is slowly developing the telltale plaques and sticky tangles of proteins associated with Alzheimer’s disease. These distinctive proteins, called biomarkers, are easy to identify in the brain and the cerebrospinal fluid that surrounds it, but they also exist in lower concentrations in the blood. For decades, researchers have been trying to develop simple, affordable tests that would make it possible to identify those biomarkers in a blood sample. This article in Nature charts the progress of that research, revealing that the science has reached somewhat of a tipping point: A number of blood tests now exist that accurately measure progression of certain elements of the disease, and some of them work as well as the previous gold standard of PET scans. Researchers are now using these tests to help inform clinical trials: Biomarkers are making it easier to recruit for such trials and are helping researchers track how effective treatments are. There is still work to be done, because different biomarkers reveal clues about different facets of the disease and its progression. But we are, according to researchers involved with the work, approaching a point at which doctors could perform a simple blood test that would reveal if people are at risk from Alzheimer’s. Combined with a new wave of drugs that don’t halt the disease but can dramatically slow its progression, these tests could have a profound impact on the suffering caused by this disease.
An AI to help make your end-of-life decisions. It can be impossible to know how a person wants to see out their final days when they are, for whatever reasons, unable to communicate with doctors. It’s a situation that is torturous for loved ones, as well as extraordinarily tough for healthcare workers if there are no loved ones around. Emerging from that kind of struggle is a potentially useful but controversial new technology: an AI-powered digital twin trained on data from the dying person that could help family members and doctors get a better sense of how someone who might not be able to communicate would like to be cared for. Such AI doesn’t yet exist, but a team of researchers from around the world that spoke with MIT Technology Review have very clear plans about how to build one, and evidence from earlier rudimentary software models to suggest that it could be of use. Even simple models based on small amounts of general-population surveys have been shown to perform as well as family members in predicting how a person might feel about end-of-life medical care. The idea is that by tailoring software for different patients based on their personal data, accuracy about their preferences could be dramatically increased. Yet for obvious reasons, this technology is fraught with ethical considerations — about what data is used, how the model is used and just how accurate the results can be. Those issues will need to be ironed out before any such software makes much progress, and even then hospitals and ethicists will undoubtedly wrestle with how to implement it.
Magazine and Journal Articles Worthy of Your Time
The New Gods of Weather Can Make Rain on Demand — or So They Want You to Believe,from Wired
5,500 words, or about 22 minutes
It is dry in the United Arab Emirates. Very dry. In fact, the area receives about half as much rain as Nevada. So, just outside Abu Dhabi, scientists are attempting to employ the decades-old concept of cloud seeding to produce rain. Flying planes into turbulent conditions thousands of feet up in the air, pilots release flares full of salts into the air in an attempt to encourage water droplets to form and fall from the sky. The jury is out on whether this approach is reliably successful, but more exotic methods are also being researched, including one that involves firing high-powered laser beams into the sky. As this story explains, the concept is a tangle of science, innovation, expense, excitement and no shortage of theater. But if it does help it rain in the U.A.E., it might all be worth it.
Watching the watchers, from The Economist Technology Quarterly
9,000 words over five articles, or about 36 minutes
Being a spy in 2024 looks rather different than it did 50 years ago. Many of the intelligence operations conducted by governments around the world take place not with people on the ground but instead with analysts sitting at computer terminals — using AI to decipher the contents of huge qualities of satellite imagery, say, or attempting to break encryption to spy on secret communications. This package of stories from The Economist takes a close look at how ubiquitous technology has increased the volume of data accessible to intelligence officers, the benefits and challenges it has created, the rise of AI in spycraft, and how these developments have enabled private firms to muscle into the sector. If your favorite scenes from the James Bond movies involved the character Q, you’re in luck.
Engineering The First Fitbit: The Inside Story, from IEEE Spectrum
4,400 words, or about 18 minutes
The humble Fitbit is one of those beguilingly simple products that somehow captured the imagination of millions of people — 136 million, to be precise. This story, based on interviews with the company’s founders, is a fascinating glimpse into the product development of a technology that was designed to collapse cutting-edge technology into a tiny form to create something simple, affordable and usable. If there’s one big takeaway from the piece, it’s that it’s almost impossible to predict at the outset how a product should be built — both from a technical perspective, and also because user preferences rarely match up neatly with those of a product designer.