R . 19/2/26 R . 19/2/26

Issue #2: It worked, just not equally

Introduction

In crisis contexts, generally speaking, decisions are made in good faith and are defensible at the time. Albeit the nature of crisis environments means that no matter how well intended a course of action, often the results can be questionable. It is the very reasonableness of such decision making that throws up the most ethical questions. As such, this newsletter is less interested in villains and "gotchas" and more the structural and systemic environments that allowed a particular choice and decision to be made.

Note: There are of course plenty of examples of bad faith decision making within the sector, but that is a topic for another issue! You will remember the four lenses from issue 1.

1.Power and Agency

2.Data and Consent

3.Accountability and Governance

4.Operational Reality

Each issue we will be looking at a topic through one or more of these lenses.

Today's topic: If an AI system can only act on what it can see, what happens to the damage it can't?

Primary lens: Operational Reality | Secondary lens: Power & Agency

Let's go….

By way of example, let us consider the aftermath of major hurricanes. The pressure to assess damage quickly is enormous. Emergency managers need to know which areas were hit hardest and how to allocate limited resources. Traditionally, this has meant sending inspection teams into affected areas to assess the damage. This is usually a slow, dangerous, and labour- intensive process that can take days or weeks to produce a usable picture. We also need to know where to send these teams in the first place, which is itself a difficult call to make.

Immediately we can see the appeal of AI in this space. Fly drones or reposition satellites, capture imagery, and let machine learning models classify damage across thousands of buildings in minutes rather than days. Faster information + faster decisions = better help.

Of course there is a 'but…' coming…

Before we launch ourselves head first into the doom and gloom, it's important to remember that some real achievements have been made in this domain. For example, when AI-based damage assessment tools were deployed during the 2024 hurricane season, the results were genuinely impressive. One system assessed over 400 buildings in approximately 18 minutes. The technology worked. But, worked for whom? (There's that 'but' i was just talking about).

What we can't see from the sky

Aerial based damage assessment, whether done by humans or machines, depends on a basic assumption: that what is visible from above reflects what has happened below. (I can feel the collective eye roll from you seasoned disaster managers! But stick with me, and let's get into this, that's what we are here for).

To understand why this matters, it helps to know how these systems are built. First, humans look at aerial images from a previous disaster and label each building: no damage, minor damage, major damage, destroyed. An AI model is then trained on those labels and it learns the visual patterns that correspond to each category. The model is tested against more human labels from the same imagery source, and if it performs well enough, it gets deployed. When the next hurricane hits, fresh imagery is captured and the model classifies each building based on what it previously learned.

The quality of every step in that chain depends on the quality of the first one: the human labels. And the quality of those labels depends on what the humans could see. Research published at the ACM Conference on Fairness, Accountability, and Transparency audited damage labels for over 15,000 buildings across three major US hurricanes. The study compared assessments made from satellite imagery with those made from drone imagery for the same buildings, using the same damage scale, labelled by human assessors.

The disagreement rate was 29%, meaning that nearly one in three buildings was classified differently depending on which imagery source was used. Also, this pattern was not random with satellite-derived labels systematically under-reporting damage compared to drone-derived labels. Meaning that buildings that appeared undamaged from satellite altitude showed clear evidence of harm when viewed at closer range.

Strictly speaking, this isn't an AI failure story. In fact, a large proportion of the most consequential problems with AI often aren't in the AI itself, they're in everything that happens before the AI is switched on. This matters because it tells us something important about where bias enters a system, and it enters earlier than most people might think. The problem is not that an AI model made a mistake, the problem is that the data the model learns from already carries a structural blind spot.

The study didn't test an AI model specifically and it didn't need to, in order to offer up useful findings. Namely, if you train a system on satellite imagery that under-reports damage, the system will learn to under-report damage. It will do so confidently, at scale, and without flagging the gap. It doesn't matter how good your model is if what it's learning from doesn't reflect what's actually on the ground.

And here's the uncomfortable part: the model can appear to perform well at every stage. However, because it is tested against labels drawn from the same imagery source it was trained on, its accuracy is measured against the same skewed standard. Without ground- level verification i.e. someone physically going to the building, there is no way to know which aerial view was closer to reality. You just know they disagreed nearly a third of the time, and that the disagreement went in one direction.

These concerns are not unique to a single study. Independent reviews of AI in emergency management and humanitarian aid have identified data collection practices as a recurring source of structural bias across the field.

The buildings that don't look damaged enough

Let's do a hypothetical walkthrough. After a hurricane, some types of damage are highly visible from the air: collapsed roofs, debris fields, destroyed structures. These are the patterns that imagery-based models are best at detecting. Communities with this kind of damage should be prioritised. But other types of damage are less visible from above. Flooding that devastated ground floors while leaving roofs intact, structural compromise that is not apparent from aerial angles, damage to informal or non-standard housing that does not match the patterns in training data and damage in densely vegetated areas where buildings are partially obscured.

None of this is invisible. It is simply harder to see from the vantage point the system relies on. The result is a quiet but consequential skew. Areas with dramatic, visible destruction are identified quickly and accurately - excellent! But, areas with serious but less photogenic damage risk being categorised as lower priority and not because someone decided they mattered less, but because the system's view of the world made their damage harder to detect. This is not a failure of the technology. It is a failure of the perspective.

Reasonable decisions, uneven consequences. It is worth pausing here, because the instinct at this point may be to look for the person who got it wrong. But that is precisely what makes this example ethically interesting. Every decision in this chain is understandable. Using satellite imagery makes sense because it covers vast areas quickly and is available soon after a disaster. Training models on the largest available datasets is standard practice. Deploying automated tools to speed up assessment is a rational response to the data avalanche that modern disasters produce - one operational deployment documented receiving up to 369GB of drone imagery per day which is far more than any team can manually review.

Of course, no one set out to create a system that would deprioritise less visible damage. But the combination of practical, defensible decisions about imagery sources, training data, deployment timelines, and coverage priorities produced exactly that outcome. So what would this mean if these skewed labels were feeding into real-world decisions?

Through the Operational Reality lens, the system's outputs would appear sound. Buildings classified, maps generated, assessments delivered on time. The model would perform as designed. But its performance would be shaped by assumptions about resolution, vantage point, and what "damage" looks like. These are the very assumptions that, as the study shows, do not hold equally across imagery sources.

Through the Power & Agency lens, those outputs would then shape downstream decisions. Damage assessments inform where inspection teams go first, which communities receive priority aid, and how recovery funding is allocated. An AI system that detects damage unevenly would not just produce inaccurate data, it would redistribute attention, resources, and urgency, without anyone explicitly making that choice.

Speed isn't neutral

Going back to the ACM conference paper. The researchers who built and deployed these systems deserve credit for something: they documented not just their successes but their limitations.

Their published work openly identifies the operational challenges, including connectivity failures, resolution variations and spatial misalignment that degraded or delayed their outputs in the field. They also conducted the audit that revealed the 29% disagreement between imagery sources. More of this kind of transparency please!!

But it also raises an uncomfortable question: if the people who build these systems are openly acknowledging the gaps, what happens when organisations adopt the tools without reading the fine print? In fast-moving disaster response, the pressure to act on available information is immense. A damage assessment that arrives quickly, looks comprehensive, and provides clear classifications is hard to second-guess, especially when the alternative is waiting days for ground-level verification. Speed creates its own authority. But speed is not neutral. When a system produces results faster than they can be validated, the window for challenge shrinks. And when those results carry the visual authority of a colour-coded damage map, they tend to be treated as more definitive than their creators intended.

The risk is not that emergency managers blindly trust AI. Most are experienced, sceptical, and aware of limitations. The risk is that under pressure, with incomplete information and urgent demands for action, a "good enough" automated assessment becomes the baseline and the communities it overlooks become harder to advocate for, because the data does not make their case visible.

So, what are we going to do about this?

This is not an argument against AI-assisted damage assessment. The technology addresses a real and growing problem, and early deployments have shown it can deliver results at a speed that manual assessment simply cannot match. But efficiency and equity are not the same thing. And a system that delivers both speed and confidence can make the gap between them harder to see, not easier.

So it's over to you with this week's question: When an AI system is faster and more scalable than any alternative, but you know its accuracy is unevenly distributed, what does responsible deployment actually look like?

Leave your thoughts in the comments and I will see you in a fortnight.

RB.

Sources & further reading: Manzini, T., Perali, P., Tripathi, J. & Murphy, R. (2025). "Now you see it, Now you don't: Damage Label Agreement in Drone & Satellite Post-Disaster Imagery." ACM Conference on Fairness, Accountability, and Transparency (FAccT '25). https://dl.acm.org/doi/10.1145/3715275.3732135 – Manzini, T., Perali, P. & Murphy, R. (2025). "Deploying Rapid Damage Assessments from sUAS Imagery for Disaster Response." arXiv preprint, Texas A&M University. https://arxiv.org/abs/2511.03132 – Manzini, T., Perali, P., Murphy, R. & Merrick, D. (2025). "Challenges and Research Directions from the Operational Use of a Machine Learning Damage Assessment System." arXiv preprint, Texas A&M University / Florida State University. https://arxiv.org/abs/2506.15890 – Lythreatis, S. et al. (2025). "Artificial Intelligence in Humanitarian Aid: A Review and Future Research Agenda." Technovation. https://www.sciencedirect.com/science/article/pii/S0166497225002470

R . 11/1/26 R . 11/1/26

Issue #1. Ethical AI Wasn’t Designed for Disasters

Why read this newsletter?

Every disaster movie starts with someone ignoring a scientist.

Artificial intelligence is everywhere. Whether organisations are already using it or still exploring the possibilities, there is no shortage of new tools and terms to keep up with - from systems like ChatGPT to concepts such as large language models, prompt engineering, hallucinations, and retrieval-augmented generation (RAG). The list grows daily.

AI is also increasingly present in disaster and emergency management - from early warning systems and damage assessment to logistics planning and decision support. But what does this actually mean in practice?

In crisis settings, decisions are made under pressure, with incomplete information, and often with life-altering consequences. The promise of AI in these environments is clear: speed, scale, and analytical reach beyond human capacity. But with that promise comes real risk. Errors, bias, and opaque systems can be introduced precisely where there is the least room for them.

So what are we going to do about this?

Ethical guidance for AI exists across many documents, organisations, and initiatives - but it is fragmented and uneven in quality. Even less clear is how these principles hold up in crisis environments, where consent is constrained, accountability is scattered, and “human-in-the-loop” may be more symbolic than real.

This newsletter exists to sit in that gap - without claiming to fill it.

It does not argue for or against the use of AI in crisis contexts. Instead, it slows things down enough to ask how AI is being used, who it serves, and who bears the risk when it fails.

Why do disasters change the ethical rules so much anyway?

Disasters create unusual environments. They compress time, concentrate power, and narrow the range of choices available to individuals and communities. Decisions that would normally involve consultation, consent, or deliberation are often made quickly, by a small number of actors, under conditions of uncertainty.

This matters for AI because many ethical assumptions baked into technology design do not hold in crisis settings. Consent may be nominal or impossible. Data may be collected from people with little ability to refuse, limited understanding of how it will be used, or no control over its future reuse. Accountability can become scattered as responsibility is distributed across agencies, vendors, models, and workflows.

In these contexts, AI systems do not simply support decisions. They can shape strategy - influencing what is seen as urgent, relevant, or even possible. A risk score, forecast, or generated summary can quietly steer attention and resources, even when a human remains formally in charge.

The challenge is not only whether an AI system is accurate, but whether its influence is visible, contestable, and appropriate for the moment in which it is used.

What may be acceptable in planning or low-stakes settings can become ethically fraught when lives, livelihoods, and trust are on the line.

How will this newsletter approach this?

Ethical risks in disaster settings rarely appear as isolated technical failures. More often, they emerge from predictable patterns across tools, organisations, and crisis contexts.

To make those patterns visible, this newsletter examines AI developments through four recurring lenses - each highlighting a different way ethical risk tends to surface when AI is introduced into high-stakes, human-centred decision-making.

These lenses are not exhaustive, but they capture the most common ways ethical risk surfaces when AI enters crisis decision-making.

The four lenses

1. Power & Agency

Who decides, who benefits, and who bears the risk?

In crisis settings, AI systems can acquire authority they have not earned. Confident forecasts, scores, or summaries may narrow the range of options under consideration, embedding value judgements about what matters most - including implicit assumptions about fairness and whose needs are prioritised.

In time-critical environments, these outputs can shortcut human deliberation rather than support it, shifting decision-making power away from people and towards systems whose assumptions are rarely explicit or contested at the moment decisions are made.

What appears to be neutral technical support can, in practice, define urgency, shape priorities, and influence who receives assistance - with consequences that are not evenly distributed.

2. Data & Consent

How is data collected, owned, reused, and protected?

Human-centred crises are environments where meaningful consent is often constrained or impossible. Ethical data practices designed for stable settings - informed consent, clear purpose limitation, restrictions on reuse - can quickly erode under emergency conditions.

Bias can be introduced not only through what data is collected, but through whose data is missing, outdated, or over-represented. The ethical risk is not limited to collection itself, but extends to what happens afterwards: when crisis data is retained, repurposed, or combined in ways that expose communities to harm long after the emergency has passed.

3. Accountability & Governance

Who is responsible when AI influences life-critical decisions?

When AI systems are embedded into complex, multi-agency crisis workflows, responsibility can become blurred. Decisions may be shaped by a combination of data pipelines, models, vendors, internal teams, and partner organisations.

Influence ≠ responsibility.

Systems can shape outcomes without being accountable for them, while organisations may retain formal accountability without the power, time, or transparency needed to intervene meaningfully.

Calls for transparency or explainability do not resolve this on their own if decision-makers cannot realistically challenge, override, or pause an AI system under operational pressure. When something goes wrong, it is often unclear who can intervene - or where accountability ultimately sits.

4. Operational Reality

What actually happens on the ground - not what the model promises?

AI systems often rely on historical or aggregated data that struggles to keep pace with rapidly changing crisis conditions. Infrastructure damage, population movement, political constraints, and access limitations can quickly invalidate model assumptions.

In these moments, AI rarely fails loudly. Instead, it risks failing quietly - producing outputs that appear reasonable, explainable, or technically sound while no longer reflecting the realities facing responders and affected communities.

Does this feel uncomfortably familiar? Then you’re probably in the right place.

Rather than treating these lenses as abstract principles, this newsletter uses them as practical tools. Some issues will explore a single lens in depth; others will examine specific cases through several lenses at once.

As this series begins, the aim is not to provide answers, but to ask better questions - together.

Week one question

If AI systems reshape human-centred crisis decision-making in subtle ways, how would we notice - and what would it take to intervene in time?

Leave your thoughts in the comments and I’ll see you in a fortnight.

RB.