Issue #2: It worked, just not equally

19 Feb

Written By R .

Introduction

In crisis contexts, generally speaking, decisions are made in good faith and are defensible at the time. Albeit the nature of crisis environments means that no matter how well intended a course of action, often the results can be questionable. It is the very reasonableness of such decision making that throws up the most ethical questions. As such, this newsletter is less interested in villains and "gotchas" and more the structural and systemic environments that allowed a particular choice and decision to be made.

Note: There are of course plenty of examples of bad faith decision making within the sector, but that is a topic for another issue! You will remember the four lenses from issue 1.

1.Power and Agency

2.Data and Consent

3.Accountability and Governance

4.Operational Reality

Each issue we will be looking at a topic through one or more of these lenses.

Today's topic: If an AI system can only act on what it can see, what happens to the damage it can't?

Primary lens: Operational Reality | Secondary lens: Power & Agency

Let's go….

By way of example, let us consider the aftermath of major hurricanes. The pressure to assess damage quickly is enormous. Emergency managers need to know which areas were hit hardest and how to allocate limited resources. Traditionally, this has meant sending inspection teams into affected areas to assess the damage. This is usually a slow, dangerous, and labour- intensive process that can take days or weeks to produce a usable picture. We also need to know where to send these teams in the first place, which is itself a difficult call to make.

Immediately we can see the appeal of AI in this space. Fly drones or reposition satellites, capture imagery, and let machine learning models classify damage across thousands of buildings in minutes rather than days. Faster information + faster decisions = better help.

Of course there is a 'but…' coming…

Before we launch ourselves head first into the doom and gloom, it's important to remember that some real achievements have been made in this domain. For example, when AI-based damage assessment tools were deployed during the 2024 hurricane season, the results were genuinely impressive. One system assessed over 400 buildings in approximately 18 minutes. The technology worked. But, worked for whom? (There's that 'but' i was just talking about).

What we can't see from the sky

Aerial based damage assessment, whether done by humans or machines, depends on a basic assumption: that what is visible from above reflects what has happened below. (I can feel the collective eye roll from you seasoned disaster managers! But stick with me, and let's get into this, that's what we are here for).

To understand why this matters, it helps to know how these systems are built. First, humans look at aerial images from a previous disaster and label each building: no damage, minor damage, major damage, destroyed. An AI model is then trained on those labels and it learns the visual patterns that correspond to each category. The model is tested against more human labels from the same imagery source, and if it performs well enough, it gets deployed. When the next hurricane hits, fresh imagery is captured and the model classifies each building based on what it previously learned.

The quality of every step in that chain depends on the quality of the first one: the human labels. And the quality of those labels depends on what the humans could see. Research published at the ACM Conference on Fairness, Accountability, and Transparency audited damage labels for over 15,000 buildings across three major US hurricanes. The study compared assessments made from satellite imagery with those made from drone imagery for the same buildings, using the same damage scale, labelled by human assessors.

The disagreement rate was 29%, meaning that nearly one in three buildings was classified differently depending on which imagery source was used. Also, this pattern was not random with satellite-derived labels systematically under-reporting damage compared to drone-derived labels. Meaning that buildings that appeared undamaged from satellite altitude showed clear evidence of harm when viewed at closer range.

Strictly speaking, this isn't an AI failure story. In fact, a large proportion of the most consequential problems with AI often aren't in the AI itself, they're in everything that happens before the AI is switched on. This matters because it tells us something important about where bias enters a system, and it enters earlier than most people might think. The problem is not that an AI model made a mistake, the problem is that the data the model learns from already carries a structural blind spot.

The study didn't test an AI model specifically and it didn't need to, in order to offer up useful findings. Namely, if you train a system on satellite imagery that under-reports damage, the system will learn to under-report damage. It will do so confidently, at scale, and without flagging the gap. It doesn't matter how good your model is if what it's learning from doesn't reflect what's actually on the ground.

And here's the uncomfortable part: the model can appear to perform well at every stage. However, because it is tested against labels drawn from the same imagery source it was trained on, its accuracy is measured against the same skewed standard. Without ground- level verification i.e. someone physically going to the building, there is no way to know which aerial view was closer to reality. You just know they disagreed nearly a third of the time, and that the disagreement went in one direction.

These concerns are not unique to a single study. Independent reviews of AI in emergency management and humanitarian aid have identified data collection practices as a recurring source of structural bias across the field.

The buildings that don't look damaged enough

Let's do a hypothetical walkthrough. After a hurricane, some types of damage are highly visible from the air: collapsed roofs, debris fields, destroyed structures. These are the patterns that imagery-based models are best at detecting. Communities with this kind of damage should be prioritised. But other types of damage are less visible from above. Flooding that devastated ground floors while leaving roofs intact, structural compromise that is not apparent from aerial angles, damage to informal or non-standard housing that does not match the patterns in training data and damage in densely vegetated areas where buildings are partially obscured.

None of this is invisible. It is simply harder to see from the vantage point the system relies on. The result is a quiet but consequential skew. Areas with dramatic, visible destruction are identified quickly and accurately - excellent! But, areas with serious but less photogenic damage risk being categorised as lower priority and not because someone decided they mattered less, but because the system's view of the world made their damage harder to detect. This is not a failure of the technology. It is a failure of the perspective.

Reasonable decisions, uneven consequences. It is worth pausing here, because the instinct at this point may be to look for the person who got it wrong. But that is precisely what makes this example ethically interesting. Every decision in this chain is understandable. Using satellite imagery makes sense because it covers vast areas quickly and is available soon after a disaster. Training models on the largest available datasets is standard practice. Deploying automated tools to speed up assessment is a rational response to the data avalanche that modern disasters produce - one operational deployment documented receiving up to 369GB of drone imagery per day which is far more than any team can manually review.

Of course, no one set out to create a system that would deprioritise less visible damage. But the combination of practical, defensible decisions about imagery sources, training data, deployment timelines, and coverage priorities produced exactly that outcome. So what would this mean if these skewed labels were feeding into real-world decisions?

Through the Operational Reality lens, the system's outputs would appear sound. Buildings classified, maps generated, assessments delivered on time. The model would perform as designed. But its performance would be shaped by assumptions about resolution, vantage point, and what "damage" looks like. These are the very assumptions that, as the study shows, do not hold equally across imagery sources.

Through the Power & Agency lens, those outputs would then shape downstream decisions. Damage assessments inform where inspection teams go first, which communities receive priority aid, and how recovery funding is allocated. An AI system that detects damage unevenly would not just produce inaccurate data, it would redistribute attention, resources, and urgency, without anyone explicitly making that choice.

Speed isn't neutral

Going back to the ACM conference paper. The researchers who built and deployed these systems deserve credit for something: they documented not just their successes but their limitations.

Their published work openly identifies the operational challenges, including connectivity failures, resolution variations and spatial misalignment that degraded or delayed their outputs in the field. They also conducted the audit that revealed the 29% disagreement between imagery sources. More of this kind of transparency please!!

But it also raises an uncomfortable question: if the people who build these systems are openly acknowledging the gaps, what happens when organisations adopt the tools without reading the fine print? In fast-moving disaster response, the pressure to act on available information is immense. A damage assessment that arrives quickly, looks comprehensive, and provides clear classifications is hard to second-guess, especially when the alternative is waiting days for ground-level verification. Speed creates its own authority. But speed is not neutral. When a system produces results faster than they can be validated, the window for challenge shrinks. And when those results carry the visual authority of a colour-coded damage map, they tend to be treated as more definitive than their creators intended.

The risk is not that emergency managers blindly trust AI. Most are experienced, sceptical, and aware of limitations. The risk is that under pressure, with incomplete information and urgent demands for action, a "good enough" automated assessment becomes the baseline and the communities it overlooks become harder to advocate for, because the data does not make their case visible.

So, what are we going to do about this?

This is not an argument against AI-assisted damage assessment. The technology addresses a real and growing problem, and early deployments have shown it can deliver results at a speed that manual assessment simply cannot match. But efficiency and equity are not the same thing. And a system that delivers both speed and confidence can make the gap between them harder to see, not easier.

So it's over to you with this week's question: When an AI system is faster and more scalable than any alternative, but you know its accuracy is unevenly distributed, what does responsible deployment actually look like?

Leave your thoughts in the comments and I will see you in a fortnight.

RB.

Sources & further reading: Manzini, T., Perali, P., Tripathi, J. & Murphy, R. (2025). "Now you see it, Now you don't: Damage Label Agreement in Drone & Satellite Post-Disaster Imagery." ACM Conference on Fairness, Accountability, and Transparency (FAccT '25). https://dl.acm.org/doi/10.1145/3715275.3732135 – Manzini, T., Perali, P. & Murphy, R. (2025). "Deploying Rapid Damage Assessments from sUAS Imagery for Disaster Response." arXiv preprint, Texas A&M University. https://arxiv.org/abs/2511.03132 – Manzini, T., Perali, P., Murphy, R. & Merrick, D. (2025). "Challenges and Research Directions from the Operational Use of a Machine Learning Damage Assessment System." arXiv preprint, Texas A&M University / Florida State University. https://arxiv.org/abs/2506.15890 – Lythreatis, S. et al. (2025). "Artificial Intelligence in Humanitarian Aid: A Review and Future Research Agenda." Technovation. https://www.sciencedirect.com/science/article/pii/S0166497225002470

R .