As governments and organizations race to deploy AI tools to detect online abuse, a troubling contradiction has emerged. These systems, designed to protect vulnerable people from digital harm, might inadvertently create new forms of discrimination and surveillance.
Consider what happened in New Zealand last year. The Department of Internal Affairs deployed an algorithm to flag potentially harmful content for human review. The system was meant to enhance protection against online abuse—but officials eventually discovered it was disproportionately flagging content from Māori and Pacific Island communities, effectively targeting cultural expressions rather than actual threats.
This case highlights the fundamental tension in AI abuse detection systems: they promise to scale protection beyond human capacity while simultaneously risking harm to the very communities they aim to safeguard.
“The appeal of these systems is obvious,” explains Dr. Maya Ramirez, digital rights researcher at the University of Toronto. “Human moderators can’t possibly review the billions of posts created daily. But AI systems learn from historical patterns of enforcement, which often reflect existing societal biases.”
The problems run deeper than just biased training data. Many abuse detection systems operate as black boxes, making decisions that even their creators can’t fully explain. When these systems are deployed by governments or powerful platforms, they can amplify existing power imbalances.
In Australia, welfare surveillance algorithms have disproportionately flagged Indigenous recipients for investigation. Similar patterns have emerged with predictive policing in the United States, where systems have intensified scrutiny of already over-policed communities.
What makes these systems particularly concerning is their scale and invisibility. Unlike human moderators, algorithmic systems can monitor millions of interactions simultaneously without transparency about their decision-making processes.
“There’s a dangerous assumption that these systems are neutral,” notes Jayden Wong, tech policy advisor at Digital Rights Watch. “But they’re designed by humans with specific worldviews and trained on data that reflects historical patterns of discrimination.”
The financial incentives driving AI development further complicate matters. Major tech companies have invested billions in content moderation AI, promising safer online spaces while simultaneously cutting human moderator jobs. This creates a profit-driven approach to protection that prioritizes scale over nuance.
Some communities have found their cultural expressions repeatedly flagged by these systems. Indigenous activists report having content about traditional practices removed, while abuse against them often remains untouched. LGBTQ+ advocacy content frequently triggers automated flags, while subtle forms of harassment slip through undetected.
“These systems simply don’t understand context,” explains Aisha Johnson, who researches online harassment at the Australian National University. “They can’t distinguish between a harmful slur and a community reclaiming language, or between genuine threats and cultural expressions.”
The limitations become even more apparent across languages. Most abuse detection systems perform significantly worse in non-English languages and dialects, creating global inequities in protection. A recent study by the Algorithmic Justice League found that content moderation AI was 27% less accurate at identifying harmful content in African American English compared to Standard American English.
Despite these concerns, adoption of these systems continues to accelerate. In the EU, the Digital Services Act now requires platforms to demonstrate how they’re addressing illegal content, creating pressure to implement automated detection. In Australia, the eSafety Commissioner has increased powers to demand platforms address harmful content quickly—a timeline that practically necessitates algorithmic approaches.
The financial scale is equally significant. The content moderation AI market is projected to reach $14.8 billion by 2028, according to industry analysts at MarketsandMarkets. This represents one of the fastest-growing segments of the AI industry.
Some researchers and advocates are working on alternative approaches. The Partnership on AI has developed guidelines for more transparent and accountable content moderation systems. Community-based moderation models, like those pioneered on platforms such as Mastodon, distribute decision-making power rather than centralizing it in algorithms.
“We need to recognize that technology alone won’t solve these problems,” says Professor Carlos Mendes, who studies digital governance at McGill University. “The most effective approaches combine thoughtful human oversight with systems designed to support human judgment rather than replace it.”
The path forward requires a fundamental shift in how we approach these technologies. Rather than rushing to deploy AI systems as cost-saving measures, organizations need to carefully consider the potential for harm and build in meaningful accountability.
This means including affected communities in system design, conducting regular bias audits, creating clear appeals processes, and maintaining meaningful human oversight. It also requires recognizing that some forms of protection cannot be automated without causing new harms.
As AI continues to transform digital spaces, the question isn’t simply whether these technologies can detect abuse, but whether they can do so in ways that don’t perpetuate existing patterns of discrimination and surveillance. The promise of protection must be balanced against the potential for new, algorithmic forms of harm.
The real challenge may be resisting the temptation to view AI as a simple solution to complex social problems. Truly protecting vulnerable people online requires not just better algorithms, but better governance, community empowerment, and a commitment to digital spaces that reflect our highest values rather than our deepest biases.