Information pollution at scale

Information

Our information sphere is the total sum all of humanity’s writings and conversations. We humans recall, we instruct, and we tweet. No doubt some of this chatter holds lies, some are mundane, and some may even convey deep truths.

As I am about to add my own contribution to the marketplace of ideas, my app’s autocorrect function suggests an improvement. I write “That’s the way…” expecting to finish with “…some people are.” For a moment, I pause to consider. How about “That’s the way the cookie crumbles”?

The generative AI that made this suggestion was trained on a large body of text, perhaps some small portion written by me. The choice it proposed was not on my mind, perhaps not even an example of my personal style. Should I absorb it into my collection of writings?

Pollution

It’s hard to see how accepting such an offer would harm me personally. Assuming I approve of the suggestion, it could very well be an improvement. But it wouldn’t be true to me. As part of a training set, it wouldn’t be a valid example to others of how I think or write.

The result would be added the world’s store of writing examples. AIs train on these datasets, to learn what a sentence should look like. Neural nets collect votes. The more frequently cookie crumbles appears in the training set, the more frequently it will be suggested to future authors in search of a phrase.

This shift away from authenticity is a form of corruption. Whatever bias or prejudice I carry (and surely I do), my unedited works reveal them as truly mine. When AI infiltrates my writing, like a stowaway virus, it hitches a ride on my information stream, masquerading its phrasing as mine. The result is then used to train more AIs, which in turn amplifies a weak signal that wasn’t even mine to begin with.

In a healthy ecosystem, one should avoid consumption of one’s own waste product as a food source. But that is precisely what our generative AIs are doing.1

My role as a living information-generating host is hardly necessary for pollution to thrive. There are plenty of incubators where machine-generated text flourishes and provides training data for ravenous AIs. One need look no further than social media and online product reviews to find bots that push disinformation as hard as they can to serve the purpose of their masters.

This is information pollution, at scale. In a polluted environment, it becomes increasingly harder to find an unpolluted signal that we can trust. This very same polluted information is served up as corrupted training data, creating misinformed and corrupted signals that we then must rely on to construct our truth-detectors.

At scale

Our information sphere is being polluted by rapidly expanding sources of false signals. It makes no difference whether the disinformation is intentional or by accident - it’s all pollution, making true signals harder to identify and to trust. The growth is an unsuprising consequence of the commercial incentives to generate strong false signals at scale.

We now find microplastics in the air we breathe, the food we eat, even in remote glaciers and deserted islands. There is no way to remove or escape from microplastics. They are here with us forever, in every corner of the planet. Like microplastics, we have set about polluting all our information with phrases that are not even ours.

Why are the incentives aligned with false signals? We like to think capitalism is good at natural selection of goods and services. We get better products because we all go shopping for products that delight us and avoid those that don’t. But pollution is hard to stop, because it doesn’t respond well to market feedback at all. We get the tragedy of the commons because incentives are decoupled from harmful behavior.

The victims of pollution aren’t always in the immediate vicinity. If the sandwich I eat today contains microplastics from a shopping bag from Buenos Aires, there’s not a whole lot I can do about it. With pollution, the harm can be done far away from, or long before the impact is felt by the victims. That makes accountability more difficult, and defeats the mechanisms of capitalism that are supposed to reward good behavior.

So what

Generative AI is not just a tool, but a weapon. It lets you generate strong false signals at scale. The capacity to generate false signals far outstrips any capacity to detect true signals. In fact, the false signals end up polluting training sets for legit recognition.

We rely on true signals for civic discourse, law enforcement, healthcare, and commerce. Mimicry and camouflage corrupt these true signals. A world where these true signals are corrupted cheaply, easily, and at scale is very, very unsafe.

  1. “We find that use of model-generated content in training causes irreversible defects in the resulting models”, in The Curse of Recursion: Training on Generated Data Makes Models Forget 

August 2022