I fed it an image to generate a description. Then I fed it the resulting description—to generate a new image. Then I fed the resulting image—to generate a new description. Then...
It's funny, yes, but also fascinating, and insightful. It shows more than how AI operates.
Take a look: it's different from a game of telephone. In a game of telephone, you can, in theory, arrive at the original message, even after rounds of mistakes.* Here, the transformation has a distinct vector. Like entropy or time, it moves in one direction only: simplify, dramatize, exaggerate. There are obvious errors, too, but those can be eliminated with crosschecks. The overall shift, however, is hard to rectify. I knew something like that would happen, so I specifically instructed it to *avoid* dramatizing and exaggerating. I was clear the goal was visual fidelity, not smooth or punchy prose.
I won't elaborate on the insights much. An AI can already explain it pretty well. Here are some excerpts:
"It produces a compressed linguistic summary of what it considers salient. Language pulls perception toward concepts. The real world is messy, asymmetric, and contingent. Descriptions compress that mess into categories. Categories are cleaner than reality. So repeated image ↔ language cycling doesn’t just reveal model bias—it reveals something about cognition itself. Humans do something similar. AI is not inventing this tendency. It amplifies a property of symbolic compression. Current systems converge toward dataset archetypes."
We can never create a system of lossless conversion between the real world and a natural language. We can, however, try to eliminate the unidirectional drift.
With ChatGPT, we discussed possible solutions and caveats, and then it engineered a prompt that produced much better results.
The descriptions looked more like bullet points of metadata than prose, but they were readable.
Yet it still seemed to move in one direction only. It was less obvious though, and the nature of this movement was harder to understand. Perhaps the image-generating model is moving toward the average of all the images it was trained on? And perhaps the description-generating model is moving toward the average of all the descriptions it was trained on? I don't believe those are the major drivers behind the continuous drift.
I think we're still prisoners of a more fundamental nature. See how the characters become more and more pronounced, while first the background and then the entire scene fade to a generic blur? Despite my best attempts, both models accentuate what they consider most noteworthy, just like human writers and painters would.
---
*Come to think of it, a game of telephone would also display a shift, albeit a less distinct and uniform one. The message would ultimately move toward something that's easier to memorize, enunciate, and recognize.
**There are ways, probably. Introduce more unbiased noise and ambiguity. Counter models' tendencies more aggressively. Do something else that I don't immediately see. I leave it as is simply because the effort outmatches the joy of experimentation or success.