Introduction
I had no intention of jumping into the controversy surrounding generative AI. As I frequent many support sites for writers and participate in writers’ workshops, I sense a growing tide of animosity towards it. Some of this animosity stems from legitimate concerns, but much of it is based on misinformation—so much so that it feels like we’re in the midst of a political campaign.
Having one foot in both the writing and AI communities, I’ve even felt a sense of persecution from the writing community. Rather than attempting to change opinions, I decided it was important to simply explain what generative AI actually does, where there may be ethical ambiguities, and let the reader decide for themselves.
Top Misconceptions
A common depiction of generative AI is that it contains a large database of images and text. It then uses some algorithms to figure out and cobble together new images or text based on a user’s prompt. A simple back-of-the-envelope calculation shows this is far from the truth. Take the stable diffusion models, trained on the LAION-5B dataset, which is estimated at around 240 TB of already compressed images (particularly in JPEG format). However, the size of a trained stable diffusion model can be downloaded at just a few gigabytes. Common sense tells us that you cannot store 240 TB of data in a space that’s four to five orders of magnitude smaller. Additionally, these models contain no algorithmic code specifically to assemble images, and very little if any algorithmic code for interpreting text.
Another common misconception is that generative AI models think or understand. At their core, these models statistically predict a passage of text or generate an image based on the user’s prompt. While the goal is to mimic human thinking or understanding, this mimicry is only superficial. The responses are merely sophisticated predictions of what a human might expect in reply, generated from statistics learned from the vast amounts of data they were trained on.
To illustrate this point, I found a website featuring logic puzzles. I selected a “very simple one.” Instead of including a transcript, which might infringe on someone’s website, I’ll describe the results here, but similar examples can be easily found online. To my surprise, the model nearly solved the first puzzle correctly. However, when presented with a second, even simpler puzzle, it stumbled, insisting that there wasn’t enough information to solve it. This highlights an important point: generative AI doesn’t deductively reason. Any semblance of logic is derived purely from language mimicry. What should be more surprising, then, is how close it sometimes gets to a logical conclusion.
Though I haven’t heard it much, there exists a misconception that AI models are continually training. In one instance I recall, this data-hungry misconception conjured impressions of a “SkyNet” scenario. However, all AI models operate in a very structured way. They undergo an initial training phase, during which their internal parameters (or weights) are adjusted to learn the statistics of a given image or text dataset. Once this training passes muster—often in a test phase—the model’s parameters are set, and they no longer change. This means that the model’s “knowledge” is fixed from that point onward.
In fact, when running inferences, an LLM is stateless, meaning it actually remembers nothing. But ChatGPT remembers my conversation. Well, ChatGPT does because the framework supporting the LLM (GPT) keeps track of the conversation, but GPT itself doesn’t remember a thing.
Think of it this way—though this is, as usual, an oversimplification—it’s like every time you ask ChatGPT something, GPT gets, “Here’s what the user just asked, and by the way, here’s what we’ve been talking about…” This makes it seem like GPT has memory, but in reality, it’s just being provided with the conversation history each time.
What many AI developers do is continually acquire more data and retrain a model on an updated dataset. Once retrained, they may release it as a new version of the model. Many AI services perform these updates without necessarily informing users, creating the impression that the model is continuously learning.
Additionally, there’s the concept of fine-tuning. In fine-tuning, a generally trained model undergoes, from this broad starting point, training on specialized data to improve its performance in a specific domain. For example, a general-purpose language model could be fine-tuned with a medical text corpus, enhancing its ability to answer questions related to healthcare.
Regardless of whether fine-tuning or retraining takes place, AI models have distinct phases: training, where learning happens, and prediction, where the model applies its fixed knowledge to new inputs.
Generative AI
The term itself is a bit misleading. While “Generative AI” has evolved more in mainstream usage than in scholarly work, it is more of a functional description than a well-defined technical term. The same architecture—or even the same model—might be used in both generative and non-generative applications.
While I am not the keeper of technical language, I would argue that “Generative AI” is much more a mainstream description of AI systems rather than a category of AI models. Transformers, for example, while state-of-the-art in text generation, also have applications in classification and recognition—non-generative applications. They even play a significant role in autonomous vehicles, where their purpose is far from generative.
Big Ass Statistical Models
I think the standing of AI models in the art community has been hurt by the marketing hype surrounding AI, with its connotations derived from science fiction. Marketing teams want to associate their products with images of machines like HAL-9000, Data from Star Trek, Sonny from I, Robot, and Baymax from Big Hero 6 in an effort to portray their products as cutting-edge and technologically advanced. This imagery, however, is also what I believe frightens the creative world. As depicted, characters like Data and HAL were capable of human-type creativity.
The truth is that both “AI” and “generative AI” are somewhat nebulous terms. During my undergraduate days, I heard AI pioneer Marvin Minsky define AI, somewhat tongue-in-cheek, as: “If it works, it’s not AI.” The implication was twofold: first, the state of the art at the time wasn’t particularly utilitarian; and second, the definition of AI itself was vague.
Though I can’t say for certain, the term “generative AI” gained traction with the advent of generative adversarial networks, which helped popularize the adjective “generative.” Again, there is no precision tied to this term. In fact, if you search scholarly works, you’ll rarely find “AI” or “generative AI” mentioned directly in fields of research—they may appear in tangential areas, however.
Tongue-in-cheek I propose to address this confusion with a new term describing the systems out there—big ass statistical model, or the acronym BASM. A pet peeve of mine is the misuse of the word “acronym.” For example, AI is not an acronym; it’s an abbreviation. Laser, scuba, and radar are acronyms. Rule of thumb: if you’re spelling it out when you say it, it’s not an acronym. Enough of my complaining.
Both present-day LLMs and VDMs (image diffusion models) are simply statistical models, plain and simple. They’re huge, ranging from a few gigabytes to hundreds, hence my descriptive “big ass.”
BASM doesn’t have the same marketing ring as generative AI, but it would separate reality from the hype.
Large Language Models
The core of most state-of-the-art language models is the transformer architecture. Unlike many earlier forms of neural networks, retaining sequential information is key—the order of words matters. In a nutshell, all LLMs, including GPT (Generative Pretrained Transformer), do is predict the next word in a sequence based on statistics.
I won’t delve into an example here. My next article will give a detailed example of precisely how an LLM works by actually recording all the steps a smaller LLM produces given a prompt.
It is amazing how much simple completion can mimic the following of instructions. This illusion arises primarily because instructions and actions are embedded within the training set itself. Furthermore, the encoding for semantic similarity allows the model to project responses to prompts that may never have appeared explicitly in the training data. This combination of probabilistic prediction and semantic generalization enables LLMs to produce responses that mimic instruction-following, but in reality, it is just overgrown text completion.
Understanding this in the training process should give you insight into how your text, should it end up in a training set, is used by an LLM. While I am not here to condone how an author’s data may have been used—whether properly or improperly, legally or illegally—I try to clarify how that data is actually used.
During the training process, your text is never verbatim copied into the model. Instead, the model extracts statistics from your text. It learns the likelihood of certain words or phrases following others, rather than storing the text itself. However, short-form plagiarism is possible if your work appears in the training set often enough to reinforce the statistics.
One thing to consider, whether you view it as right or wrong, is that if any of your content can be coaxed from an LLM, it indicates your popularity.
Image Models
Image models such as Stable Diffusion and DALL-E operate using Variable Diffusion Models (VDMs). However, there are now some new models using transformers, which are trained and implemented similarly to the LLMs described above. VDMs operate completely differently from transformers. To dumb down the explanation, VDMs take a set of images corresponding to a text prompt and train by adding noise to the image and then attempting to denoise it. Progressively more noise is added, and the VDM is trained to remove it.
This process repeats across all images related to a given prompt, and over time, the model learns to denoise images effectively. At the end of training, you are left with a very efficient denoiser for all images of that category. Suppose the prompt is simply “dog.” After training, you are left with a very efficient denoiser for dog images by extracting features specific to a dog.
So, what happens if you simply start with noise? In theory, the denoiser “sees” dog features in the noise even though there aren’t any. You end up with a “hallucinated” dog. In essence, you have AI pareidolia: the system sees patterns that aren’t there, just as humans might see a face in a rock formation.
This simplification ignores the interplay between the text and image. The training set is actually an image-text pair, and the nuance between the text and image isn’t captured in this simplified example. The purpose of this example is to demonstrate how images in a training set are used. Like with text, the data is never copied. Instead, the model can generate a stylistic similarity based on the prompt, like “dogs playing poker in the style of Picasso.” If you are lucky, the text parses correctly, and you’ll get the style of Picasso represented statistically. However, one of the drawbacks is the model’s more limited linguistic understanding, so you could end up with an image of Picasso holding a poker game with dogs. That’s why many new models are trying to combine the language prediction capabilities of transformers with VDMs, while others are moving purely to transformers.
Conclusion
If nothing else, this article explains how LLMs and VDMs are trained and how the training data is used. For those who think the usage is unethical, you can now frame your arguments correctly without inflammatory statements with a theory that aligns with reality. For those who don’t, you can combat ignorance and misinformation.
Either way, I believe that knowledge, not rhetoric, is essential to addressing ethical concerns. Without a grasp of the facts, there is no debate—only empty arguments.
