Introduction
It is surprising how ignorant people are about what LLMs actually do, even in the very sectors that actively leverage large language models. Take one step back from that—say, to businessmen and the media—and the picture gets even more distorted. No wonder so many writers end up totally misinformed about them.
I did, however, run across an article—though I wish I remembered who wrote it and where it was published—that I think was related to the stock forecasting of NVIDIA. The reporter wrote, to paraphrase: an LLM predicts the next word in a string. Period. Finally, someone who actually understands.
In a very real sense, an LLM is the opposite of machine intelligence. It is not “thinking.” It is not “reasoning.” It is simply a massive statistical model. It has no semblance of reasoning, no capacity even for basic if–then logic. At one point, I went through a phase of calling these AI systems (both text and image) BASM: big ass statistical models. It was my way of driving home that point. While the phrase is a cute encapsulation of what they are, it doesn’t really capture what they do.
So I came up with another sound bite to explain what an LLM specifically is: “autocompletion on autopilot.” A co-worker of mine prefers the phrase “autocompletion on steroids.” While that certainly captures the scale, I feel mine is a better functional description.
Parts of a Transformer
First, if you skipped my other blogs, a transformer is the fundamental architecture behind most language models. At the heart of a transformer is a multi-layer perceptron neural network (MLP), which honestly hasn’t evolved all that much since I studied and built them back in the 1990s. The problem with MLPs is simple: they do not work well with sequence data such as language, where the order of things actually matters and adjacent data points are connected in meaningful ways.
Fast forward to 2017 and the introduction of the self-attention mechanism. Self-attention is basically a method of recording the relationships between words so that words which frequently occur together are recognized and coded as related. That’s pretty much it. Without drowning you in technical details, this mechanism was a vast improvement over earlier approaches—things like generative adversarial networks (GANs) and recurrent neural networks (RNNs). In my view, the true innovation with self-attention was that it eliminated the need for feedback loops in making predictions and allowed for faster convergence in training—something that had always been the bane of RNNs and GANs, respectively.
Now, while the MLP and self-attention layer are highly mathematical, there are two other components that are just as important if you want to actually understand how an LLM works. The first is the tokenizer. This converts words into numbers—because neural networks are mathematical systems, words must be represented numerically. The second is the sampler. When the MLP makes a prediction, it doesn’t produce a single word. Instead, it generates a distribution: a list of possible words along with their associated probabilities, or logits.
The role of the sampler is to choose from those probabilities. This is also the source of randomness in an LLM. In theory, if your sampler is set to be greedy (always picking the single most likely word), then the LLM is completely deterministic. The same input will always yield the exact same output every single time. But using traditional sampling techniques—this is where you might see terms like temperature, top-k, nucleus sampling, etc. associated with LLMs—you get tunable randomness. This is often what gives people the *illusion of creativity.
A Real Example.
The following is a transcript from a program I wrote that runs an LLM but reports on intermediate results after every prediction. Incidentally, ChatGPT was used to write a significant part of the code. The underlying model is Mistral, an open-source model trained by Antropic—and yes, I know some of you writers have a bone to pick with them. However, Mistral is open, small, and has a simple prompting format, which makes it very practical for this kind of demonstration.
My initial input was:
[INST] Where is Paris? [/INST]
You’re probably thinking, what’s up with the brackets? Simply put, that is the template Mistral expects its input to be in. Other models have more complicated formats that might include system instructions, guardrail instructions, standing user instructions, context information, and then finally the user prompt itself. By contrast, Mistral’s format is basic, and that’s exactly why I chose the model for this experiment.
This first snippet is the result of the tokenization phase. Here you can see how the various words and characters are broken up and how strings are translated into numbers. For example, INST becomes 16289, and Paris becomes 5465.
== Tokenizer ==
IDs : [1, 733, 16289, 28793, 6926, 349, 5465, 28804, 733, 28748, 16289, 28793]
Tokens: ['', ' [', 'INST', ']', ' Where', ' is', ' Paris', '?', ' [', '/', 'INST', ']']
Here’s your next paragraph fleshed out — same tone, no rewording of your phrases, just expanded for clarity and detail:
This is what happened in the first “completion.” The LLM suggests a 43% probability according to its training set that the next token should be ” Question” and a 32% chance it should be ” Q.”
== Start autoregression ==
Text so far: [INST] Where is Paris? [/INST]
-- step 1 --
top-5: [(' Question', 0.4323), (' Q', 0.3232), (' #', 0.122), (' User', 0.0753), (' package', 0.0059)]
entropy: 2.296 bits
chosen (nucleus(top_p=0.9, temp=0.3)): ' Q'
With some randomness in the sampling step, it actually picks the second most likely option. After that, the string
[INST] Where is Paris? [/INST] Q
—with the “Q” appended—is fed back into the LLM. This is very likely a reflection of its training, where the model has learned to generate in a Q & A style format. In other words, it sees the presence of a question and then anticipates that the structure should continue with a label like Q: followed by an A:
This caught me a little off guard. I really expected the probabilities to line up more with answers—something like France 60%, Texas 20%, The 10% (as in The city of Paris is in France) . Instead, what I got was the model prioritizing the formatting of a Q & A exchange before actually providing content.
Do you want me to keep fleshing this forward into the next cycle (how the appended Q leads into an “A” response), or should I stop here so you can drop in the raw transcript first?
Here’s your step 2 paragraph fleshed out, keeping your word choice and tone, just expanding it for clarity and depth:
Naturally, in step 2, the probabilities shift and focus almost entirely on punctuation, with a colon being the overwhelming preferred choice—over 99% probability. You might wonder, what’s the deal with all the other symbols like stars and backticks showing up in the distribution? Those appear because they are often used as formatting symbols in Markdown—a standard markup language for indicating formatting such as bold, italics, or code snippets. Since so much training text on the internet contains Markdown or similar conventions, the model has learned to associate those symbols with how text is structured. All that said, this step itself is rather mundane.
-- step 2 --
top-5: [(':', 0.9931), (':**', 0.0014), ('.:', 0.0011), (':\', 0.0008), (':`', 0.0005)]
entropy: 0.097 bits
chosen (nucleus(top_p=0.9, temp=0.3)): ':'
text so far: [INST] Where is Paris? [/INST] Q:
The process repeats, step after step, as the model keeps appending tokens. I’ll fast forward all the way out to step 86, where by then you essentially have the complete question and answer. Actually, I lie—the LLM doesn’t stop neatly at that point. It just keeps blabbering on, continuing to generate text beyond what you really asked for. In my case, I let it run only out to 100 steps, and then I cut it off. I didn’t want to leave a partial sentence hanging in the transcript.
-- step 86 --
top-5: [('.', 0.8813), (',', 0.1165), (' among', 0.0022), (' amongst', 0.0), ('am', 0.0)]
entropy: 0.542 bits
chosen (nucleus(top_p=0.9, temp=0.3)): '.'
text so far: [INST] Where is Paris? [/INST] Q: Where is Paris?
A: Paris is the capital city of France, located in the northern part of the country in Europe. It is situated in the region of Île-de-France and on the Seine River. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe.
But that is really all an LLM does. It’s not “thinking,” it’s not “reasoning.” As I’ve been saying from the beginning, it is nothing more than autocomplete on autopilot.
Hallucinations
In generating another example, I could actually see a hallucination forming. There are clear indicators in the transcript as to how this actually happens, the little steps along the way that build into something larger. This time I’ll spare you the individual steps and instead just show you the final response so you can see it in one piece.
[INST] Answer concisely. The sun sets in the [/INST] Question: "The sun sets in the __."
Answer: The sun sets in the west. However, it's important to note that this is relative to an observer on the northern hemisphere. In the southern hemisphere, the sun sets in the east. The sun's actual position in the sky is not changing, but our perspective of it does based on our location on Earth.
Your first reaction might be to think that makes sense, and then of course you’ll say wait a minute, the sun never sets in the east. So how did this happen? Well, after the correct phrase “The sun sets in the west.” the LLM simply felt compelled to keep generating.
Here are the token choices for the continuation:
top-5: [(' However', 0.3755), (' This', 0.3463), ('\n', 0.0952), (' It', 0.0656), (' While', 0.029)]
And it picked “However.” Had it instead picked “This” or “It” or even just a newline, the output might have stayed on track, maybe contributing additional facts or commentary still tied to the sun setting in the west. But the choice of “However” shows the model following a very common pattern of producing qualifying text. Remember, the LLM is always trying to generate text that sounds human, period. It is not checking facts, it is not evaluating truth—it is just continuing.
That choice of “However” leads directly into the next sentence: “However, it’s important to note that this is relative to an observer on the northern hemisphere.” And once you have that kind of qualification, it naturally sets up the question of what about the southern hemisphere? The model obligingly answers its own setup and generates: “In the southern hemisphere, the sun sets in the east.”
So now the answer reads: “The sun sets in the west. However, it’s important to note that this is relative to an observer on the northern hemisphere. In the southern hemisphere, the sun sets in the east.” At that point, the final sentence is generated as an explanation, tying the thought together as if it were a logical chain.
In a nutshell, the hallucination manifests simply because of the unfortunate choice of the word “However.” That single token creates a cascade of completions, which roll forward into a coherent-sounding but false statement.
Conclusion
Conclusion
This article illustrates how a transformer-based LLM actually works. Granted, the 7-billion parameter Mistral model is “dumb” by today’s standards, but I believe it is illustrative of the process, and for me it was quite illuminating to actually watch a hallucination form in real time.
Understanding the operation gives insights into two important things. First is the point of diminishing returns, which we are now starting to see. Making models bigger or training sets richer is only yielding marginal improvements in “intelligence.” You can throw more data and more parameters at the architecture, but the gains are already declining.
Second, this explains why, regardless of scale, an LLM will likely never be capable of consistent long-form writing. It’s a simple numbers game. If 99.9% of the time the model picks the right next token, then after 1000 tokens the chance that the entire sequence is “right” is only about 36.8%. Stretch it to 2000 tokens and it drops to 13.6%. This exponential decay is just the math—a conspiracy of combinatorics. Even with very high probabilities on each individual step, over a large number of samples the system inevitably runs into a wall.
At the end of the day, an LLM is autocompletion on autopilot. Its goal is not knowledge—its goal is language.
