Raw Input

Input Processing

Every interaction with an LLM begins here. The user types a message — a string of characters. At this stage, the system has not yet processed the content of the words.

The raw input layer captures the text exactly as entered and wraps it in a structured object with metadata: a timestamp and a session identifier.

Watch how a simple sentence becomes structured data:

Prompt field awaits input
Characters arrive one by one
Metadata is attached
Input object is assembled

Scroll down to continue ↓

2

Tokenization

Encode

An LLM cannot read text. It works with numbers. The tokenizer splits the input string into tokens — subword units that may be whole words, parts of words, or single characters — and maps each one to an integer ID from a vocabulary of 100,277 entries.

This is a deterministic, reversible transformation — converting text into a sequence of integers that the model’s mathematical operations can work with.

Watch the text get sliced into tokens:

Text string from Layer 1
Token boundaries are identified
Tokens separate into groups
Integer IDs are assigned
Encoding summary

3

Embedding Layer

Vector Representation

Now each token ID is converted into a dense vector — a list of 1,536 floating-point numbers. These vectors live in a high-dimensional space where meaning is encoded as geometry: words with similar meanings end up near each other.

The model represents meaning geometrically: the vector for “acquisition” is close to “investment” and far from “Draft”. In this space, meaning is distance.

Watch integers become geometry:

Token IDs from Layer 2
Each ID expands into 1,536 dimensions
Embedding matrix forms
Tokens mapped to semantic space

4

Semantic Search

Vector Similarity

Before generating a response, the system searches its memory. The query embedding from Layer 3 is compared against a knowledge base of stored documents using cosine similarity — a measure of how closely two vectors point in the same direction.

The result is a ranked list of the most relevant context. Relevance is expressed as a number between 0 and 1 — a high similarity score means the document’s vector is geometrically close to the query vector.

Watch the search unfold:

Query embedding enters the search
Knowledge base documents appear
Cosine similarity is computed
Top 5 documents are retrieved

5

Context Injection

Model Context Protocol

The retrieved documents from Layer 4 are now injected into the prompt. The Model Context Protocol (MCP) assembles a structured context object that combines the user’s original input with memory, system state, and retrieved knowledge.

This is how the model gets its “memory” — not by remembering, but by having relevant information physically inserted into the prompt text. The model sees everything at once, as if it always knew.

Watch the context assembly:

Original user input
Retrieved context slides in
MCP structure assembles
Enriched prompt is ready

6

Task Decomposition

Orchestration

The enriched prompt doesn’t go straight to the model. An orchestration framework first breaks it into a sequence of sub-tasks — a chain of smaller, focused operations that execute one after another.

A SequentialChain defines which steps run in which order. Each step receives the output of the previous one. The model operates as one component in a larger orchestration pipeline.

Watch the chain execute:

Enriched prompt enters the chain
Sequential chain is defined
Steps execute in sequence
Output passes to the model

7

Attention Mechanism

Multi-Head Self-Attention

This is the computational heart of the transformer. Every token computes how much it should “attend to” every other token. The result is an attention matrix — a grid of weights that captures relationships and dependencies between words.

GPT-4 runs this operation across 96 attention heads in parallel, repeated across 96 layers deep — over 9,000 different “perspectives” on the same input. Each head learns to focus on different patterns: syntax, semantics, position, or something humans can’t name.

Watch one attention head at work:

Tokens enter the attention layer
Attention weights are computed
Token relationships revealed
Attention connections visualized

8

Next-Token Prediction

Probability Sampling

The transformer outputs a probability distribution over 100,277 possible tokens. It does not “choose” a word — it assigns a probability to every token in its vocabulary, then samples from that distribution.

Temperature controls how “creative” the sampling is: lower temperature concentrates probability on the top choice, higher temperature flattens it. Top-p filtering removes unlikely tokens from consideration. The result is rolling weighted dice, one token at a time.

Watch the dice roll:

Probability distribution over vocabulary
Token candidates compete
Tokens are generated one by one
Response takes shape

9

Detokenization

Decode

The transformer has produced a sequence of 523 token IDs. These are just integers — meaningless to a human. The tokenizer now runs in reverse, mapping each integer back to its corresponding text fragment.

The result is raw, unformatted text: no capitalization, no structure, no markdown — a continuous stream of characters. This is the inverse of Layer 2: a deterministic conversion from numbers back to text.

Watch the decode:

Token IDs from the transformer
IDs are mapped back to text
Fragments merge into raw text
Raw, unformatted output

10

Post-Processing

Formatting & Presentation

The raw text from Layer 9 is unformatted chaos: no capitalization, no structure, no visual hierarchy. It’s a continuous stream of lowercase characters that no user would want to read.

A formatting pipeline now applies a sequence of rules — capitalizing proper nouns, adding markdown headers and bold text, inserting bullet points, and ensuring consistent tone — transforming the raw stream into structured, readable output.

Watch the text get polished:

Raw text from Layer 9
Formatting rules are applied
Text transforms
Polished response delivered