Decoder-only Models

Decoder-only models in LLMs consist solely of the transformer decoder component of the overall architecture (see Fig. 1). The decoder receives input data (prompts) and generates coherent and context-aware output. Unlike encoder-only models, which leverage bidirectional context, decoder-only models are autoregressive whereby they predict the next token based on the previously generated tokens. Therefore, these models are particularly well-suited for text generation tasks.

Fig. 1: Decoder-only model architecture [1].

Generative pre-trained transformer (GPT) family (e.g., GPT-2, GPT-3, and ChatGPT) [2], pathways language model (PaLM) [3], and large language model Meta AI (LlaMA) [4] models are examples of decoder-only models, commonly used for text generation tasks such as creative writing and conversational agents.

As shown in Fig. 1, similar to encoder-only models, the inputs in a decoder-only model are processed through a series of sequential components, including input embedding, positional encoding, and decoder blocks/layers (denoted as Nx in the figure). In the remainder of this chapter, we review each layer and its underlying mechanisms in detail.

Input Embedding

In encoder-only models, the inputs are textual data, represented as sequences of text units such as words, subwords, or characters. However, the encoder block processes numerical representations rather than raw text. To bridge this gap, an input embedding layer converts textual inputs into numerical IDs. This process relies on a predefined vocabulary in which each text unit, commonly referred to as a token, is mapped to a unique identifier, namely token ID. Notably, each token ID is a vector of numbers with a shape of prdefined embedding dimensiion. Consequently, the input text undergoes tokenization, where it is divided into tokens, and then each token is mapped to the corresponding token ID.

Positional Encoding

Positional encoding is a mechanism whereby the information about the position of a token is injected to the input data. Hence, the model can learn the meaning and importance of the corresponding token w.r.t. its position in the input (subject, object, verb, adjective, etc.). The traditional positional encoding is calculated using the sin and cos equations in [1].

Decoder Layer

The decoder layer (shown as Nx in Fig. 1) consists of three sub-layers: (1) a multi-head self-attention mechanism applied to the decoder's inputs, (2) a multi-head cross-attention mechanism over the encoder's outputs, and (3) a fully connected feed-forward network. Each sub-layer includes a residual connection and is followed by a normalization layer [1].

Multi-Head Self-Attention Mechanism

The self-attention mechanism is a core component of transformers, enabling the model to learn and capture the relationships between tokens within a sequence. This mechanism is implemented within the attention heads of the transformer architecture [1].

Fig. 2: Architecture of an attention head [1].

To model the relationships between tokens, the input is first projected into three distinct representations: the query (Q), key (K), and value (V) vectors. Figure 2 illustrates the architecture of an attention head within the model. When an embedded and positionally encoded input passes through the attention head, it undergoes the following six processing steps [1]:

  • Step 1: query (Q), key (K), and value (V) components are passed through linear layers in the attention head model so that these components are learned.

  • Step 2: the alignment scores are calculated through multiplication of matrices query and key.

  • Step 3: the alignment scores are scaled by 1/dk, where dk is the dimension of the query (or the key) matrix.

  • Step 4: a causal masking function is applied to prevent the model from attending to future tokens.

  • Step 5: softmax operation is applied to the scaled scores to obtain the attention weights.

  • Step 6: the attention weights are multiplied with the value matrix.

The multi-head self-attention mechanism consists of multiple parallel attention heads, each learning distinct representation subspaces. The outputs from all attention heads are concatenated and then projected through a linear layer to produce the final combined representation [1].

Add & Norm

Residual Connections (Add): Preserving information from earlier layers helps mitigate the vanishing gradient problem. In transformer encoders, this is achieved through residual connections, where the original input of a layer is added to its output. This mechanism enables the network to retain essential information across layers and facilitates more effective gradient flow during training [1].

Layer Normalization (Norm): During training, the model may experience internal covariate shift, where the distribution of activations changes across layers, potentially leading to vanishing or exploding gradients. To address this challenge, layer normalization is applied to the outputs of deeper layers that in result, stabilizes training by reducing distributional shifts and ensures more consistent gradient flow [1].

Feed-Forward Network

The feed-forward sub-layer introduces non-linearities into the encoder, thereby enhancing the model's capacity to capture complex patterns and non-linear relationships within sequential data. The feed forward network consists of two linear transformations separated by an activation function, typically the rectified linear unit (ReLU) or the Gaussian error linear unit (GELU) [1].

Implementation

In this section, we use a small input text to develop a minimal decoder-only model, called TinyGPT. The goal is to gain familiarity with the operation of decoder-only models. The overall network architecture is shown below, and the complete implementation script is available on Decoder-only.

As defined in the __init__ function, the model network includes the following layers:

  • self.token_embed: it creates an embedding layer that converts token indices (integers) into dense vector representations (embeddings).

  • self.pos_embed: this layer create the positional encoding of the input tokens.

  • self.blocks: it includes encoder blocks (layers), each with a multi-head self-attention head, a feed forward network, and the corresponding add & norm components.

  • self.ln_f: this layer defines the final layer normalization applied to the transformer's output before passing it into the language modeling head.

  • self.lm_head: it defines the language modeling head, i.e., the final layer that maps the hidden representations produced by the transformer into predicted token probabilities.

When an input passes through the forward function, it is first converted into token embeddings. Positional embeddings are then added to incorporate information about token order. The resulting representations are sequentially passed through all the transformer blocks defined in the model. Within the attention head, the self.register_buffer mechanism implements a lower-triangular mask to prevent the model from attending to future tokens during processing.

The output of the final block is normalized using the last layer normalization and then fed into the language modeling head. At this stage, each token is represented by a hidden vector of size equal to the embedding dimension, and the final layer predicts the probability distribution over the vocabulary for the next token.

During training, the model compares these predicted logits with the true token IDs using cross-entropy loss, and updates its weights through backpropagation to minimize this loss.

Evaluation

For evaluation, we provide the model with a starting prompt and ask it to generate the remaining text. Considering the small size of the training data, the generated outputs are reasonably accurate.

Also, we compute the cross-entropy loss, perplexity, accuracy, bit per char (BPC), and distinct scores.

  • Cross-entropy loss: this metric measures the difference between the predicted probability distribution of a model and the true distribution of the target data. It quantifies how well the model's predicted probabilities match the actual outcomes, with lower values indicating better predictions and less uncertainty.
  • Perplexity: it measures the model's uncertainty; lower perplexity implies that the model assigns higher probability to the actual next word in the sequence, resulting a more confident and accurate model.
  • Accuracy: this metric is defined in terms of correct predictions over ground-truth data. However, since the tokens are in terms of characters, accuracy is not a reliable metric in the current evaluations.
  • Bit per char: BPC measures the average number of bits a model needs to encode or predict each character in the text. Lower BPC indicates better predictive performance and less uncertainty, and it is equivalent to cross-entropy expressed in bits rather than nats.
  • Distinct scores: these scores quantify diversity in generated text. To compute them, all n-grams of length n in the text are first extracted. The metric then calculates the proportion of unique n-grams relative to the total number of n-grams. A higher score indicates greater diversity, meaning the text is less repetitive.

Table I. Performance evaluation of TinyGPT
Cross-Entropy Loss Perplexity Accuracy (%) BPC (bits) Distinct-1 Distinct-2
0.0399 1.04 98.48 0.058 0.113 0.432

References

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017, https://arxiv.org/abs/1706.03762.

[2] T. B. Brown, B. Mann, N. Ryder, and M. Subbiah et al., “Language models are few-shot learners,” 2020, https://arxiv.org/abs/2005.14165.

[3] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, and S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” 2022, https://arxiv.org/abs/2204.02311.

[4] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023, https://arxiv.org/abs/2302.13971.