Section 06

The Code: causal language model and classification fine-tuning

Improving Language Understanding by Generative Pre-Training 2018

6. The Code — causal language model and classification fine-tuning

🟡 Intermediate. Runs free on Google Colab. No GPU required for this demo.


Part A: Causal language modelling (the pre-training objective)

import numpy as np

# --- Tiny vocabulary and corpus ---
vocab = ["chai", "bahut", "garam", "hai", "acha", "[START]", "[END]"]
word_to_id = {w: i for i, w in enumerate(vocab)}  # map each word to an integer
id_to_word = {i: w for w, i in word_to_id.items()}

# Three short "sentences" in our tiny corpus (token ID sequences)
corpus = [
    [5, 0, 1, 2, 3, 6],   # [START] chai bahut garam hai [END]
    [5, 0, 4, 6],          # [START] chai acha [END]
    [5, 0, 1, 4, 3, 6],   # [START] chai bahut acha hai [END]
]

# --- Causal language model: predict next token from all previous tokens ---
def build_ngram_lm(corpus, vocab_size):
    """Build a smoothed bigram LM: P(word_t | word_t-1)."""
    # Count how often word j follows word i
    counts = np.ones((vocab_size, vocab_size))  # Laplace smoothing: start with 1
    for sentence in corpus:
        for i in range(len(sentence) - 1):
            prev, curr = sentence[i], sentence[i + 1]  # bigram (prev → curr)
            counts[prev, curr] += 1                     # increment count
    # Convert counts to probabilities (each row must sum to 1)
    probs = counts / counts.sum(axis=1, keepdims=True)  # divide by row totals
    return probs

lm = build_ngram_lm(corpus, len(vocab))  # shape: (vocab_size, vocab_size)

# --- Generate text from the model ---
def generate(lm, start_token, max_len=6):
    """Sample from the LM one token at a time (autoregressive generation)."""
    tokens = [start_token]
    for _ in range(max_len - 1):
        last = tokens[-1]                        # most recent token
        probs = lm[last]                         # P(next | last)
        next_token = np.random.choice(len(vocab), p=probs)  # sample
        tokens.append(next_token)
        if next_token == word_to_id["[END]"]:    # stop at sentence end
            break
    return [id_to_word[t] for t in tokens]

np.random.seed(42)
print("Generated:", generate(lm, word_to_id["[START]"]))
# Expected output: something like ['[START]', 'chai', 'bahut', 'garam', 'hai', '[END]']

This bigram LM is a toy version of GPT-1’s objective. GPT-1 conditions on up to 512 previous tokens (not just 1), using the Transformer’s attention to capture long-range dependencies. The core idea is the same: predict the next token from all previous tokens, sample from the resulting distribution.


Part B: Input transformation for classification

# --- Simulate the GPT-1 input transformation for sentiment classification ---

# Extend vocabulary with special tokens
special_tokens = ["[EXTRACT]", "[DELIM]"]
for t in special_tokens:
    word_to_id[t] = len(word_to_id)
    id_to_word[len(id_to_word)] = t

def encode_classification(text_tokens):
    """Transform a list of token IDs for classification fine-tuning.
    GPT-1 format: [START] + text + [EXTRACT]
    """
    start = word_to_id["[START]"]
    extract = word_to_id["[EXTRACT]"]
    return [start] + text_tokens + [extract]  # wrap with markers

def encode_entailment(premise_tokens, hyp_tokens):
    """Transform premise + hypothesis for entailment fine-tuning.
    GPT-1 format: [START] + premise + [DELIM] + hypothesis + [EXTRACT]
    """
    start   = word_to_id["[START]"]
    delim   = word_to_id["[DELIM]"]
    extract = word_to_id["[EXTRACT]"]
    return [start] + premise_tokens + [delim] + hyp_tokens + [extract]

# Example: "chai bahut garam hai" → positive sentiment
text = [word_to_id["chai"], word_to_id["bahut"],
        word_to_id["garam"], word_to_id["hai"]]
clf_input = encode_classification(text)
print("Classification input:", [id_to_word[t] for t in clf_input])
# → ['[START]', 'chai', 'bahut', 'garam', 'hai', '[EXTRACT]']

# Example: premise = "chai bahut garam hai", hypothesis = "chai acha hai"
premise = [word_to_id["chai"], word_to_id["bahut"],
           word_to_id["garam"], word_to_id["hai"]]
hypothesis = [word_to_id["chai"], word_to_id["acha"], word_to_id["hai"]]
nli_input = encode_entailment(premise, hypothesis)
print("Entailment input:", [id_to_word[t] for t in nli_input])
# → ['[START]', 'chai', 'bahut', 'garam', 'hai', '[DELIM]', 'chai', 'acha', 'hai', '[EXTRACT]']

Notice: the model receives a flat list of token IDs in both cases. There is no special “premise encoder” or “hypothesis encoder.” The same transformer processes everything. The [DELIM] token teaches the model where one segment ends and another begins.


Part C: The combined loss (pre-training + fine-tuning)

def cross_entropy_loss(probs, true_idx):
    """Compute cross-entropy loss for a single prediction.
    probs: probability distribution over classes (numpy array summing to 1)
    true_idx: index of the correct class
    """
    return -np.log(probs[true_idx] + 1e-9)  # add small constant for numerical stability

# Simulated output of classification head
P_positive = 0.72    # model assigns 72% to "positive"
P_negative = 0.28

# Task loss: true label is "positive"
L_task = cross_entropy_loss(np.array([P_positive, P_negative]), true_idx=0)
print(f"Task loss:     {L_task:.4f}")   # → 0.3285

# Language model loss: suppose average -log P(token|context) = 1.5 over this batch
L_lm = 1.5

# Combined loss (λ = 0.5)
lambda_lm = 0.5
L_total = L_task + lambda_lm * L_lm
print(f"LM loss:       {L_lm:.4f}")
print(f"Combined loss: {L_total:.4f}")  # → 0.3285 + 0.75 = 1.0785

During backpropagation, gradients from both losses flow back through the same transformer weights. The task loss pushes weights toward correct classification. The language modelling loss acts as a regulariser, preventing catastrophic forgetting of pre-trained knowledge.


What this code does not show

These snippets capture the conceptual structure of GPT-1. The real model differs in:

  • Scale: 12 layers, 768 dimensions, 12 attention heads, 40,478 BPE token vocabulary, trained on 800M words
  • Attention: the transformer uses multi-head masked self-attention (Section 4) rather than a bigram
  • BPE tokenisation: words are split into subword units (e.g., “beautiful” → “beau” + “tiful”), allowing the model to handle rare words
  • Training infrastructure: the full model was trained on 64 GPUs over 30 days

For a full runnable GPT-2 implementation (GPT-1’s successor, same architecture), see Andrej Karpathy’s minGPT — a clean 300-line PyTorch implementation.