6. The code — a toy seq2seq in PyTorch
Building a full translation system requires large vocabularies and
massive datasets. To understand the pure mechanics, we’ll build a toy
seq2seq model that learns a much simpler task: reversing a sequence
of numbers. If we input [3, 1, 4], we want the decoder to output
[4, 1, 3].
This code shows exactly how the encoder passes its hidden state (the context vector) to the decoder.
Runs free on Google Colab.
import torch
import torch.nn as nn
class ToySeq2Seq(nn.Module):
def __init__(self, input_size=1, hidden_size=16):
super().__init__()
# The encoder reads the input sequence
self.encoder = nn.LSTM(input_size, hidden_size, batch_first=True)
# The decoder generates the output sequence
self.decoder = nn.LSTM(input_size, hidden_size, batch_first=True)
# Linear layer to map hidden state back to a number prediction
self.fc = nn.Linear(hidden_size, 1)
def forward(self, source_seq, target_seq_for_teacher_forcing):
# 1. ENCODER PASS
# We don't care about encoder outputs, only the final (hidden, cell)
_, (hidden, cell) = self.encoder(source_seq)
# 2. CONTEXT-VECTOR HANDOFF
# The encoder's final (hidden, cell) IS the context vector
# 3. DECODER PASS (using teacher forcing)
# We feed the true target sequence to speed up learning
decoder_outputs, _ = self.decoder(
target_seq_for_teacher_forcing, (hidden, cell)
)
# Map decoder hidden states down to number predictions
predictions = self.fc(decoder_outputs)
return predictions
# --- Forward-pass demo ---
model = ToySeq2Seq()
# Source sequence: [3.0, 1.0, 4.0] (batch=1, seq_len=3, features=1)
src = torch.tensor([[[3.0], [1.0], [4.0]]])
# Target input (shifted with a 0.0 start token): [0.0, 4.0, 1.0]
tgt = torch.tensor([[[0.0], [4.0], [1.0]]])
out = model(src, tgt)
# Untrained model → random-ish floats
print([round(p.item(), 2) for p in out.squeeze()])
# Example output: [-0.12, 0.04, 0.22]
Things to notice in Colab
self.decoder(...)takes(hidden, cell)from the encoder. That tuple is the physical manifestation of the context vector.- We passed
target_seq_for_teacher_forcingas the decoder’s input. That’s teacher forcing in action — feeding the true sequence during the forward pass instead of the decoder’s own predictions. - To actually train this model to reverse numbers, you’d add an
optimiser (
torch.optim.Adam), a loss (nn.MSELoss), and a loop that runs the forward pass, computes loss against the true reversed sequence, and callsloss.backward()— exactly the same pattern from Paper 03 (Backpropagation). - Try changing
hidden_size=16tohidden_size=4. The model will struggle more, because the context vector is too small to hold the input. That’s the bottleneck we’ll discuss in Section 8, live.