Residual connection
Adding the sub-layer's input directly to its output: `x + SubLayer(x)`.
Adding the sub-layer's input directly to its output: x + SubLayer(x). Borrowed from ResNet (He et al., 2016). Allows gradients to flow backward through addition, enabling training of very deep networks. Without residuals, a 6-layer Transformer would be very hard to train.