TransformerLSTM
TransformerLSTM
TransformerLSTM is a hybrid deep learning architecture that combines Long Short-Term Memory (LSTM) networks with Transformer attention mechanisms for time series forecasting. It was developed as part of the TransformerTF framework for modeling temporal dynamics in physics applications, particularly magnetic field transfer functions at CERN particle accelerators.
Architecture
TransformerLSTM follows a four-stage processing pipeline that builds upon traditional encoder-decoder architectures while incorporating modern attention mechanisms.
Core Components
The model consists of the following key components:
- Gated Residual Networks (GRNs) for input feature transformation
- LSTM Encoder-Decoder for sequential processing
- Custom Transformer Blocks for attention-based refinement
- Linear Output Projection for final predictions
Data Flow
Input Sequences
↓
┌─────────────────┐ ┌─────────────────┐
│ past_sequence │ │ future_sequence │
│ (B×Tp×Fp) │ │ (B×Tf×Ff) │
└─────────────────┘ └─────────────────┘
↓ ↓
┌─────────────────┐ ┌─────────────────┐
│ Encoder GRN │ │ Decoder GRN │
└─────────────────┘ └─────────────────┘
↓ ↓
┌─────────────────┐ ┌─────────────────┐
│ Encoder LSTM │ │ Decoder LSTM │
│ + Residual │ │ + Residual │
└─────────────────┘ └─────────────────┘
↓ ↓
┌─────────────────────────────────────────┐
│ Transformer Blocks │
│ ┌─────────────────────────────────┐ │
│ │ Encoder Self-Attention │ │
│ │ + GateAddNorm │ │
│ └─────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────┐ │
│ │ Cross-Attention │ │
│ │ Query: decoder_output │ │
│ │ K,V: encoder + decoder concat │ │
│ │ + GateAddNorm │ │
│ └─────────────────────────────────┘ │
│ (repeat for num_transformer_blocks) │
└─────────────────────────────────────────┘
↓
┌─────────────────┐
│ Linear Output │
│ (B×Tf×D_out) │
└─────────────────┘
Where:
- B = batch size
- Tp, Tf = past and future sequence lengths
- Fp, Ff = past and future feature dimensions
- D_out = output dimension
Technical Details
Input/Output Specification
Inputs:
past_sequence: Historical time series data of shape(batch_size, past_seq_len, num_past_features)future_sequence: Known future inputs of shape(batch_size, future_seq_len, num_future_features)encoder_lengths,decoder_lengths: Optional sequence length tensors for variable-length sequence handling
Output:
- Predictions tensor of shape
(batch_size, future_seq_len, output_dim)
Transformer Block Architecture
The custom Transformer blocks differ significantly from standard Transformer layers. Each block processes encoder and decoder sequences separately:
- Encoder Self-Attention: The encoder output attends to itself using interpretable multi-head attention with padding masks
- Cross-Attention: The decoder queries the concatenation of both encoder and decoder outputs, providing full contextual access
- Gated Processing: Both attention mechanisms are followed by
GateAddNormlayers that combine gating, residual connections, and layer normalization
Key Hyperparameters
d_model: Hidden dimension for LSTMs and attention layers (typically 128-512)num_layers: Number of LSTM layers in encoder and decoder (typically 1-3)num_transformer_blocks: Depth of Transformer processing (typically 1-4)num_heads: Number of attention heads (typically 4-16)causal_attention: Boolean controlling future information maskingshare_lstm_weights: Whether encoder and decoder LSTMs share parameters
Relationship to Existing Models
TransformerLSTM builds upon two established architectures in the TransformerTF framework:
Compared to AttentionLSTM
The AttentionLSTM model represents a simpler attention-augmented approach, using a single attention layer applied after LSTM processing. TransformerLSTM extends this concept by:
- Applying multiple attention layers through stacked Transformer blocks
- Incorporating both self-attention and cross-attention mechanisms
- Using more sophisticated gated processing with
GateAddNormlayers - Enabling iterative refinement of sequence representations
Compared to Temporal Fusion Transformer
The Temporal Fusion Transformer (TFT) is a more complex architecture designed specifically for multi-variate time series with static covariates. Key differences include:
- TFT uses variable selection networks for automatic feature selection
- TFT explicitly processes static (time-invariant) features
- TFT applies attention after variable selection and static enrichment
- TransformerLSTM focuses on the temporal modeling aspects without specialized feature selection
Design Rationale
Hybrid Architecture Benefits
The combination of LSTM and Transformer components leverages complementary strengths:
- LSTMs provide strong inductive bias for sequential temporal data and natural handling of variable-length sequences
- Transformers enable modeling of complex long-range dependencies and parallel processing of sequences
- Post-LSTM attention operates on context-aware hidden states rather than raw embeddings
Gating Mechanisms
Extensive use of gating throughout the architecture allows for:
- Controlled information flow through
GatedResidualNetworkinput processing - Adaptive feature combination via
GateAddNormlayers - Non-linear transformations beyond simple linear projections
Implementation Details
TransformerLSTM is implemented as a PyTorch Lightning module within the TransformerTF framework. Key implementation features include:
- Efficient variable-length sequence handling using
torch.nn.utils.rnn.pack_padded_sequence - Support for both causal and non-causal attention patterns
- Optional LSTM weight sharing between encoder and decoder
- Integration with the TransformerTF data pipeline and training infrastructure
Applications
TransformerLSTM is particularly suited for time series forecasting tasks that require:
- Modeling of complex temporal dependencies
- Processing of moderate-complexity time series (more than simple patterns, less than highly multivariate data)
- Balance between model expressiveness and computational efficiency
- Sequential processing with attention-based refinement
The model was originally developed for modeling magnetic field transfer functions in particle accelerator systems, where temporal dynamics exhibit both sequential dependencies and complex long-range interactions.
See Also
- AttentionLSTM - Simpler attention-augmented LSTM
- Temporal Fusion Transformer - Complex multi-variate time series architecture
- transformertf - Parent framework for physics-informed time series modeling