AttentionLSTM

AttentionLSTM is a sequence-to-sequence deep learning architecture that adds multi-head self-attention to the standard encoder-decoder framework for time series forecasting. The model sits between simpler encoder-decoder approaches and more complex transformer-based architectures, combining the sequential processing of LSTMs with attention mechanisms that can model long-range dependencies. This architecture is part of the TransformerTF framework developed for temporal modeling in physics applications.

Architecture

The AttentionLSTM implements a design that combines recurrent and attention-based processing. The architecture maintains the encoder-decoder structure while introducing attention mechanisms that allow the decoder to focus on different parts of the input sequence.

Core Components

The model consists of five primary components:

• Encoder LSTM processes historical time series data • Decoder LSTM generates future predictions initialized with encoder context • Multi-head self-attention mechanism captures dependencies across time steps • Skip connection combines attention output with decoder states • Output projection layer produces final predictions

Data Flow

Historical Data          Known Future Features
     ↓                           ↓
┌─────────────────┐     ┌─────────────────┐
│ past_sequence   │     │ future_sequence │
│ (B×Tp×Fp)      │     │ (B×Tf×Ff)      │
└─────────────────┘     └─────────────────┘
     ↓
┌─────────────────┐
│ Encoder LSTM    │
│ num_layers      │
│ d_model         │
└─────────────────┘
     ↓ (final states)           ↓
┌─────────────────────────────────────┐
│         Decoder LSTM                │
│       (initialized with             │
│        encoder states)              │
└─────────────────────────────────────┘
     ↓                           ↓
┌─────────────────────────────────────┐
│     Multi-Head Attention            │
│                                     │
│  Query: decoder_output              │
│  Key/Value: concat(encoder_output,  │
│                   decoder_output)   │
└─────────────────────────────────────┘
                   ↓
          ┌─────────────────┐
          │  Skip Connection │
          │ (GateAddNorm or │
          │  LayerNorm)     │
          └─────────────────┘
                   ↓
          ┌─────────────────┐
          │ Linear Output   │
          │ (B×Tf×D_out)    │
          └─────────────────┘

Where B represents batch size, Tp and Tf are past and future sequence lengths, Fp and Ff are past and future feature dimensions, and D_out is the output dimension.

Technical Details

Input and Output Specification

The model processes two input sequences. The past_sequence contains historical time series data with shape (batch_size, past_seq_len, num_past_features), while the future_sequence provides known future features with shape (batch_size, future_seq_len, num_future_features). Optional length tensors handle variable-length sequences within batches. The model produces predictions as a tensor of shape (batch_size, future_seq_len, output_dim).

Attention Mechanism Design

The multi-head self-attention operates on the decoder output. The attention mechanism uses the decoder output as queries while the keys and values are formed by concatenating both encoder and decoder output sequences. This allows the decoder to attend to both past observations and its own generated outputs at each time step.

The attention computation follows standard multi-head attention but uses this concatenated key-value construction. The mechanism can apply causal masking to prevent the decoder from attending to future positions in its own output sequence.

Skip Connection and Gating

The model uses a skip connection mechanism that combines the attention output with the original decoder output. When use_gating is enabled, the model uses a GateAddNorm layer that applies a gated linear unit to control information flow, followed by residual connection and layer normalization. When gating is disabled, a simpler additive residual connection with layer normalization is used.

The gating mechanism provides control over the influence of the attention mechanism, allowing the model to balance between recurrent processing and attention-based refinement.

Key Hyperparameters

The model’s behavior is controlled by several hyperparameters:

• d_model controls the hidden dimension of LSTM networks and attention mechanisms • num_layers determines the depth of both encoder and decoder LSTM networks • num_heads specifies the number of attention heads • use_gating enables or disables the gated skip connection mechanism • causal_attention controls whether the decoder can attend to future positions • output_dim specifies the final prediction dimension

Position as Middle Ground Architecture

Compared to EncoderDecoderLSTM

The EncoderDecoderLSTM relies solely on the encoder’s final hidden state to transfer information to the decoder. This creates a potential information bottleneck where all historical information must be compressed into fixed-size vectors.

AttentionLSTM addresses this limitation by allowing the decoder to access the complete encoder output sequence through attention mechanisms. This helps with long sequences where early information might otherwise be lost in the compression process. The attention mechanism also provides interpretability by revealing which parts of the input sequence are most relevant for each prediction.

Compared to TransformerLSTM

TransformerLSTM applies multiple transformer blocks with both self-attention and cross-attention mechanisms after LSTM processing. While this provides greater modeling capacity, it also increases computational requirements and architectural complexity.

AttentionLSTM achieves many of the benefits of attention-based processing through a single attention layer. This approach maintains efficiency while capturing important attention-based interactions. The model retains inductive bias for sequential data through its LSTM foundation while adding attention where it provides benefit.

The AttentionLSTM also provides clearer architectural interpretation, where the roles of sequential processing (LSTM) and relationship modeling (attention) are clearly separated and combined through the skip connection mechanism.

Mathematical Formulation

Encoding Phase

For a past sequence (x₁, x₂, …, x_T_past), the encoder processes each timestep:

H_enc, (h_n, c_n) = LSTM_encoder(X_past)

where H_enc contains the output at each timestep and (h_n, c_n) represents the final hidden and cell states.

Decoding Phase

The decoder is initialized with the encoder’s final states and processes the future sequence:

H_dec, _ = LSTM_decoder(X_future, initial_state=(h_n, c_n))

Attention Mechanism

The multi-head attention operates on the decoder output with concatenated encoder-decoder context:

Context = concat(H_enc, H_dec) Attn_out = MultiHeadAttention(Query=H_dec, Key=Context, Value=Context)

Skip Connection and Output

The final output combines attention and decoder representations:

Combined = SkipConnection(Attn_out, H_dec) Output = Linear(Combined)

where SkipConnection can be either GateAddNorm or simple residual connection with layer normalization.

Design Rationale

Hybrid Processing Strategy

The combination of LSTM and attention mechanisms uses different strengths. LSTMs provide inductive bias for sequential temporal data and handle variable-length sequences naturally, while attention mechanisms enable modeling of long-range dependencies and non-sequential relationships.

Selective Attention Application

Rather than replacing the LSTM structure entirely, AttentionLSTM applies attention where it provides benefit. This approach maintains computational efficiency while capturing important attention-based interactions that improve prediction quality.

Interpretability

The attention mechanism provides interpretability by revealing which parts of the input sequence are most influential for each prediction. The attention weights can be visualized to understand the model’s decision-making process, which is valuable in physics applications where understanding model behavior is important.

Implementation Details

AttentionLSTM is implemented as a PyTorch Lightning module with several technical features. The model supports variable-length sequences through packed sequence processing and proper attention masking, enabling efficient batch handling of sequences with different lengths.

The implementation includes dropout at multiple levels, including within LSTM layers, attention mechanisms, and skip connections. The model provides full integration with the TransformerTF data pipeline and training infrastructure, along with configurable attention and gating mechanisms.

The architecture maintains memory efficiency through careful state management and supports both causal and non-causal attention patterns depending on the forecasting requirements.

Applications

AttentionLSTM works well for time series forecasting scenarios that require both sequential processing and the ability to model temporal relationships. The model works when simple encoder-decoder approaches are insufficient but full transformer architectures may be unnecessarily complex.

The architecture is suitable for scenarios with moderate sequence lengths where the attention mechanism can capture important temporal relationships without the computational overhead of more complex attention-based architectures. The interpretability provided by attention weights makes it useful for applications where understanding model decisions is important.

Within the context of CERN’s particle accelerator systems, AttentionLSTM handles magnetic field transfer function modeling where both sequential temporal dynamics and selective attention to specific historical periods are important for accurate predictions.

Hysteresis Compensation

Explorer

AttentionLSTM

AttentionLSTM

Architecture

Core Components

Data Flow

Technical Details

Input and Output Specification

Attention Mechanism Design

Skip Connection and Gating

Key Hyperparameters

Position as Middle Ground Architecture

Compared to EncoderDecoderLSTM

Compared to TransformerLSTM

Mathematical Formulation

Encoding Phase

Decoding Phase

Attention Mechanism

Skip Connection and Output

Design Rationale

Hybrid Processing Strategy

Selective Attention Application

Interpretability

Implementation Details

Applications

See Also

Graph View

Table of Contents

Backlinks