The default loss function for the Temporal Fusion Transformer is the Quantile loss, which measures the deviation of predicted values from actual observations while accounting for uncertainty in the target distribution. Unlike traditional losses such as mean squared error (MSE), which penalize all deviations equally, quantile loss allows the model to learn conditional quantiles, making it particularly useful for capturing asymmetric errors and probabilistic forecasts. In the context of hysteresis prediction, this is beneficial as it enables the model to express confidence intervals around its predictions, accounting for the inherent variability in magnetic field behavior. By optimizing for multiple quantiles simultaneously, the Temporal Fusion Transformer produces a more robust and informative prediction, which is especially useful when modeling complex, history-dependent dynamics such as those seen in accelerator magnets.

Pinball loss

Where is the ground truth, and is the model prediction. So if we are underestimating, i.e. , then applies, wheras if , then the reverse applies. This means that if we are using a quantile loss, selecting quantile and overestimating would mean the weight would be would be applied to the absolute error, i.e. a large loss weight for an incorrect prediction.

In this way the quantiles should be chosen to be more or less sensitive to outliers, depending on what is being modeled.

The asymmetry is best understood with an example. Here I picked 𝜏 = 0.1 and the ground truth is y = 103.

  • The pinball loss is zero if we predict f(x) = 103 perfectly

  • If the model underpredicts by 6 with f(x) = 97, the loss is 0.1 (103 - 97) = 0.6

  • If the model overpredicts by 6 with f(x) = 109, the loss is (1 - 0.1) (109 - 103) = 5.4

The effect of 𝜏=0.1 on model training is to bias the model toward underpredicting the target since overpredicting is 9 times more expensive than underpredicting when the absolute differences are the same.

Comparison to other losses

Compared to other loss functions such as Mean Absolute Error (MAE), Huber loss, and Log-Cosh loss, quantile loss provides a distinct advantage in handling asymmetric errors and generating probabilistic predictions. MAE penalizes all errors linearly, making it more robust to outliers than Mean Squared Error (MSE) but less informative in cases where uncertainty estimation is needed. Huber loss, on the other hand, combines the best of MAE and MSE by applying a quadratic penalty to small errors and a linear penalty to large errors, making it well-suited for balancing robustness and sensitivity to measurement noise. Log-Cosh loss behaves similarly to Huber loss but is smooth everywhere, reducing the effect of large residuals while still being differentiable. In contrast, quantile loss does not focus on minimizing the mean error but instead learns multiple quantiles, making it ideal for capturing the full distribution of possible outcomes rather than just a single point estimate.

Comparison of Loss Functions

Loss FunctionSensitivity to NoiseNumerical StabilityComputational ComplexityKey Advantage
Quantile LossModerate (depends on chosen quantiles)Stable but requires careful selection of quantilesModerate (requires multiple passes for different quantiles)Captures uncertainty and allows probabilistic forecasting
MAEHigh (treats all deviations equally)StableLowRobust to outliers, simple interpretation
Huber LossLow (quadratic for small errors, linear for large ones)StableModerate (threshold tuning required)Balances robustness and sensitivity to small errors
Log-Cosh LossLow (similar to Huber but smoother)High (differentiable everywhere)ModerateMore stable than Huber for optimization, smooth convergence

In terms of sensitivity to measurement noise, MAE is highly susceptible due to its linear nature, whereas Huber and Log-Cosh reduce the impact of outliers. Quantile loss, while not explicitly designed for noise robustness, can still model uncertainty by adjusting quantiles to account for noise distributions. Numerically, MAE, Huber, and Log-Cosh are stable, while quantile loss requires careful tuning of quantile levels to avoid instabilities. Computationally, MAE is the simplest, while Huber and Log-Cosh introduce moderate complexity. Quantile loss is computationally heavier since multiple quantiles must be computed, but this added complexity is justified when uncertainty estimation is crucial, as in the case of hysteresis prediction with the Temporal Fusion Transformer.

Experiments

  • Investigate which quantiles to use for hysteresis prediction [priority:: high] [due:: 2025-02-03] [completion:: 2025-02-26]

Best quantiles seem to be [0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98].