Quantile loss for hysteresis prediction

The default loss function for the Temporal Fusion Transformer is the Quantile loss, which measures the deviation of predicted values from actual observations while accounting for uncertainty in the target distribution. Unlike traditional losses such as mean squared error (MSE), which penalize all deviations equally, quantile loss allows the model to learn conditional quantiles, making it particularly useful for capturing asymmetric errors and probabilistic forecasts. In the context of hysteresis prediction, this is beneficial as it enables the model to express confidence intervals around its predictions, accounting for the inherent variability in magnetic field behavior. By optimizing for multiple quantiles simultaneously, the Temporal Fusion Transformer produces a more robust and informative prediction, which is especially useful when modeling complex, history-dependent dynamics such as those seen in accelerator magnets.

Pinball loss

L_{q u an t i l e} (θ, τ) = \frac{1}{N} i = 1 \sum N {τ (y - f_{θ} (x))_{i} (1 - τ) (f_{θ} (x) - y)_{i} if y_{i} \geq f_{θ} (x_{i}) if y_{i} < f_{θ} (x_{i})

Where $y_{i}$ is the ground truth, and $f_{i}$ is the model prediction. So if we are underestimating, i.e. $y_{i} > f$ , then $τ$ applies, wheras if $y_{i} < f$ , then the reverse $1 - τ$ applies. This means that if we are using a quantile loss, selecting quantile $τ = 0.1$ and overestimating would mean the weight would be $1 - τ = 0.9$ would be applied to the absolute error, i.e. a large loss weight for an incorrect prediction.

In this way the quantiles should be chosen to be more or less sensitive to outliers, depending on what is being modeled.

The asymmetry is best understood with an example. Here I picked 𝜏 = 0.1 and the ground truth is y = 103.

The pinball loss is zero if we predict f(x) = 103 perfectly
If the model underpredicts by 6 with f(x) = 97, the loss is 0.1 (103 - 97) = 0.6
If the model overpredicts by 6 with f(x) = 109, the loss is (1 - 0.1) (109 - 103) = 5.4

The effect of 𝜏=0.1 on model training is to bias the model toward underpredicting the target since overpredicting is 9 times more expensive than underpredicting when the absolute differences are the same.

Comparison to other losses

Compared to other loss functions such as Mean Absolute Error (MAE), Huber loss, and Log-Cosh loss, quantile loss provides a distinct advantage in handling asymmetric errors and generating probabilistic predictions. MAE penalizes all errors linearly, making it more robust to outliers than Mean Squared Error (MSE) but less informative in cases where uncertainty estimation is needed. Huber loss, on the other hand, combines the best of MAE and MSE by applying a quadratic penalty to small errors and a linear penalty to large errors, making it well-suited for balancing robustness and sensitivity to measurement noise. Log-Cosh loss behaves similarly to Huber loss but is smooth everywhere, reducing the effect of large residuals while still being differentiable. In contrast, quantile loss does not focus on minimizing the mean error but instead learns multiple quantiles, making it ideal for capturing the full distribution of possible outcomes rather than just a single point estimate.

Comparison of Loss Functions

Loss Function	Sensitivity to Noise	Numerical Stability	Computational Complexity	Key Advantage
Quantile Loss	Moderate (depends on chosen quantiles)	Stable but requires careful selection of quantiles	Moderate (requires multiple passes for different quantiles)	Captures uncertainty and allows probabilistic forecasting
MAE	High (treats all deviations equally)	Stable	Low	Robust to outliers, simple interpretation
Huber Loss	Low (quadratic for small errors, linear for large ones)	Stable	Moderate (threshold tuning required)	Balances robustness and sensitivity to small errors
Log-Cosh Loss	Low (similar to Huber but smoother)	High (differentiable everywhere)	Moderate	More stable than Huber for optimization, smooth convergence

In terms of sensitivity to measurement noise, MAE is highly susceptible due to its linear nature, whereas Huber and Log-Cosh reduce the impact of outliers. Quantile loss, while not explicitly designed for noise robustness, can still model uncertainty by adjusting quantiles to account for noise distributions. Numerically, MAE, Huber, and Log-Cosh are stable, while quantile loss requires careful tuning of quantile levels to avoid instabilities. Computationally, MAE is the simplest, while Huber and Log-Cosh introduce moderate complexity. Quantile loss is computationally heavier since multiple quantiles must be computed, but this added complexity is justified when uncertainty estimation is crucial, as in the case of hysteresis prediction with the Temporal Fusion Transformer.

Experiments

Investigate which quantiles to use for hysteresis prediction [priority:: high] [due:: 2025-02-03] [completion:: 2025-02-26]

Best quantiles seem to be [0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98].

Hysteresis Compensation

Explorer

Quantile loss for hysteresis prediction

Pinball loss

Comparison to other losses

Comparison of Loss Functions

Experiments

Graph View

Table of Contents

Backlinks