- - query vector: representation of one word.
- - key vector: representation of all words.
- - value vector: representation of all words.
Embedding layer: [bs seqlen] [bs dim_emb seqlen]
Architecture
In this figure we can see the architecture of a transformer
Normally for translation tasks, the input is the sequence to be translated, e.g. “Je suis une baguette”, which is embedded into feature space, and added to a positional embedding. The target is first a <SOS> token (start of sequence), and the decoder infers the next token (“I”), which is then fed back into the transformer as “<SOS> I”, and outputs “am”, until “<SOS> I am a baguette <EOS>” is produced (EOS for end of sequence).
The transformer uses self-attention in the multi-head attention to attend to different parts of the sequence. The outputs of the transformer encoded is re-used for each decoding step.
The masked self-attention layer in the decoder ensures that the decoder does not attend to future data points. These are provided from the encoder.
Adaptation to continuous input
The transformer is originally designed for translation tasks, i.e. the outputs are discrete tokens, which can be classified using a SoftMax layer. For continuous data we can either output a point-estimate, or a discrete probability distributions by outputting quantiles, all using linear layers.
Adaptation to continuous input
In NLP tasks, discrete tokens are mapped to continuous feature space using an embedder. Time series data is already continuous, and therefore we can either use the raw data, or attempt to extract meaningful features using for instance an LSTM, and then feed that output to the transformer.
Adaptation to time series data
While the transformer is a sequence-to-sequence model, we cannot use it directly for time series data. This is because we don’t the “target” embeddings.
One way to overcome this is to use a decoder-only approach.