Question

We need to understand the importance of context and target sequence length for transformer models, mainly how well we can predict into the future given a fixed target length.

Hypothesis

We hope to god that the transformers can predict a few hundred tokens into the future.

Experiment

We fix the downsampling rate to 50, which with 1kHz data will give samples spaced 0.02s apart. With a maximum  of 2200 A/s on ramp-up (MD1) (and 5000 A/s on rampdown [LHC]), which equates to  of 1 T/s on ramp-up, and 1.75 T/s on ramp-down. This means roughly a max sample spacing of 40 ish A on ramp-up and 100 A on ramp-down, or similarly in field 0.02 T on ramp-up and 0.05 T on ramp-down. This is not optimal, but for the sake of the experiment we keep the downsampling rate high.

We use sequence lengths that are multiples of basic periods, i.e. multiples of 60.

We make a grid of the two sequence lengths and train them over 2000 epochs with a fixed architecture (TransformerV2). The grid is the following

ctxt_seq_len 60, 120, 240, 480, 960, 1920

tgt_seq_len 60, 120, 240, 360, 480

This gives a grid that is .

We run this with our tuning script.

Results

Add results of study here, including hard data, to inform future decisions.

I screwed up by defining teh grid using ray.tune.choice instead of ray.tune.grid_search , and had to cancel the search after the same sequence length pair re-appeared. Since the grid search can do about 3-5 trainings per day depending on the sequence length, this is not feasible to do when short on time.