Question
We need to understand the importance of context and target sequence length for transformer models, mainly how well we can predict into the future given a fixed target length.
Hypothesis
We hope to god that the transformers can predict a few hundred tokens into the future.
Experiment
We fix the downsampling rate to 50, which with 1kHz data will give samples spaced 0.02s apart. With a maximum of 2200 A/s on ramp-up (MD1) (and 5000 A/s on rampdown [LHC]), which equates to of 1 T/s on ramp-up, and 1.75 T/s on ramp-down. This means roughly a max sample spacing of 40 ish A on ramp-up and 100 A on ramp-down, or similarly in field 0.02 T on ramp-up and 0.05 T on ramp-down. This is not optimal, but for the sake of the experiment we keep the downsampling rate high.
We use sequence lengths that are multiples of basic periods, i.e. multiples of 60.
We make a grid of the two sequence lengths and train them over 2000 epochs with a fixed architecture (TransformerV2). The grid is the following
ctxt_seq_len 60, 120, 240, 480, 960, 1920
tgt_seq_len 60, 120, 240, 360, 480
This gives a grid that is .
We run this with our tuning script.
Results
Add results of study here, including hard data, to inform future decisions.
I screwed up by defining teh grid using ray.tune.choice instead of ray.tune.grid_search , and had to cancel the search after the same sequence length pair re-appeared. Since the grid search can do about 3-5 trainings per day depending on the sequence length, this is not feasible to do when short on time.