An (admittedly amature) attempt to combine the strength of CNNs for sequence prediction though PACING - Patched bAsed ConvolutIoNal Generators
An image is worth 16x16 words has shown us how transformers can be applied to classify images by converting patches into "tokens" for said model to learn from. On the other spectrum, ConvNext has shown how classical CNNs outperform transformers on the same task by translating similar model architextual improvements.
Currently, transformers reign over text/sequence tasks, but the idea is to augment the data in such a way to allow CNNs and their recent progression to perform the same task in a similar manner to visual transformers (patches).
If successful, then the utilisation of CNNs could durastically reduce memory consumption, infernence times, and training requirements, due their more efficient use of parameters.
So, to convert a sequence of words (tokens) into an image, we:
- Compute embeddings from tokens
- Add positional encoding
- Reshape embeddings into patches
- Add padding where nessesary.
- Combine patches into a final image
This image is then passed on to a CNN for training/inferenece. Specifically, for this experiment, 1024 tokens with an embedding size of 768 are converted to an image of size 3,512,512.
Please note, this experiment was designed acknowledging computing constants and should be scaled up for a more fair comparison.
Here, the task is to see if transformers or CNNs are better at predicting the next token in a sequence. The dataset at hand is a token sequence of serveral shakespeare texts.
The tranformer is GPT-small (pos=1024, ctx=1024, embd=768, layers=12, heads=12) The CNN is ConvNextv2-base (depths=[3, 3, 20, 3], dims=[128, 256, 512, 1024]) * Note the changed depth of 20 at position 2, this is for a more equal parameter count.
Both have their positional and word embeddings trained from scratch, with an vocabulary size of 50256. Both have an extremely similar parameter count (124.44M vs 124.75M), however ConvNext is admittedly 0.2% larger. Both were trained for one epochs (12120 samples), batch size of 2, AdamW(lr=2e-4, weight_decay=0, betas=(0.9, 0.999), eps=1e-8), gradient clipping of 1, with a targeted minimum 1 hour training session.
Both models had their crossentropy loss logged on a validation set of 100 samples.
Lower is better
PACING - 6.916 (-13.7%)
GPT2 - 8.015 (+15.9%)
Higher is better
PACING - 2.275 (+51.8%)
GPT2 - 1.499 (-34.1%)
PACING - 0.5GB
GPT2 - 0.55GB
top_k = 50, seed=43
PACING
what rich fair nothing thee d nothing love's
With did fair thy heart even of best with love thee form fromty byWhen I his beauty with me mine a. thy self oWhen thou: of,yTo then beautyed thee thou then behold.ilsted one be time: night's the you. ill love best have
love me is this
see and with on then I on'sTo me nothing live this mine wealth not on live by behold I form thee of mine did behold
GPT2
health corrupt now other night look' deeds my that dece yous age theySh glad eyesost
face divine dull so grace hours my tender keep'one grace lofty eyes, keep health they ares head; one now eyesred one so now me that which make her, eyes mine from hours present be express none see health I express'a dece presents those' tender keep they head none glad health look' dull sa not themselves dulls none dece now night other themselves keep they'presentSh
Given a lower final test-set loss while being faster and more memory efficent, with an extremely similar parameter count, CNN's outperformed GPT-2 in this token prediction task.
I hope to see more experimentation in this, with larger models, more data, and for longer periods of time, as I personally was not able to achieve the computational requirements for a more conclusive result. I believe it shows promise nonetheless.