Today we are launching a server dedicated to Tokenization research! Come join us!
discord.gg/CDJhnSvU
@mc Interesting. Looks like I'm a bit late to the party (link expired). Is this still going?
@mc My first thought is always that the problem with tokenizers is really the problem with long contexts in disguise. If we could just learn longer context-models, we could go back to good old character-level tokenization.
@pbloem In some sense, I agree that character-level would solve a lot of our problems. But character-level models have low information density and, at least with current architectures, are too costly and slow.
Subword tokenization is definitely an imperfect solution, but improves upon both of those problems for most (but ofc then has the downsides of like "has trouble spelling/doing basic math", etc.).
I hope to see a lot more research on tokenization-free methods like BLT.