sigmoid.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A social space for people researching, working with, or just interested in AI!

Server stats:

710
active users

marco

Today we are launching a server dedicated to Tokenization research! Come join us!

discord.gg/CDJhnSvU

Tokenization is an often-overlooked aspect of modern , but it’s experiencing a resurgence — thanks in large part to @karpathy and his classic tweet:

x.com/karpathy/sta...

Come hang out with us and let's fix these problems!

@mc Interesting. Looks like I'm a bit late to the party (link expired). Is this still going?

@mc My first thought is always that the problem with tokenizers is really the problem with long contexts in disguise. If we could just learn longer context-models, we could go back to good old character-level tokenization.

@pbloem In some sense, I agree that character-level would solve a lot of our problems. But character-level models have low information density and, at least with current architectures, are too costly and slow.

Subword tokenization is definitely an imperfect solution, but improves upon both of those problems for most (but ofc then has the downsides of like "has trouble spelling/doing basic math", etc.).

I hope to see a lot more research on tokenization-free methods like BLT.