The Gradient @thegradient

Recent searches

Search options

Only available when logged in.

marco @mc@sigmoid.social

Today we are launching a server dedicated to Tokenization research! Come join us!

discord.gg/CDJhnSvU

#nlproc #machinelearning #tokenization

Feb 11, 2025, 08:13 AM··Web

1boost·2favorites

**marco** @mc · Feb 11

Feb 11

marco @mc

Tokenization is an often-overlooked aspect of modern #NLP, but it’s experiencing a resurgence — thanks in large part to @karpathy and his classic tweet:

x.com/karpathy/sta...

Come hang out with us and let's fix these problems!

**marco** @mc · Feb 11

Feb 11

marco @mc

**Peter Bloem** @pbloem · Mar 8

Mar 8

Peter Bloem @pbloem

@mc Interesting. Looks like I'm a bit late to the party (link expired). Is this still going?

**Peter Bloem** @pbloem · Mar 8

Mar 8

Peter Bloem @pbloem

@mc My first thought is always that the problem with tokenizers is really the problem with long contexts in disguise. If we could just learn longer context-models, we could go back to good old character-level tokenization.

**marco** @mc · Mar 8

Mar 8

marco @mc

@pbloem In some sense, I agree that character-level would solve a lot of our problems. But character-level models have low information density and, at least with current architectures, are too costly and slow.

Subword tokenization is definitely an imperfect solution, but improves upon both of those problems for most (but ofc then has the downsides of like "has trouble spelling/doing basic math", etc.).

I hope to see a lot more research on tokenization-free methods like BLT.