-
Notifications
You must be signed in to change notification settings - Fork 13.8k
Closed
Labels
Description
So, I found out that \n\n if appended by a character tokenizes as ['\n',\n'] ([198, 198]) instead of ['\n\n'] ([271]).
(I'm using Llama3 for this example, but this extends to other models as well)
Here's an example prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You're Psy, user's assistant, and a master of concise replies.<|eot_id|><|start_header_id|>user<|end_header_id|>
Write a short poem<|eot_id|><|start_header_id|>assistant<|end_header_id|>
If I switch the template to use \n\n\n\n (1038) it tokenizes as ['\n\n\n', '\n'] ([1432, 198]):

(Note: I know there've been efforts in making special tokens render, but rn I understand they don't have a textual representation, so you can ignore tokens like 128000, 128006 and 128007 in the sequences above)
In C# I patch the issue like so:
var tokensCount = NativeApi.llama_tokenize(model, bytesPtr, bytes.Length, tokensPtr, tokenBuffer.Length, add_bos, special);
var list = new List<LLamaToken>();
for (int i = 0; i < tokensCount; i++) { // Hack: ['\n','\n'] --> ['\n\n']
if (tokenBuffer[i] == 198 && tokenBuffer[i + 1] == 198) { list.Add(271); i++; }
else { list.Add(tokenBuffer[i]); }
}
return list.ToArray();(ignoring all \n merges except the \n\n which is common for the template)
HarperGrieve, LostRuins, lin72h and luoshmgLostRuinsteleprint-me
