按Token拆分
语言模型有token限制。你不应超过token限制。因此,将文本拆分成块时,最好计算token数。有许多token处理器(分词器)。在计算文本中的token数时,应使用与语言模型中使用的相同的token处理器。
本章介绍LangChain如何使用各种token分词器,根据token拆分文本内容。
tiktoken
tiktoken 是由 OpenAI 开源的快速 BPE 分词器。
我们可以使用它来估算使用的token。对于 OpenAI 模型来说,它可能更准确。
- 文本如何拆分:按照传入的字符
- 块大小如何测量:通过
tiktoken
分词器
%pip install --upgrade --quiet langchain-text-splitters tiktoken
# 加载原始文本内容
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter
# 根据token拆分文本
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
我们也可以直接加载tiktok分割器
from langchain_text_splitters import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
spaCy
spaCy是一种用Python和Cython编写的开源软件库,用于高级自然语言处理。
另一个使用NLTK的替代方案是使用spaCy分词器。
- 文本如何拆分:通过
spaCy
分词器 - 如何测量块大小:按字符数计算
#!pip install spacy
# 加载原始文本
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain_text_splitters import SpacyTextSplitter
# 定义文本拆分器
text_splitter = SpacyTextSplitter(chunk_size=1000)
# 拆分文本
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
SentenceTransformers
SentenceTransformersTokenTextSplitter
是专为 sentence-transformer 模型设计的文本分割器。默认行为是将文本分割成适合所需使用的 sentence-transformer 模型的 token 窗口大小的块。
from langchain_text_splitters import SentenceTransformersTokenTextSplitter
splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "
count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)
2
token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1
# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier
print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")
tokens in text to split: 514
text_chunks = splitter.split_text(text=text_to_split)
print(text_chunks[1])
lorem
NLTK
自然语言工具包,或更常见的 NLTK,是用 Python 编程语言编写的一组用于符号和统计自然语言处理(NLP)的库和程序,用于英语。
与其仅仅按“\ \ n \ \ n”拆分文本,我们可以使用 NLTK
基于 NLTK 分词器 进行拆分。
- 文本拆分方式:使用
NLTK
分词器。 - 块大小的测量方式:按字符数测量。
# pip install nltk
# 加载原始文本
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain_text_splitters import NLTKTextSplitter
# 定义文本拆分器
text_splitter = NLTKTextSplitter(chunk_size=1000)
# 拆分文本
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Hugging Face tokenizer
Hugging Face有很多的分词器。
我们使用Hugging Face分词器中的GPT2TokenizerFast来计算文本长度的token数。
- 文本是如何分割的:按照传入的字符进行分割。
- 块的大小如何计算:由
Hugging Face
分词器计算得到的token数来衡量。
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
# 加载原始文本
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter
# 定义文本拆分器,拆分文本
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])