LangChain教程(Python版本) > 内容正文

LangChain 按Token拆分文本内容

按Token拆分

语言模型有token限制。你不应超过token限制。因此，将文本拆分成块时，最好计算token数。有许多token处理器（分词器）。在计算文本中的token数时，应使用与语言模型中使用的相同的token处理器。

本章介绍LangChain如何使用各种token分词器，根据token拆分文本内容。

tiktoken

tiktoken 是由 OpenAI 开源的快速 BPE 分词器。

我们可以使用它来估算使用的token。对于 OpenAI 模型来说，它可能更准确。

文本如何拆分：按照传入的字符
块大小如何测量：通过 tiktoken 分词器

%pip install --upgrade --quiet langchain-text-splitters tiktoken

# 加载原始文本内容
with open("../../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_text_splitters import CharacterTextSplitter

# 根据token拆分文本
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

    Last year COVID-19 kept us apart. This year we are finally together again. 

    Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

    With a duty to one another to the American people to the Constitution.

我们也可以直接加载tiktok分割器

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

spaCy

spaCy是一种用Python和Cython编写的开源软件库，用于高级自然语言处理。

另一个使用NLTK的替代方案是使用spaCy分词器。

文本如何拆分：通过spaCy分词器
如何测量块大小：按字符数计算

#!pip install spacy

# 加载原始文本
with open("../../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_text_splitters import SpacyTextSplitter

# 定义文本拆分器
text_splitter = SpacyTextSplitter(chunk_size=1000)

# 拆分文本
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

SentenceTransformers

SentenceTransformersTokenTextSplitter 是专为 sentence-transformer 模型设计的文本分割器。默认行为是将文本分割成适合所需使用的 sentence-transformer 模型的 token 窗口大小的块。

from langchain_text_splitters import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "

count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)

token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier

print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")

tokens in text to split: 514

text_chunks = splitter.split_text(text=text_to_split)

print(text_chunks[1])

lorem

NLTK

自然语言工具包，或更常见的 NLTK，是用 Python 编程语言编写的一组用于符号和统计自然语言处理（NLP）的库和程序，用于英语。

与其仅仅按“\ \ n \ \ n”拆分文本，我们可以使用 NLTK 基于 NLTK 分词器进行拆分。

文本拆分方式：使用 NLTK 分词器。
块大小的测量方式：按字符数测量。

# pip install nltk

# 加载原始文本
with open("../../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_text_splitters import NLTKTextSplitter

# 定义文本拆分器
text_splitter = NLTKTextSplitter(chunk_size=1000)

# 拆分文本
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Hugging Face tokenizer

Hugging Face有很多的分词器。

我们使用Hugging Face分词器中的GPT2TokenizerFast来计算文本长度的token数。

文本是如何分割的：按照传入的字符进行分割。
块的大小如何计算：由Hugging Face分词器计算得到的token数来衡量。

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# 加载原始文本
with open("../../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_text_splitters import CharacterTextSplitter

# 定义文本拆分器，拆分文本
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])

关联主题

LangChain开发指南

梯子教程-tizi365.com

LangChain教程(Python版本)

LangChain入门

提示词管理

LangChain表达式语言

语言模型

本地数据处理

文档处理

文本向量处理

任务例子

历史记忆

Agents