按字符拆分
LangChain 最简单的文本拆分方法。它基于字符(默认情况下是 “\n\n”)进行拆分,并通过字符数来测量块的长度。
- 文本如何拆分:按单个字符拆分
- 块大小如何测量:按字符数测量
安装包
%pip install -qU langchain-text-splitters
例子
# 读取一个待拆分的文档内容
with open('../../../state_of_the_union.txt') as f:
state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter
# 定义文本拆分器
text_splitter = CharacterTextSplitter(
separator = "\n\n",
chunk_size = 1000,
chunk_overlap = 200,
length_function = len,
)
# 切割文本,打印第一个片段
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
page_content='Madam Speaker, Madam Vice President 忽略文本....' lookup_str='' metadata={} lookup_index=0
这是将元数据与文档一起传递的示例,请注意它与文档一起分割。
metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents([state_of_the_union, state_of_the_union], metadatas=metadatas)
print(documents[0])
# 返回内容
page_content='..忽略文本..' lookup_str='' metadata={'document': 1} lookup_index=0
text_splitter.split_text(state_of_the_union)[0]