Chonky 是一个 Python 库,它使用经过微调的 Transformer 模型,将文本智能地分割成有意义的语义块。这个库可以用于 RAG 系统中。

安装

pip install chonky

用法:

from chonky import TextSplitter
# 首次运行时,它将下载 Transformer 模型
splitter = TextSplitter(device="cpu")
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
for chunk in splitter(text):
 print(chunk)
 print("--")

Transformer 模型

mirth/chonky_distilbert_base_uncased_1

关于

一种用于文本分块的全神经方法。

Topics

ai ml chunking rag text-splitter llms semantic-chunking