Show HN: Chonky - 一种用于文本语义分块的神经方法
**Chonky** 是一个 Python 库,利用微调的 Transformer 模型,将文本分割成语义相关的块,适用于 RAG 系统。安装简单,只需使用 `pip install chonky`。使用时,`TextSplitter` 会下载 Transformer 模型。文章提供了代码示例展示了如何使用该库进行文本分块。该库的核心是基于神经方法进行文本分块,相关主题包括 AI、ML、chunking、RAG、text-splitter、LLMs 和 semantic-chunking。
Chonky 是一个 Python 库,它使用经过微调的 Transformer 模型,将文本智能地分割成有意义的语义块。这个库可以用于 RAG 系统中。
安装
pip install chonky
用法:
from chonky import TextSplitter
# 首次运行时,它将下载 Transformer 模型
splitter = TextSplitter(device="cpu")
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
for chunk in splitter(text):
print(chunk)
print("--")
Transformer 模型
mirth/chonky_distilbert_base_uncased_1
关于
一种用于文本分块的全神经方法。