-
Book Overview & Buying
-
Table Of Contents
RAG from First Principles
By :
After importing unstructured data, we need to perform text chunking, also known as text splitting. This process involves dividing long texts into appropriately sized fragments to facilitate embedding, indexing, and storage, and to improve retrieval accuracy.

Figure 2.1: Text chunking optimization methods with flowcharts
Lewis: Why do we need to do this step? It’s not hard to understand, right?
Alex: Exactly. The results of retrieval are composed of individual units, which we call chunks. The size of these units is very important. Take Journey to the West as an example; this work is nearly a million words long. If a user asks what the last of the Eighty-One Tribulations is, and we simply pass the entire course to the large model as a reference, although this isn’t necessarily a retrieval failure, the information provided is too broad and not precise enough.