-
Book Overview & Buying
-
Table Of Contents
RAG from First Principles
By :
As mentioned earlier, LangChain provides the Unstructured-Loader by integrating with the Unstructured tool. This allows for the automatic extraction and splitting of text into document elements when loading a variety of documents, such as PDF, Word, HTML, etc. Then, based on the specified chunking strategy and maximum character limit, these document elements are combined or further split to generate text chunks that better meet specific requirements.
There are two main chunking strategies in the Unstructured tool: Basic and By Title. Both of these strategies are based on recognizing the document’s semantic structure, rather than simply splitting text by blank lines or newline characters.
max_characters) or a soft limit (new_after_n_chars). If a single element (such as a particularly...