# Core Concepts

# Chunking

Document chunking is the process of dividing large files or datasets into smaller, more manageable pieces for processing. This technique significantly enhances the performance and cost-efficiency of Generative AI platforms. Large Language Models (LLMs) often have context window limitations, meaning they can only process a set amount of text at once. Chunking overcomes this by breaking large documents into smaller, independent units that can be processed simultaneously, significantly speeding up the overall processing time for large datasets and significantly reducing costs. For LLMs, effective chunking leads to better information retrieval, faster processing, and a deeper understanding of the content.

# Embedding

A vector embedding model is a specialized tool that helps a system comprehend the semantic meaning of words and sentences. It functions by converting textual data into numerical vector representations, which computers can efficiently process and analyze. The embedding model translates the generated chunks into these numerical vector representations, making the actual content understandable by LLMs for generating relevant output.

# Metadata

Metadata refers to "data about data." It is a systematic and structured method for communicating essential information about content. Its significance lies in its ability to establish a consistent mechanism and terminology, which in turn greatly facilitates the discovery, usage, and preservation of that content.

For files, metadata can encompass a wide range of descriptive attributes, including but not limited to:

  • Title: The official name of the document.

  • Author: The creator(s) of the document.

  • Publication Date: The date when the document was published or created.

  • Abstract: A brief summary of the document's content. (E.g. Interest rates' home loan impact)

  • Keywords: Important terms that describe the document's subject matter, aiding in search and classification. (E.g. Home Loan, Interest rates, Credit score)

  • Citations: References to other works, indicating relationships and sources.

By providing these descriptive elements, metadata makes content more searchable, understandable, and manageable throughout its lifecycle.

# Tagging

Tagging in document retrieval involves assigning descriptive keywords or labels to files based on their content or user-defined categories. This practice helps to effectively organize and retrieve information, speeding search procedures and improving the efficiency and performance of the model.

# Top K

Top K refers to the selection of the K most relevant or highest-scoring chunks of data from a larger collection, based on a query or specific criteria. It is a method for identifying and presenting a limited, prioritized subset of chunks that are most likely to contain the answer or context relevant to a user's request, especially when dealing with long files or vast knowledge bases that have been broken down into smaller, manageable chunks.

When you set Top K=10, you are instructing a system to retrieve only the 10 most relevant or highest-scoring items from a larger collection in response to a query.

# Reranker

A reranker is a component in an advanced search system that refines the initial list of results. The retrieval process typically happens in two stages:

  1. Initial Retrieval: A fast and efficient retriever scans the entire knowledge base to find a broad set of potentially relevant chunks, known as the Top K.

  2. Reranking: The reranker then takes this Top K list and uses a more powerful, computationally intensive model to re-evaluate and re-order the chunks based on a much deeper analysis of their relevance to the query.

The final, smaller, and highly relevant list of chunks produced by the reranker is known as the Top N. Because reranking adds an extra step, systems often allow you to enable or disable it. When enabled, the Top N chunks are used as the final, high-quality context to generate the most accurate answer.

# Reranking Models

A reranking model is the specific engine that powers the reranker. These models are designed for high accuracy and perform a sophisticated, contextual analysis of the query against each of the Top K chunks. This allows them to understand nuance and relevance far better than the initial, faster retrieval model.

Purple Fabric offers a choice of different reranking models. These are powerful, pre-trained models designed to perform well across a wide variety of topics and use cases. The ability to choose a reranking model allows users to balance factors like cost, speed, and the required level of accuracy for their specific application.

# Similarity Score

A similarity score in a retrieval test is a numerical value that shows how well a piece of information (called a chunk) matches a question. The system compares each chunk’s content to the question and gives it a score. Chunks are then arranged from highest to lowest score.

For Retrieval-Augmented Generation (RAG) agents, the chunks with the highest scores are sent to an LLM model. The LLM Models use these top chunks as reliable, up-to-date references to give clear and accurate answers, reducing the chances of hallucination.

The similarity score ranges from 0 to 100. A higher score means the chunk is a better match for the question.

**Example ** Query: “What are the latest regulations on home loan interest rates?”

Chunk 1 (High Score – 95)
"The Central Financial Authority issued a guideline on August 30, 2025, stating that banks must link home loan interest rates to the benchmark policy rate, reviewed every quarter to ensure transparency and fair pricing for customers."

Why this scores high:

  • Directly answers the query about regulations.

  • Contains specific information: issuing body, date, and rule details.

  • Entire content is focused on compliance, not marketing.

Chunk 2 (Low Score – 80)
*"Our bank now offers home loans at attractive rates starting at 5.5% with zero processing fees and a quick approval process. Apply today to make your dream home a reality!"
* Why this scores low:

  • Talks about offers and marketing promotions, not regulations.

  • Even though it mentions “home loan interest rates,” it does not explain rules or guidelines.

  • Words like “quick approval” and “dream home” are unrelated to the intent of the query.