Skip to main content

📄 2: Build Your Knowledge Base

Welcome to the second step of your Testus Patronus journey! In this exercise, you'll learn how to build a Knowledge Base that your AI assistant can search through using retrieval-augmented generation (RAG).

This is where your assistant starts to understand your testing documentation, requirements, and historical issues.

Retrieval

📌 What You'll Learn


🧠 Embeddings

When a document is ingested, it's transformed into a vector representation using an embedding model. These vectors help the AI understand meaning and similarity between pieces of text.

Embeddings let the model know that "bug report" and "defect ticket" might refer to the same concept.

In this tutorial, we use the model text-embedding-ada-002 from Azure OpenAI.


🍰 Chunking: Breaking Text into Pieces

📏 Chunk Size

Your documents are too long to fit directly into the model's input window, so they are divided into chunks.

Chunk SizeProsCons
Small Chunks (200–300 tokens)🎯 High precision
🔍 Matches narrow queries well
🧩 May lose broader context
Large Chunks (800–1000 tokens)📚 Preserve more context
🔁 Fewer retrievals needed
🧊 More noise
🎯 Lower match precision
Chunk Size Tradeoff

Think of it like reading pages of a book — too many pages at once and the meaning blurs. Too few and you lose the plot.

🔄 Chunk Overlap

Overlap ensures that context near the boundaries of chunks isn't lost.

Chunk Overlap Visual

For example, if two chunks overlap by 100 tokens, the second chunk repeats the last 100 tokens of the first. This helps the model "remember" what came before — improving answers that need continuity.


🎯 Ranking and top-k Retrieval

Once you have all your documents chunked and embedded, you'll want to find the most relevant ones to a query.

This is where ranking comes in:

  • The retriever uses cosine similarity (or a similar metric) to score each chunk.
  • Then it picks the top K chunks (usually 3–5) with the highest scores.
  • These are sent to the LLM for final answer generation.
Getting the top K right can dramatically improve performance!

🛠️ Step-by-Step: Ingesting Documents

📝 Manual Upload via Dify UI

  1. Go to the Knowledge tab in your Dify instance.
    Knowledge Landing
  2. Click Create Knowledge.
Get your Jira Dataset sample in .txt format:

⬇️ Download Jira Dataset Files

  1. Unzip the dataset and upload the files to the Knowledge Base.
  2. Upload WEBHOOKS_JiraEcosystem_issues.txt and WEBHOOKS_JiraEcosystem_issues_SUMMARY.txt manually.
  3. Click on Next.
  4. Configure:

    • Chunk size: Start with 2000 characters.
    • Overlap: Try 200 characters (10% of chunk size).
    Knowledge Chunk Settings
  5. Select High Quality index method and select the text-embedding-3-large embeddings model.

    Knowledge Retrieval Settings High
  6. Click on Save & Process. You can now investigate your document chunks, sizes, content, etc. What do you think?

    Knowledge Documents

📦 Example Issue (JSON)

{
"summary": "Support 'expand' param on GET /rest/api/2/project",
"description": "The GET /rest/api/2/project endpoint should support 'expand' to allow additional project details.",
"issueType": "New Feature",
"priority": "Medium"
}

🧑‍💻 Advanced: Ingesting via API

Need a working config?

Why is API ingestion better?
  • ✅ Supports more complex documents and formats.
  • 🔁 Enables automation and batch processing.
  • 📦 You can attach metadata (e.g. issue_key) for filtering or analytics.