📄 2: Build Your Knowledge Base

Welcome to the second step of your Testus Patronus journey! In this exercise, you'll learn how to build a Knowledge Base that your AI assistant can search through using retrieval-augmented generation (RAG).

This is where your assistant starts to understand your testing documentation, requirements, and historical issues.

📌 What You'll Learn

Understand how embeddings represent meaning
Explore chunking: what it is and why size matters
Learn the role of chunk overlap for better results
Discover why ranking and top_k are critical for relevance
Practice ingesting documents manually via Dify UI
Learn to ingest programmatically via Dify API

🧠 Embeddings

When a document is ingested, it's transformed into a vector representation using an embedding model. These vectors help the AI understand meaning and similarity between pieces of text.

Embeddings let the model know that "bug report" and "defect ticket" might refer to the same concept.

In this tutorial, we use the model text-embedding-ada-002 from Azure OpenAI.

🍰 Chunking: Breaking Text into Pieces

📏 Chunk Size

Your documents are too long to fit directly into the model's input window, so they are divided into chunks.

Chunk Size	Pros	Cons
Small Chunks (200–300 tokens)	🎯 High precision 🔍 Matches narrow queries well	🧩 May lose broader context
Large Chunks (800–1000 tokens)	📚 Preserve more context 🔁 Fewer retrievals needed	🧊 More noise 🎯 Lower match precision

Think of it like reading pages of a book — too many pages at once and the meaning blurs. Too few and you lose the plot.

🔄 Chunk Overlap

Overlap ensures that context near the boundaries of chunks isn't lost.

For example, if two chunks overlap by 100 tokens, the second chunk repeats the last 100 tokens of the first. This helps the model "remember" what came before — improving answers that need continuity.

🎯 Ranking and top-k Retrieval

Once you have all your documents chunked and embedded, you'll want to find the most relevant ones to a query.

This is where ranking comes in:

The retriever uses cosine similarity (or a similar metric) to score each chunk.
Then it picks the top K chunks (usually 3–5) with the highest scores.
These are sent to the LLM for final answer generation.

Getting the top K right can dramatically improve performance!

🛠️ Step-by-Step: Ingesting Documents

📝 Manual Upload via Dify UI

Go to the Knowledge tab in your Dify instance.
Click Create Knowledge.

Get your Jira Dataset sample in .txt format:

⬇️ Download Jira Dataset Files

Unzip the dataset and upload the files to the Knowledge Base.
Upload WEBHOOKS_JiraEcosystem_issues.txt and WEBHOOKS_JiraEcosystem_issues_SUMMARY.txt manually.
Click on Next.
Configure:
- Chunk size: Start with 2000 characters.
- Overlap: Try 200 characters (10% of chunk size).
Select High Quality index method and select the text-embedding-3-large embeddings model.
Click on Save & Process. You can now investigate your document chunks, sizes, content, etc. What do you think?

📦 Example Issue (JSON)

{
  "summary": "Support 'expand' param on GET /rest/api/2/project",
  "description": "The GET /rest/api/2/project endpoint should support 'expand' to allow additional project details.",
  "issueType": "New Feature",
  "priority": "Medium"
}

🧑‍💻 Advanced: Ingesting via API

Need a working config?

Why is API ingestion better?

✅ Supports more complex documents and formats.
🔁 Enables automation and batch processing.
📦 You can attach metadata (e.g. issue_key) for filtering or analytics.

📌 What You'll Learn​

🧠 Embeddings​

🍰 Chunking: Breaking Text into Pieces​

📏 Chunk Size​

🔄 Chunk Overlap​

🎯 Ranking and top-k Retrieval​

🛠️ Step-by-Step: Ingesting Documents​

📝 Manual Upload via Dify UI​

📦 Example Issue (JSON)​

🧑‍💻 Advanced: Ingesting via API​