Skip to main content

๐Ÿ“š About the Dataset

In this workshop, you'll be working with a curated and preprocessed subset of the Public Jira Dataset, originally published by researchers from the University of Hamburg:

Montgomery, Lloyd โ€ข Lรผders, Clara โ€ข Prof. Dr. Walid Maalej
Public Jira Dataset Project (Zenodo)

This dataset contains real-world Jira issue data from over 1,800 projects and more than 2.7 million issues. We've distilled this massive collection into a focused and workshop-friendly format.

Dataset Overview
Sample Dataset Structure

โœจ Workshop Datasetโ€‹

To make the dataset manageable and practical for hands-on Retrieval-Augmented Generation (RAG) tasks, we extracted relevant projects, filtered noise, and generated summaries with GPT-3.5 Turbo.
You'll be working primarily with data from the JiraEcosystem, including modules like REST APIs, Webhooks, and Voting systems.

๐Ÿ“ฆ Included Filesโ€‹

FileDescription
REST_JiraEcosystem_issues.jsonRaw cleaned issues related to REST API
REST_JiraEcosystem_SUMMARY.jsonGPT-generated summaries of REST-related issues
TOC_JiraEcosystem_issues.jsonTable of Contents module issues
VOTE_JiraEcosystem_issues.jsonVoting component-related issues
WEBHOOKS_JiraEcosystem_issues.jsonEvent/webhook handling issues
*_SUMMARY.jsonLLM-generated summaries per module
full_quidditch_jira_issues.jsonSimulated internal Jira project for demo/testing

๐Ÿ” Sample Entry: Raw Issue Exampleโ€‹

Here's a peek at what a cleaned issue looks like from REST_JiraEcosystem_issues.json:

{
"key": "REST-50",
"summary": "Return HTTP 500 when plugin fails to load component",
"description": "The REST API will throw an HTTP 500 error if a plugin fails to load a required component. This is hard to debug and provides little feedback.",
"issueType": "Bug",
"components": ["REST Module"],
"status": "Open",
"priority": "Major",
"comments": [
{
"author": "developer42",
"body": "I also encountered this when working with a broken plugin descriptor.",
"created": "2023-01-05T08:23:00Z"
}
]
}

๐Ÿง  Sample Entry: GPT-Generated Summaryโ€‹

The REST_JiraEcosystem_SUMMARY.json file contains LLM-generated summaries like this:

{
"project": "REST",
"summary": "This set of issues revolves around the handling and resilience of Jira's REST API. Common themes include error codes not being descriptive enough, plugin dependency resolution problems, and inconsistent JSON serialization across endpoints. Several bugs concern the lack of backward compatibility when upgrading REST modules."
}

๐Ÿงพ Why We Generate Summariesโ€‹

In real-world software projects, technical documentation is often sparse or outdated. While issue trackers like Jira contain valuable information, the data is typically verbose and unstructured.

We generate summaries to simulate internal documentation that teams often create manuallyโ€”think release notes, design specs, or project overviews. These summaries help:

  • Provide a digestible context to the AI assistant
  • Allow the model to answer high-level or cross-issue questions
  • Enable queries like "Who contributed to this project the most?" or "What is the main focus of the REST module?"

This kind of enriched summary acts as a bridge between raw issue data and meaningful insightsโ€”just like internal documentation would in a real team.


๐Ÿ› ๏ธ How the Data Was Builtโ€‹

  1. Data Extraction
    From a local MongoDB dump of the Public Jira Dataset, grouped by project/component.
  2. Field Normalization
    We simplified and flattened nested structures with a custom script (extract_and_clean_jira_data.py).
  3. Sampling & Filtering
    Only issues with meaningful descriptions, summaries, or comments were retained.
  4. Summarization
    Azure OpenAI's GPT-3.5 Turbo was used to generate overviews of each project subset.
  5. Output Format
    Final JSON files are optimized for ingestion into vector databases and LLM pipelines.

๐Ÿ’ก Optional Enrichment Ideasโ€‹

If you'd like to extend the dataset, consider:

  • Adding GitHub release notes or changelogs for Jira components
  • Scraping official Atlassian REST API docs to complement issues
  • Creating FAQ-style entries by grouping similar issues
  • Generating synthetic feature specs from multiple related issues

๐ŸŽฏ Why This Mattersโ€‹

This dataset is structured to simulate the kind of heterogeneous documentation you'd find in enterprise software projects. It includes bugs, change requests, and dev conversationsโ€”all ripe for retrieval-based reasoning.

You'll be using this to build and test a RAG-powered AI assistant capable of:

  • Understanding feature behavior
  • Identifying high-risk areas
  • Suggesting test cases based on natural language descriptions

โœจ Bonus: You can reuse our scripts to extract, clean, and summarize your own Jira data, giving you a real-world path to bring Testus Patronus into your projects.