Backporting-LLM

Designed for Team AzureLinux, Cloud+AI Microsoft

When a new bug/vulnerability is discovered in any package, fixes are generally available for the latest versions only. If you are using any older version, then Maintaining Your Version is a task that is left for developers to do manually, from understanding the fix and the latest package version to adapting it to the older version currently in use.

This manual work is what we try to reduce.

This LLM Framework has been designed for Backporting - taking Patches written for latest versions, view the context of an older version and update the Patch to apply cleanly to the older package version.

Existing Approaches & Our Goals

Patch Backporting is not new, existing LLM models when prompted directly for simple backports, do attempt to backport patches. But most of their results have some common issues (line number, tab characters etc) due to which patches generated never cleanly apply to the repository.

Also, there have been attempts to fix complex backports using the SWE-Agent (AllSpark) and even Github-Copilot that work by searching in files based on the given patch and try to map older version files/functions to the ones in the given Patch.

However, the models used for these purposes do not have enough context about the repository itself, requiring them Large Token Counts to Find & Map simple changes, and failing entirely for some tricky changes.

We attempt to improve quality of base LLM outputs for both these approaches, first via Chain-of-Prompts and second using LLM Finetuning

Project Demo

View the RetroPatch AI Hackathon Submission, Microsoft Global Hackathon 2025

Simple Backports

For changes where the file paths have NOT been changed, we can simply traverse the files and fetch code-blocks with any relevant lines, that match with the patch.
The data about the CVE (vulnerability), Upstream Patch, and extracted Codeblocks are all passed with backporting prompts to the LLM.
The output from these LLMs is then tested against the actual package.
Base Models used for this purpose include codellama/CodeLlama-13b-Instruct-hf and Qwen/Qwen2.5-Coder-32B-Instruct, with Qwen2.5-Coder-32B-Instruct significantly outperforming Codellama-13B-Instruct, in giving Patches in the Standard Git Diff Format.
Base models typically always struggle with

Adding correct line number in hunk header (@@ -X,Y +A,B @@)
Tabs, whitespace Character Differences
Missing some lines in the hunk content (if new lines were added / removed in the upstream)

These issues are fixable using few-shot prompts & prompt chaining.
By Providing specialized prompts, and chaining their outputs together, we show how these base models can be used to generate valid patches.

Prompt-Chain Example

From Logs Dir: logs/libsoup/CVE-2025-32052/12-SEP-2025_14-36
Consider the upstream Patch:

    --- a/libsoup/content-sniffer/soup-content-sniffer.c
    +++ b/libsoup/content-sniffer/soup-content-sniffer.c
    @@ -524,7 +524,7 @@ sniff_unknown (SoupContentSniffer *sniffer, GBytes *buffer,
                guint index_pattern = 0;
                gboolean skip_row = FALSE;
    
    -			while ((index_stream < resource_length) &&
    +			while ((index_stream < resource_length - 1) &&
                    (index_pattern <= type_row->pattern_length)) {
                    /* Skip insignificant white space ("WS" in the spec) */
                    if (type_row->pattern[index_pattern] == ' ') {
    -- 
    GitLab

To be backported to the file:

{
    "libsoup/content-sniffer/soup-content-sniffer.c": [
        "527: \t\tif (type_row->has_ws) {\n",
        "528: \t\t\tguint index_stream = 0;\n",
        "529: \t\t\tguint index_pattern = 0;\n",
        "530: \t\t\tgboolean skip_row = FALSE;\n",
        "531: \n",
        "532: \t\t\twhile ((index_stream < resource_length) &&\n",
        "533: \t\t\t       (index_pattern <= type_row->pattern_length)) {\n",
        "534: \t\t\t\t/* Skip insignificant white space (\"WS\" in the spec) */\n",
        "535: \t\t\t\tif (type_row->pattern[index_pattern] == ' ') {\n",
        "536: \t\t\t\t\tif (resource[index_stream] == '\\x09' ||\n",
    ]
}

The base LLM response gives hunk @@ -532,7 +532,7 @@ which is incorrect:

@@ -532,7 +532,7 @@ sniff_unknown (SoupContentSniffer *sniffer, GBytes *buffer,
                guint index_stream = 0;
                guint index_pattern = 0;
                gboolean skip_row = FALSE;
 
-			while ((index_stream < resource_length) &&
+			while ((index_stream < resource_length - 1) &&
                    (index_pattern <= type_row->pattern_length)) {
                    /* Skip insignificant white space ("WS" in the spec) */
                    if (type_row->pattern[index_pattern] == ' ') {
--

The output from the base llm would not be directly applicable. This we fix using prompt-chains, designed by analyzing common-errors in patch generations.
By using prompt chaining, we can get the following outputs from the llm:

First Mentioned Unchanged Line:                guint index_stream = 0;
First Changed Line: -			while ((index_stream < resource_length) &&

First Mentioned Unchanged Line Number: 528: guint index_stream = 0;
First Changed Line Number: 532: 			while ((index_stream < resource_length) &&

And hence the updated patch:

@@ -528,7 +528,7 @@ sniff_unknown (SoupContentSniffer *sniffer, GBytes *buffer,
                guint index_stream = 0;
                guint index_pattern = 0;
                gboolean skip_row = FALSE;
 
-			while ((index_stream < resource_length) &&
+			while ((index_stream < resource_length - 1) &&
                    (index_pattern <= type_row->pattern_length)) {
                    /* Skip insignificant white space ("WS" in the spec) */
                    if (type_row->pattern[index_pattern] == ' ') {

which fixes the line number as required. Similarly, multiple other issues with LLMs, like not handling tabs & whitespaces, missing lines, incorrect hunk headers, etc can be fixed with specialized prompts as used in this repository.

Complex Backports

When the old version of a package contains large differences - changes to file paths or function signatures, or even refactoring / modularization of functions, it becomes very difficult for models to generate patches directly.

So, we feed the entire COMMIT HISTORY of the desired Package to an LLM in a question-answer format
This allows the model to be able to map old versions, paths and function signatures to the new versions.

To do this, we build the following pipeline:

Example Dataset:

{"messages": [{"role": "system", "content": "\n        You are an expert software developer with deep knowlege of c programming language.\n        You have in-depth knowlege about the commit history of libsoup package.\n        Answer questions about how the files, functions and lines of code were changed over range of commits.\n    "}, {"role": "user", "content": "Has the file path of 'libsoup/soup-cache-input-stream.c' changed? What is the new path?"}, {"role": "assistant", "content": "Yes, the file path changed. Old Path: 'libsoup/soup-cache-input-stream.c', New Path: 'libsoup/cache/soup-cache-input-stream.c'."}]}
{"messages": [{"role": "system", "content": "\n        You are an expert software developer with deep knowlege of c programming language.\n        You have in-depth knowlege about the commit history of libsoup package.\n        Answer questions about how the files, functions and lines of code were changed over range of commits.\n    "}, {"role": "user", "content": "What changes were made to the function 'read_internal' in the file 'libsoup/http-1/soup-message-io-data.c'?"}, {"role": "assistant", "content": "The function 'read_internal' was modified. New code blocks were added to handle cases for 'SOUP_ENCODING_CONTENT_LENGTH' and 'SOUP_ENCODING_EOF'. Additionally, new variables 'soup_random_size' and 'soup_buffer' were introduced. Code changes:\n\n**Old Version:**\n```c\ncase SOUP_ENCODING_CHUNKED:\n    return soup_body_input_stream_read_chunked(bistream, buffer, count, blocking, cancellable, error);\n\ncase SOUP_ENCODING_CONTENT_LENGTH:\ncase SOUP_ENCODING_EOF:\n    return soup_body_input_stream_read_raw(bistream, buffer, count, blocking, cancellable, error);\n```\n\n**New Version:**\n```c\ncase SOUP_ENCODING_CHUNKED:\n    return soup_body_input_stream_read_chunked(bistream, buffer, count, blocking, cancellable, error);\n\ncase SOUP_ENCODING_CONTENT_LENGTH:\n    return soup_random_size;\n\ncase SOUP_ENCODING_EOF:\n    return soup_body_input_stream_read_raw(bistream, buffer, count, blocking, cancellable, error);\n\nif (nread == 0) {\n    priv->eof = TRUE;\n}\nif (priv->encoding != SOUP_ENCODING_EOF) {\n    g_set_error_literal(error, G_IO_ERROR, G_IO_ERROR_PARTIAL_INPUT, _(\"Connection terminated unexpectedly\"));\n    return -1;\n}\n```"}]}
{"messages": [{"role": "system", "content": "\n        You are an expert software developer with deep knowlege of c programming language.\n        You have in-depth knowlege about the commit history of libsoup package.\n        Answer questions about how the files, functions and lines of code were changed over range of commits.\n    "}, {"role": "user", "content": "Has the file path of 'libsoup/soup-converter-wrapper.c' changed? What is the new path?"}, {"role": "assistant", "content": "Yes, the file path changed. Old Path: 'libsoup/soup-converter-wrapper.c', New Path: 'libsoup/content-decoder/soup-converter-wrapper.c'."}]}

Here, the dataset is designed to enable model to map changes from latest version to the chosen version (Libsoup tag 3.4.4)

Finetuning Dataset:

Find the generated dataset here:

Training-dataset.jsonl
Validation-dataset.jsonl
(Scope of Improvement: the validation dataset is too short)

Finetuning Details:

Chosen Hyper-Parametres:

Batch Size = 2
Epochs = 7
learning rate = 0.9
The small batch size help the model learn each individual example well. This enables the model to learn the commit history like paths and changes to specific hunks
Time taken to finetune gpt-4o: 12 hours GPU waiting queue, 6.5 hours for finetuning.

Training Loss:

Validation Loss:

Training Token Accuracy:

Validation Token Accuracy:

Testing Model Knowledge:

The Model performs very well when asked about changed to files, function names & locations:

How To Run

To Run Simple Backports:

Download the Following Outside this Repo:

1. Download Azure Linux Repository: https://github.com/microsoft/azurelinux
2. Download Libsoup 3.4.4 Version (Package Used for testing)

Download Any LLM Model (Used Model: Qwen-2.5-Coder-32B-Instruct)

git lfs install
git clone https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct
hf download Qwen/Qwen2.5-Coder-32B-Instruct

install python dependencies from requirements.txt (or view the VM Setup below)
Update the Paths for the repositories, dataset (if changed), LLM Model in the Constants.py file
Now, just Run Main-refactored.py file!

To Create Finetuning Dataset

Create .env in Backporting-LLM/finetuning with the fields (AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT). Use this file as reference for env.
Update the desired Paths in BOTH the Constants files: finetuning/constants.py, constants.py
Run finetuning/fetch_commit_history.py file to fetch commits from mentioned Package Version / Commit till latest Origin/master HEAD
Run main() function in file generate_q_a_dataset.py to generate question-answer pairs, and prepare_dataset_in_proper_format() to convert it into training data format.
The dataset will be ready & added to finetuning/data/training_data/ directory.

To Test the Finetuned Model (in SWE-Agent)

Results:

The finetuned model is plugged into SWE-AGENT and used for Patch Backporting.
Due to more context on the commit history, the Finetuned LLM is able to better predict changed locations of functions and map the new function signatures to older function signatures.

The finetuned model uses less tokens to find the locations of desired files and functions.
The finetuned model is able to backport function-refactor changes that a base model just cannot handle.

For the BASE GPT-4o Model:

view the complete logs for base-gpt-4o-model

DEBUG - swea-lm - 
    total_tokens_sent=259,795, 
    total_tokens_received=706, 
    total_cost=0.57, 
    total_api_calls=23

For the FINETUNED GPT-4o Model:

view the complete logs for finetuned-gpt-4o-model

DEBUG - swea-lm -
    total_tokens_sent=138,645,
    total_tokens_received=843,
    total_cost=0.21,
    total_api_calls=13

The Results indicate that the Finetuned LLM is much more efficient in backporting packages, significantly reducing the API calls, input-tokens and hence cost for backporting each patch.

Additional

View VM Setup used for running this framework

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
data		data
data_creation_scripts		data_creation_scripts
finetuning		finetuning
libsoup_custom_changes_for_swe_test		libsoup_custom_changes_for_swe_test
logs/libsoup		logs/libsoup
old_code		old_code
prepared_prompts		prepared_prompts
test-environment		test-environment
.gitignore		.gitignore
README.md		README.md
backporting_handler.py		backporting_handler.py
constants.py		constants.py
generate_q_a_dataset.py		generate_q_a_dataset.py
helper_functions.py		helper_functions.py
image-assets.md		image-assets.md
llm_handler.py		llm_handler.py
logger_refactored.py		logger_refactored.py
main_refactored.py		main_refactored.py
prompt_refactored.py		prompt_refactored.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Backporting-LLM

Existing Approaches & Our Goals

Project Demo

Simple Backports

Prompt-Chain Example

Complex Backports

Finetuning Dataset:

Finetuning Details:

Training Loss:

Validation Loss:

Training Token Accuracy:

Validation Token Accuracy:

Testing Model Knowledge:

How To Run

To Run Simple Backports:

To Create Finetuning Dataset

To Test the Finetuned Model (in SWE-Agent)

Results:

For the BASE GPT-4o Model:

For the FINETUNED GPT-4o Model:

Additional

About

Uh oh!

Releases

Packages

Languages

MadhurAggarwal/Backporting-LLM

Folders and files

Latest commit

History

Repository files navigation

Backporting-LLM

Existing Approaches & Our Goals

Project Demo

Simple Backports

Prompt-Chain Example

Complex Backports

Finetuning Dataset:

Finetuning Details:

Training Loss:

Validation Loss:

Training Token Accuracy:

Validation Token Accuracy:

Testing Model Knowledge:

How To Run

To Run Simple Backports:

To Create Finetuning Dataset

To Test the Finetuned Model (in SWE-Agent)

Results:

For the BASE GPT-4o Model:

For the FINETUNED GPT-4o Model:

Additional

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages