Minimizing False Positives in CodeQL Taint Analysis Through LLM-Driven Contextual Reasoning

Cybersecurity CodeQL LLM Integration
Large Language Models CodeQL JavaScript
Scroll for details ↓
← Back to Projects

Overview

Static Application Security Testing (SAST) tools, such as CodeQL, are fundamental for identifying vulnerabilities at scale, yet they suffer from a significant credibility problem due to high False Positive Rates (FPR) [1] . This over-approximation leads to Alert Fatigue among security analysts, increasing the cost of manual triage [ 1]. The foundational UntrustIDE study demonstrated this challenge empirically, finding an effective FPR of approximately 98% in CodeQL’s taint analysis of VS Code extensions [ 1]. We propose a novel, scalable, three-stage LLM Triage Pipeline that automates contextual reasoning to filter noisy CodeQL alerts [1 ]. By transforming the purely structural taint flow data produced by CodeQL into a contextual reasoning task for a Large Language Model (LLM), we aim to reduce the baseline FPR of > 98% to an industrially acceptable threshold of < 20% [ 1]. Our initial empirical validation on a sampled dataset demonstrates the LLM’s capability for path-aware reasoning and contextual understanding, showcasing a promising path toward solving the most persistent precision problem of static analysis [1].

Challenges & Solutions

Some challenges that I encountered while working on this project was definitely learning CodeQL. I had to learn how to use it in the context of this project and understand its uses in order to obtain the necessary information for the LLM triage pipeline. Another challenge was working with LLMs and understanding how to best prompt them to obtain the best results. Through research and experimentation, I was able to overcome these challenges and successfully implement the LLM triage pipeline. Also learning how to validate flows in JavaScript was challenging!

Key Features

Final Results

The project successfully demonstrated the feasibility of using LLMs to reduce false positives in CodeQL taint analysis. Through empirical validation, we showed that the LLM triage pipeline could significantly improve precision while maintaining scalability.