Fixing Pipelines-as-Code Delays: The GetFiles() Bottleneck
Hey there, fellow developers and ops folks! Ever felt like your CI/CD pipelines are taking forever to kick off, leaving you staring at your screen wondering what's going on? You're not alone, and today we're diving deep into a specific performance bottleneck within Pipelines-as-Code (PaC) that can cause some truly excruciating delays. We're talking minutes, sometimes even hours, before your pipelines even start! This isn't just a minor annoyance; it's a critical issue that can severely impact your deployment velocity and overall developer experience. We're going to break down the problem, show you a real-world example, dig into the technical nitty-gritty, and then, most importantly, talk about a super effective solution. So, grab a coffee, because we're about to make your OpenShift Pipelines run much, much faster.
The Hidden Performance Killer in Pipelines-as-Code: Serial GetFiles() Calls
Alright, let's get straight to the point about what's actually slowing things down in your Pipelines-as-Code (PaC) setup. Imagine this: you push some code, a webhook event fires, and PaC, being the smart cookie it is, starts evaluating which PipelineRuns should actually be triggered. This evaluation often involves checking conditions defined using CEL expressions. These expressions are super flexible, letting you define complex rules, like only triggering a pipeline if specific files have changed. This is where the core issue, a major Pipelines-as-Code performance problem, kicks in.
Here’s the deal: if your CEL expression references files.* (for example, files.all.exists() to check if any files changed, or files.added.matches('path/to/my-service/**') to be more specific), PaC needs to ask your Git provider (like GitLab, GitHub, or Gitea) for a list of changed files in that particular webhook event. Sounds reasonable, right? The problem, guys, is that PaC currently makes this GetFiles() API call separately for every single PipelineRun that uses a files.* pattern. Yes, you read that right – every single one, even though all these PipelineRuns are looking at the exact same webhook event and thus, the exact same set of changed files. This isn't just inefficient; it's a recipe for disaster when you have a decent number of Pipelines-as-Code definitions in your repository.
This serial GetFiles() call behavior leads to a few nasty side effects that seriously degrade OpenShift Pipelines performance. First, you get linear scaling of delays. If one API call takes about 3 seconds (which is a pretty common latency for these types of Git API requests), and you have N PipelineRuns using files.* expressions, then you're looking at N * 3 seconds just for file fetching. For 100 PipelineRuns, that’s 300 seconds, or a solid 5 minutes, before any pipeline even starts. Second, this hammering of the Git provider’s API can lead to API saturation and, consequently, dreaded HTTP/2 stream errors. When this happens, PaC has to retry these calls, often with exponential backoff. This means the delay doesn't just add up; it cascades, turning minutes into hours. We're talking about a significant webhook processing bottleneck that affects pipeline creation time and frankly, just makes everyone frustrated. This fundamental design flaw, where file list fetching isn't cached or shared, is the root of extended delays and frustrating waits, particularly impacting environments with a high volume of Pipelines-as-Code definitions relying on file-based triggers.
A Production Nightmare: Hours of Delays and Real-World Impact
Let me tell you, this isn't some theoretical edge case; this is a real-world production problem that hits where it hurts: developer productivity and deployment speeds. Imagine logging off for the day, only to find out the next morning that your critical changes, pushed hours ago, still haven't triggered a pipeline. That's exactly what happened in a production GitLab repository we observed. This repository was home to a whopping 132 PipelineRuns, all configured to use those handy files.* triggers. Each PipelineRun needed to check file changes before deciding if it should run. Sounds like a smart way to manage a monorepo, right? Well, in this scenario, it turned into a nightmare.
The timeline was stark and frankly, unacceptable. A webhook event was received at 17:52:31. You'd expect a pipeline to kick off within seconds, maybe a minute or two for complex setups. But in this case, the first PipelineRun wasn't created until 21:28:46! Guys, that's a staggering 3 hours and 36 minutes later. Can you even picture the frustration? A simple merge request taking over three and a half hours just to begin its CI/CD process. This level of production system delay directly translates to missed deadlines, stalled releases, and a severe hit to team morale. Nobody wants to be the one explaining why a deployment is hours late because the system is stuck doing the same API call over and over again.
So, what was the culprit behind this epic GitLab integration issue? It was precisely those 132 serial API calls to fetch file lists, repeated multiple times. Because the Git API was getting hammered, HTTP/2 stream errors started popping up like whack-a-moles, forcing PaC to retry. And with exponential backoff, those retries just kept pushing the delay further and further. What should have taken a mere ~6 seconds with proper caching (one API call for file list, then quick CEL evaluations) instead spiraled into hours. This isn't just about a slow pipeline; it's about a complete breakdown of the intended CI/CD flow, causing massive CI/CD bottlenecks and severely impacting the developer experience. It highlights how quickly a seemingly minor inefficiency, when scaled, can bring a critical system to its knees. The actual impact of such Pipelines-as-Code delays extends beyond mere waiting, leading to costly resource usage, increased cloud spend, and a general loss of confidence in the automation system itself, making this a high-priority performance optimization target.
Diving Deep: The Technical Root Cause of the Bottleneck
Let's roll up our sleeves and get a bit technical, shall we? To truly understand this Pipelines-as-Code performance problem, we need to peek at the underlying code that orchestrates CEL expression evaluation and Git API interaction. The core of the issue lies in how PaC handles fetching changed files when evaluating files.* patterns within your on-cel-expression annotations. It’s all about where and when that GetFiles() call happens.
In the current implementation, specifically within pkg/matcher/cel.go, there’s a function called celEvaluate. This function is responsible for, well, evaluating your CEL expressions. Inside celEvaluate, you'll find a check: r.MatchString(expr) which uses a regex reChangedFilesTags to see if the CEL expression contains the string "files.". If it does, then – and here's the kicker – it immediately calls changedFiles, err = vcx.GetFiles(ctx, event). This vcx.GetFiles is the actual Git provider API call to fetch the list of changed files. The critical flaw? This entire block, including the GetFiles() call, is nested inside celEvaluate.
Now, let's look at how celEvaluate gets called. If you follow the execution path, you'll find it invoked from pkg/matcher/annotation_matcher.go within a function called MatchPipelinerunByAnnotation. This function, as its name suggests, iterates through all the PipelineRuns it knows about. For each PipelineRun, it checks for an on-cel-expression annotation. If one exists, it then calls celEvaluate(ctx, celExpr, event, vcx). See the problem, folks? This means the celEvaluate function, and by extension, the vcx.GetFiles() API call, is executed in a loop for every single PipelineRun that has a files.* based CEL expression. This is a classic case of repeated API calls for identical data.
While there's a per-expression optimization (it only calls GetFiles() if files.* is actually used in the CEL expression), this optimization is happening at the wrong level. It’s optimizing for individual expressions, not for the entire webhook event. For a single webhook event, the list of changed files never changes. Yet, PaC fetches this same file list N times if N PipelineRuns utilize file-based triggers. This behavior is incredibly common in monorepo setups where different components or services have their own PipelineRuns, each using `files.all.exists(x, x.matches(