Kube-burner-ocp Bug: SriovNetwork Deletion In Rds-core

by Admin 55 views
Kube-burner-ocp Bug: SriovNetwork Deletion in rds-core

Hey everyone,

We've encountered a rather puzzling issue with the kube-burner-ocp tool when running the rds-core workload. It seems there's a discrepancy in the logs, where the tool reports the deletion of SriovNetworks between job steps, even when these networks are not actually being removed. Let's dive into the details to understand what's happening and how to address it.

Bug Description

Output of kube-burner-ocp version

Version: 1.8.0
Git Commit: fa9dd1b5a854e69af94463f4caaf97c64b656506
Build Date: 2025-10-28T14:33:17Z
Go Version: go1.23.12
OS/Arch: linux amd64

Describe the bug

The core issue lies in the rds-core workload. The kube-burner-ocp tool spits out a log message indicating that SriovNetworks are being deleted between the initial job (where they are created) and subsequent steps that depend on these networks. This is quite problematic because the second step expects the network to be present. If the network is actually deleted, then that's a big problem. If the network is not actually deleted but the tool incorrectly reports it, that's still an issue because it causes confusion and distrust in the tool's output.

To further clarify, this issue arises specifically when running the rds-core workload in kube-burner-ocp. The tool is designed to automate and benchmark Kubernetes/OCP deployments, and rds-core is one of the workload configurations it supports. The SriovNetworks are custom resources that define the network configuration for SR-IOV (Single Root I/O Virtualization) enabled network interfaces. These interfaces provide high-performance networking capabilities, often used in Telco or other high-throughput environments. So, when the tool incorrectly reports their deletion, it casts doubt on whether these critical network resources are actually available when needed.

This issue is not just about a misleading log message. It touches upon the reliability and trustworthiness of the automation tool itself. When engineers and operators rely on kube-burner-ocp to simulate real-world workload conditions and measure performance, they need to be able to trust the information it provides. An incorrect log message like this can lead to wasted time investigating false alarms or, even worse, making incorrect assumptions about the system's behavior. This highlights the importance of addressing this bug to restore confidence in the tool's accuracy.

Ultimately, the resolution requires a deep dive into the tool's code to understand how it manages and reports the status of SriovNetworks during the rds-core workload. It may involve examining the logic that handles the creation, deletion, and verification of these network resources. The goal is to identify why the tool is generating this misleading log message and implement a fix that ensures accurate reporting.

To Reproduce

Here's how you can reproduce the behavior:

  1. Run the following command: kube-burner-ocp rds-core --iterations=1

Expected behavior

Ideally, one of two things should happen:

  • If the SriovNetwork is truly not being deleted, the tool should not log that it is being deleted. Accurate reporting is key.
  • If the SriovNetwork is indeed being deleted unintentionally, the tool should prevent this deletion from occurring.

Also, the current ordering of log messages is a bit confusing. The deletion message appears immediately after "Triggering job: rds," which is misleading. The message should be placed in a more logical sequence to avoid confusion.

The expectation is that the SriovNetwork resource, once created, should persist throughout the duration of the rds-core workload execution, unless explicitly designed to be deleted as part of a specific test scenario. Premature or unintended deletion of this network resource can disrupt the workload's functionality and skew performance measurements.

Furthermore, it's essential to consider the broader implications of this behavior. If the tool is incorrectly managing or reporting the status of SriovNetworks, it raises concerns about its handling of other custom resources or Kubernetes objects. A thorough review of the tool's resource management logic may be necessary to ensure consistent and reliable behavior across different workload configurations.

In addition to fixing the immediate issue, it would be beneficial to enhance the tool's logging and error handling capabilities. More verbose and informative log messages can help users understand the tool's actions and diagnose potential problems more easily. Clear error messages can guide users towards resolving configuration issues or other problems that may be causing unexpected behavior. These improvements would not only address the specific bug but also improve the overall usability and maintainability of the kube-burner-ocp tool.

Screenshots or output

Here's the output you might see:

[root@bb37-h23-000-r750 rds-core]# kube-burner-ocp rds-core --churn=false --iterations=1 --dpdk-cores=28 --dpdk-hugepages=1Gi --gc=false --perf-profile=controlplane --check-health=false                                                 
time="2025-11-15 00:41:51" level=info msg="Config file rds-core.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="🔥 Starting kube-burner (1.8.0@fa9dd1b5a854e69af94463f4caaf97c64b656506) with UUID f757642c-2c00-4989-96f5-3ffec5efebde" file="job.go:91"
time="2025-11-15 00:41:51" level=info msg="Config file ipaddresspool.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="Config file bgpadvertisement.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="Config file bgppeer.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="Config file secret.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="Config file configmap.yml available in the current directory, using it" file="file_reader.go:77"                    
time="2025-11-15 00:41:51" level=info msg="Config file np-deny-all.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="Config file np-allow-from-clients.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="Config file np-allow-from-ingress.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="Config file sriov-network.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="Config file service.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="Config file service-lb.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="Config file route.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="Config file deployment-server.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="Config file deployment-client.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="Config file deployment-dpdk.yml available in the current directory, using it" file="file_reader.go:77"
time="2025-11-15 00:41:51" level=info msg="Pre-load: images from job bgp-setup" file="pre_load.go:73"                   
time="2025-11-15 00:41:51" level=info msg="No images found to pre-load, continuing" file="pre_load.go:79"               
time="2025-11-15 00:41:51" level=info msg="Initializing measurements for job: bgp-setup" file="factory.go:98"           
time="2025-11-15 00:41:51" level=info msg="Registered measurement: podLatency" file="factory.go:128"                    
time="2025-11-15 00:41:51" level=info msg="Creating /v1, Resource=pods latency watcher for bgp-setup" file="base_measurement.go:69"
time="2025-11-15 00:41:51" level=info msg="Triggering job: bgp-setup" file="job.go:122"                                 
time="2025-11-15 00:41:51" level=info msg="Deleting IPAddressPools labeled with kube-burner-job=bgp-setup in metallb-system" file="namespaces.go:55"
time="2025-11-15 00:41:51" level=info msg="Deleting BGPAdvertisements labeled with kube-burner-job=bgp-setup in metallb-system" file="namespaces.go:55"
time="2025-11-15 00:41:51" level=info msg="Deleting BGPPeers labeled with kube-burner-job=bgp-setup in metallb-system" file="namespaces.go:55"
time="2025-11-15 00:41:51" level=info msg="Waiting up to 4h0m0s for actions to be completed" file="create.go:169"       
time="2025-11-15 00:41:51" level=info msg="Actions completed" file="waiters.go:76"                                      
time="2025-11-15 00:41:51" level=info msg="Verifying created objects" file="utils.go:136"                               
time="2025-11-15 00:41:52" level=info msg="Stopping measurement: podLatency" file="factory.go:158"                      
time="2025-11-15 00:41:52" level=info msg="Evaluating latency thresholds" file="metrics.go:48"                          
time="2025-11-15 00:41:52" level=info msg="Initializing measurements for job: rds" file="factory.go:98"                 
time="2025-11-15 00:41:52" level=info msg="Registered measurement: podLatency" file="factory.go:128"                    
time="2025-11-15 00:41:52" level=info msg="Creating /v1, Resource=pods latency watcher for rds" file="base_measurement.go:69"
time="2025-11-15 00:41:52" level=info msg="Triggering job: rds" file="job.go:122"                                       
time="2025-11-15 00:41:52" level=info msg="Deleting SriovNetworks labeled with kube-burner-job=rds in openshift-sriov-network-operator" file="namespaces.go:55"
time="2025-11-15 00:41:58" level=info msg="Waiting up to 4h0m0s for actions to be completed" file="create.go:169"
time="2025-11-15 00:42:11" level=info msg="Actions in namespace rds-0 completed" file="waiters.go:74"
time="2025-11-15 00:42:11" level=info msg="Verifying created objects" file="utils.go:136"
time="2025-11-15 00:42:11" level=info msg="Stopping measurement: podLatency" file="factory.go:158"
time="2025-11-15 00:42:11" level=info msg="Evaluating latency thresholds" file="metrics.go:48"
time="2025-11-15 00:42:11" level=info msg="rds: ContainersReady 99th: 14000 max: 14000 avg: 5980" file="base_measurement.go:111"
time="2025-11-15 00:42:11" level=info msg="rds: Initialized 99th: 500 max: 1000 avg: 19" file="base_measurement.go:111"
time="2025-11-15 00:42:11" level=info msg="rds: Ready 99th: 14000 max: 14000 avg: 5980" file="base_measurement.go:111"
time="2025-11-15 00:42:11" level=info msg="rds: PodReadyToStartContainers 99th: 0 max: 0 avg: 0" file="base_measurement.go:111"
time="2025-11-15 00:42:11" level=info msg="rds: PodScheduled 99th: 0 max: 0 avg: 0" file="base_measurement.go:111"
time="2025-11-15 00:42:11" level=info msg="Finished execution with UUID: f757642c-2c00-4989-96f5-3ffec5efebde" file="job.go:264"
time="2025-11-15 00:42:11" level=info msg="👋 kube-burner run completed with rc 0 for UUID f757642c-2c00-4989-96f5-3ffec5efebde" file="helpers.go:105"

Next Steps:

To effectively tackle this bug, a thorough investigation is essential. We need to:

  • Examine the Code: Delve deep into the kube-burner-ocp codebase, specifically the modules responsible for managing SriovNetworks and handling the rds-core workload.
  • Analyze the Logs: Scrutinize the logs generated during the workload execution, paying close attention to the timestamps and sequence of events surrounding the creation and potential deletion of SriovNetworks.
  • Reproduce the Issue: Consistently reproduce the bug in a controlled environment to facilitate debugging and testing of potential solutions.
  • Implement a Fix: Develop a solution that either prevents the unintended deletion of SriovNetworks or corrects the misleading log message.
  • Test Thoroughly: Rigorously test the fix to ensure it resolves the issue without introducing any new problems.

By addressing this bug, we can enhance the reliability and trustworthiness of kube-burner-ocp, making it an even more valuable tool for benchmarking and automating Kubernetes deployments. Let's work together to get this sorted out!