Vshard Rebalancer Errors Hidden By Log Ratelimiter

Nov 14, 2025 by Admin 51 views

Hey guys, let's dive into a pretty critical issue that Tarantool users, especially those running Vshard for distributed data, need to be aware of. We're talking about a scenario where serious Vshard rebalancer errors are getting completely hidden or suppressed by the log_ratelimiter in Tarantool. This isn't just a minor logging glitch; it’s a significant problem that can mask underlying data distribution issues, replication failures, and cluster instability, making it incredibly tough to diagnose and fix problems in production environments. Imagine your cluster silently struggling with sending buckets, receiving buckets, or garbage buckets – critical states that demand immediate attention – but your logs stay quiet, tricking you into believing everything is hunky-dory. This bug, observed in Tarantool 3.5.0 with Vshard 0.1.36, specifically highlights how the log_ratelimiter, intended to prevent log spam, inadvertently becomes a blindfold, preventing administrators from seeing crucial diagnostic messages. Understanding this problem is key to maintaining the health and reliability of your distributed Tarantool setup, ensuring that when something goes wrong with the Vshard rebalancer, you actually get to hear about it loud and clear, rather than having the warning signs swept under the rug. Let's unpack why this happens and what it means for your operations.

Understanding the Core Problem: Hidden Rebalancer Errors

When you're running a distributed system like Tarantool with Vshard, the Vshard rebalancer is your unsung hero, constantly working behind the scenes to ensure your data buckets are evenly distributed across your replica sets. It’s responsible for moving data around, recovering from failures, and maintaining optimal performance. But what happens when this critical component starts encountering Vshard rebalancer errors and you don't even know about it? That's precisely the core problem we're tackling today: the log_ratelimiter inadvertently hiding critical Vshard rebalancer errors, turning what should be a transparent operation into a black box of potential issues. The log_ratelimiter is a feature designed with good intentions—to prevent your logs from being flooded with repetitive error messages, which can happen in highly active systems. It groups identical errors by their err_type and err_code and only logs them periodically. However, the Achilles' heel here is that many distinct and serious Vshard rebalancer errors, even though they describe different underlying problems (like a replica sending buckets, receiving buckets, or being inactive), all end up sharing the same err_type and err_code. This means that once the ratelimiter sees one of these errors, it considers subsequent, different rebalancer errors with the same err_code as duplicates, and simply suppresses them. So, you might see one initial warning, but then a cascade of subsequent, more specific, and potentially more severe Vshard rebalancer errors related to data movement or consistency will go completely undocumented in your logs. Imagine the nightmare of trying to debug a slow or inconsistent cluster when the very tools designed to tell you what's wrong are actively obscuring the truth. This can lead to prolonged outages, unnoticed data inconsistencies, and a massive headache for anyone responsible for keeping the Tarantool cluster healthy. We're talking about errors like "Replica %s has sending / receiving / garbage buckets" – which are incredibly important for diagnosing stuck rebalancing operations or orphaned data – disappearing into the void, making it impossible to understand the true state of your Vshard cluster. This situation significantly impacts the observability and maintainability of your Tarantool Vshard deployments, turning minor glitches into prolonged investigations due to a lack of critical diagnostic information.

Diving Deeper: The Specifics of the Vshard Bug

Let’s get into the nitty-gritty of the specific Vshard rebalancer errors that are being impacted by this log_ratelimiter behavior. Understanding these details will shed light on why this isn't just a cosmetic issue, but a genuine threat to the stability and debuggability of your distributed Tarantool systems. The core of the issue lies in how different Vshard rebalancer error messages, despite indicating distinct problems, are categorized under the hood with identical error types and codes. This uniformity, while perhaps simplifying error handling in some contexts, becomes a critical flaw when combined with rate-limiting, as it effectively lumps distinct critical alerts into a single, often suppressed, log entry. We're talking about scenarios where a replica might be in a problematic state, actively sending buckets or receiving buckets, or even having garbage buckets – all of which are severe indicators that require immediate attention. Yet, because these distinct issues share the same underlying error classification, the log_ratelimiter sees them as repetitive occurrences of the same error. This means that after the first instance is logged, subsequent and potentially more critical alerts regarding these other problematic bucket states are simply dropped. For example, if your cluster initially logs "Rebalancer is not active...", and then moments later, a replica enters a state where it has sending buckets, that crucial "Replica %s has sending buckets" message might never make it to your log files. This creates a severe blind spot in monitoring and debugging, as the Tarantool operator is deprived of vital, real-time information about the cluster's health and ongoing Vshard rebalancer operations. Effectively, the log_ratelimiter is inadvertently acting as a filter for essential diagnostic data, forcing engineers to work harder, or even guess, what the actual problem might be, rather than being explicitly informed by the logs. This issue underscores the importance of granular error typing and coding in complex distributed systems, especially when log aggregation and filtering mechanisms are in play, to ensure that every unique critical event is properly surfaced and recorded for subsequent analysis and troubleshooting.

The "Rebalancer is not active..." Mystery

One of the most confusing Vshard rebalancer errors surfacing due to this log_ratelimiter bug is the "Rebalancer is not active or is in progress" message. Now, this error itself isn't inherently problematic – it tells you the rebalancer isn't doing its thing, which could be by design or an temporary state. The real mystery, and the actual bug here, is why this particular error appears in the logs at all in certain scenarios, especially when the rebalancer_request_state function, according to the original bug report, is not supposed to return this specific error type. Guys, this throws a serious wrench into debugging! If a function that's supposed to report the rebalancer's state isn't explicitly generating this message, then its appearance implies an underlying logical inconsistency or an unexpected state transition within the Vshard rebalancer itself. When you enable the rebalancer and wake it up, you expect it to start working or report specific, actionable errors. Instead, you get this vague message, which, as a result of the ratelimiter's logic, might then suppress more specific and critical error messages that follow. This creates a chain reaction: a less specific, potentially misleading error consumes the log slot, and then the truly insightful diagnostic messages about sending buckets or receiving buckets are never seen. For anyone trying to monitor a Tarantool Vshard cluster, seeing "Rebalancer is not active or is in progress" is unhelpful when you've just explicitly enabled it. It forces you to question the very state of your system – is it really inactive, or is it trying to do something but encountering an unlogged error? This ambiguity wastes precious time in high-pressure situations and makes it incredibly difficult to automate responses or build reliable monitoring alerts. It's like your car's check engine light coming on, but then instead of telling you it's the "Brake fluid low" or "Engine overheating", it just says "Car is not working optimally" – and then refuses to show any further, more specific warnings. This lack of clarity and the subsequent suppression of more detailed Vshard rebalancer errors can lead to serious operational challenges, making it harder for developers and system administrators to pinpoint the root cause of Tarantool Vshard performance issues or data synchronization problems.

The "Replica has sending/receiving/garbage buckets" Concealment

This is where the log_ratelimiter truly becomes a silent saboteur, concealing critical Vshard rebalancer errors that demand immediate attention. Errors like "Replica %s has sending buckets", "Replica %s has receiving buckets", or "Replica %s has garbage buckets" are not just informational; they are alerts signifying that a replica is in a problematic state regarding its data buckets. A replica with sending buckets means it's actively trying to move data, but it could be stuck or failing. Receiving buckets indicates it's supposed to be getting data, but again, this process might be stalled. And garbage buckets? That's typically a red flag, pointing to orphaned or inconsistent data that needs cleanup. These specific Vshard rebalancer errors are absolutely crucial for understanding the health and data integrity of your Tarantool Vshard cluster. However, because these distinct issues, originating from rebalancer_request_state, unfortunately share the same err_type and err_code as the less specific "Rebalancer is not active..." message, the log_ratelimiter groups them all together. What happens? If the log_ratelimiter has recently logged the generic "Rebalancer is not active..." error, it will then suppress subsequent, more specific, and infinitely more useful messages like "Replica %s has sending buckets". This creates a terrifying debugging scenario. You might have a replica that's genuinely stuck, unable to complete its data transfers, leading to potential data inconsistencies or performance degradation, but your logs will remain eerily quiet. The expected t.assert(g.replica_1_a:grep_log(log)) in the reproducer, which looks for the "Replica %s has sending buckets" log, simply hangs because that critical log line never appears. This isn't just inconvenient; it’s a direct impediment to proactive monitoring and rapid incident response. Operators are left in the dark, unable to diagnose why their cluster might be underperforming or showing signs of data distribution issues. The ability to see clear, distinct error messages for different problematic bucket states is fundamental for maintaining a healthy and reliable distributed database. The current behavior essentially mutes the most important warnings, forcing administrators to manually inspect cluster states or rely on external metrics rather than straightforward, explicit log entries, making the Tarantool Vshard debugging process far more arduous and prone to oversight.

Replicating the Issue: A Step-by-Step Guide

Alright, let's talk about how to actually reproduce this Vshard rebalancer error logging issue so you can see it in action (or rather, not see it in your logs!). We're using luatest for this, a fantastic tool for testing Tarantool applications. This reproducer demonstrates the bug quite clearly, showcasing how the log_ratelimiter suppresses vital error messages. We're setting up a minimal Tarantool Vshard cluster to isolate the problem. The environment consists of Tarantool version: 3.5.0-entrypoint-128-g71766c8bf9 and Vshard version: 0.1.36, running on Linux, x86-64. It’s important to note these versions, as the behavior might differ in other releases, though the underlying mechanism of log_ratelimiter often remains consistent. First off, we'll configure a basic sharded setup with two replica sets, each having a single master. We set bucket_count to 10 and a replication_timeout of 0.1 for quick transitions. Before we start messing with anything, we use vtest.cluster_rebalancer_disable(g) to make sure the rebalancer is initially inactive. This gives us a clean slate to trigger the desired states. After bootstrapping the cluster, the real magic (or rather, the real bug) begins. We then intentionally enable the rebalancer on replica_1_a and immediately call vshard.storage.rebalancer_wakeup(). The first t.helpers.retrying block then asserts that we do see the generic "Rebalancer is not active or is in progress" error in replica_1_a's logs. This confirms the initial, less specific error is logged. Now, for the critical part: we manually update a bucket's state in box.space._bucket on replica_1_a to "sending" (specifically, box.space._bucket:update(1, {{'=', 2, 'sending'}})). This forces replica_1_a into a state where it has sending buckets. We then call vshard.storage.rebalancer_wakeup() again, expecting it to process this new state and log the appropriate "Replica %s has sending buckets" error. However, the subsequent t.helpers.retrying block, which tries to grep_log for this specific "Replica %s has sending buckets" message, hangs. Why? Because the log_ratelimiter has already logged the generic "Rebalancer is not active..." error with the same err_type and err_code, and it's now suppressing this new, distinct, and highly critical error message. This exact sequence clearly demonstrates how different Vshard rebalancer errors are obscured by the rate-limiting mechanism, leaving a significant gap in your diagnostic capabilities. The code snippet below illustrates this precise flow, making it straightforward to confirm the problem in your own environment:

local t = require('luatest')
local vtest = require('test.luatest_helpers.vtest')
local vutil = require('vshard.util')
local consts = require('vshard.consts')

local test_group = t.group('rebalancer')

local cfg_template = {
    sharding = {
        {
            replicas = {
                replica_1_a = {
                    master = true,
                },
            },
        },
        {
            replicas = {
                replica_2_a = {
                    master = true,
                },
            },
        },
    },
    bucket_count = 10,
    replication_timeout = 0.1,
}
local global_cfg

test_group.before_all(function(g)
    global_cfg = vtest.config_new(cfg_template)
    vtest.cluster_new(g, global_cfg)
    vtest.cluster_bootstrap(g, global_cfg)
    vtest.cluster_rebalancer_disable(g)
end)

test_group.after_all(function(g)
    g.cluster:drop()
end)

test_group.test_rebalancer_errors = function(g)
    g.replica_1_a:exec(function()
        vshard.storage.rebalancer_enable()
        vshard.storage.rebalancer_wakeup()
    end)
    t.helpers.retrying({}, function()
        t.assert(g.replica_1_a:grep_log('Rebalancer is not active ' ..
                                        'or is in progress')) -- <--- Why this error appears?
    end)
    g.replica_1_a:exec(function()
        box.space._bucket:update(1, {{'=', 2, 'sending'}})
        vshard.storage.rebalancer_wakeup()
    end)
    t.helpers.retrying({}, function()
        local log = string.format('Replica %s has sending buckets',
                                  g.replica_1_a:replicaset_uuid())
        t.assert(g.replica_1_a:grep_log(log)) -- <--- Hanging as no "Replica %s has sending buckets" in logs
    end)
end

The Actual vs. Expected Outcome: What We See and What We Need

Now that we’ve walked through how to reproduce this Vshard rebalancer error logging issue, let’s compare what actually lands in our logs versus what we should be seeing. This comparison is critical to understanding the severity of the problem and the kind of actionable information we're currently missing due to the log_ratelimiter's behavior. In the actual result, after running the reproducer, you'll typically see something like this in your Tarantool logs:

2025-11-14 13:30:29.081 [49832] main/126/vshard.rebalancer/vshard.log_ratelimit log_ratelimit.lua:138 E> Error during downloading rebalancer states: {"replicaset_id":"00000000-0000-0000-0000-000000000004","trace":[{"file":"/home/mrforza/Desktop/vshard/vshard/error.lua","line":284}],"code":32,"base_type":"ClientError","type":"ClientError","details":"Rebalancer is not active or is in progress","name":"PROC_LUA","message":"Rebalancer is not active or is in progress"}
2025-11-14 13:30:29.081 [49832] main/131/vshard.ratelimit_flush/vshard.util util.lua:114 I> ratelimit_flush_f has been started
2025-11-14 13:30:33.658 [49832] main/125/vshard.recovery/vshard.storage init.lua:951 I> Starting sending buckets recovery step
2025-11-14 13:30:33.658 [49832] main/125/vshard.recovery/vshard.storage init.lua:952 W> Can not find for bucket 1 its peer nil

Notice that first error message: "Rebalancer is not active or is in progress". This is the generic error that the log_ratelimiter does log. But here's the catch: after we intentionally put replica_1_a into a "sending" bucket state and wake up the rebalancer again, we don't see the specific error "Replica %s has sending buckets". The log output remains silent on this critical development. The log just continues with other informational messages or warnings, completely oblivious to the fact that a replica is in a problematic state related to data movement. This silence is deafening for anyone trying to diagnose Tarantool Vshard issues in real-time. What we expect to see, and what would provide invaluable diagnostic information, is a log output that includes both the initial, generic message and the subsequent, highly specific Vshard rebalancer errors. The ideal, expected behavior in the logs would look something like this:

2025-11-14 13:30:29.081 [49832] main/126/vshard.rebalancer/vshard.log_ratelimit log_ratelimit.lua:138 E> Error during downloading rebalancer states: {"replicaset_id":"00000000-0000-0000-0000-000000000004","trace":[{"file":"/home/mrforza/Desktop/vshard/vshard/error.lua","line":284}],"code":32,"base_type":"ClientError","type":"ClientError","details":"Rebalancer is not active or is in progress","name":"PROC_LUA","message":"Rebalancer is not active or is in progress"}
2025-11-14 13:30:29.081 [49832] main/131/vshard.ratelimit_flush/vshard.util util.lua:114 I> ratelimit_flush_f has been started
2025-11-14 13:30:33.658 [49832] main/125/vshard.recovery/vshard.storage init.lua:951 I> Starting sending buckets recovery step
2025-11-14 13:30:33.658 [49832] main/125/vshard.recovery/vshard.storage init.lua:952 W> Can not find for bucket 1 its peer nil
.
.
.
2025-11-14 13:30:29.081 [49832] main/126/vshard.rebalancer/vshard.log_ratelimit log_ratelimit.lua:138 E> Error during downloading rebalancer states: {"replicaset_id":"00000000-0000-0000-0000-000000000004","trace":[{"file":"/home/mrforza/Desktop/vshard/vshard/error.lua","line":284}],"code":32,"base_type":"ClientError","type":"ClientError","details":"Replica 00000000-0000-0000-0000-000000000004 has sending buckets","name":"PROC_LUA","message":"Replica 00000000-0000-0000-0000-000000000004 has sending buckets"}

See that last line in the expected output? "Replica 00000000-0000-0000-0000-000000000004 has sending buckets". This is the crucial, actionable information that is currently missing. Having this specific error message would immediately tell an operator: "Hey, this particular replica is stuck trying to send buckets. Time to investigate!" Without it, you're flying blind, relying on guesswork or deeper, more intrusive monitoring tools to figure out a problem that should be clearly evident in the logs. This distinction between the actual and expected behavior highlights a significant deficiency in how Vshard rebalancer errors are logged, directly impacting the ability to effectively manage and troubleshoot Tarantool Vshard clusters. The goal is clear: ensure that all distinct and critical rebalancer states are properly and individually logged, even when a log_ratelimiter is in effect, to provide comprehensive visibility into Tarantool data distribution and replication health.

Why This Matters to You (and Your Data!)

Okay, so we've dug into the technical specifics, but let's be real: why should you, as a Tarantool user or operator, care about this Vshard rebalancer error logging bug? This isn't just a minor annoyance; it has profound implications for the reliability, maintainability, and ultimately, the integrity of your data in a Tarantool Vshard cluster. Imagine running a critical production system where your Vshard rebalancer is silently struggling. Maybe some buckets are stuck sending, others are receiving indefinitely, or even worse, garbage buckets are accumulating – all indicating potential data inconsistencies or even data loss risks. If these critical Vshard rebalancer errors are hidden by the log_ratelimiter, you simply won't know about them. This lack of visibility can lead to unnoticed cluster failures, prolonged data synchronization issues, and a significant increase in your mean time to recovery (MTTR) when something eventually breaks. You'll spend hours, perhaps days, manually sifting through metrics, running box.info commands, and performing deep diagnostics, trying to figure out a problem that should have been immediately obvious from your logs. In a distributed system, clear and comprehensive logging is your first line of defense. It's the voice of your infrastructure, telling you when something's amiss. When that voice is muffled or silenced, you're effectively operating in the dark. This directly impacts Tarantool data integrity and Vshard cluster health. Unresolved sending or receiving bucket states can lead to inconsistencies between replicas, causing applications to read stale or incorrect data. Garbage buckets can signify data that is no longer properly managed, consuming disk space and potentially hindering future rebalancing operations. For businesses relying on Tarantool for high-performance, critical data storage, this bug poses a serious risk. It undermines the very promise of a robust and self-healing distributed database, as the self-healing mechanisms themselves might be failing without alerting anyone. Guys, always remember: logs are your most valuable debugging tool. If they're not telling you the full story, you're at a disadvantage. This makes it absolutely essential for this Vshard rebalancer error logging issue to be addressed, ensuring that all distinct critical states are properly and individually logged, regardless of rate-limiting. For now, it’s a strong reminder to not solely rely on high-level log_ratelimiter outputs for Tarantool Vshard monitoring, but to also implement deeper, state-based checks to confirm the true health of your data distribution and replication processes. Stay vigilant, monitor closely, and advocate for clearer logging to protect your data and sanity!

Conclusion

To wrap things up, the issue where the log_ratelimiter in Tarantool 3.5.0 with Vshard 0.1.36 hides serious Vshard rebalancer errors is more than just a minor inconvenience; it's a critical bug impacting the observability and reliability of distributed Tarantool systems. By suppressing vital messages about sending, receiving, or garbage buckets, the log_ratelimiter inadvertently creates blind spots that can lead to unnoticed data inconsistencies, prolonged debugging efforts, and increased operational risk. For any serious Tarantool deployment, especially those leveraging Vshard, having clear, distinct, and unfiltered error logs for every unique problematic state of the Vshard rebalancer is non-negotiable. Addressing this bug will significantly enhance the ability of administrators and developers to proactively monitor and quickly resolve issues, ensuring the long-term health and data integrity of their Tarantool Vshard clusters. Let's push for this fix to ensure our distributed systems are truly transparent and resilient!