Consolidate Fabric Reliability With MGD Relaxed Mode
This article discusses consolidating the FabricReliabilityMode with the MGD (Mesh Graph Descriptor) relaxed mode in the context of Tenstorrent's tt-metal project. The goal is to simplify and unify the control of "relaxed" behavior within the fabric system, addressing potential confusion and inconsistencies arising from having two separate mechanisms.
Problem Statement: Two Paths to Relaxed Behavior
Currently, there are two distinct methods to manage "relaxed" behavior in the fabric system. This can create confusion and potential inconsistencies. Let's break it down:
- Mesh Graph Descriptor Channel Policy: Defined in
tt_metal/fabric/mesh_graph_descriptors/*.textprotousingchannels { count: 4 policy: RELAXED }. - Fabric Reliability Mode: Configured at runtime using
FabricReliabilityMode::RELAXED_SYSTEM_HEALTH_SETUP_MODE.
Both settings influence how the system handles connection requirements, but they operate at different levels and overlap in their concerns. It's like having two different steering wheels in your car – you want them to work together, not against each other!
Current Behavior: A Deep Dive
Let's examine each mechanism in detail to understand how they work and where they differ.
1. Mesh Graph Descriptor Channel Policy
This policy is set within the mesh graph descriptor textproto files. Think of these files as blueprints for how the mesh network should be connected. A typical example can be found in dual_galaxy_mesh_graph_descriptor.textproto around line 8, which includes channels { count: 4 policy: RELAXED }. This setting does the following:
- Location:
tt_metal/fabric/mesh_graph_descriptors/*.textproto - Storage: Stored in the
MeshGraph::intra_mesh_relaxed_policy_map. You can see this inmesh_graph.cpp:279-280. - Access: Accessed using
MeshGraph::is_intra_mesh_policy_relaxed(MeshId)as shown inmesh_graph.cpp:603-607.
Impact on Topology Mapper: The topology mapper, found in topology_mapper.cpp:433-436, uses this policy to determine how strictly it needs to adhere to the connection counts specified in the mesh graph descriptor.
bool relaxed = mesh_graph_.is_intra_mesh_policy_relaxed(mesh_id);
for (const auto& [neighbor_chip_id, edge] : adjacent_map) {
size_t repeat_count = relaxed ? 1 : edge.connected_chip_ids.size();
// ...
}
- RELAXED Mode: The topology mapper only requires one connection per neighbor, effectively ignoring the
countfield in the mesh graph descriptor. - STRICT Mode (default): The topology mapper demands all connections specified by
count, fully utilizingedge.connected_chip_ids.size().
Purpose: This policy gives the topology mapper the flexibility to map a logical mesh onto a physical topology, even if the physical topology has fewer connections than the logical mesh specifies. It’s like saying, "Do your best to connect everything, but if you can't connect it all, that's okay."
2. Fabric Reliability Mode
This mode is configured at runtime. It dictates how strictly the system validates the health of the fabric during initialization. You can find an example of this in test_routing_tables.cpp:24, where kReliabilityMode is set to tt::tt_fabric::FabricReliabilityMode::STRICT_SYSTEM_HEALTH_SETUP_MODE. Here’s what it entails:
- Location: Runtime configuration (e.g.,
test_routing_tables.cpp:24). - Control: Passed to
ControlPlane::configure_routing_tables_for_fabric_ethernet_channels(). This setting governs system health validation during the fabric's initialization phase.
Impact on Control Plane: The control plane, specifically control_plane.cpp:900-919, is heavily influenced by this mode.
bool connections_exist = connected_chips_and_eth_cores.find(physical_connected_chip_id) !=
connected_chips_and_eth_cores.end();
TT_FATAL(
connections_exist ||
reliability_mode != tt::tt_fabric::FabricReliabilityMode::STRICT_SYSTEM_HEALTH_SETUP_MODE,
"Expected connections to exist for M{}D{} to D{}",
mesh_id, fabric_chip_id, logical_connected_chip_id);
if (!connections_exist) {
continue; // Skip missing connections in RELAXED mode
}
if (reliability_mode == tt::tt_fabric::FabricReliabilityMode::STRICT_SYSTEM_HEALTH_SETUP_MODE) {
TT_FATAL(
connected_eth_cores.size() >= edge.connected_chip_ids.size(),
"Expected {} eth links from physical chip {} to physical chip {}",
edge.connected_chip_ids.size(),
physical_chip_id,
physical_connected_chip_id);
}
- STRICT Mode: The system initialization fails if any expected connection is missing or if fewer Ethernet links exist than specified in the mesh graph descriptor. The number of routing planes is determined based on the actual Ethernet channels found.
- RELAXED Mode: The system initialization continues even if connections are missing or fewer Ethernet links exist than expected. Missing connections are skipped during the routing table configuration.
Purpose: This mode determines whether the system can initialize with degraded connectivity, such as missing links or devices. It's the system's gatekeeper, deciding whether to proceed even if everything isn't perfect.
Key Differences: A Side-by-Side Comparison
To highlight the distinctions, here's a table summarizing the key differences between the two settings:
| Aspect | Mesh Graph Descriptor Policy | Fabric Reliability Mode |
|---|---|---|
| Scope | Per-mesh, topology mapping | System-wide, runtime initialization |
| When Applied | During topology mapping (logical→physical) | During fabric initialization (routing table config) |
| What it Relaxes | Connection count requirements for mapping | Connection existence validation |
| Failure Mode | Mapping fails → no valid topology found | Initialization fails → system cannot start |
| Granularity | Mesh-level | System-level |
Overlap and Confusion: The Problem with Two Settings
Both settings share some common ground:
- Both allow the system to function with fewer connections than originally specified.
- Both operate using STRICT vs RELAXED semantics.
- Both influence whether the system mandates precise connection counts.
However, the timing of their operation differs:
- The Mesh Graph Descriptor Policy impacts the success of topology mapping.
- The Fabric Reliability Mode determines whether initialization is successful post-mapping.
This discrepancy can lead to inconsistencies. For instance, a mesh might be successfully mapped with a RELAXED policy (allowing for fewer connections), but initialization could subsequently fail if the STRICT reliability mode is enabled (requiring all connections to be present).
Proposed Solution: Unification Through MGD
The Solution: Consolidate the control by relying solely on the Mesh Graph Descriptor's channels.policy setting. This makes the MGD the central authority for defining relaxed/strict behavior.
Decision: MGD as the Single Source of Truth
The MGD (Mesh Graph Descriptor) should be the single source of truth for relaxed/strict behavior:
- Keep
channels.policyin mesh graph descriptors (per-mesh control). - Remove
FabricReliabilityModeparameter from runtime APIs (or derive it from MGD policy). - Use MGD policy for both:
- Topology mapping (already implemented).
- Runtime initialization (needs to be updated).
Code Changes Summary: Implementation Details
Here's a breakdown of the code modifications required to implement this solution:
Files to modify:
tt_metal/fabric/control_plane.cpp- Remove reliability mode parameter, use MGD policy.tt_metal/fabric/control_plane.hpp- Update method signatures.tt_metal/fabric/fabric_host_utils.cpp- Remove reliability mode from fabric setup.tt_metal/api/tt-metalium/fabric.hpp- Update public API.tests/tt_metal/tt_fabric/fabric_router/test_routing_tables.cpp- Remove reliability mode usage.tests/tt_metal/tt_fabric/common/fabric_fixture.hpp- Remove reliability mode handling.
Files that already use MGD policy (no changes needed):
tt_metal/fabric/mesh_graph.cpp- Already reads and stores policy.tt_metal/fabric/topology_mapper.cpp- Already uses MGD policy for mapping.
References: Further Reading
For more in-depth information, refer to these code locations:
tt_metal/fabric/mesh_graph.cpp:279-280- Setting relaxed policy from MGDtt_metal/fabric/mesh_graph.cpp:603-607- Getting relaxed policytt_metal/fabric/topology_mapper.cpp:433-436- Using relaxed policy in topology mappertt_metal/fabric/control_plane.cpp:900-919- Using reliability mode in control planett_metal/api/tt-metalium/fabric_types.hpp:51-64- FabricReliabilityMode enum definitiontt_metal/fabric/protobuf/mesh_graph_descriptor.proto:315-335- Channels and Policy definitions
By consolidating the control of relaxed behavior into the MGD channel policy, the Tenstorrent tt-metal project can achieve a more consistent, simplified, and manageable fabric system. This will reduce the chances of unexpected behavior and make the system easier to understand and maintain. It's all about streamlining and making things work better together, right guys?