Skip to content

Conversation

@jeankuo
Copy link
Contributor

@jeankuo jeankuo commented Oct 14, 2025

Description:
Incremental PID logic has been verified by thermal team. Update the fan service config data according to thermal's tuning result.

Changes:

  1. Applied OPTIC_AGGREGATION_TYPE_INCREMENTAL_PID for optics temperature management.
  2. Applied SENSOR_PWM_CALC_TYPE_INCREMENTAL_PID for CPU_UNCORE_TEMP.
  3. Applied SENSOR_PWM_CALC_TYPE_INCREMENTAL_PID for SMB_TH5_TEMP.
  4. Updated platform_manager.json: changed SMB CPLD address from 0x3e to 0x33 to enable TH5 power control.
  5. Added shutdownCondition with associated shutdownCmd for TH5.

Test Plan:

  1. Build and deploy the latest versions of fboss components including fan_service, sensor_service, and platform_manager to ensure the updated configuration is in effect.
  2. Run platform_manager and confirm that the SMB CPLD address has been updated from 0x3e to 0x33 for TH5 power control.
  3. Start sensor_service, qsfp_service, and fan_service to ensure proper initialization and inter-service communication with the new configuration.
  4. Confirm with the thermal team that the new incremental PID logic adjusts fan speed dynamically based on temperature changes (optics, CPU, inlet, ASIC).
  5. Trigger the shutdown condition and verify that the shutdownCmd for TH5 is executed correctly.

Test Log:
1_mp3ba_platform_manager_smb_change_to_0x33.txt
2_mp3ba_thermal_team_fan_service_35C.txt
3_mp3ba_thermal_team_fan_service_35C.xlsx
4_mp3ba_thermal_team_fan_service_35C_fan3_one_rotor_failed.txt
5_mp3ba_thermal_team_fan_service_35C_fan3_one_rotor_failed.xlsx
6_mp3ba_th5_pwr_en_test_log.txt

@meta-cla meta-cla bot added the CLA Signed label Oct 14, 2025
Comment on lines +125 to +134
"shutdownCondition": {
"numOvertempSensorForShutdown": 1,
"conditions": [
{
"sensorName": "SMB_TH5_TEMP",
"overtempThreshold": 110.0,
"slidingWindowSize": 1
}
]
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many levels of hardware protection (without reliance on software) do we have? If we have sufficient protection in hardware itself, we dont need this here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many levels of hardware protection (without reliance on software) do we have? If we have sufficient protection in hardware itself, we dont need this here.

@somasun
image

Currently, we have a hardware protection mechanism built into the TH5:
it automatically shuts down the device and enters reset mode when any of the
15 PVTMON sensors reaches 125°C.

Our current software shutdown threshold is set to 110°C.
This allows us to shut down the system earlier than the hardware limit,
which provides an additional safety margin.

If we want to maintain this earlier shutdown behavior,
we would still need to keep the software shutdown command in place.
Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the protection provided by TH5 hardware itself. Do you have any mechanism in CPLD/FPGA which will shut down TH5 power supply when the ASIC temperature exceeds a threshold? This is without any involvement from user space software.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the protection provided by TH5 hardware itself. Do you have any mechanism in CPLD/FPGA which will shut down TH5 power supply when the ASIC temperature exceeds a threshold? This is without any involvement from user space software.

Hi @somasun ,
At present, the SMB CPLD does not implement any hardware protection mechanism that would automatically shut down the TH5 power supply when the ASIC temperature exceeds a certain threshold. Such control is not handled at the CPLD level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants