监控和告警

监控

InternEvo 使用 internlm.monitor.monitor.initialize_monitor_manager() 来初始化上下文监控管理。其中,一个实例化的单例对象 internlm.monitor.monitor.MonitorManager 将管理监控线程并使用 internlm.monitor.monitor.MonitorTracker 来跟踪模型训练生命周期和训练状态。

internlm.monitor.monitor.initialize_monitor_manager(job_name: str = None, alert_address: str = None)[源代码]

Initialize monitor manager for monitoring training lifetime and alerting exception info to Feishu.

参数:
  • job_name (str) – The training job name.

  • alert_address (str) – The Feishu webhook address for sending alert messages.

class internlm.monitor.monitor.MonitorManager(*args, **kwargs)[源代码]

Monitor Manager for managing monitor thread and monitoring training status.

monitor_loss_spike(alert_address: str | None = None, step_count: int = 0, cur_step_loss: float = 0.0)[源代码]

Check loss value, if loss spike occurs, send alert message to Feishu.

monitor_exception(alert_address: str | None = None, excp_info: str | None = None)[源代码]

Catch and format exception information, send alert message to Feishu.

handle_sigterm(alert_address: str | None = None)[源代码]

Catch SIGTERM signal, and send alert message to Feishu.

start_monitor(job_name: str, alert_address: str, monitor_interval_seconds: int = 300, loss_spike_limit: float = 1.5)[源代码]

Initialize and start monitor thread for checking training job status, loss spike and so on.

参数:
  • job_name (str) – The training job name.

  • alert_address (str) – The Feishu webhook address for sending alert messages.

  • monitor_interval_seconds (int) – The time of monitor interval in seconds, defaults to 300.

  • loss_spike_limit (float) – The limit multiple of current loss to previous loss value, which means loss spike may be occurs, defaults to 1.5.

stop_monitor()[源代码]

Stop the monitor and alert thread.

class internlm.monitor.monitor.MonitorTracker(alert_address: str, check_interval: float = 300, loss_spike_limit: float = 1.5)[源代码]

Track job status and alert to Feishu during job training.

参数:
  • alert_address (str) – The Feishu webhook address for sending alerting messages.

  • check_interval (float) – The interval in seconds for monitoring checks. Defaults to 300.

  • loss_spike_limit (float) – The threshold for detecting loss value spikes. Defaults to 1.5.

run()[源代码]

start the monitor tracker.

stop()[源代码]

Stop the monitor tracker.

告警

InternEvo 监控线程会周期性地检查模型训练过程中是否出现 loss spike、潜在的 training stuck、运行时异常等,并捕获 SIGTERM 异常信号。当出现上述情况时,将触发警报,并通过调用 internlm.monitor.alert.send_feishu_msg_with_webhook() 向飞书的 Webhook 地址发送报警消息。

internlm.monitor.alert.send_feishu_msg_with_webhook(webhook: str, title: str, message: str)[源代码]

Use Feishu robot to send messages with the given webhook.

参数:
  • webhook (str) – The webhook to be used to send message.

  • title (str) – The message title.

  • message (str) – The message body.

返回:

The response from the request. Or catch the exception and return None.

抛出:

Exception – An exception rasied by the HTTP post request.

轻量监控

InternEvo轻量级监控工具采用心跳机制实时监测训练过程中的各项指标,如loss、grad_norm、训练阶段的耗时等。同时,InternEvo还可以通过 grafana dashboard 直观地呈现这些指标信息,以便用户进行更加全面和深入的训练分析。

轻量监控的配置由配置文件中的 monitor 字段指定, 用户可以通过修改配置文件 config file 来更改监控配置。以下是一个监控配置的示例:

monitor = dict(
    alert=dict(
        enable_feishu_alert=False,
        feishu_alert_address=None,
        light_monitor_address=None,
    ),
)
  • enable_feishu_alert (bool):是否启用飞书告警。默认值:False。

  • feishu_alert_address (str):飞书告警的 Webhook 地址。默认值:None。

  • light_monitor_address (str):轻量监控的地址。默认值:None。

InternEvo 使用 internlm.monitor.alert.initialize_light_monitor 来初始化轻量监控客户端。一旦初始化完成,它会建立与监控服务器的连接。在训练过程中,使用 internlm.monitor.alert.send_heartbeat 来发送不同类型的心跳信息至监控服务器。监控服务器会根据这些心跳信息来检测训练是否出现异常,并在需要时发送警报消息。