Notify on failed/stuck runs

Many things can cause the training process to fail or get stuck:

hardware failure
power outage
programmatic error

At first, it may seem a relatively easy, however checking that everything is OK with the training run is proven to be a non-trivial task.

In some cases the training process may not even have a chance to notify about the failure. The process could also be stuck due to

network issues
filesystem I/O issues
etc

With this in mind, the following definition of failed process is used:

The Run is considered as failed if it hasn’t reported any progress in a predefined time-interval.

Once the progress is not reported for the given period of time, a notification will be sent to the enabled channels.

See how to set up the notification service and add notification channels here.

Run progress reporting SDK

Aim SDK provides interface for reporting Run progress. The Run class has now two methods: report_progress and report_successful_finish to report progress and successful finish respectively. Here’s a small code snippet showing how Run.report_progress() method can be integrated in the training loop.

from aim import Run

# prep dataset and model
...
aim_run = Run()

for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Report progress assuming each iteration shoud take less than 3 sec.
        aim_run.report_progress(expect_next_in=3)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if i % 30 == 0:
            # Track data with aim
            aim_run.track(loss.item(), name='loss', epoch=epoch)

...
# Training is done, report success
aim_run.report_successful_finish()

report_progress() takes an ETA (in seconds) of the next anticipated progress report call.

Detailed description of interfaces is available in aim SDK reference.

Note

Additional grace period of 100s is enabled to compensate for possible hardware (e.g. filesystem) latency.