Notify on failed/stuck runs
Many things can cause the training process to fail or get stuck:
At first, it may seem a relatively easy, however checking that everything is OK with the training run is proven to be a non-trivial task.
In some cases the training process may not even have a chance to notify about the failure. The process could also be stuck due to
filesystem I/O issues
With this in mind, the following definition of failed process is used:
The Run is considered as failed if it hasn’t reported any progress in a predefined time-interval.
Once the progress is not reported for the given period of time, a notification will be sent to the enabled channels.
See how to set up the notification service and add notification channels here.
Run progress reporting SDK
Aim SDK provides interface for reporting
Run progress. The
Run class has now two methods:
report_successful_finish to report progress and successful finish respectively.
Here’s a small code snippet showing how
Run.report_progress() method can be integrated in the training loop.
from aim import Run # prep dataset and model ... aim_run = Run() for epoch in range(num_epochs): for i, (images, labels) in enumerate(train_loader): images = images.to(device) labels = labels.to(device) # Forward pass outputs = model(images) loss = criterion(outputs, labels) # Report progress assuming each iteration shoud take less than 3 sec. aim_run.report_progress(expect_next_in=3) # Backward and optimize optimizer.zero_grad() loss.backward() optimizer.step() if i % 30 == 0: # Track data with aim aim_run.track(loss.item(), name='loss', epoch=epoch) ... # Training is done, report success aim_run.report_successful_finish()
report_progress() takes an ETA (in seconds) of the next anticipated progress report call.
Detailed description of interfaces is available in aim SDK reference.
Additional grace period of 100s is enabled to compensate for possible hardware (e.g. filesystem) latency.