Storage indexing - how Aim data is indexed
Background
When tracking experiment metadata with Aim, each run creates its own isolated space in
aim repository. This allows to run multiple concurrent experiments without setting-up additional
services responsible for data writes synchronization. Once run is complete, all the data it
tracked is being indexed. We call this step run finalization. When the training script terminated
with SIGTERM
signal, Aim will handle this and make sure that run properly finalized and data is
indexed. However, there are cases when training terminated abnormally and data remains unindexed.
How things worked before?
Due to the chunks of data being unindexed, chunks of data would remain in the runs’ separate
storage but not in index storage. This means that queries had to open multiple files to read the
repo data. Once failed runs started to accumulate, queries will slow down. In order to mitigate this
aim reindex
command has been introduced.
The command will scan the aim repo and index all stalled runs.
Automatic indexing
Though aim reindex
command will address the performance issues it is not the most convenient
way to do. The questions such as “When should I run aim reindex
?” or “How frequent should I run aim reindex
?” depend on the actual
aim repository and use-case. Thus, we need to automate the indexing of aim repository.
Each time aim up
command is ran, Aim will spawn a background thread along with the web server.
The thread will check for the unindexed runs and reindex them one at the time. This will keep
queries performance high without locking the index storage for too long.
Conclusion
With the new automatic indexing logic in place, users don’t have to manually run aim reindex
command. It is still in place for cases when all the runs data should be indexed at once. The
combination of automatic (implicit) and manual (explicit) reindexing makes sure aim repo has
good performance in a long-term usage screnarios and provides good overall user experience.