Efficient computation

Last updated: May 15, 2023

Efficient training

GPU utilization and profiling

From the resource utilization point-of-view, a common first thing to check is the GPU utilization. The GPU utilization should ideally be close to 100%. If your utilization is consistently low (for example under 50%) it might be a sign of a bottle-neck in the processing pipeline. For example:

  • There might not be enough CPU cores reserved for data loading
  • File I/O is too slow (e.g., an overloaded shared file system in a Supercomputer)

Having a GPU load of 100% is not a guarantee that the job is actually doing something useful on the GPU. Having a good reported GPU load can be considered as a neccessary, but not sufficient condition for an efficient job.

The next step would be to check with a profiler, such as the PyTorch profiler what operations are being performed. CSC’s Machine learning guide has a short tutorial on how to use the PyTorch profiler.

Mixed precision training

A simple way to speed up training is to enable mixed precision training, and for many software packages this is already the default. In mixed precision training, some floating points values are stored with reduced 16-bit precision in cases where the loss of precision is not critical. A simple way to do this is to enable Automatic Mixed Precision (amp) in PyTorch.

Efficient inference

To use the trained large language models, the models need to be deployed and served to the users. The user experience from the initial startup overhead and the usage latency are inherently combatting with computational resources.

If the resources would be unlimited the collection of all available models could be kept in memory, but with limited server space and optimization for less energy usage, it is efficient to operate on a shared resource that runs other workloads while the demand for inference is low, and thus the idle power consumption of the system is reduced.

Dynamic loading

To avoid wasting computational resources, and assuming a shared cluster, the models should only be loaded when they are required. This could be implemented for example by signaling a backend when an user enters a website, or by implementing API calls for loading a model as a first step for starting inferencing.

Job scheduling

To improve utilization and thus being able to tune for efficiency, having multiple users operating through a same system and possibly on a single copy of a network is required. The users would be interacting via a job scheduling or queue system. This can be implemented

Scaling the throughput

TODO

Reduced precision inference

TODO. The software stack is important and can make this one unfeasible if the reduced precision reduces accuracy but does not improve throughput.

Model compilation

To optimize models for inference, multiple optimized formats such as ONNX are available.

TODO: evaluation