Write Multi-Threaded Code

Description

In this session, we discuss how to write multi-threaded ITK code that uses native operating system CPU threads.

Introduction

Writing parallel algorithms to take advantage of multi-processor or multi-core systems can significantly improve the algorithms speed (although there are exceptions!). In image analysis, we often encounter embarrassingly parallel problems. For example, every output pixel of an algorithm can often be computed independently from the other output pixels.

In an itk::ImageToImageFilter

../../_images/Threading.gif

CPU-based threading in a shared-memory SMP architecture.

The output image is a single contiguous block on memory that is used for all processing threads. Each thread is informed which pixels they are responsible for producing the output values. All the threads write to this same block of memory but a given thread is only allowed to set specific pixels.

A multi-threaded filter provides an implementation of the ThreadedGenerateData() method as opposed to the normal single threaded GenerateData() method. A superclass of the filter will spawn several threads (usually matching the number of processors in the system) and call ThreadedGenerateData() in each thread specifying the portion of the output that a given thread is responsible for generating. For instance, on a dual processor computer, an image processing filter will spawn two threads, each processing thread will generate one half of the output image, and each thread is restricted to writing to separate portions of the output image. Note that the “entire” input and “entire” output images (i.e. what would be available normally to the GenerateData() method whether or not streaming is being peformed) are available to each call of ThreadedGenerateData(). Each thread is allowed to read from anywhere in the input image but each thread can only write to its designated portion of the output image.

When writing a threaded filter, the ThreadedGenerateData() method must be thread-safe. Thread safety requires that multiple threads do not write to the same location in memory at the same time. To avoid this problem, thread-local storage is often used. Intermediate results are written to thread-local data structures then merged in a single-threaded way after parallel processing. To facilitate this common strategy, the default ImageSource implementation of GenerateData() also runs a BeforeThreadedGenerateData() and AfterThreadedGenerateData() method, which are null by default. To use thread-local storage, create the thread-specific data structures in BeforeThreadedGenerateData() and merge the results in AfterThreadedGenerateData(). For an example see the implementation of the itk::MinimumMaximumImageFilter.

BeforeThreadedGenerateData() is also commonly used to perform single-threaded processing to prepare for the ThreadedGenerateData() call.

With itk::DomainThreader

While the ThreadedGenerateData() method in an itk::ImageToImageFilter make most data-parallel operations easy to code, it does not fit our needs. We may want to:

  • Perform more than one multi-threaded operation in a filter.
  • Split our data into something other than itk::ImageRegion‘s.
  • Do multi-threading outside an itk::ImageSource.

The itk::DomainThreader class overcomes these limitations.

It is possible to have more than one itk::DomainThreader member in a class, so it is possible to perform multiple different multi-threaded operations in a single filter Update(). The multi-threaded operations can be performed multiple times within a loop, be conditionally performed, etc.

The itk::DomainThreader class is templated over the data domain to be split/partitioned, so it can operate on itk::ImageRegion’s like ThreadedGenerateData() but also index ranges in an itk::Array or a range specified by Standard Library iterators depending on which class is selected to partition the domain.

The itk::DomainThreader class structure is similar to the itk::ImageSource structure. The analog of itk::ImageSource‘s:

BeforeThreadedGenerateData()
ThreadedGenerateData( const OutputImageRegionType & outputRegionForThread, ThreadIdType threadId )
AfterThreadedGenerateData()

in itk::DomainThreader are the:

BeforeThreadedExecution()
ThreadedExecution( const DomainType & subDomain, const ThreadIdType threadId )
AfterThreadedExecution()

methods.

To use itk::DomainThreader, create a Threader subclass of itk::DomainThreader. This subclass will be templated over TAssociate, the type of the class that will use the new subclass to perform the multi-threaded operation. The Associate class should declare and instantiate a member Threader. The type of the Threader will be templated with the Associate’s Self type. The Associate will also declare the Threader a friend class, so that it can access its private and protected members with this->m_Associate->m_AssociateMemberName. When the Associate wants to perform the multi-threaded operation, it will call Execute( this, completeDomain ) on the member Threader.

A cell counting example can be found in the ITKExamples project.

With itk::MultiThreader

The itk::MultiThreader can be used to manually spawn and terminate your own methods to operate on threads in a platform-independent way. However, there is much more code overhead and opportunity for errors compared to itk::ImageToImageFilter or itk::DomainThreader. For an example that applies itk::MultiThreader directly, see the implementation of the CannyEdgeDetectionImageFilter.

Static methods on itk::MultiThreader can also be used to control default number of threads or a global maximum number of threads. When writing filters designed to work with other processors or future processors, however, care should be taken to avoid hard-coding the number of threads utilized.

Setting the default global number of threads to one can be useful when writing or debugging multi-threaded code. The static method on itk::MultiThreader can be used or the environmental variable ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS can be set to 1 to achieve the desired effect.

Locking classes

Note that ITK has a few classes to perform thread synchronization, such as the itk::SimpleFastMutexLock, but these should be avoided if possible because of performance reasons.

Video