Domain decomposition and Parallel programming

This section describes the Proof of Concept implementation for speeding up code execution in AeoLiS using domain decomposition and parallel programming.

Overview

When domains get larger (more gridpoints), the calculation slows down. Profiling AeoLiS reveals that solving the sparse system of equations (Ax=b) accounts for approximaely 50% of the total code execution time for running a simulation. This operation involves calling the below mentioned Scipy routine that operates on the model as one single domain.

Ct_i += scipy.sparse.linalg.spsolve(A, yCt_i.flatten())

We use domain decomposition to divide the model into several smaller subdomains that can be computed in parallel.

Implementation details

  • Code changes for implementing domain decomposition and parallel programming in AeoLiS are in the branch parallel-update.

  • Issue tracker: Issue #115

  • Domain decomposition in AeoLis is implemented by dividing a large domain into several smaller overlapping sub domains and computing them in parallel. The parallel computation of the subdomains is facilitated by a pool of Python processes created using the multiprocessing module in Python. The results of the parallel computations are collected and combined to get the final result.

  • Simulation in parallel mode is enabled via a keyword parallel in the configuration file (aeolis.txt). When the keyword is set to T (True), the parallel implementation is enabled.

#create a pool of processes
p = Pool(5)

#divide the domain into smaller subdomains

if p['parallell'] == True:
    d1 = np.int((np.floor(yCt_i.shape[0]*1/4)+1)*yCt_i.shape[1])
    d2 = np.int((np.floor(yCt_i.shape[0]*1/4))*yCt_i.shape[1])
    d3 = np.int((np.floor(yCt_i.shape[0]*2/4)+1)*yCt_i.shape[1])
    d4 = np.int((np.floor(yCt_i.shape[0]*2/4))*yCt_i.shape[1])
    d5 = np.int((np.floor(yCt_i.shape[0]*3/4)+1)*yCt_i.shape[1])
    d6 = np.int((np.floor(yCt_i.shape[0]*3/4))*yCt_i.shape[1])
# Solve the system of equations in parallel. Distribute the smaller computations among the pool of processes.

results = pools.starmap(scipy.sparse.linalg.spsolve,
                        [(A[0:d1,0:d1], yCt_i.flatten()[0:d1]),
                        (A[d2:d3,d2:d3], yCt_i.flatten()[d2:d3]),
                        (A[d4:d5,d4:d5], yCt_i.flatten()[d4:d5]),
                        (A[d6:,d6:], yCt_i.flatten()[d6:])
                        ])


# Collect the results from the pool of processes and combine them to get the final result.

Ct_i[0:d1] += results[0]
Ct_i[d2:d3] += results[1]
Ct_i[d4:d5] += results[2]
Ct_i[d6:] += results[3]

Experiments and Results

The parallel verison of AeoLiS is benchmarked on the sandengine example with 2 different grid sizes. The simulation is run on a laptop computer with 4 cores and 8GB RAM.

The results are shown in the table below.

Table 1 Benchmark results for parallel version of AeoLiS

Configuration

Sequential version

Parallel version

Speed-up achieved

Sandengine with small grids

7 mins 2 secs

4 mins 23 secs

1.6X

Sandengine with large grids

1 hr 07 mins 30 secs

17 mins 17 secs

3.9X

Key findings

  • The speed-up achieved by using domain decomposition and parallel programming is significant for big domain with large grid sizes. For smaller domains, the overhead of distributing the subdomains among the available cores on the computer and collecting the results overshadows the overall speed-up achieved.

  • The parallel implementation fails the test that verifies the content of the output netCDF file generated by the parallel version against a reference output netCDF file generated by the sequential version of the code for the same configuration.

Reflections and To do

  • Testing for the parallel version need be reconsidered. Existing tests may not be applicable for the parallel version due to the overlap of the subdomains.

  • The proof of concept implementation uses a hardcoded approach to divide the big domain into smaller subdomains. This approach is not generic and may not work for all use cases. Replacing the hardcoded approach with a generalized function to decompose the domains based on the domain properties could be a possible improvement, but this needs to be investigated further.

Try it yourself: Running AeoLiS in parallel mode

You can run AeoLiS in parallel mode by following the below steps:

  1. Clone the aeolis-python GitHub repository and checkout the parallel-update branch.

  2. Install AeoLiS in development mode by running the command pip install -e . from the root directory of the repository.

  3. Navigate to the folder examples/sandmotor_large_grids/ and run the command aeolis aeolis.txt to run the sandmotor simulation with large grids. The keyword parallel in the configuration file (aeolis.txt) is set to T (True) by default, so the simulation will run in parallel mode. This can also be confirmed by looking at the print statements in the terminal.

  4. You may not have sufficient memory available on your computer to run the sandengine large grids example. In that case, try running the sandmotor example with smaller grid size. Navigate to the folder examples/sandmotor_small_grids/ and run the command aeolis aeolis.txt to run the sandmotor simulation with small grids.

  5. If you would like to run AeoLiS in parallel mode for your own example, set the keyword parallel to T (True) in the configuration file.