HPCG: 2. Multi-coloring for Parallelization

TL;DR

Multi-coloring is a technique to assign colors to nodes in a graph such that no two adjacent nodes have the same color. In regards to HPCG, we can think of each row in the matrix as a node, and the edges between the nodes as the dependences between the rows. By assigning colors to the rows, we can parallelize the computation of SYMGS foperation in HPCG.

Disclaimer

The good practice would be analyzing the performance of the reference HPCG first, and then implement the optimizations one by one to see how much each optimization contributes to the overall performance improvement. However, because the changes in the data structures require all operations to be modified, it's better to implement all the optimizations at once in reality. So, we'll skip the ablation study and directly implement all the optimizations. After that, we can compare the performance characteristics of the original and optimized versions of HPCG.

Introduction

The HPCG algorithm uses iterative approach, which tries to make the current solution closer to the actual solution in each iteration. The specification allows for some relaxation in the algorithm which can make the solution less accurate for same number of iterations. However, if the relaxation makes each iteration faster, the overall time taken to reach the solution can be reduced despite the larger number of iterations. By using MultiColoring, we can parallelize the computation of SYMGS operation in HPCG.

SYMGS Operation in Python

Let's simplify the problem and try to write the code in python. First, let's define the matrix and vector.

import numpy as np

# Generates a matrix for a 3D 27-point stencil on an N x N x N grid.
def generate_3d_grid_matrix(N):
    size = N * N * N
    X = np.zeros((size, size))  # Initialize NxN dense matrix
    index = lambda x, y, z: x + y * N + z * N * N  # Map 3D to 1D

    for z in range(N):
        for y in range(N):
            for x in range(N):
                current = index(x, y, z)
                X[current, current] = 26.0  # diagonal element

                for sz in range(-1, 2):
                    if 0 <= z + sz < N:
                        for sy in range(-1, 2):
                            if 0 <= y + sy < N:
                                for sx in range(-1, 2):
                                    if 0 <= x + sx < N:
                                        neighbor = index(x + sx, y + sy, z + sz)
                                        if neighbor != current:
                                            X[current, neighbor] = -1.0
    return X

def initialize_vectors(N):
    size = N * N * N
    x = np.zeros(size)
    b = np.ones(size)
    r = b.copy()
    return x, b, r

It isn't a complete implementation, but shows how the matrix is generated. The resulting matrix would be something like this:

[[26. -1.  0. ...  0.  0.  0.]
 [-1. 26. -1. ...  0.  0.  0.]
 [ 0. -1. 26. ...  0.  0.  0.]
 ...
 [ 0.  0.  0. ... 26. -1.  0.]
 [ 0.  0.  0. ... -1. 26. -1.]
 [ 0.  0.  0. ...  0. -1. 26.]]

Next, let's implement the SYMGS operation.

def symgs(X, x, r, iterations=10):
    n_rows = X.shape[0]
    for _ in range(iterations):
        # Forward sweep
        for i in range(n_rows):
            sum_val = r[i]
            for j in range(n_rows):
                if i != j:
                    sum_val -= X[i, j] * x[j]
            x[i] = sum_val / X[i, i]

        # Backward sweep
        for i in reversed(range(n_rows)):
            sum_val = r[i]
            for j in range(n_rows):
                if i != j:
                    sum_val -= X[i, j] * x[j]
            x[i] = sum_val / X[i, i]

    return x

It's directly translated from the C++ reference HPCG code. What we can see here is that the computation of each row is dependent on other rows which are also updated in the same iteration. This makes it difficult to parallelize the computation. Technically, it's not parallelizable because of the dependences between the rows.

Relaxation and Parallelization

However, because of the nature of the algorithm, a subtle changes in the computation won't ruin the whole algorithm. We can relax the computation by allowing a little bit of departures from the original solution.

JPL (Jones-Plassmann-Luby) Coloring

In rocHPCG from AMD, they explicitly call the coloring algorithm "JPL Coloring," but it's often referred to as just coloring or multi coloring in other sources. There is a good explanation of the coloring algorithm in an NVIDIA blog article. So I'll skip on the explanation on the algorithm itself. We can directly jump to the coloring for HPCG.

In parallel computing, the need to break (a little bit of) dependences in iterative methods such as Gauss-Seidel (SYMGS) is critical. Technically, operations must be performed sequentially since updates to one row can depend on another in the algorithm. Coloring introduce parallelization with a little bit of relaxation to the algorithm by grouping independent rows (vertices in the graph) that can be updated simultaneously without (much) conflicts.

The algorithm was originally meant for graph-based problems, but our probem can be thought of as a kind of a graph problem. We can think of each row as a vertex in a graph, and the dependencies between the rows as the edges connecting the vertices. By "adjacent," we mean that the index of one row is in the set of nonzero elements column indices of another row. For example, if ith row has a nonzero element at jth column, jth row is adjacent to ith row.

def jpl_coloring(X):
    n = X.shape[0]
    # Initialize all vertices as uncolored (-1)
    colors = [-1] * n

    current_color = 0
    # Loop until all vertices are colored
    while any(c == -1 for c in colors):
        # For each vertex, check if it can be colored with the current color
        for i in range(n):
            if colors[i] == -1:  # Only consider uncolored vertices
                neighbors = np.nonzero(X[i])[0]
                # Get the indices of neighbors (non-zero entries in the row)
                neighbor_colors = []
                for neighbor in neighbors:
                    if colors[neighbor] != -1:
                        neighbor_colors.append(colors[neighbor])

                # Can color this vertex if none of the neighbors have the current color
                if current_color not in neighbor_colors:
                    colors[i] = current_color
        # Move to the next color
        current_color += 1

    return colors

Although different implementations might vary in detail, the basic idea of the JPL Coloring algorithm follows these steps:

Initialization

Each row (vertex) is initialized as uncolored (-1).

Neighbor Inspection

For each uncolored row, its adjacent neighbors(rows) are inspected. The algorithm checks the colors of the neighboring rows to avoid assigning the same color to adjacent rows.

Assigning Colors

The algorithm tries to assign the current color (current_color) to a row. If none of the neighbors have the current color, the row is assigned this color. Otherwise, it remains uncolored until the next pass.

Iteration

The algorithm loops through all rows and colors them in successive iterations. In each iteration, it assigns the current color to as many rows as possible without conflicts. After each pass, it increments the color index (changes the current color) and continues until all rows are colored. The process continues until every row has a color. Once the coloring is complete, rows with the same color represent independent sets, meaning they can be processed in parallel.

Parallelized SYMGS Operation

Once the coloring is complete, we can modify the SYMGS operation to use this coloring information and parallelize the computation. Coloring helps identify groups of rows (or vertices in the graph) that are independent, meaning these rows can be processed simultaneously with acceptable amount of relaxation of dependency. This allows for parallel execution of SYMGS.

from concurrent.futures import ThreadPoolExecutor, as_completed
def update_row(X, x, r, i):
    n_rows = X.shape[0]
    sum_val = r[i]
    for j in range(n_rows):
        if i != j:
            sum_val -= X[i, j] * x[j]
    return sum_val / X[i, i]

def sweep(X, x, r, color, n_colors, max_workers, is_forward=True):
    color_range = range(n_colors + 1) if is_forward else reversed(range(n_colors + 1))
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        for c in color_range:
            futures = {}
            for i in range(len(x)):
                if color[i] == c:
                    futures[executor.submit(update_row, X, x, r, i)] = i
            for future in as_completed(futures):
                idx = futures[future]
                x[idx] = future.result()


def symgs_parallel_concurrent(X, x, r, color, iterations=10, max_workers=4):
    n_colors = max(color)  # Determine the number of colors in use
    for _ in range(iterations):
        # Forward sweep
        sweep(X, x, r, color, n_colors, max_workers, is_forward=True)

        # Backward sweep
        sweep(X, x, r, color, n_colors, max_workers, is_forward=False)
    return x

In theory, there can be at most n_colors number of threads running simultaneously. We can think of it as different threads processing different colors at the same time. So, basically it does the same operation as the original SYMGS, but in parallel for each color.

Comparison of the Results

Let's compare the results of the original SYMGS and the parallelized SYMGS with multi-coloring.

def main():
    N = 8  # Grid size (N x N x N).
    print(f"Generating a 3D grid matrix with N={N} (Total size: {N**3})...")
    X = generate_3d_grid_matrix(N)
    print("Matrix generated.")
    print(X)

    x, b, r = initialize_vectors(N)
    print("Vectors initialized.")
    print("\nInitial residual norm:", np.linalg.norm(r))

    n_iteration = 5
    x_original = x.copy()
    print(f"\nPerforming original SYMGS for {n_iteration} iterations...")
    x_original = symgs(X, x_original, r, iterations=n_iteration)
    print(f"Updated solution x (first 10 elements, original after {n_iteration} iterations):", x_original[:10])

    # Perform Coloring
    print("\nPerforming Coloring...")
    color = jpl_coloring(X)
    print("Coloring completed. Number of colors used:", max(color) + 1)

    # Perform SYMGS
    x_parallel = x.copy()
    print(f"\nPerforming parallel SYMGS for {n_iteration} iterations...")
    x_parallel = symgs_parallel_concurrent(X, x_parallel, r, color, iterations=n_iteration, max_workers=4)
    print(f"Updated solution x (first 10 elements, parallel after {n_iteration} iterations):", x_parallel[:10])

    # Compare results
    print(f"Difference between original and parallel solutions after {n_iteration} iterations:", np.linalg.norm(x_original - x_parallel))

if __name__ == "__main__":
    main()

It gives the output like this:

Generating a 3D grid matrix with N=8 (Total size: 512)...
Matrix generated.
[[26. -1.  0. ...  0.  0.  0.]
 [-1. 26. -1. ...  0.  0.  0.]
 [ 0. -1. 26. ...  0.  0.  0.]
 ...
 [ 0.  0.  0. ... 26. -1.  0.]
 [ 0.  0.  0. ... -1. 26. -1.]
 [ 0.  0.  0. ...  0. -1. 26.]]

Vectors initialized.

Initial residual norm: 22.627416997969522

Performing original SYMGS for 5 iterations...
Updated solution x (first 10 elements, original after 5 iterations): [0.07511054 0.10086726 0.11240291 0.11697966 0.11692561 0.11225382
 0.10066379 0.07493138 0.10074629 0.14457307]

Performing Coloring...
Coloring completed. Number of colors used: 8

Performing parallel SYMGS for 5 iterations...
Updated solution x (first 10 elements, parallel after 5 iterations): [0.07273684 0.09628176 0.10655959 0.11009813 0.11039728 0.10629401
 0.09647671 0.07263084 0.09594817 0.13539174]

Difference between original and parallel solutions after 5 iterations: 0.5304713021583516

The difference between the original and parallel solutions is not zero, but it's close enough. If we increase the number of iterations to 10,

Difference between original and parallel solutions after 10 iterations: 0.15672233780717715

If we increase the number of iterations to 20,

Difference between original and parallel solutions after 20 iterations: 0.00733167473733031

We can see that the difference is getting smaller as the number of iterations increases. Thus, we can say that the results from the parallelized SYMGS operation with multi-coloring are close enough to the original results, especially when the number of iterations is large.

Conclusion

SYMGS takes up a significant portion of the computation time in HPCG (The exact amount will be shown in the analysis after all the optimizations are explained and implemented). So, the fact that we can parallelize this part of the computation can lead to a significant speedup in the overall computation time. Actually, we can say that it's the most important part of the optimizing HPCG.