Implementing seam carving (liquid rescaling) in OpenCL

Rescaling an image to a different aspect ratio is often achieved via letter boxing, cropping or stretching, but none of these approaches are perfect. For large changes in aspect ratios, letter boxing looks inelegant, stretching distorts image contents and cropping causes outright loss of information.

Avidan and Shamir’s paper titled, “Seam carving for content-aware image resizing” presents a simple but clever method of resizing that aims to overcome these weaknesses. The paper defines a seam as a 8-connected path of pixels connecting two opposing edges of the image, and the cost of a seam is the sum of the magnitude of the gradients of the pixels in the path. The insight is that seams with a low cost would pass through regions of the image with small gradients meaning they can be removed without much distortion or loss of information. Removing a single seam reduces either the width or height of the image by 1 pixel, and by repeating this action as many times as necessary, the image can be rescaled to any aspect ratio.

Finding a minimum cost seam through an image can be reduced to the shortest path graph problem, but the problem is more easily solved using dynamic programming as it possesses the following optimal substructure. Suppose without loss of generality that we are seeking a minimum cost vertical seam. The cost of including a pixel at row r and column c equals \(C(r, c) = min\{C(r-1, c-1),C(r-1, c),C(r-1, c + 1)\} + G(r, c)\) where \(G(r, c)\) is the magnitude of the gradient at \((r, c)\). Via this recursive formula, the minimum cost vertical seam is given by \(min\{C(height – 1, k)\}\) and the path of this seam through the image can be found by following the minimum cost parents until reaching the zeroth row.

Serial implementation

Let us first implement a serial version of the algorithm to establish a baseline performance and to verify the correctness of the OpenCL implementation later on. I will be using Python with the Pillow imaging library on top of scipy and numpy.

from Pillow import Image
import numpy
import scipy.signal

def calculate_gradient_magnitude(image: numpy.array) -> numpy.array:
	"""Calculate gradient magnitude of an RGB image

	Calculates gradient magnitude of each color channel using the
	Prewitt operator and simply sums magnitudes
	(does not use any luminance-preserving weighting for now)

	Args:
	image (numpy.array): RGB image as 3-dimensional array
	(row, column, color channel)

	Returns:
	gradient_magnitudes (numpy.array): 2-dimensional array
	"""

	gradient_magnitudes = numpy.zeros(image.shape[:2], dtype=numpy.float32)
	for color, index in [("red", 0), ("green", 1), ("blue", 2)]:
		grad_x = scipy.signal.sepfir2d(image[:, :, index], [1, 0, -1], [1, 1, 1])
		grad_y = scipy.signal.sepfir2d(image[:, :, index], [1, 1, 1], [1, 0, -1])
		grad = numpy.sqrt(grad_x ** 2 + grad_y ** 2)
		gradient_magnitudes += grad

	return gradient_magnitudes

Because the Prewitt operator is a separable filter, it can be more efficiently computed as a series of two 1D convolution in the horizontal and vertical directions as opposed to a 2D convolution. This trick reduces the time complexity of convolution from \(O(mnw^2)\) to \(O(mnw)\) where \((m, n)\) are the dimensions of the image and \(w\) is the width of the filter.

It is important to consider the boundary conditions of the convolution as it affects the dimensions of the output. For seam carving, it is convenient to ensure that the 2D array holding gradient magnitudes has the same dimensions as the original image. Such a scheme requires padding the convolution input; fortunately scipy.signal.sepfir2d does this by default using mirror symmetric padding.

def seam_carve(image: Image.Image, num_columns_to_remove: int=1) -> Image.Image:
	"""Removes the specified number of vertical seams from a color image

	Args:
	image (Image.Image): color image
	num_columns_to_remove (int)

	Returns:
	image (Image.Image): seam carved color image
	"""
	
	image_data = numpy.array(image.getdata(), numpy.uint8).reshape(image.size[1], image.size[0], 3)
	gradient_magnitudes = calculate_gradient_magnitudes(daimage_datata)
	rows, columns, _ = image_data.shape
	
	for i in range(num_columns_to_remove):
		# Initialize memo
		parent = {row: {} for row in range(rows)}
		previous_row_costs = [0] * columns
		cost = [float('inf')] * columns

		# Calculate minimum cost seams for each successive row
		for row in range(rows):
			for col in range(columns):
				min_parent_cost = float('inf')
				for parent_col in [col - 1, col, col + 1]:
					parent_cost = previous_row_costs[parent_col] if 0 <= parent_col < columns else float('inf')
					if parent_cost < min_parent_cost: min_parent_cost = parent_cost min_parent = (row - 1, parent_col) parent[row][col] = min_parent cost[col] = gradient_magnitudes[row, col] + min_parent_cost previous_row_costs = cost[:] # Find minimum cost seam by traversing parent dictionary path = [] current_row = len(parent) - 1 current_col = numpy.argmin(cost) while current_row > 0:
			path.append(current_col)
			current_row, current_col = parent[current_row][current_col]
		path = path[::-1]
		
		# Create new image by removing seam
		new_image_data = numpy.empty((image_data.shape[0], image_data.shape[1] - 1, image_data.shape[2]), dtype=numpy.uint8)
		for row in range(rows):
			new_image_data[row, 0:path[row]] = image_data[row, 0:path[row]]
			new_image_data[row, path[row]:] = image_data[row, path[row] + 1:]
		
		# compute gradients for new image
		gradient_magnitudes = calculate_gradient_magnitudes(new_image_data)
		image_data = new_image_data
		
	return Image.fromarray(data)

The seam_carve function above implements a slight simplification of seam carving that only removes vertical seams (can only reduce width), though it is trivial to remove horizontal seams by passing in a rotated image and then undoing the transformation. The outer for-loop repeatedly finds and removes a single minimum cost vertical seam from the image.

Each iteration starts by initializing a memo to store the costs of pixels in the previous row as well as a “parent” dictionary to store the minimum cost parent of each pixel so that the minimum cost seam can later be found. After initialization, the algorithm iterates over each pixel, top to bottom and left to right, calculating the cost of the pixel using the aforementioned recursive formula and noting the parent. After that, the seam is found by traversing the parent dictionary and then removed from the image.

Note that because the image gets modified at the end of each iteration, the gradient must be recomputed at the start of every iteration. I did implement a version which only recomputes the pixels affected by the seam removal and copies over unchanged values from the old image gradient, but this did not significantly improve performance. This is probably because my implementation was interpreted Python code which would be much slower than scipy’s convolution function written as a compiled C extension.

Example

Base image

Source

By charlesjsharp on Wikimedia Commons (Creative Commons BY-SA 4.0)

 

Removal

The above video shows the image as each vertical seam gets removed. Initially the algorithm removes seams from the blurry background on the left side of the image as the region has small gradients. Towards the end, it is forced to remove seams that cross the leaf which create some unnatural jagged edges but the butterfly is fortunately left intact.

Performance

Unfortunately, the rate at which seams get removed in the above video is much faster than how the serial implementation removes seams in reality. Even for the small base image above with dimensions of 224 by 320, the serial implementation took 25.6 seconds to remove vertical 100 seams on my laptop. This level of performance is unacceptable, especially in interactive use cases where the user is an image editor and expects a live preview as the image is rescaled.

Profiling the code revealed that the most time-consuming part of the algorithm is the inner nested for-loop that applies the recursive formula to determine the minimum cost of each pixel. The problem of finding a minimum cost seam ending at a given pixel requires having solved the subproblem of finding the minimum cost seam ending at its three possible parents in the previous row. However, computing the minimum cost seam for pixels in the same row can be parallelized.  Let’s see how much of a speedup we can achieve using this parallelization scheme.

Parallel implementation (OpenCL)

I will be using the OpenCL parallel computing framework and pyopencl to access the OpenCL API from Python.

Communication between the host and the device is one of the biggest bottlenecks for parallel programs so it is important that all of the computation for the algorithm occur on the device after transferring over the input image. This means that we must write a kernel for each stage of the algorithm:

  1. Calculating gradient magnitudes by convolving the Prewitt filter
  2. Finding minimum cost seam at each pixel using the recursive formula
  3. Extracting the path of the minimum cost vertical seam through the image
  4. Removing the seam from the image

Our Python host code will repeatedly run the se four kernels however many times as requested by the user i.e. how many vertical seams need to be removed.

OpenCL provides a built-in type for 2D images (image2d_t) which the GPU can place in special memory for textures that is faster to access than global memory. However, textures can either be read-only or write-only, and this restriction is not acceptable for seam carving where the output of one iteration of seam carving becomes the input of the next. Therefore, the image will be represented as an uchar4 array (8-bit unsigned integers for red, green, blue and alpha components of each pixel) in all kernels.

We will be representing the 2D image in row-major (C-style) format whereby the consecutive elements inside a row are stored contiguously memory. This will allow easier interfacing with numpy in the Python host code as numpy arrays which are in row-major format by default.

Notice that all of the kernels presented here have rowPitch, rows and columns arguments. rows and columns give the dimensions of the image while rowPitch specifies the interval at which each row starts. This allows the kernels to index into the array using a row and column number using the following formula: \(A[row, column] \rightarrow A[column + rowPitch * row]\).

Let me explain why we have a separate rowPitch parameter as you may be thinking we could just use columns instead. To minimize memory access, the stage 4 kernel performs seam removal in-place on the image buffer by shifting the pixels on the right side of the vertical seam to the left by one pixel while leaving the rest of the image intact. You can see that after this operation, the image buffer has a column of unused pixels on the rightmost of each row, and while the number of columns in the image has changed, we still need the same value of rowPitch to index into the array. Note that rowPitch remains constant at the number of columns of the original image throughout the algorithm.

1. Calculating gradient magnitudes

#define filter(i, j) filter[j + i * filterWidth]
#define image(row, column) image[clamp(column, 0, columns - 1) + rowPitch * clamp(row, 0, rows - 1)]
#define output(row, column) output[column + rowPitch * (row)]
#define region(row, column) region[column + 1 + (get_local_size(1) + 2) * (row + 1)]

__kernel void gradient_magnitudes(
        __global uchar4 const *image, 
        __const int rowPitch,
        __const int rows, __const int columns,
        __constant int const *filter,
        __local float4 *region, __global float4 *output)
{
    int row = get_global_id(0);
    int column = get_global_id(1);
    int local_row = get_local_id(0);
    int local_column = get_local_id(1);
    
    // Copy region to local memory
    if(row <= rows && column <= columns) {
        region(local_row, local_column) = convert_float4(image(row, column));

        // Let work units on boundary of work group copy extra boundary pixels
        if(local_row == 0) {
            region(local_row - 1, local_column) = convert_float4(image(row - 1, column));
            if(local_column == 0) {
                region(local_row - 1, local_column - 1) = convert_float4(image(row - 1, column - 1));
            } else if(local_column == (int)get_local_size(1) - 1) {
                region(local_row - 1, local_column + 1) = convert_float4(image(row - 1, column + 1));
            }
        } else if(local_row == (int)get_local_size(0) - 1) {
            region(local_row + 1, local_column) = convert_float4(image(row + 1, column));
            if(local_column == 0) {
                region(local_row + 1, local_column - 1) = convert_float4(image(row + 1, column - 1));
            } else if(local_column == (int)get_local_size(1) - 1) {
                region(local_row + 1, local_column + 1) = convert_float4(image(row + 1, column + 1));
            }
        }
        
        if(local_column == 0) {
            region(local_row, local_column - 1) = convert_float4(image(row, column - 1));
        } else if(local_column == (int)get_local_size(1) - 1) {
            region(local_row, local_column + 1) = convert_float4(image(row, column + 1));
        }  
    }
    
    // Synchronize work items and ensure visibility of local memory
    barrier(CLK_LOCAL_MEM_FENCE);
    
    // Convolve filter in both directions against region stored in local memory
    if(row < rows && column < columns) { float4 grad_x = 0.0; float4 grad_y = 0.0; int filterWidth = 3; int offset = filterWidth >> 1;
        for(int i = 0 ; i < filterWidth ; ++i) {
            for(int j = 0 ; j < filterWidth ; ++j) {
                grad_x += filter(i, j) * region(local_row + i - offset, local_column + j - offset);
                grad_y += filter(j, i) * region(local_row + i - offset, local_column + j - offset);
            }
        }

        output(row, column) = sqrt(grad_x * grad_x + grad_y * grad_y);
    }
}

Launch configuration
Global size:             image height x image width (one work unit per pixel)
Workgroup size:     16 x 16

I started with a naive implementation that took over 4ms to convolve the 3×3 gradient filter over the 224 by 320 image but brought it down to 0.2ms after using several of the techniques from Khairi Reda’s excellent article on optimizing image convolution kernels in OpenCL.

The most impactful optimization was to launch the kernel in workgroups of 16 by 16 and instruct each workgroup to copy into local memory the 18 by 18 region of the image that its work items need to access for convolution. This (approximately) reduces the number of global number accesses per by a factor of 9 i.e. the area of the filter, making the work items read from much faster local memory.

2. Finding minimum cost seams

#define imageGradient(row, column) imageGradient[column + rowPitch * (row)]
#define parent(row, column) parent[column + rowPitch * (row)]

__kernel void seam_carve(
        __global const float4 *imageGradient, __const int rowPitch,
        __const int rows, __const int columns,
        __local float *previousCost, __local float *cost,
        __global int *parent, global int *minimumCostColumn)
{
    // Initialize previousCost to zero
    int lid = get_local_id(0);
    int columns_per_thread = ceil((float)columns / get_local_size(0));
    int offset = lid * columns_per_thread;
    for(int i = 0 ; i < columns_per_thread ; ++i) {
        int column = offset + i;
        if(column < columns) {
            previousCost[column] = 0.0;
        }
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    
    // Calculate minimum cost seams for each successive row
    for(int row = 0 ; row < rows ; ++row) {
        for(int i = 0 ; i < columns_per_thread ; ++i) {
            int column = offset + i;
            if(column < columns) { int minimumParent = -1; float minimumParentCost = INFINITY; if(column - 1 >= 0 && previousCost[column - 1] < minimumParentCost) {
                    minimumParentCost = previousCost[column - 1];
                    minimumParent = column - 1;
                }
                if(previousCost[column] < minimumParentCost) {
                    minimumParentCost = previousCost[column];
                    minimumParent = column;
                }
                if(column + 1 < columns && previousCost[column + 1] < minimumParentCost) {
                    minimumParentCost = previousCost[column + 1];
                    minimumParent = column + 1;
                }
                
                cost[column] = minimumParentCost + dot(1.0, imageGradient(row, column));
                parent(row, column) = minimumParent;
            }
        }
        __local float* temp = cost;
        cost = previousCost;
        previousCost = temp;
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    
    /* Perform a parallel reduction to find the minimum cost pixel of the last row */
    
    // Prepare for reduction
    __local float* lastRowCost = previousCost;
    __local int* minimumIndices = (__local int*)cost;
    float minimum = INFINITY;
    int minimumIndex = -1;
    for(int i = 0 ; i < columns_per_thread ; ++i) {
        int column = offset + i;
        if(column < columns && lastRowCost[column] < minimum) {
            minimum = lastRowCost[column];
            minimumIndex = column;
        }
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    lastRowCost[lid] = minimum;
    minimumIndices[lid] = minimumIndex;
    barrier(CLK_LOCAL_MEM_FENCE);

    // Perform the reduction
    for(size_t stride = 1 ; stride < get_local_size(0) ; stride <<= 1) {
        int mask = (stride << 1) - 1; if((lid & mask) == 0) { if(lastRowCost[lid] > lastRowCost[lid + stride]) {
                lastRowCost[lid] = lastRowCost[lid + stride];
                minimumIndices[lid] = minimumIndices[lid + stride];
            }
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // Let first thread write result of reduction to global memory 
    if(lid == 0) {
        *minimumCostColumn = minimumIndices[0];
    }
}

Launch configuration
Global size:         max work group size (256 in my case)
Workgroup size: max work group size

Similar to the Python implementation, this kernel iterates through each row of the image except the work within an iteration is now parallelized.

Computing the minimum cost seam ending at a given pixel requires accessing the minimum costs of its three adjacent pixels in the previous row, so it is preferable to keep previous row costs in faster local memory. However, if we are to use local memory which is shared only within a workgroup, the kernel must be launched as a single workgroup. OpenCL devices have a maximum work group size which was 256 for my GPU, so in order to process images wider than this number, each work unit must become in charge of multiple pixels within each row (columns_per_thread variable in code).

As the kernel computes the minimum cost seams ending at each pixel, it marks the parents in a buffer in global memory. This will be passed onto the next kernel which extracts the path of the minimum cost seam. Finding this path also requires knowing the minimum cost column of the last row of the pixel. I take advantage of the commutative and associative property of the \(min(a, b)\) operation and use a parallel reduction to find the minimum cost column. Reduction is easier to think about and implement when the input array is equal to the size of the work group, so we prepare for the reduction by first getting the array holding the last row costs into this format.

3. Extracting path of minimum cost seam

__kernel void extract_seam_path(__global int const *parent, __const int rows, __const int columns,
				__global int const *lastRowMinimumCostColumn, __global int *path)
{
	int currentColumn = *lastRowMinimumCostColumn;
	for(int row = rows - 1 ; row >= 0 ; --row) {
		path[row] = currentColumn;
		currentColumn = parent(row, currentColumn);
	}
}

Launch configuration
Global size:         1
Workgroup size: 1

This kernel is run as a single thread (serially). It uses the parent array and minimum cost column of the last row computed from the previous stage kernel and extracts the minimum cost path, saving it into an integer array in global memory.

4. Seam removal

__kernel void remove_seam(__global uchar4 *image, __const int rowPitch,
                          __const int rows, __const int columns,
                          __global int const *path)
{
    int lid = get_local_id(0);
    int columns_per_thread = ceil((float)columns / get_local_size(0));
    int offset = lid * columns_per_thread;
    for(int row = 0 ; row < rows ; ++row) {
        uchar4 onePixelBeyondLast = image(row, offset + columns_per_thread);
        barrier(CLK_LOCAL_MEM_FENCE);

        for(int i = 0 ; i < columns_per_thread - 1 ; ++i) { int column = offset + i; if(column >= path[row]) {
                image(row, column) = image(row, column + 1);
            }
        }

        int lastIndex = offset + columns_per_thread - 1;
        if(lastIndex >= path[row]) {
            image(row, lastIndex) = onePixelBeyondLast;
        }

        barrier(CLK_LOCAL_MEM_FENCE);
    }
}

Global size:         max workgroup size (256 in my case)
Workgroup size:      max workgroup size (256 in my case)

This kernel performs the final stage of the algorithm, removing the seam from the image in-place. As explained before, it does this by shifting everything that is on the right side of the seam to the left by one pixel. Because we are reading and writing to the same buffer, we need to be prudent in using barriers to ensure safety. Barriers only synchronize threads within one workgroup so like the previous seam carving kernel, this kernel is launched as a single workgroup. The workgroup removes the seam from the image one row at a time and some threads become responsible for shifting multiple pixels in each iteration (row).

Host code (pyopencl)

import pyopencl as cl
import numpy
from PIL import Image

class OpenCLSeamCarver:
    def __init__(self):
        self.platform = cl.get_platforms()[0]
        
        # Select fastest device (by clock frequency) that supports 16x16 work group  
        def supports_16x16_workgroup(device):
            return device.max_work_item_sizes[0] >= 16 and device.max_work_item_sizes[1] >= 16
        devices_that_support_16x16_workgroup = filter(supports_16x16_workgroup, self.platform.get_devices())
        self.device = max(devices_that_support_16x16_workgroup, key=lambda device: device.max_clock_frequency)

        self.context = cl.Context([self.device])
        self.queue = cl.CommandQueue(self.context, properties=cl.command_queue_properties.PROFILING_ENABLE)
       
        # Read and compile kernels from source file
        with open("seam_carve.cl", "r") as source:
            self.program = cl.Program(self.context, source.read()).build()
        
        # Set up kernels for execution
        self.gradient_magnitudes_kernel = self.program.gradient_magnitudes
        self.gradient_magnitudes_kernel.set_scalar_arg_dtypes([None, numpy.int32, numpy.int32, numpy.int32, None, numpy.int32, None, None])
        
        self.seam_carve_kernel = self.program.seam_carve
        self.seam_carve_kernel.set_scalar_arg_dtypes([None, numpy.int32, numpy.int32, numpy.int32, None, None, None, None])
        
        self.extract_seam_path_kernel = self.program.extract_seam_path
        self.extract_seam_path_kernel.set_scalar_arg_dtypes([None, numpy.int32, numpy.int32, numpy.int32, None, None])
        
        self.remove_seam_kernel = self.program.remove_seam
        self.remove_seam_kernel.set_scalar_arg_dtypes([None, numpy.int32, numpy.int32, numpy.int32, None])
        
    def seam_carve(self, image: Image.Image, num_columns_to_remove: int=1) -> Image.Image:
        """Removes the specified number of vertical seams from a color image

        Args:
        image (Image.Image): color image
        num_columns_to_remove (int)

        Returns:
        image (Image.Image): seam carved color image
        """
        
        # Initialize buffers
        ## For image
        image_data = numpy.array(image.getdata()).reshape(image.size[1], image.size[0], 3)
        image_data = numpy.concatenate((image_data, numpy.ones((image_data.shape[0], image_data.shape[1], 1))), axis=2).astype(numpy.uint8)
        rows, columns, channels = image_data.shape
        row_pitch = columns
        image_cl = cl.Buffer(self.context, cl.mem_flags.READ_ONLY, image_data.nbytes)
        cl.enqueue_copy(self.queue, image_cl, image_data)
        
        ## For gradient computation
        filter = numpy.matrix('-1 0 1 ; -1 0 1 ; -1 0 1', dtype=numpy.int32)
        filter_cl = cl.Buffer(self.context, cl.mem_flags.READ_ONLY, filter.nbytes)
        cl.enqueue_copy(self.queue, filter_cl, filter)
        region_cl = cl.LocalMemory(18 * 18 * 4 * 4) # array of float32 x 4 vector
        gradient_magnitudes_cl = cl.Buffer(self.context, cl.mem_flags.READ_WRITE, numpy.product(image_data.shape) * 4) # float32
        
        ## For minimum cost seam computation
        previous_cost_cl = cl.LocalMemory(columns * 4) # float32
        cost_cl = cl.LocalMemory(columns * 4) # float32
        parent_cl = cl.Buffer(self.context, cl.mem_flags.WRITE_ONLY, numpy.product(image_data.shape[:2]) * 4) # int32
        
        ## For seam path extraction
        minimum_cost_col_cl = cl.Buffer(self.context, cl.mem_flags.WRITE_ONLY, 4)
        path_cl = cl.Buffer(self.context, cl.mem_flags.READ_WRITE, 4 * rows)
        
        for i in range(num_columns_to_remove):
            # Compute gradient magnitude for each color
            workgroup_size = [16, 16]
            global_size = [rows + (-rows % workgroup_size[0]),
                           columns + (-columns % workgroup_size[1])]
            self.gradient_magnitudes_kernel(self.queue, global_size, workgroup_size,
                                            image_cl, row_pitch, rows, columns,
                                            filter_cl, region_cl, gradient_magnitudes_cl)
            
            # Find minimum cost seam
            self.seam_carve_kernel(self.queue, [self.device.max_work_group_size], [self.device.max_work_group_size],
                                   gradient_magnitudes_cl, row_pitch, rows, columns,
                                   cost_cl, previous_cost_cl, parent_cl, minimum_cost_col_cl)
            
            # Extract path of seam
            self.extract_seam_path_kernel(self.queue, [1], [1],
                                          parent_cl, row_pitch, rows, columns,
                                          minimum_cost_col_cl, path_cl)
            
            # Remove seam from image
            self.remove_seam_kernel(self.queue, [self.device.max_work_group_size], [self.device.max_work_group_size],
                                    image_cl, row_pitch, rows, columns,
                                    path_cl)
            
            columns -= 1
        
        final_data = numpy.zeros((rows, row_pitch, channels), dtype=numpy.uint8)
        cl.enqueue_copy(self.queue, final_data, image_cl)
        
        return Image.fromarray(final_data[:, :columns, :3])

The OpenCLSeamCarver class initializes an OpenCL context and an in-order command queue on the fastest possible device that supports a 16×16 work group (necessary for the optimized gradient_magnitudes kernel). Its seam_carve method then repeatedly queues the kernels for execution  in the appropriate order as many times as specified by the caller (number of vertical seams to remove). As we wrote OpenCL kernels for all stages of the algorithm, memory transfer between the host and device is minimized. The image is only transferred between the host and device twice in the function, once to copy the input image to the device and lastly to copy the modified image (with the specified number of seams removed) back to the host.

Performance

Running on just the GPU integrated with my Intel Kaby Lake processor, the OpenCL implementation performs the task of removing 100 seams from the earlier image with dimensions of 224 by 320 in 142 milliseconds, which is almost 180x faster than the serial implementation which took 25.6 seconds.

It would have been infeasible to run the serial implementation on large images captured by a modern camera but the parallel implementation makes this possible.  Carving the equivalent chunk out of a larger version of the butterfly image (400 pixels from a 1280 by 853 image) took only 4.95 seconds:

Demo GUI application

I wrote a small GUI application using Qt and C++ to demo the OpenCL implementation of seam carving. Click here to go to the GitHub repo where you can download and run the app (macOS and Windows only).

The app shows a live preview of the image as it gets rescaled. One way to implement this would have been to copy the contents of the OpenCL image buffer after each iteration but this I/O would be a major performance killer. Instead, I used OpenCL and OpenGL interop functionality to share the OpenCL image buffer in the memory of the GPU with an OpenGL context which is used by a simple fragment shader to display the image.