2024/04/17

OpenCL

簡介

OpenCL(Open Computing Language,開放計算語言)是一個為異構平台編寫程式的框架, 此異構平台可由 CPU、GPU、DSP、FPGA 或其他類型的處理器與硬體加速器所組成。 Portable Computing Language (PoCL) 則是 OpenCL 的一個自由軟體實作, 可以在機器上沒有 GPU 的情況下使用 OpenCL API 進行運算。

OpenCL 包括一組 API 和一個程式語言。基本的原理是程式透過 OpenCL API 取得 OpenCL 裝置(例如顯示晶片)的相關資料, 並將要在裝置上執行的程式(使用 OpenCL 程式語言撰寫)編繹成適當的格式以後在裝置上執行。

An OpenCL application is split into host code and device kernel code. Execution of an OpenCL program occurs in two parts: kernelsthat execute on one or more OpenCL devices and a host program that executes on the host.

The most commonly used language for programming the kernels that are compiled and executed across the available parallel processors is called OpenCL C. OpenCL C is based on C99 and is defined as part of the OpenCL specification.

The core of the OpenCL execution model is defined by how the kernels execute. OpenCL regards a kernel program as the basic unit of executable code (similar to a C function). Kernels can execute with data or task-parallelism. An OpenCL program is a collection of kernels and functions (similar to dynamic library with run-time linking).

An OpenCL command queue is used by the host application to send kernels and data transfer functions to a device for execution. By enqueueing commands into a command queue, kernels and data transfer functions may execute asynchronously and in parallel with application host code.

The kernels and functions in a command queue can be executed in-order or out-of-order. A compute device may have multiple command queues.


A complete sequence for executing an OpenCL program is:

  1. Query for available OpenCL platforms and devices
  2. Create a context for one or more OpenCL devices in a platform
  3. Create and build programs for OpenCL devices in the context
  4. Select kernels to execute from the programs
  5. Create memory objects for kernels to operate on
  6. Create command queues to execute commands on an OpenCL device
  7. Enqueue data transfer commands into the memory objects, if needed
  8. Enqueue kernels into the command queue for execution
  9. Enqueue commands to transfer data back to the host, if needed

A host is connected to one or more OpenCL compute devices. Each compute device is collection of one or more compute units where each compute unit is composed of one or more processing elements. Processing elements execute code with SIMD (Single Instruction Multiple Data) or SPMD (Single Program Multiple Data) parallelism.


For example, a compute device could be a GPU. Compute units would then correspond to the streaming multiprocessors (SMs) inside the GPU, and processing elements correspond to individual streaming processors (SPs) inside each SM. Processors typically group processing elements into compute units for implementation efficiency through sharing instruction dispatch and memory resources, and increasing local inter-processor communication.

OpenCL's clEnqueueNDRangeKernel command enables a single kernel program to be initiated to operate in parallel across an N-dimensional data structure. Using a two-dimensional image as a example, the size of the image would be the NDRange, and each pixel is called a work-item that a copy of kernel running on a single processing element will operate on.

As we saw in the Platform Model section above, it is common for processors to group processing elements into compute units for execution efficiency. Therefore, when using the clEnqueueNDRangeKernel command, the program specifies a work-group size that represents groups of individual work-items in an NDRange that can be accommodated on a compute unit. Work-items in the same work-group are able to share local memory, synchronize more easily using work-group barriers, and cooperate more efficiently using work-group functions such as async_work_group_copy that are not available between work-items in separate work-groups.


OpenCL has a hierarchy of memory types:

  • Host memory - available to the host CPU
  • Global/Constant memory - available to all compute units in a compute device
  • Local memory - available to all the processing elements in a compute unit
  • Private memory - available to a single processing element

OpenCL memory management is explicit. None of the above memories are automatically synchronized and so the application explicitly moves data between memory types as needed.


在 openSUSE Tumbleweed 上安裝:

sudo zypper in ocl-icd-devel pocl-devel opencl-headers clinfo

OpenCL Installable Client Driver (ICD) allows multiple OpenCL implementations to co-exist; also, it allows applications to select between these implementations at runtime.

Use the clGetPlatformIDs() and clGetPlatformInfo() functions to see the list of available OpenCL implementations, and select the one that is best for your requirements.

執行 clinfo 觀察目前的 OpenCL device 資訊。

下面的程式使用 clGetPlatformIDs 函式取得目前可用的 platform 數目 (編譯指令:gcc test.c `pkg-config --libs --cflags OpenCL`):

#include <stdio.h>

#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif

int main( void ) {
    // OpenCL related declarations
    cl_int err;
    cl_uint num;

    err = clGetPlatformIDs( 0, NULL, &num );
    printf("%d\n", num);

}

下面是另外一個範例:

#include <stdlib.h>
#include <stdio.h>

#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif

const char *kernel_code =
    "__kernel void vector_add(__global const int *A, __global const int *B, __global int *C) {"
    "    int i = get_global_id(0);"
    "    C[i] = A[i] + B[i];"
    "}";

int main( void ) {
    // OpenCL related declarations
    cl_int err;
    cl_platform_id platform;
    cl_device_id device;
    cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
    cl_context ctx;
    cl_program program;
    cl_command_queue queue;
    cl_kernel kernel;
    int i;

    //
    const size_t N = 1024; // vector size
    size_t global_item_size = N; // Process the entire lists
    size_t local_item_size = 64; // Divide work items into groups of 64

    int *A, *B, *C;
    A = (int*) malloc(N * sizeof(*A));
    B = (int*) malloc(N * sizeof(*B));
    C = (int*) malloc(N * sizeof(*C));
    for (i=0; i<N; i++) {
        A[i] = i;
        B[i] = i + 1;
    }
    cl_mem d_A, d_B, d_C;

    /* Setup OpenCL environment. */
    err = clGetPlatformIDs( 1, &platform, NULL );
    err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_DEFAULT, 1, &device, NULL );

    props[1] = (cl_context_properties)platform;
    ctx = clCreateContext( props, 1, &device, NULL, NULL, &err );
    queue = clCreateCommandQueueWithProperties( ctx, device, 0, &err );
    program = clCreateProgramWithSource(ctx, 1, (const char **) &kernel_code, NULL, &err);
    err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
    kernel = clCreateKernel(program, "vector_add", &err);

    // initialize buffer with data
    d_A = clCreateBuffer( ctx, CL_MEM_READ_ONLY, N*sizeof(*A), NULL, &err );
    d_B = clCreateBuffer( ctx, CL_MEM_READ_ONLY, N*sizeof(*B), NULL, &err );
    d_C = clCreateBuffer( ctx, CL_MEM_WRITE_ONLY, N*sizeof(*C), NULL, &err );

    err = clEnqueueWriteBuffer( queue, d_A, CL_TRUE, 0, N*sizeof(*A), A, 0, NULL, NULL );
    err = clEnqueueWriteBuffer( queue, d_B, CL_TRUE, 0, N*sizeof(*B), B, 0, NULL, NULL );

    err = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&d_A);
    err = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&d_B);
    err = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&d_C);

    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL,
            &global_item_size, &local_item_size, 0, NULL, NULL);

    err = clFinish(queue);

    err = clEnqueueReadBuffer( queue, d_C, CL_TRUE, 0, N*sizeof(*C), C, 0, NULL, NULL );
    err = clFinish(queue);

    for(i = 0; i < N; i++)
        printf("%d + %d = %d\n", A[i], B[i], C[i]);

    err = clFlush(queue);
    err = clFinish(queue);

    /* Release OpenCL memory objects. */
    clReleaseMemObject( d_A );
    clReleaseMemObject( d_B );
    clReleaseMemObject( d_C );
    free(A);
    free(B);
    free(C);
    clReleaseKernel( kernel );
    clReleaseProgram( program );
    clReleaseCommandQueue( queue );
    clReleaseContext( ctx );

    return 0;
}

下面的程式是使用 stb_image 讀取圖檔,測試 image object 功能的程式。

#include <stdio.h>
#include <stdlib.h>

#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif

#define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h"
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"

const char *kernel_code =
    "__kernel void PixelAccess(__read_only image2d_t imageIn,__write_only image2d_t imageOut)"
    "{"
    "  sampler_t srcSampler = CLK_NORMALIZED_COORDS_FALSE | "
    "    CLK_ADDRESS_CLAMP_TO_EDGE |"
    "    CLK_FILTER_NEAREST;"
    "  int2 imageCoord = (int2) (get_global_id(0), get_global_id(1));"
    "  uint4 pixel = read_imageui(imageIn, srcSampler, imageCoord);"
    "  write_imageui (imageOut, imageCoord, pixel);"
    "}";


int main( int argc, char *argv[] ) {
    // OpenCL related declarations
    cl_int err;
    cl_platform_id platform;
    cl_device_id device;
    cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
    cl_context ctx;
    cl_program program;
    cl_command_queue queue;
    cl_kernel kernel;
    int i;
    int width = 0, height = 0, channel = 0;
    unsigned char *data = NULL;
    const char *filename = NULL;

    if (argc < 2) {
        printf("Please give a filename.\n");
        return 0;
    } else if (argc == 2) {
        filename =  argv[1];
    }

    // Load image data
    data = stbi_load(filename, &width, &height, &channel, 0);
    if(!data) {
        fprintf(stderr, "Open image failed.\n");
        return 0;
    }

    cl_mem myClImageInBuffer;
    cl_mem myClImageOutBuffer;
    cl_sampler sampler;

    cl_image_format format;
    if (channel==4) {
        format.image_channel_order = CL_RGBA;
    } else {
        printf("Not supported image format.\n");
        return 0;
    }
    format.image_channel_data_type = CL_UNSIGNED_INT8;

    err = clGetPlatformIDs( 1, &platform, NULL );
    err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_DEFAULT, 1, &device, NULL );

    cl_bool imageSupport = CL_FALSE;
    clGetDeviceInfo(device, CL_DEVICE_IMAGE_SUPPORT, sizeof(cl_bool),
                    &imageSupport, NULL);

    if (imageSupport != CL_TRUE)
    {
        printf("OpenCL device does not support images.\n");
        return 1;
    }

    props[1] = (cl_context_properties)platform;
    ctx = clCreateContext( props, 1, &device, NULL, NULL, &err );
    queue = clCreateCommandQueueWithProperties( ctx, device, 0, &err );
    program = clCreateProgramWithSource(ctx, 1, (const char **) &kernel_code, NULL, &err);
    err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
    kernel = clCreateKernel(program, "PixelAccess", &err);

    //
    // For OpenCL 1.2
    cl_image_desc clImageDesc;
    clImageDesc.image_type = CL_MEM_OBJECT_IMAGE2D;
    clImageDesc.image_width = width;
    clImageDesc.image_height = height;
    clImageDesc.image_row_pitch = 0;
    clImageDesc.image_slice_pitch = 0;
    clImageDesc.num_mip_levels = 0;
    clImageDesc.num_samples = 0;
    clImageDesc.buffer = NULL;

    myClImageInBuffer = clCreateImage(ctx, CL_MEM_READ_ONLY,
                            &format, &clImageDesc, NULL, &err);
    if (!myClImageInBuffer) {
        printf("Create myClImageInBuffer failed.\n");
    }

    myClImageOutBuffer = clCreateImage(ctx, CL_MEM_READ_WRITE,
                            &format, &clImageDesc, NULL, &err);
    if (!myClImageOutBuffer) {
        printf("Create myClImageOutBuffer failed.\n");
    }

    size_t origin[3] = {0, 0, 0};
    size_t region[3] = {width, height, 1};

    err = clEnqueueWriteImage(
            queue, myClImageInBuffer,
            CL_TRUE, origin, region,
            0,
            0, data,
            0, NULL, NULL);

    err = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *) &myClImageInBuffer);
    err = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *) &myClImageOutBuffer);

    size_t global_item_size[2] = {width, height};
    size_t local_item_size[2] = {1, 1};

    err = clEnqueueNDRangeKernel(queue, kernel, 2, NULL,
            global_item_size, local_item_size, 0, NULL, NULL);

    err = clFinish(queue);

    unsigned char *data2 = NULL;
    data2 = (unsigned char *) malloc(width * height  * channel);
    err = clEnqueueReadImage( queue,
          myClImageOutBuffer, CL_TRUE,
          origin, region,
          width * sizeof(unsigned char) * 4,
          0, data2,
          0, NULL, NULL);

    err = clFinish(queue);

    stbi_write_png("output.png", width, height, channel, data2, 0);

    free(data2);
    stbi_image_free(data);

    clReleaseMemObject( myClImageInBuffer );
    clReleaseMemObject( myClImageOutBuffer );
    clReleaseKernel( kernel );
    clReleaseProgram( program );
    clReleaseCommandQueue( queue );
    clReleaseContext( ctx );

    return 0;
}

參考連結

沒有留言:

張貼留言

注意:只有此網誌的成員可以留言。