Results 1 to 8 of 8

Thread: Memory leak using emulator

  1. #1
    Join Date
    May 2017
    Posts
    17
    Rep Power
    1

    Default Memory leak using emulator

    Hi

    I'm working on transforming a CUDA program to OpenCL and run on FPGA, right now I'm using the emulator since I don't have the device yet.
    I wrote a OpenCL kernel that do some simple computing on the image passed from the GPU, and for some reason the memory will increase dramatically for each pixel it compute, and then it will overflow at the third frame.
    The error massages are:
    Context callback: Could not allocate a buffer in host memory
    Context callback: Could not map host buffers to device
    ERROR: CL_OUT_OF_HOST_MEMORY


    I did release the buffer after each frame and free the host memory as well, but the memory still accumulate.

    Launching kernel part (runs for each frame):
    Code:
    ////////////////////////////////////////////////////////////////////////////////////////////////////////
        cl_int status;
        cufftComplex* h_afPadScnOut;
        h_afPadScnOut = (cufftComplex *)malloc(giScnMemSzCmplx);
        CUDA_SAFE_CALL(cudaMemcpy(h_afPadScnOut, gd_afPadScnOut, giScnMemSzCmplx, cudaMemcpyDeviceToHost));// copy memory to host
        cl_mem cl_d_afPadScnOut = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, giScnMemSzCmplx, h_afPadScnOut, NULL);
        cl_event* write_event = (cl_event *)malloc(sizeof(cl_event));
        status = clEnqueueWriteBuffer(queue, cl_d_afPadScnOut, CL_TRUE, 0, giScnMemSzCmplx, h_afPadScnOut, 0, NULL, write_event);// write into CL buffer
        checkError(status, "Failed to write buffer cl_gd_afPadScnOut");
    
    
        // Set the kernel arguments 
        status = clSetKernelArg(kthLaw_kernel, 0, sizeof(cl_mem), (void*)&cl_d_afPadScnOut);
        checkError(status, "Failed to set kernel arg 0");
        status = clSetKernelArg(kthLaw_kernel, 1, sizeof(cl_int), (void*)&giScnSz);
        checkError(status, "Failed to set kernel arg 1");
    
    
        printf("\nKernel initialization is complete.\n");
        printf("Launching the kernel...\n\n");
    
    
        // Configure work set over which the kernel will execute
        size_t wgSize[3] = { 256, 1, 1 };
        size_t gSize[3] = { 307200, 1, 1 };
    
    
        // Launch the kernel
        status = clEnqueueNDRangeKernel(queue, kthLaw_kernel, 1, NULL, gSize, wgSize, 1, write_event, NULL);
        checkError(status, "Failed to launch kernel");
        clReleaseEvent(*write_event);
    
        //Read back data 
        status = clEnqueueReadBuffer(queue, cl_d_afPadScnOut, CL_TRUE, 0, giScnMemSzCmplx, h_afPadScnOut, 0, NULL, NULL);
        checkError(status, "Failed to read buffer cl_gd_afPadScnOut");
    
        //Free CL buffer
        status = clReleaseMemObject(cl_d_afPadScnOut);
        checkError(status, "Failed to release buffer");
    
        // Wait for command queue to complete pending events
        status = clFinish(queue);
        checkError(status, "Failed to finish");
    
        printf("\nKernel execution is complete.\n");
    
        // Free the resources allocated
        //AOCLcleanup();
        CUDA_SAFE_CALL(cudaMemcpy(gd_afPadScnOut, h_afPadScnOut, giScnMemSzCmplx, cudaMemcpyHostToDevice));
        free(h_afPadScnOut);
        ////////////////////////////////////////////////////////////////////////////////////////////////////////
    Kernel:
    Code:
    __kernel void kthLaw(__global float2* d_afPadScn, int dataN)
    {
        int iIndx = get_global_id(0);
        if (iIndx < dataN)
        {
            //afVals(:) = (abs(afVals(:)).^k) .* (cos(angle(afVals(:))) + sin(angle(afVals(:)))*i);
            float2 cDat = d_afPadScn[iIndx];
            float fNewAbsDat = pow(sqrtf(pow(cDat.x, 2) + pow(cDat.y, 2)), 0.1);
            float fAngDat = atan2(cDat.y, cDat.x);
            cDat.x = fNewAbsDat*cosf(fAngDat);
            cDat.y = fNewAbsDat*sinf(fAngDat);
            d_afPadScn[iIndx] = cDat;
        }
    }
    Also I saw the memory increasing from the task manager, is there a way to print out the memory usage form the kernel?

    Any advice will be appreciated.

    -------------------------update-------------------------
    Well I read some material from Altera and they said executing large number of parallel kernels is not feasible on FPGA, instead we should use pipeline design.
    So I wrote the kernel in serial and the memory problem was no more and emulator runs faster!
    I guess I haven't think it through properly, the emulator emulates the behavior of a FPGA where the kernels are actual hardware, of cause it can't be freed in runtime...
    Last edited by matthewyih; July 13th, 2017 at 03:53 PM.

  2. #2
    Join Date
    Jan 2017
    Posts
    294
    Rep Power
    1

    Default Re: Memory leak using emulator

    Why do you have "cudaMemcpy" in an OpenCL code??!! How do you even compile this code?

  3. #3
    Join Date
    May 2017
    Posts
    17
    Rep Power
    1

    Default Re: Memory leak using emulator

    Quote Originally Posted by HRZ View Post
    Why do you have "cudaMemcpy" in an OpenCL code??!! How do you even compile this code?
    Because the original code was written in CUDA and runs on CPU/GPU, and I want to port the code to OpenCL one part at a time.
    So this code is running on CPU/GPU and emulator(also CPU).
    I compile the way the Altera design examples do, Visual Studio(include cuda headers) for the host program and the aoc for the kernel and execute in command prompt.
    The CUDA part shouldn't be a problem since the output frames are fine.
    Last edited by matthewyih; July 12th, 2017 at 08:50 PM.

  4. #4
    Join Date
    Jan 2017
    Posts
    294
    Rep Power
    1

    Default Re: Memory leak using emulator

    Are you trying to perform part of your computation on a GPU using CUDA, and then pass the output to an FPGA using OpenCL? If this is the case, I wouldn't expect it to work at all since you are mixing libraries with completely different characteristics. Since OpenCL works just fine on GPUs, I recommend porting everything to OpenCL first on a GPU, and then trying to port it for FPGAs.

    Also if you are using "clEnqueueWriteBuffer" to write your host buffer to device, you shouldn't use "CL_MEM_USE_HOST_PTR" when creating the device buffer; the latter is for when you do not want to explicitly copy the buffer from host to device, and let the OpenCL runtime to decide when or how to do the transfer. This is mostly useful for targeting CPUs to avoid allocating two copies of the same buffer in host memory (which is the same as device memory in this case).

  5. #5
    Join Date
    May 2017
    Posts
    17
    Rep Power
    1

    Default Re: Memory leak using emulator

    The goal is to port everything to OpenCL/FPGA. The whole program is not trivial so I'm trying to port one kernel at a time.
    Thank you for the suggestion, I would try to run OpenCL on GPU as well.

    Do you mean CL_MEM_COPY_HOST_PTR is the proper way? I did try it but the result was the same.

    The problem I'm facing is whenever I launch a kernel thread on the emulator it would take more memory space, and it doesn't release them after it's done.
    -----------------------------
    Well I guess I found the problem
    Last edited by matthewyih; July 13th, 2017 at 03:31 PM.

  6. #6
    Join Date
    Jan 2017
    Posts
    294
    Rep Power
    1

    Default Re: Memory leak using emulator

    Quote Originally Posted by matthewyih View Post
    Do you mean CL_MEM_COPY_HOST_PTR is the proper way? I did try it but the result was the same.
    No, if you are going to manually copy the host buffer to device using clWriteBuffer, you should create your device buffer like this:

    Code:
    cl_mem cl_d_afPadScnOut = clCreateBuffer(context, CL_MEM_READ_WRITE, giScnMemSzCmplx, NULL, NULL);

  7. #7
    Join Date
    May 2017
    Posts
    17
    Rep Power
    1

    Default Re: Memory leak using emulator

    Thanks HRZ

    What's still bothering me is why does the memory keep accumulating when I launch it in parallel for each frame?
    When I use serial for loop the memory doesn't increase as more frames are computed.
    The same kernel launched with different data input should use the same hardware in the FPGA right?

  8. #8
    Join Date
    Jan 2017
    Posts
    294
    Rep Power
    1

    Default Re: Memory leak using emulator

    I am not really sure, I guess for parallel operations the emulator keeps all data in memory until execution has finished. Since the emulator is extremely slow anyway, if I do have to use it, I try to debug my code with very small inputs.

Similar Threads

  1. emulator for Windows
    By cheers00 in forum OpenCL
    Replies: 3
    Last Post: May 4th, 2016, 01:53 PM
  2. emulator issue
    By givenchy in forum OpenCL
    Replies: 0
    Last Post: May 23rd, 2015, 12:22 AM
  3. Emulator Issue
    By ayk33 in forum OpenCL
    Replies: 5
    Last Post: March 25th, 2015, 10:33 AM
  4. Debugging a memory leak on Nios II with MMU Linux
    By andres in forum Linux Forum
    Replies: 2
    Last Post: April 18th, 2011, 06:49 PM
  5. Emulator for a ACEX 1K
    By shane.hulme in forum General Altera Discussion
    Replies: 2
    Last Post: July 11th, 2010, 07:53 AM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •