Results 1 to 3 of 3

Thread: Slow memcpy speed

  1. #1
    Join Date
    Aug 2016
    Rep Power

    Default Slow memcpy speed

    Hi all,
    I have a design based upon the “Lab 4 - Linux FFT Application” from Rocketboard which runs on the Terasic DE0-Nano-SoC (Cyclone V SoC) evaluation board.

    First the data is transferred from the FPGA to the HPS SDRAM using DMA. This transfer is fast: 8 kBytes (1k * 64 bit) takes 21 us => 380 Mbytes/s.

    Doing HPS signal processing on the data while stored in sdram is a bit slow, so to increase the signal processing speed the 8 kBytes data is copied into an array using memcpy.
    Now the signal processing is much faster, but the memcpy “penalty” is high: Transferring the 8 kBytes of data takes 500 us = 16 Mbytes/s using the compile flag O0, O2 or O3.
    Using compile flag from O1 increases memcpy transfer rate to 188us = 42 Mbytes/s, but from what I have read this still seems to be at least 4 times slower than expected.

    Has anyone done similar tests, or know if there are any other options that must be set to get a faster memcpy transfer?

    All timing measurements are done using an oscilloscope (start/stop trigger signals are written from the HPS to the FPGA-GPIO).

    OS: Angstrom v2015.12. Linux real time kernel version 4.1.22-ltsi-rt (PREEMPT RT)

  2. #2
    Join Date
    Aug 2016
    Rep Power

    Default Re: Slow memcpy speed

    An update:

    When defining arrays like this
    int value[2048]; //source array
    int dest[2048] ; //destination array
    and running memcpy(dest,value,2048*4), memcpy speed is high: 446 Mbytes/s
    And the compile flag -Ofast give faster speed than -O1, as expected.

    - - - - - -

    My design is based upon the fpga_fft example from Rocketboard where DMA transfers data from FPGA into HPS’s DRAM memory.
    The memory space for these data (*value) is defined using mmap:

    volatile unsigned int *value;
    volatile unsigned int dest[2048*4];
    #define RESULT_BASE (FFT_SUB_DATA_BASE + (int)mappedBase +(FFT_SUB_DATA_SPAN/2))

    - - - - - -
    In main:

    // we need to get a pointer to the LW_BRIDGE from the softwares point of view.
    // need to open a file.
    /* Open /dev/mem */
    if ((mem = open("/dev/mem", O_RDWR | O_SYNC)) == -1)
    fprintf(stderr, "Cannot open /dev/mem\n"), exit(1);
    // now map it into lw bridge space:
    mappedBase = mmap(0, 0x1f0000, PROT_READ | PROT_WRITE, MAP_SHARED, mem, ALT_LWFPGASLVS_OFST);

    if (mappedBase == (void *)-1) {
    printf("Memory map failed. error %i\n", (int)mappedBase);

    Run DMA and wait for completion

    // And when the DMA is finnished the data is available:
    value = (unsigned int *)((int)RESULT_BASE);

    - - - - - -

    Now, when running memcpy(dest,value,2048*4) the speed is slow: only 42 Mbytes/s, and the compiler does not respond as expected to the -O compiler flags, i.e. -Ofast is slower that -O1.
    It seems that using mmap really slows down the access to memory. Is it possible to speed this up?

    Any help would be greatly appreciated!


  3. #3
    Join Date
    Aug 2016
    Rep Power

    Default Re: Slow memcpy speed

    I think my problem is related to the high address (ALT_LWFPGASLVS_OFST = ff200000) that is used, and this might have to be fixed in the kernel space…
    While waiting for someone to fix this for me , I wrote an assembly version of the memcpy using the “NEON memory copy with preload” example from arm infocenter.
    I had to add “SUBS r2,r2,#0x40” before the loop, if not the loop would go 64 bytes too far (thus overwriting memory).

    Using this "neon memcpy" I got a bit more speed (62 MBytes/s), and I could use the -Ofast flag to optimize the rest of the code.
    This function is called the same way as memcpy, but the data must be 64 bytes aligned:
    void *neon_memcpy(void *ut, const void *in, size_t n)

    .arch armv7-a
    .fpu neon
    .global neon_memcpy
    .type neon_memcpy, %function
    SUBS r2,r2,#0x40
    PLD [r1, #0xC0]
    VLDM r1!,{d0-d7}
    VSTM r0!,{d0-d7}
    SUBS r2,r2,#0x40
    BGE neon_copy_loop
    bx lr
    Last edited by ArthurDent; December 29th, 2016 at 04:31 AM.

Similar Threads

  1. Jam STAPL Byte-Code Player TCK Speed Slow
    By jamz in forum C and C++
    Replies: 3
    Last Post: July 2nd, 2014, 08:39 AM
  2. problem with memcpy
    By imported_ochando in forum General Software Forum
    Replies: 5
    Last Post: February 2nd, 2007, 02:26 AM
  3. CF speed is too slow?
    By m_isshiki in forum Linux Forum
    Replies: 2
    Last Post: November 24th, 2005, 03:00 AM
  4. memcpy ( )
    By jigdo in forum General Software Forum
    Replies: 4
    Last Post: March 8th, 2005, 05:24 PM

Tags for this Thread


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts