The __mem_delay built-in function specifies how many delay cycles there are for specific loads. These specific loads are delinquent loads with a long memory access latency because of cache misses.
When you specify which load is delinquent the compiler takes that information and carries out optimizations such as data prefetching. In addition, when you run -qprefetch=assistthread, the compiler uses the delinquent load information to perform analysis and generate prefetching assist threads. For more information, see -qprefetch.
void* __mem_delay (const void *address, const unsigned int cycles);
The __mem_delay built-in function is placed immediately before a statement that contains a specified memory reference.
Here is how you generate code using assist threads with __mem_delay:
int y[64], x[1089], w[1024];
void foo(void){
int i, j;
for (i = 0; i &l; 64; i++) {
for (j = 0; j < 1024; j++) {
/* what to prefetch? y[i]; inserted by the user */
__mem_delay(&y[i], 10);
y[i] = y[i] + x[i + j] * w[j];
x[i + j + 1] = y[i] * 2;
}
}
}
void foo@clone(unsigned thread_id, unsigned version)
{ if (!1) goto lab_1;
/* version control to synchronize assist and main thread */
if (version == @2version0) goto lab_5;
goto lab_1;
lab_5:
@CIV1 = 0;
do { /* id=1 guarded */ /* ~2 */
if (!1) goto lab_3;
@CIV0 = 0;
do { /* id=2 guarded */ /* ~4 */
/* region = 0 */
/* __dcbt call generated to prefetch y[i] access */
__dcbt(((char *)&y + (4)*(@CIV1)))
@CIV0 = @CIV0 + 1;
} while ((unsigned) @CIV0 < 1024u); /* ~4 */
lab_3:
@CIV1 = @CIV1 + 1;
} while ((unsigned) @CIV1 < 64u); /* ~2 */
lab_1:
return;
}