In this tutorial you will learn 

 1) how to hack makefiles

 2) how to view assembly code generated by gcc (works on any platform)

 3) how to run assembly through the SPU timing tool and how to interpret the output of the static analysis (in terms of the effectiveness of software pipelining) 


 Install the spu_timing tool:

1. Mount 'CellSDK-Extras-Fedora_3.,' the 'cell-spu-timing-3.1-2.i686.rpm' will be located in the x86/ folder

2. Use 'rpm -i cell-spu-timing-3.1-2.i686.rpm' command to install it

3. the 'spu_timing' tool will be shown in the '/opt/cell/sdk/usr/bin/' folder

4. Use 'spu-gcc -S <filename>.c' to generate '.s' file

5. Use '/opt/cell/sdk/usr/bin/spu_timing <filename>.s' to generate '.s.timing' file

1)start X window first

2)ssh -Y liw34@cnode1

or ssh first and then

ssh -Y


cp -r ibm ~

spu-gcc -S hello.c



 spu_timing hello.s



more hello.s.timing


Explanation copied from programming tutorial Static Analysis of SPE Threads (Page 115-116)

The listing below shows an spu-timing static timing analysis for the

inner loop of the SPE code 

illustrated in Section3.6.3.3 on page109, the Euler Particle-System

Simulation example. This 

listing shows significant dependency stalls (indicated by the -) and

poor dual-issue rates. The 

inner loop has an instruction mix of eight even-pipeline (pipe 0)

instructions and ten odd-pipeline 

(pipe 1) instructions. Therefore, any program changes that minimize

data dependencies will 

improve dual-issue rates and lower the cycle per instruction (CPI). 


0D                                                78       a


1D 012                                            789      lqx


0D                                                 89      ila


1D 0123                                            89      lqx


0  0                                                9      ai


0  ----456789                                              fma


1       -----012345                                        stqx


1             123456                                       lqx


0D             23                                          ai


1D             234567                                  lqa    $44,ctx+16 

1               345678                                   lqx    $43,$6,$9 

1                ---7890                                   rotqby $46,$48,$49 

1                    ---1234                               shufb  $45,$46,$46,$47 

0                        ---567890                         fm     $42,$12,$45 

0d                           -----123456                   fma    $41,$42,$44,$43 

1d                                ------789012              stqx   $41,$6,$9 

0D                                       89                      ai     $6,$6,16 


1D                                       8901                brnz    $7,.L19 

The character columns in the above static-analysis listing have the

following meanings: 

Column 1 The first column shows the pipeline that issued an

instruction. Pipeline 0 is repre- 

sented by 0 in the first column and pipeline 1 is represented by 1.


Column 2 The second column can contain a D, d, or nothing. A D

signifies a successful 

dual-issue was accomplished by the two instructions listed in

row-pairs. A d signifies a dual- 

issue was possible, but did not occur due to dependencies; for example,

operands being in 

flight. If there is no entry in the second column, dual-issue could not

be performed because 

the issue rules were not satisfied (for example, an even-pipeline

instruction was fetched from 


an odd LS address or an odd-pipeline instruction was fetched from an

even LS address). See 

Section3.1.1.4 Pipelines and Dual-Issue Rules on page61. 

Column 3 The third column is always blank. 

Columns 4 through 53 The next 50 columns represent clock cycles and

are repeated as 

0123456789 five times. A digit is displayed in these columns whenever

the instruction exe- 

cutes during that clock cycle. Therefore, an <n>-cycle instruction will

display <n> digits. 

Dependency stalls are flagged by a dash (-). 

Columns 54 and boyond The remaining entries on the row are the


instructions or assembler-line addresses (for example, .L19) of the

program's assembly 


Static-analysis timing files can be quickly interpreted by: 

Scanning the columns of digits. Small slopes (more horizontal) are

bad. Large slopes (more 

vertical) are good. 

Looking for instructions with dependencies (those with dashes in the


Looking for instructions with poor dual-issue rates either a d or

nothing in column 2. 

This information can be used to understand what areas of code are

scheduled well and which are poorly scheduled. 



If you are using a Bash shell, you can set SPU_TIMING as a shell

variable by using the command 

export SPU_TIMING=1. You can also set SPU_TIMING in the makefile and

build the .s file by using 

the following statement: 

SPU_TIMING=1 make foo.s 

This creates the timing file for file foo.c. It sets the SPU_TIMING

variable only in the sub-shell of 

the makefile. It generates foo.s and then invokes spu-timing on foo.s

to produce a 

foo.s.timing file. 

Another way to invoke the performance tool is by entering one of the

following statements in the 

command prompt, depending on which compiler generated that assembly: 

spu-timing foo.s

創作者 forteallan 的頭像

Ching's blog

forteallan 發表在 痞客邦 留言(0) 人氣()