Performance Profiling

  • Build in timing calls to your code to help discover bottlenecks early on.

  • Compile with -pg flag to add automatic timing code for test runs.

    • Run your code normally. A file called gmon.out will be produced containing timing information.

    • Run gprof your_program_name > profile.out to create a timing report.

  • [Demo...]

  • Don't just run one test case.

    • Here is the result of running a matrix multiply test that compares three different routines on Cortex, showing that for particular matrix sizes (1024x1024 and 1536x1536 for the case of DGEMM highlighted below), the time taken can be "unusual". The times shown are for multiplying two matrices of the given size ten times.

    • The "matmul" curve is the result for a test using the Fortran intrinsic MATMUL to do the matrix multiplication, the "multiply" curve is a naively-coded triple loop and the "dgemm" curve is for code that uses a BLAS DGEMM library routine.

Matrix multiply timings on Cortex
  • Don't just use one compiler on one system

    • The table below summarizes some results for the matrix multiply test described above. The times shown are for 10 repetitions of multiplication of 1000x1000 matrices.


    SystemCompiler Optimization Library MATMUL Matrix_Multiply DGEMM
    Cortex
    xlf95_r-O5 -q64
    (built-in)
    9.32
    7.55
    11.22
    Dendrite xlf95_r
    -O5 -q64
    (built-in)
    13.65
    6.32
    8.96
    Glacier ifort
    -O3
    (mkl81)
    111.84
    103.37
    5.98
    Glacier pgf90
    -O3
    -lblas
    17.83
    108.17
    36.05
    Lattice f90
    -fast
    -lcxml
    430.85
    429.18
    11.27
    Matrix pgf95
    -fast -tp k8 (Goto)
    14.15
    137.29
    4.79
    Matrix pgf95 -fast -lacml
    6.67
    140.05
    5.11
    Matrix pgf95 -fast -lblas 6.89
    144.18
    48.57
    Nexus f90
    -O3
    -lscs
    17.00
    17.23
    16.23


Last modified: 2007-11-07

Direct questions to support@westgrid.ca .