Performance Analysis and Optimization of a Multithreaded Application Using Intel VTune
This project focuses on analyzing and optimizing the performance of a matrix multiplication application using Intel VTune Profiler. It demonstrates a progressive optimization approach, beginning with a naive implementation and evolving into a highly efficient version using OpenMP, tiling, and SIMD vectorization.
The work was conducted as part of the Advanced Computer Architecture course.
- Profile a matrix multiplication application using Intel VTune.
- Identify performance bottlenecks and inefficiencies.
- Apply a series of optimizations to enhance performance.
- Evaluate and compare results at each stage of optimization.
- Intel VTune Profiler: For detailed performance analysis.
- C/C++: Core programming language.
- POSIX Threads (pthreads) and OpenMP: For multithreading.
- SIMD Vectorization: To enhance data-level parallelism.
- Linux (Ubuntu): Development and testing environment.
The project follows a structured optimization pipeline:
-
Naive Matrix Multiplication
A standard triple-loop implementation with no optimization. Acts as the baseline for performance comparison. -
Tiled Matrix Multiplication
Matrix multiplication with loop tiling (blocking) to improve cache locality and reduce cache misses. -
Tiled Matrix Multiplication with Pthreads
Parallelization using POSIX threads, distributing tile-based computations across threads. -
Tiled Matrix Multiplication with OpenMP
Migrated to OpenMP for simpler thread management and parallel loop control, improving scalability. -
Tiled Matrix Multiplication with Three-Level Tiling
Introduced a three-level (L1, L2, L3 cache-aware) tiling strategy to maximize cache reuse and minimize memory traffic. -
Tiled Matrix Multiplication with OpenMP + SIMD Vectorization
Combined OpenMP with compiler-level SIMD intrinsics or vectorization pragmas to exploit both thread-level and data-level parallelism for maximum performance.
Each version was profiled with VTune to observe improvements in:
- CPU Utilization
- Memory Access Efficiency
- Thread Load Balance
- Execution Time
├── src/ # Source code of each version
│ ├── naive/ # Naive matrix multiplication
│ ├── tiled/ # Basic tiling
│ ├── tiled_pthreads/ # Tiling with pthreads
│ ├── tiled_openmp/ # Tiling with OpenMP
│ ├── tiled_3tile/ # Three-level tiled approach
│ └── tiled_simd/ # OpenMP + SIMD optimized
├── reports/ # VTune performance reports
├── screenshots/ # VTune visualizations
├── optimization_notes/ # Notes on strategies and changes
├── README.md # This file
└── Makefile # Build automation
- Intel VTune Profiler
- GCC or Clang with OpenMP and SIMD support
- Make utility
- Linux OS (Ubuntu recommended)
cd src/tiled_simd
make
./matrix_mul