Hi everybody!
It looks like large matrix multiplication is extremely slow and I was looking for strategy to improve it.
The classical 3 for loop is extremely slow:
C(m, n) = A(m, k) * B(k, n)
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
for (int p = 0; p < k; p++) {
C(i, j) += A(i, p) * B(p, j);
}
}
}
This version is 2.5 times faster but 2.5 times faster is unfortunately not enough for large matrix:
float* dstPtr = output.m_data;
const float* leftPtr = m_data;
for (size_t i …