This was the paper that introduced what are now called "Level 3 BLAS". The architecture was presented to Jack Dongarra around the same time, and he used the concepts to create his own versions of bandwidth-reducing triply-nested loops. This was a watershed paper for the LINPACK benchmark, since it led to the relaxation of the rule that required matrix-vector multiplication as the kernel, a rule that made performance hard to achieve for systems with more parallelism than that of the CRAY mainframes.