Difference between revisions of "Example Large Dense Matrices"

From Efficient Java Matrix Library
Jump to navigation Jump to search
(Created page with "Different approaches are required when writing high performance dense matrix operations for large matrices. For the most part, EJML will automatically switch to using these di...")
 
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
Different approaches are required when writing high performance dense matrix operations for large matrices. For the most part, EJML will automatically switch to using these different approaches. However, it can make sense to call them directly and minimize memory usage and avoid converting matrices.
+
Different approaches are required when writing high performance dense matrix operations for large matrices. For the most part, EJML will automatically switch to using these different approaches. A key parameter that needs to be tuned for specific systems is block size. It can also make sense to work directly with block matrices instead of assuming EJML does the best for your system.
  
  
 
External Resources:
 
External Resources:
* [https://github.com/lessthanoptimal/ejml/blob/v0.40/examples/src/org/ejml/example/OptimizingLargeMatrixPerformance.java OptimizingLargeMatrixPerformance.java]
+
* [https://github.com/lessthanoptimal/ejml/blob/v0.41/examples/src/org/ejml/example/OptimizingLargeMatrixPerformance.java OptimizingLargeMatrixPerformance.java]
  
 
== Example Code ==
 
== Example Code ==
Line 16: Line 16:
 
  */
 
  */
 
public class OptimizingLargeMatrixPerformance {
 
public class OptimizingLargeMatrixPerformance {
 
 
     public static void main( String[] args ) {
 
     public static void main( String[] args ) {
 
         // Create larger matrices to experiment with
 
         // Create larger matrices to experiment with
 
         var rand = new Random(0xBEEF);
 
         var rand = new Random(0xBEEF);
         DMatrixRMaj A = RandomMatrices_DDRM.rectangle(3000,3000,-1,1,rand);
+
         DMatrixRMaj A = RandomMatrices_DDRM.rectangle(3000, 3000, -1, 1, rand);
 
         DMatrixRMaj B = A.copy();
 
         DMatrixRMaj B = A.copy();
 
         DMatrixRMaj C = A.createLike();
 
         DMatrixRMaj C = A.createLike();
  
 
         // Since we are dealing with larger matrices let's use the concurrent implementation. By default
 
         // Since we are dealing with larger matrices let's use the concurrent implementation. By default
         UtilEjml.printTime("Row-Major Multiplication:",()-> CommonOps_MT_DDRM.mult(A,B,C));
+
         UtilEjml.printTime("Row-Major Multiplication:", () -> CommonOps_MT_DDRM.mult(A, B, C));
  
 
         // Converts A into a block matrix and creates a new matrix while leaving A unmodified
 
         // Converts A into a block matrix and creates a new matrix while leaving A unmodified
Line 31: Line 30:
 
         // Converts A into a block matrix, but modifies it's internal array inplace. The returned block matrix
 
         // Converts A into a block matrix, but modifies it's internal array inplace. The returned block matrix
 
         // will share the same data array as the input. Much more memory efficient, but you need to be careful.
 
         // will share the same data array as the input. Much more memory efficient, but you need to be careful.
         DMatrixRBlock Bb = MatrixOps_DDRB.convertInplace(B,null,null);
+
         DMatrixRBlock Bb = MatrixOps_DDRB.convertInplace(B, null, null);
 
         DMatrixRBlock Cb = Ab.createLike();
 
         DMatrixRBlock Cb = Ab.createLike();
  
 
         // Since we are dealing with larger matrices let's use the concurrent implementation. By default
 
         // Since we are dealing with larger matrices let's use the concurrent implementation. By default
         UtilEjml.printTime("Block Multiplication:    ",()-> MatrixOps_MT_DDRB.mult(Ab,Bb,Cb));
+
         UtilEjml.printTime("Block Multiplication:    ", () -> MatrixOps_MT_DDRB.mult(Ab, Bb, Cb));
  
 
         // Can we make this faster? Probably by adjusting the block size. This is system dependent so let's
 
         // Can we make this faster? Probably by adjusting the block size. This is system dependent so let's
 
         // try a range of values
 
         // try a range of values
 
         int defaultBlockWidth = EjmlParameters.BLOCK_WIDTH;
 
         int defaultBlockWidth = EjmlParameters.BLOCK_WIDTH;
         System.out.println("Default Block Size: "+defaultBlockWidth);
+
         System.out.println("Default Block Size: " + defaultBlockWidth);
         for ( int block : new int[]{10,20,30,50,70,100,140,200,500}) {
+
         for (int block : new int[]{10, 20, 30, 50, 70, 100, 140, 200, 500}) {
 
             EjmlParameters.BLOCK_WIDTH = block;
 
             EjmlParameters.BLOCK_WIDTH = block;
  
Line 48: Line 47:
 
             DMatrixRBlock Bc = MatrixOps_DDRB.convert(B);
 
             DMatrixRBlock Bc = MatrixOps_DDRB.convert(B);
 
             DMatrixRBlock Cc = Ac.createLike();
 
             DMatrixRBlock Cc = Ac.createLike();
             UtilEjml.printTime("Block "+EjmlParameters.BLOCK_WIDTH+": ",()-> MatrixOps_MT_DDRB.mult(Ac,Bc,Cc));
+
             UtilEjml.printTime("Block " + EjmlParameters.BLOCK_WIDTH + ": ", () -> MatrixOps_MT_DDRB.mult(Ac, Bc, Cc));
 
         }
 
         }
  

Latest revision as of 07:27, 7 July 2021

Different approaches are required when writing high performance dense matrix operations for large matrices. For the most part, EJML will automatically switch to using these different approaches. A key parameter that needs to be tuned for specific systems is block size. It can also make sense to work directly with block matrices instead of assuming EJML does the best for your system.


External Resources:

Example Code

/**
 * For many operations EJML provides block matrix support. These block or tiled matrices are designed to reduce
 * the number of cache misses which can kill performance when working on large matrices. A critical tuning parameter
 * is the block size and this is system specific. The example below shows you how this parameter can be optimized.
 *
 * @author Peter Abeles
 */
public class OptimizingLargeMatrixPerformance {
    public static void main( String[] args ) {
        // Create larger matrices to experiment with
        var rand = new Random(0xBEEF);
        DMatrixRMaj A = RandomMatrices_DDRM.rectangle(3000, 3000, -1, 1, rand);
        DMatrixRMaj B = A.copy();
        DMatrixRMaj C = A.createLike();

        // Since we are dealing with larger matrices let's use the concurrent implementation. By default
        UtilEjml.printTime("Row-Major Multiplication:", () -> CommonOps_MT_DDRM.mult(A, B, C));

        // Converts A into a block matrix and creates a new matrix while leaving A unmodified
        DMatrixRBlock Ab = MatrixOps_DDRB.convert(A);
        // Converts A into a block matrix, but modifies it's internal array inplace. The returned block matrix
        // will share the same data array as the input. Much more memory efficient, but you need to be careful.
        DMatrixRBlock Bb = MatrixOps_DDRB.convertInplace(B, null, null);
        DMatrixRBlock Cb = Ab.createLike();

        // Since we are dealing with larger matrices let's use the concurrent implementation. By default
        UtilEjml.printTime("Block Multiplication:    ", () -> MatrixOps_MT_DDRB.mult(Ab, Bb, Cb));

        // Can we make this faster? Probably by adjusting the block size. This is system dependent so let's
        // try a range of values
        int defaultBlockWidth = EjmlParameters.BLOCK_WIDTH;
        System.out.println("Default Block Size: " + defaultBlockWidth);
        for (int block : new int[]{10, 20, 30, 50, 70, 100, 140, 200, 500}) {
            EjmlParameters.BLOCK_WIDTH = block;

            // Need to create the block matrices again since we changed the block size
            DMatrixRBlock Ac = MatrixOps_DDRB.convert(A);
            DMatrixRBlock Bc = MatrixOps_DDRB.convert(B);
            DMatrixRBlock Cc = Ac.createLike();
            UtilEjml.printTime("Block " + EjmlParameters.BLOCK_WIDTH + ": ", () -> MatrixOps_MT_DDRB.mult(Ac, Bc, Cc));
        }

        // On my system the optimal block size is around 100 and has an improvement of about 5%
        // On some architectures the improvement can be substantial in others the default value is very reasonable

        // Some decompositions will switch to a block format automatically. Matrix multiplication might in the
        // future and others too. The main reason this hasn't happened for it to be memory efficient it would
        // need to modify then undo the modification for input matrices which would be very confusion if you're
        // writing concurrent code.
    }
}