VUCAKO Bench ============ $Rev: 147 $ $Date: 2015-12-25 19:27:17 +0100 (Fri, 25 Dec 2015) $ Bench is task-oriented benchmark suite for 32-bit and 64-bit computers. It consists of individual tasks which are designed to measure performance of CPU, FPU, main memory, memory cache, file-system, multi-threading, distributed programming, etc. Bench is a "Open Source" program, see "copyright.txt" for details. Actual stable version can be downloaded from Subversion (http://subversion.tigris.org/) repository at URL: svn://cgg.mff.cuni.cz/bench/trunk/ C++ source configuration ------------------------ All configurable items are included in the first part of "config.h" file. Check following typedefs: - "byte" (unsigned 8-bit integer) - "int32" (32-bit signed integer) - "unsigned32" (32-bit unsigned integer) - "unsigned64" (64-bit unsigned integer) - only for USE_64_BITS option - "double" has to be 64-bit long (IEEE 754) floating-point number. Makefile -------- Distribution archive contains "Makefile" in UNIX (GNU make) format. You should set the best possible optimalization switches for your platform/compiler.. 64-bit CPUs ----------- Use USE_64_BITS symbol if you are using 64-bit CPU and compiler (AMD Athlon64, Opteron, Intel Itanium, ..). This option will take advantage of 64-bit computation in some tests (actually only tests #1, #17 and #18 are affected by this option). Note that many other tests (floating point computing) will take advantage of 64-bit processing regardless of USE_64_BITS setting. CPU description --------------- Individual results are identified by "CPU-description" string and "clock" - effective number of CPU clock speed in MHz (clock speed of single processor on SMP systems). "clock" and "CPU-description" should be se by the "-c " command-line parameter. Alternative: if no "-c" command-line parameter is entered, program will try to read the 1st line of "cpu.txt" file: 1st item holds an e-mail address (and is ignored), 2nd item contains "clock" (int number = MHz), the rest of the line is interpreted as "CPU-description". Result file ----------- Actual result file is "results.txt", new results are appended to this file. If test suite is run more than once, result file holds the best times only. It is recommended to run each test at least 5 times to reduce effect of various fluctuations (especially on Win32 systems with a lot of background tasks). File "enhanced.txt" contains history of all best results. Results of test runs are printed in form: "elapsed time: TIME, (act RATIO %), ref: REF". TIME is total CPU-time in seconds, RATIO is ratio of actual time compared to another result record, REF is reference number which eliminates nominal processor clock-speeds (in MHz). Architectures with different clock-speeds are comparable using this REF number (smaller values are better). "run" script ------------ Simple C-shell script runs each test three times. It is recommended to run it at least two times in environment which is as close to "single-users mode" as possible. Results feedback ---------------- If you get results on some interesting/new architectures, please send me the "result.txt" file (or "enhanced.txt"): mailto:pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/bench/ Java version ------------ Java version (compatible with JDK 1.1 to 1.5) is included in "Java" directory. Bytecodes are packed together into "bench.jar" archive (some tests require JDK >= 1.4 or even >= 1.5). Archive "bench.jar" contains binary compiled by recent Sun JDK (1.6 in 2007), binaries for older versions of Java are included in: "bench1.5.jar", "bench1.4.jar". CPU and clock parameters must be provided via "-c " command-line parameter ("cpu.txt" file is ignored). Java version can run only one benchmark test at a time! Sample scripts are to be found in "Java/class" directory, example (test #1 on 1GHz computer "PIII/1G (i815,128M,Qua-AS30,W2K)"): java -jar bench.jar -c 1000 "PIII/1G (i815,128M,Qua-AS30,W2K)" 1 ====================================================================== Command-line options -------------------- bench [-h] [-l[s]] [-m ] [-b] [-s{a|d}{t|r|c}] [-r{m|v|d}] [-i ] [-t] [-r ] [-c ["CPU-description"]] [] -h print help message (this text) -l[s|] don't run any tests, print result table(s) only "s" generates "spread-sheet readable" output -y print synthetic benchmark table using best items 0 .. use static table of top results -m merge the given result file into "results.txt" -b[b] [super-]brief mode (don't print headers and old results) -s{a|d}{t|r|c} sort results (ascending/descending, time/ref/clock) -r{m|v|d} restrict tests according to physical RAM (m), virtual memory (v) or free disk space (d). is number of bytes, examples: 256k, 64m, 2g -i run only the given test ( number - see below) -t self-test (pseudo-random generator etc.) -r restrict result table to systems matching (valid only for "-l" listing) -c system-description: is CPU clock in MHz (see the 2nd column of "cpu.txt") "CPU-description" is arbitrary string (abbreviations are preferred, see "cpu.txt" ignoring first two columns) if -c is not specified, the first line of "cpu.txt" will be read is bit-mask (hexadecimal number 0xNNNN can be used): 1. 0x000001 .. Eratosthen (5555555-th prime number) 2. 0x000002 .. Flawed transposition of 16MB image (uses two buffers!) 3. 0x000004 .. Flawed transposition of 64MB image (uses two buffers!) 4. 0x000008 .. Arbitrage sequence lookup (double[12][12] matrix) 5. 0x000010 .. Needle-throwing simulation (Monte-Carlo, 100M iterations) 6. 0x000020 .. Fast memory-copy (two 16 MB arrays, 100 iterations) 7. 0x000040 .. Monte-Carlo form-factor (50M iterations) 8. 0x000080 .. Merge-sort on disk (double[8M] array) 9. 0x000100 .. Wavelet (T-S) transform in memory (int32[4M-2K] arrays) 10. 0x000200 .. Wavelet (T-S) transform on disk (double[16M-128K] arrays) 11. 0x000400 .. Adaptive K-D tree on disk (150K operations) 12. 0x000800 .. Quick-sort in virtual memory (double[8M] array) 13. 0x001000 .. Quick-sort in virtual memory (double[32M] array) 14. 0x002000 .. Quick-sort in virtual memory (double[128M] array) 15. 0x004000 .. Garbage collection test in 16MB of VM 16. 0x008000 .. Regular expression test (16k matches on 32-bytes strings) 17. 0x010000 .. Transposition of 16MB image (uses two buffers!) 18. 0x020000 .. Transposition of 64MB image (uses two buffers!) 19. 0x040000 .. Merge-sort on disk (double[64M] array) 20. 0x080000 .. Wavelet (T-S) transform on disk (double[128M-1M] arr) 21. 0x100000 .. Adaptive K-D tree on disk (1500K operations) 22. 0x200000 .. Parallel SHA-1 digest (1 to 16 threads, 1500x1MB total) 23. 0x400000 .. Parallel SHA-1 digest (2 threads, 11Mx128 total) 24. 0x800000 .. Parallel sort in memory (2 to 16 threads, 2x double[128M] array) ====================================================================== List of tasks (version 1.015): ------------------------------ 1 (mask 0x0000001): Eratosthen ---------- Finds the 5555555-th prime number. Performance: CPU, memory system Memory: 12.5 MB Disk: no Result: the 5555555-th prime number is: 96210113 Java: -ms32m -mx32m 2 (mask 0x0000002): Flawed transposition of 16MB image ---------------------------------- "Transposes" a 4096x4096 byte-image in main memory (96 times). There is a bug in this test: only one of fourth consequent scanlines is read.. Performance: CPU, memory system Memory: 32 MB Disk: no Java: -ms64m -mx64m 3 (mask 0x0000004): Flawed transposition of 64MB image ---------------------------------- "Transposes" a 8192x8192 byte-image in main memory (24 times). There is a bug in this test: only one of fourth consequent scanlines is read.. Performance: CPU, memory system, virtual memory system Memory: 128 MB Disk: virtual memory only Java: -ms168m -mx168m 4 (mask 0x0000008): Arbitrage sequence lookup ------------------------- Finds the best arbitrage-sequence in a set of 12 currencies. Actual exchange rates are defined by a 12x12 double matrix, maximum length of arbitrage sequence is 12. Performance: CPU, FPU Memory: <3 KB Disk: no Result: arbitrage profit: 1.882, seq: 10 5 8 12 11 2 7 1 4 9 6 10 (starting point of the sequence does not matter) Java: -ms32m -mx32m 5 (mask 0x0000010): Needle-throwing simulation -------------------------- Does Monte-Carlo simulation of "needle-throw" experiment: a needle of length A falls to a regular infinite pattern of parallel lines with distance A. The goal is to determine probability of needle-line intersection. Performance: CPU, FPU Memory: <1 KB Disk: no Result: ratio = 0.636609330 Java: -ms32m -mx32m 6 (mask 0x0000020): Fast memory-copy ---------------- The fastest memory-copy operation ("memcpy()" routine in libc) is performed on two large arrays (16MB each). Total 3200MB will be read and 3200MB will be written in this test. Performance: CPU, memory system Memory: 32 MB Disk: no Java: -ms64m -mx64m 7 (mask 0x0000040): Monte-Carlo form-factor ----------------------- Monte-Carlo computation of form-factor between parallel equal sized rectangles. 50 mil. rays are shot from one rectangle to another. Performance: CPU, FPU Memory: <1 KB Disk: no Result: Ffull = 0.1998385, Fhalf = 0.0686381 Java: -ms32m -mx32m 8 (mask 0x0000080): Merge-sort on disk ------------------ Merge-sort of large disk file (double[8M] array of random numbers). double[1024] segments are pre-sorted in memory, "Merge-and-split" routine is used - 4 disk files of 32MB each are allocated. Performance: CPU, FPU, file-system Memory: 8 MB Disk: 128 MB in 4 files Result: OK Java: -ms32m -mx32m 9 (mask 0x0000100): Wavelet transform in memory --------------------------- 1D T-S lifting transform is performed on various int32[] arrays in memory. Equal number of arithmetic operations is used in every stage (for every array size) - partial times should represent efficiency of memory cache system. Performance: CPU, memory system, memory cache Memory: 16 MB max. (16MB, 8MB, 4MB, ... 8KB) Disk: no Java: -ms32m -mx32m 10 (mask 0x0000200): Wavelet transform on disk ------------------------- 1D T-S lifting transform is performed on large disk files (double[] type). Lifting and unlifting uses only sequential data access. Equal number of arithmetic operations is used in every stage (for every array size) - partial times should represent efficiency of disk cache system. Performance: CPU, FPU, file-system, disk cache Memory: <1 KB Disk: 128 MB max. (128MB, 64MB, ... 1MB) in 3 files Java: -ms16m -mx16m 11 (mask 0x0000400): Adaptive K-D tree on disk ------------------------- 2D adaptive K-D tree is used for storing point data objects. Each object occupies 512 bytes of disk space, bucket (leaf node) size is 4KB (i.e. 8 objects). 150000 operations (55% of insertions, 45% of searches) are performed. Disk file is cached in 8MB of main memory. Performance: CPU, file-system, disk cache Memory: 8 MB Disk: 57.3 MB max. Result: 95k, 79k, 108k, leaf nodes: 14606, hits: 83 Java: -ms32m -mx32m 12 (mask 0x0000800): Quick-sort in virtual memory ---------------------------- Quck-sort of small double[] array is performed. 64MB of virtual memory is used. Performance: CPU, FPU, memory system, [virtual memory] Memory: 64 MB Disk: virtual memory only Result: OK Java: -ms84m -mx84m 13 (mask 0x0001000): Quick-sort in virtual memory ---------------------------- Quck-sort of middle double[] array is performed. 256MB of virtual memory is used. Performance: CPU, FPU, memory system, [virtual memory] Memory: 256 MB Disk: virtual memory only Result: OK Java: -ms320m -mx320m 14 (mask 0x0002000): Quick-sort in virtual memory ---------------------------- Quck-sort of large double[] array is performed. 1GB of virtual memory is used. Performance: CPU, FPU, memory system, [virtual memory] Memory: 1 GB Disk: virtual memory only Result: OK Java: -ms1056m -mx1056m 15 (mask 0x0004000): Garbage collection test in 16MB of VM ------------------------------------- Random binary tree updates are performed. Tree is constructed and truncated many times. Approximately 200 GC passes are forced (in Java). Performance: CPU, memory system, Java VM garbage collection / C++ memory management Memory: 16 MB (C++: <2MB) Disk: no Result: Garbage collection test in 16MB of VM (3766MB, max 1095KB) Java: -ms16m -mx16m 16 (mask 0x0008000): Regular expressions ------------------- Random strings "(a|b){32,32}" are generated and tested against four regular expressions: "(.*a){16,16}.*" (at least 16 occurances of "a"), "(.*aa){8,8}.*" (at least 8 occurances of "aa"), "^(.*a)a[^a]a[^a]a^(a.*)" (three isolated "a" characters in one sequence), ".*(aaaa.*bbbb|bbbb.*aaaa).*" (string contains both "aaaa" and "bbbb"). Performance: CPU, memory system Memory: <64 KB Disk: no Result: Regular expressions (16k,32): 0.5720-0.1073-0.1663-0.4233 Java: not available yet 17 (mask 0x0010000): Transposition of 16MB image --------------------------- Transposes a 4096x4096 byte-image in main memory (96 times). Performance: CPU, memory system Memory: 32 MB Disk: no Java: -ms64m -mx64m 18 (mask 0x0020000): Transposition of 64MB image --------------------------- Transposes a 8192x8192 byte-image in main memory (24 times). Performance: CPU, memory system, virtual memory system Memory: 128 MB Disk: virtual memory only Java: -ms168m -mx168m 19 (mask 0x0040000): Merge-sort on disk (big) ------------------------ Merge-sort of large disk file (double[64M] array of random numbers). double[8192] segments are pre-sorted in memory, "Merge-and-split" routine is used - 4 disk files of 256MB each are allocated. Performance: CPU, FPU, file-system Memory: 8 MB Disk: 1 GB in 4 files Result: OK Java: -ms40m -mx40m 20 (mask 0x0080000): Wavelet transform on disk (big) ------------------------------- 1D T-S lifting transform is performed on large disk files (double[] type). Lifting and unlifting uses only sequential data access. Equal number of arithmetic operations is used in every stage (for every array size) - partial times should represent efficiency of disk cache system. Performance: CPU, FPU, file-system, disk cache Memory: <1 KB Disk: 1 GB max. (1GB, 512MB, ... 8MB) in 3 files Java: -ms16m -mx16m 21 (mask 0x0100000): Adaptive K-D tree on disk (big) ------------------------------- 2D adaptive K-D tree is used for storing point data objects. Each object occupies 512 bytes of disk space, bucket (leaf node) size is 4KB (i.e. 8 objects). 1.5M operations (55% of insertions, 45% of searches) are performed. Disk file is cached in 8MB of main memory. Performance: CPU, file-system, disk cache Memory: 8 MB Disk: 566 MB max. Result: 1425k, 1064k, 1568k, leaf nodes: 145009, hits: 7610 Java: -ms32m -mx32m 22 (mask 0x0200000): Parallel SHA-1 digest (1-16 threads) ------------------------------------ SHA-1 digest computation using 1 to 16 threads. 4000 data arrays in memory (1MB each) will be generated (using pseudo-random generator) and after it check-summed using SHA-1. First (single-thread) run is used for comparison to multi-threaded runs (2, 4, 8, 16 threads). The whole test shows effectivity of hyper- and multi- threading (especially on multi-core CPUs). Performance: CPU (int ALU), parallelism, memory system Memory: 64MB Disk: no Result: OK B38098263E749B74AF6A36B6ED1D5CDD69BD3F57 Java: -ms120m -mx120m 23 (mask 0x0400000): Ctx-switch efficiency - SHA-1 (2 threads) ----------------------------------------- SHA-1 digest computation using 2 threads. 32M data arrays in memory (128 bytes each) will be generated (using pseudo-random generator) and after it check-summed using SHA-1. One master and two worker threads are used to test context-switch efficiency of the multi-threading API. Performance: CPU (int ALU), parallelism, context-switch efficiency Memory: <1MB Disk: no Result: OK C4F3F6510A3D765A16AC37B901E91B39D8B41E85 Java: -ms64m -mx64m 24 (mask 0x0800000): Parallel sort in virtual memory (2-16 threads) ---------------------------------------------- Sorting double[128M] array using many quick-sort and merge-sort stages distributed to multiple worker threads. Primary buckets sorted by quick-sort: double[32K]. The whole sorting job is repeated 4 times utilizing 2, 4, 8 and 16 worker threads, partial timings are logged (unless in -bb mode). The whole test shows effectivity of hyper- and multi- threading (especially on multi-core CPUs). Performance: CPU, FPU, parallelism, virtual memory system Memory: 2GB Disk: no Result: OK Java: -ms2600m -mx2600m 64 (no mask, special binary): Parallel SHA-1 digest, pure MPI (K worker processes) ---------------------------------------------------- SHA-1 digest computation using arbitrary number of worker processes. Pure distributed implementation using MPI. 1008000 data arrays (32KB each) will be generated (using pseudo-random generator) and check-summed using SHA-1. Master process generates work-units and distributes them to slave processes via MPI. The test shows effectivity of multi-threading and MPI. To compare effectivity of a cluster interconnect, one can modify size of work-unit-batch sent from master to worker in one MPI message (-b parameter). Typical computing time for 1 work-unit is < 1ms (250us on Intel E5320@1.87GHz) Performance: CPU (int ALU), distributed parallelism, MPI Memory: 40KB (per process) Disk: no Result: OK F59C71D483760C955687AB191942AAB16766A2A5 Java: not available Examples of run command on regular Windows with Computer Cluster Pack installed (utilizing 4 CPU cores, batch sizes: 1 and 1000): mpiexec -n 4 mpi64.exe -b 1 mpiexec -n 4 mpi64.exe -b 1000 Example of run command on Microsoft Compute Cluster Server (utilizing 20 CPU cores, batch size: 10): job submit /numprocessors:20 /workdir:\\server\share\ /stdout:out.txt mpiexec mpi64.exe -b 10