How to do a Disk benchmark test

Estimated time to read: 8 minutes

We will explain how you can test and compare the performance of our instance flavors. We will use an Ubuntu VM and test the Disk. You can repeat these tests on any similar platform and compare the results.

Disk

A disk in Fuga is a block device presented in your VM. It will be available in the "/dev" folder as "/dev/vda", "/dev/vdb" and so on. The disk is usually formatted with an ext4 filesystem and mounted as '/'. Additional volumes must be manually formatted and mounted. The disk performance can be expressed in different metrics. We use input output operations per second or IOPS. Every disk operation can be either a read or write action with specific data size. Especially for multi-threaded applications, it's the IOPS metric that's important. If your disk can do just one operation at a time, only one application can use the disk and other processes will have to wait. Your operating system usually has a queuing system for this. If your system can do more simultaneous transactions your applications will spend less time waiting for io transactions.

Fuga has several different block device options.

Ephemeral

The ephemeral storage is provided by local fast, low-latency NVME drives. This is the fastest storage we offer and the best is it comes for free with your instance. If you however deploy your operating system on a volume your system will not have an ephemeral disk attached. Every flavor can have a different size ephemeral storage with different speed limits which you can find in our dashboard. Ephemeral storage is not redundant.

Volume storage

The volume is a remote attached block device, the data on this storage type is stored on our storage cluster. This storage is highly redundant, your data is stored three times on different machines. Volume storage can be fast but not as fast as ephemeral storage. The speed of this storage is dependant on the chosen tier and its size. This storage scales extremely well, especially in multi-threaded applications. Fuga offers several volume types with varying performance or capabilities. In these tests, I'm using our two default volume tiers, tier-1 and tier-2. Both are highly redundant and scalable. Performance scales along with their raw size. I'm not testing encrypted volumes or volumes designed to be attached to multiple instances.

Block size

If you run a benchmark suite the block size is important. Getting one 1 megabyte file in 1 action is faster compared to getting the same file in 4KB blocks (1024KB / 4KB = 256 actions, or IOPS).

Your filesystem also uses a block size which you can find by:

tune2fs -l /dev/vda1 | grep -i 'block size'

On Fuga Cloud, this is not important as our drives are virtual and we use a different block size, but if you use a local drive you might encounter alignment issues. You normally want your device blockdevice size (512 bytes or 4096 bytes) to match your filesystem block size and make sure they're aligned. Requesting a 4KB block precisely can be one action for your device, or two if it's misaligned hampering performance. Since it's not an issue on our platform we'll move on.

Every use case is different. Where copying large files can be fast when using large block sizes, copying a kernel tree (lots of small files) can take a long time.

cp

'cp' is the standard linux copy tool. To find out the block size used you can run the following command, the last value returned on a line is the block size used in bytes. Cancel the strace with ctrl-c

strace -s 8 -xx cp /dev/urandom /dev/null

In our example 131072 bytes or 128KB gets read vrom /dev/urandom and written to /dev/null.

read(3, "\xb5\x4e\x92\x33\x90\x55\x10\x43"..., 131072) = 131072
write(4, "\xb5\x4e\x92\x33\x90\x55\x10\x43"..., 131072) = 131072

Summary

by default available on your system

uses a single thread
gives an indication of single-threaded read and write file performance
if used on one block device (cp file file2) the same block device is reading and writing at the same time.

dd

'dd' is a fairly powerful tool to copy and convert data. We'll use it to copy raw data and measure the transfer rate.

used options here:

if: which file or device to read
of: where to write the data
bs: the block size to use
count: how many of those blocks are we copying
oflag: which options to use, sync and nocache disable caching (worse performance but more fair), noatime does not update the timestamp on disk to get a more precise result

Examples

dd if=/dev/zero of=./test1.bin bs=1G count=1 oflag=sync,noatime,nocache 
dd if=./test1.bin of=/dev/null bs=1G iflag=sync,noatime,nocache 

dd if=/dev/zero of=./test1.bin bs=1M count=1024 oflag=sync,noatime,nocache 
dd if=./test1.bin of=/dev/null bs=1M iflag=sync,noatime,nocache 

dd if=/dev/zero of=./test1.bin bs=4096 count=256 oflag=sync,noatime,nocache 
dd if=./test1.bin of=/dev/null bs=4096 iflag=sync,noatime,nocache

If you run these commands you'll see there is a big difference on the used block size. Don't forget to remove the test file (test1.bin).

Summary

by default available on your system
uses a single thread
can use different block sizes
gives an easy to read summary
can copy files and also copy data directly from or to devices

Fio

Fio is widely used as a disk benchmark tool. It's highly configurable, you can get very different results just by modifying the options. It's probably not installed so you'll have to install it yourself.

Installation

sudo apt update && sudo apt install fio

Run a test

We're using a random read write test here, using a large io-depth, small block size, and 32 threads. Sequential tends to be faster than random data. Most loads are not sequential so we'll use a random test here. The large io-depth stresses the system more, we'll make sure it has enough to do. More threads make for more stress again, making the disk the limit.

These tests should put a fair amount of stress on the disk. Make sure you use fast storage like SSD or NVME drives, any FUGA storage will work. Using an HDD might take some time.

The test will first create a test set of data, run the test and quit leaving the data behind. Remove this data after you're done running all your tests by running:

rm fio_iotest

Random Read test

fio -direct=1 -iodepth=128 -rw=randread -ioengine=libaio -bs=4k -size=1G -numjobs=32 -runtime=15 -filename=fio_iotest -name=test -group_reporting -gtod_reduce=1

Random Write test

fio -direct=1 -iodepth=128 -rw=randwrite -ioengine=libaio -bs=4k -size=1G -numjobs=32 -runtime=15 -filename=fio_iotest -name=test -group_reporting -gtod_reduce=1

Understanding the results, I've used the S3.medium flavor here which should do up to 2500 read and 1250 write IOPS for the EPHEMERAL storage

Read test:

test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.16
Starting 32 processes
test: Laying out IO file (1 file / 1024MiB)

Jobs: 32 (f=32): [r(32)][100.0%][r=9984KiB/s][r=2496 IOPS][eta 00m:00s]
test: (groupid=0, jobs=32): err= 0: pid=4305: Fri Nov 26 12:16:38 2021
  read: IOPS=2517, BW=9.83MiB/s (10.3MB/s)(153MiB/15505msec)
   bw (  KiB/s): min= 2978, max=19060, per=98.88%, avg=9957.83, stdev=121.05, samples=872
   iops        : min=  740, max= 4764, avg=2486.75, stdev=30.27, samples=872
  cpu          : usr=0.05%, sys=0.25%, ctx=35114, majf=0, minf=4409
  IO depths    : 1=0.1%, 2=0.2%, 4=0.3%, 8=0.7%, 16=1.3%, 32=2.6%, >=64=94.8%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=39040,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=9.83MiB/s (10.3MB/s), 9.83MiB/s-9.83MiB/s (10.3MB/s-10.3MB/s), io=153MiB (160MB), run=15505-15505msec

Disk stats (read/write):
  vda: ios=38935/3, merge=37/1, ticks=3900304/19, in_queue=3822044, util=99.38%

There is lots of interesting data here, for a quick comparison we'll look at the "read:" line

read: IOPS=2517, BW=9.83MiB/s (10.3MB/s)(153MiB/15505msec)

So we're indeed reading ~2500 4K blocks per second which should be around 10MB per second. This seems slow, 10MB per second, but we're using 4k blocks and are limited by the available IOPS.

Now re-run the test using 1MB blocks

fio -direct=1 -iodepth=128 -rw=randwrite -ioengine=libaio -bs=512k -size=1G -numjobs=32 -runtime=15 -filename=iotest -name=test -group_reporting -gtod_reduce=1

the read: line shows:

read: IOPS=2834, BW=1417MiB/s (1486MB/s)(21.5GiB/15537msec)

which actually does over 2800 IOPS and transfers almost 1500 MB per second.

Which parameters did we use:

direct=1: use O_DIRECT, disables caching
iodepth=128: use a large io queue
rw=randwrite: type of test
ioengine=libaio: IO engine used
bs=4k: use a small block size to see the io limit, not the max throughput. use a 512k size for throughput
size=1G: size of the test file
numjobs=32: this is the number of threads simultaneously testing the file
runtime=15: duration in seconds
filename=iotest: name of the test file, please delete when done
name=test: test name
group_reporting: combine the stats of all threads
gtod_reduce=1: reduce the time of day calls

Some VOLUME results using 4KB block size

The tier-1 volume tier is limited to ~5 read and write IOPS per gigabyte. The tier-2 volume tier is limited to ~25 read and write IOPS per gigabyte. So a tier-1 volume of 100GB should be able to do 500 read and write iops simultaneously. For simplicity we'll do one test at a time.

tier-1 volume, 100GB

read: IOPS=503, BW=2014KiB/s (2063kB/s)(34.2MiB/17412msec)
write: IOPS=502, BW=2011KiB/s (2059kB/s)(34.8MiB/17697msec); 0 zone resets

tier-2 volume, 100GB

read: IOPS=2518, BW=9.84MiB/s (10.3MB/s)(153MiB/15503msec)
write: IOPS=2517, BW=9.83MiB/s (10.3MB/s)(153MiB/15557msec); 0 zone resets

tier-2 volume, 1TB

read: IOPS=25.2k, BW=98.4MiB/s (103MB/s)(1481MiB/15053msec)
write: IOPS=25.2k, BW=98.4MiB/s (103MB/s)(1482MiB/15063msec); 0 zone resets