Determine the specific benefit of Writeback Throttling (CONFIG_WBT)
up vote
1
down vote
favorite
[PATCH 0/8] Throttled background buffered writeback v7
Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background
activity... But for as long as I can remember, heavy buffered writers
have not behaved like that. For instance, if I do something like this:
$ dd if=/dev/zero of=foo bs=1M count=10k
on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done.
And now we have the patches for this applied to Linux, and available in Fedora Workstation and elsewhere. Yay.
But this "writeback throttling" (WBT) has no effect by default, at least on SATA HDDs and SSDs. The default IO scheduler CFQ is not compatible with WBT. And nor is BFQ (the successor to CFQ, for the multi-queue block layer). You need to switch to an I/O scheduler which does not try to throttle background writeback itself. So you have to make some trade-off :-(.
CFQ was probably advertised in the past with descriptions that would have sounded similarly attractive. BFQ certainly is as well... but from the documentation, it seems to use a similar class of heuristic to CFQ. I don't see it measuring the IO latency and throttling writeback if the latency is too high.
I have a cheap 2-year old laptop with a spinning hard drive. It has full support for SATA NCQ, i.e. 32-deep I/O queue on the device. I've certainly noticed some tendency for the system to become unusable when copying large files (20G VM files. I have 8GB of RAM). Based on the cover letter for v3 of the patches, I expect a system like mine can indeed suffer from the problem, that background writeback can submit far too much IO at a time.
- What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?
- It sounds like this problem is quite severe, and has been understood now for at least a couple of years. You would hope that some monitoring systems know how to show the underlying problem. If you have a performance problem, and you think this is one of the possible causes, what measurement(s) can you look at in order to diagnose it?
linux linux-kernel io cache
add a comment |
up vote
1
down vote
favorite
[PATCH 0/8] Throttled background buffered writeback v7
Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background
activity... But for as long as I can remember, heavy buffered writers
have not behaved like that. For instance, if I do something like this:
$ dd if=/dev/zero of=foo bs=1M count=10k
on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done.
And now we have the patches for this applied to Linux, and available in Fedora Workstation and elsewhere. Yay.
But this "writeback throttling" (WBT) has no effect by default, at least on SATA HDDs and SSDs. The default IO scheduler CFQ is not compatible with WBT. And nor is BFQ (the successor to CFQ, for the multi-queue block layer). You need to switch to an I/O scheduler which does not try to throttle background writeback itself. So you have to make some trade-off :-(.
CFQ was probably advertised in the past with descriptions that would have sounded similarly attractive. BFQ certainly is as well... but from the documentation, it seems to use a similar class of heuristic to CFQ. I don't see it measuring the IO latency and throttling writeback if the latency is too high.
I have a cheap 2-year old laptop with a spinning hard drive. It has full support for SATA NCQ, i.e. 32-deep I/O queue on the device. I've certainly noticed some tendency for the system to become unusable when copying large files (20G VM files. I have 8GB of RAM). Based on the cover letter for v3 of the patches, I expect a system like mine can indeed suffer from the problem, that background writeback can submit far too much IO at a time.
- What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?
- It sounds like this problem is quite severe, and has been understood now for at least a couple of years. You would hope that some monitoring systems know how to show the underlying problem. If you have a performance problem, and you think this is one of the possible causes, what measurement(s) can you look at in order to diagnose it?
linux linux-kernel io cache
Generally the way a lack of write throttling shows up is: 1. something does a huge amount of buffered write I/O to something slow (e.g. a USB2 attached disk) but at this stage it is buffered and not being flushed. 2. Something else is doing write I/O to a disk that is medium speed or fast. 3. A sync is done for some reason (either time or by request) that forces the system to have to flush I/O to the slow disk. Something tells me you've seen this link before but see utcc.utoronto.ca/~cks/space/blog/linux/… (and the links in the comments) for details.
– Anon
Nov 24 at 6:48
@Anon read the cover letter for WBT, the link I gave. It is designed to help on a kernel developers laptop - which has SATA SSD if I read the speeds correctly - and it was also originally advertised to help on internal SATA HDD. CKS' scenario is different, that's about interference between different devices. There's different cover letters for each WBT patch version, but they're all about a problem on a single device.
– sourcejedi
Nov 24 at 9:53
@Anon Also, I believe CKS, and there are very similar looking reports of that USB problem on this site, but the LWN article on "pernicious USB-stick stall problem" is broken. It completely mis-represents the original report & series of responses. That LWN article needs to be discounted. At least as a citation for an analysis of that problem. unix.stackexchange.com/questions/480399/…
– sourcejedi
Nov 24 at 10:04
@Anon analysis can get very confusing because there are four writeback buffers: linux dirty page cache, linux I/O scheduler queue, device queue (NCQ), device writeback cache. I want to understand what WBT can solve and what it does not solve.
– sourcejedi
Nov 24 at 10:50
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
[PATCH 0/8] Throttled background buffered writeback v7
Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background
activity... But for as long as I can remember, heavy buffered writers
have not behaved like that. For instance, if I do something like this:
$ dd if=/dev/zero of=foo bs=1M count=10k
on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done.
And now we have the patches for this applied to Linux, and available in Fedora Workstation and elsewhere. Yay.
But this "writeback throttling" (WBT) has no effect by default, at least on SATA HDDs and SSDs. The default IO scheduler CFQ is not compatible with WBT. And nor is BFQ (the successor to CFQ, for the multi-queue block layer). You need to switch to an I/O scheduler which does not try to throttle background writeback itself. So you have to make some trade-off :-(.
CFQ was probably advertised in the past with descriptions that would have sounded similarly attractive. BFQ certainly is as well... but from the documentation, it seems to use a similar class of heuristic to CFQ. I don't see it measuring the IO latency and throttling writeback if the latency is too high.
I have a cheap 2-year old laptop with a spinning hard drive. It has full support for SATA NCQ, i.e. 32-deep I/O queue on the device. I've certainly noticed some tendency for the system to become unusable when copying large files (20G VM files. I have 8GB of RAM). Based on the cover letter for v3 of the patches, I expect a system like mine can indeed suffer from the problem, that background writeback can submit far too much IO at a time.
- What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?
- It sounds like this problem is quite severe, and has been understood now for at least a couple of years. You would hope that some monitoring systems know how to show the underlying problem. If you have a performance problem, and you think this is one of the possible causes, what measurement(s) can you look at in order to diagnose it?
linux linux-kernel io cache
[PATCH 0/8] Throttled background buffered writeback v7
Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background
activity... But for as long as I can remember, heavy buffered writers
have not behaved like that. For instance, if I do something like this:
$ dd if=/dev/zero of=foo bs=1M count=10k
on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done.
And now we have the patches for this applied to Linux, and available in Fedora Workstation and elsewhere. Yay.
But this "writeback throttling" (WBT) has no effect by default, at least on SATA HDDs and SSDs. The default IO scheduler CFQ is not compatible with WBT. And nor is BFQ (the successor to CFQ, for the multi-queue block layer). You need to switch to an I/O scheduler which does not try to throttle background writeback itself. So you have to make some trade-off :-(.
CFQ was probably advertised in the past with descriptions that would have sounded similarly attractive. BFQ certainly is as well... but from the documentation, it seems to use a similar class of heuristic to CFQ. I don't see it measuring the IO latency and throttling writeback if the latency is too high.
I have a cheap 2-year old laptop with a spinning hard drive. It has full support for SATA NCQ, i.e. 32-deep I/O queue on the device. I've certainly noticed some tendency for the system to become unusable when copying large files (20G VM files. I have 8GB of RAM). Based on the cover letter for v3 of the patches, I expect a system like mine can indeed suffer from the problem, that background writeback can submit far too much IO at a time.
- What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?
- It sounds like this problem is quite severe, and has been understood now for at least a couple of years. You would hope that some monitoring systems know how to show the underlying problem. If you have a performance problem, and you think this is one of the possible causes, what measurement(s) can you look at in order to diagnose it?
linux linux-kernel io cache
linux linux-kernel io cache
edited Nov 24 at 10:49
asked Nov 21 at 17:41
sourcejedi
22k43396
22k43396
Generally the way a lack of write throttling shows up is: 1. something does a huge amount of buffered write I/O to something slow (e.g. a USB2 attached disk) but at this stage it is buffered and not being flushed. 2. Something else is doing write I/O to a disk that is medium speed or fast. 3. A sync is done for some reason (either time or by request) that forces the system to have to flush I/O to the slow disk. Something tells me you've seen this link before but see utcc.utoronto.ca/~cks/space/blog/linux/… (and the links in the comments) for details.
– Anon
Nov 24 at 6:48
@Anon read the cover letter for WBT, the link I gave. It is designed to help on a kernel developers laptop - which has SATA SSD if I read the speeds correctly - and it was also originally advertised to help on internal SATA HDD. CKS' scenario is different, that's about interference between different devices. There's different cover letters for each WBT patch version, but they're all about a problem on a single device.
– sourcejedi
Nov 24 at 9:53
@Anon Also, I believe CKS, and there are very similar looking reports of that USB problem on this site, but the LWN article on "pernicious USB-stick stall problem" is broken. It completely mis-represents the original report & series of responses. That LWN article needs to be discounted. At least as a citation for an analysis of that problem. unix.stackexchange.com/questions/480399/…
– sourcejedi
Nov 24 at 10:04
@Anon analysis can get very confusing because there are four writeback buffers: linux dirty page cache, linux I/O scheduler queue, device queue (NCQ), device writeback cache. I want to understand what WBT can solve and what it does not solve.
– sourcejedi
Nov 24 at 10:50
add a comment |
Generally the way a lack of write throttling shows up is: 1. something does a huge amount of buffered write I/O to something slow (e.g. a USB2 attached disk) but at this stage it is buffered and not being flushed. 2. Something else is doing write I/O to a disk that is medium speed or fast. 3. A sync is done for some reason (either time or by request) that forces the system to have to flush I/O to the slow disk. Something tells me you've seen this link before but see utcc.utoronto.ca/~cks/space/blog/linux/… (and the links in the comments) for details.
– Anon
Nov 24 at 6:48
@Anon read the cover letter for WBT, the link I gave. It is designed to help on a kernel developers laptop - which has SATA SSD if I read the speeds correctly - and it was also originally advertised to help on internal SATA HDD. CKS' scenario is different, that's about interference between different devices. There's different cover letters for each WBT patch version, but they're all about a problem on a single device.
– sourcejedi
Nov 24 at 9:53
@Anon Also, I believe CKS, and there are very similar looking reports of that USB problem on this site, but the LWN article on "pernicious USB-stick stall problem" is broken. It completely mis-represents the original report & series of responses. That LWN article needs to be discounted. At least as a citation for an analysis of that problem. unix.stackexchange.com/questions/480399/…
– sourcejedi
Nov 24 at 10:04
@Anon analysis can get very confusing because there are four writeback buffers: linux dirty page cache, linux I/O scheduler queue, device queue (NCQ), device writeback cache. I want to understand what WBT can solve and what it does not solve.
– sourcejedi
Nov 24 at 10:50
Generally the way a lack of write throttling shows up is: 1. something does a huge amount of buffered write I/O to something slow (e.g. a USB2 attached disk) but at this stage it is buffered and not being flushed. 2. Something else is doing write I/O to a disk that is medium speed or fast. 3. A sync is done for some reason (either time or by request) that forces the system to have to flush I/O to the slow disk. Something tells me you've seen this link before but see utcc.utoronto.ca/~cks/space/blog/linux/… (and the links in the comments) for details.
– Anon
Nov 24 at 6:48
Generally the way a lack of write throttling shows up is: 1. something does a huge amount of buffered write I/O to something slow (e.g. a USB2 attached disk) but at this stage it is buffered and not being flushed. 2. Something else is doing write I/O to a disk that is medium speed or fast. 3. A sync is done for some reason (either time or by request) that forces the system to have to flush I/O to the slow disk. Something tells me you've seen this link before but see utcc.utoronto.ca/~cks/space/blog/linux/… (and the links in the comments) for details.
– Anon
Nov 24 at 6:48
@Anon read the cover letter for WBT, the link I gave. It is designed to help on a kernel developers laptop - which has SATA SSD if I read the speeds correctly - and it was also originally advertised to help on internal SATA HDD. CKS' scenario is different, that's about interference between different devices. There's different cover letters for each WBT patch version, but they're all about a problem on a single device.
– sourcejedi
Nov 24 at 9:53
@Anon read the cover letter for WBT, the link I gave. It is designed to help on a kernel developers laptop - which has SATA SSD if I read the speeds correctly - and it was also originally advertised to help on internal SATA HDD. CKS' scenario is different, that's about interference between different devices. There's different cover letters for each WBT patch version, but they're all about a problem on a single device.
– sourcejedi
Nov 24 at 9:53
@Anon Also, I believe CKS, and there are very similar looking reports of that USB problem on this site, but the LWN article on "pernicious USB-stick stall problem" is broken. It completely mis-represents the original report & series of responses. That LWN article needs to be discounted. At least as a citation for an analysis of that problem. unix.stackexchange.com/questions/480399/…
– sourcejedi
Nov 24 at 10:04
@Anon Also, I believe CKS, and there are very similar looking reports of that USB problem on this site, but the LWN article on "pernicious USB-stick stall problem" is broken. It completely mis-represents the original report & series of responses. That LWN article needs to be discounted. At least as a citation for an analysis of that problem. unix.stackexchange.com/questions/480399/…
– sourcejedi
Nov 24 at 10:04
@Anon analysis can get very confusing because there are four writeback buffers: linux dirty page cache, linux I/O scheduler queue, device queue (NCQ), device writeback cache. I want to understand what WBT can solve and what it does not solve.
– sourcejedi
Nov 24 at 10:50
@Anon analysis can get very confusing because there are four writeback buffers: linux dirty page cache, linux I/O scheduler queue, device queue (NCQ), device writeback cache. I want to understand what WBT can solve and what it does not solve.
– sourcejedi
Nov 24 at 10:50
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?
So far, I haven't reproduced the most severe results. The behaviour varies between disks.
My laptop hard disk is a WDC WD5000LPLX-7
, a WD Black Mobile model, 7200 RPM. My kernel version is 4.18.16-200.fc28.x86_64
. It uses the CFQ I/O scheduler.
I tried reproducing the report about trying to start google-chrome
and it being basically stalled while a dd
command is running. Even though that was reported for an SSD, and it is not very scientific to rely on an unspecified google-chrome
version. google-chrome
took about 20 seconds to start normally (after running echo 3 | sudo tee drop_caches
). When running dd if=/dev/zero of=foo bs=1M count=10k
, google-chrome
takes maybe up to twice as long to start, but that was less than half-way through the dd
command.
I tried to reproduce the read-starvation, shown in the output of vmstat 1
. I tried this by writing a script (see below), to match the scenario described in the cover letter for WBT v3. I looked at vmstat 1
and the reader was not starved as severely. The reader was unfairly reduced to 10-20MB/s, v.s. the writer which achieved 50-60MB/s. So it is fair to say this system hits some limitation, of using the so-called "Completely Fair Queuing" I/O scheduler :-). It is not as bad as the reported result, where the reader was reduced to under 4MB/s while the writer achieved over 70MB/s.
The read latencies look like
lat (msec) : 2=0.11%, 4=0.67%, 10=0.01%, 20=0.01%, 50=0.03%
lat (msec) : 100=0.02%, 250=0.04%, 500=0.03%, 750=0.01%, 1000=0.01%
whereas if I run the read test on an idle system, it looked like
lat (msec) : 2=0.57%, 4=0.98%, 10=0.01%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%
So I'm definitely seeing longer tail latencies. But it's not such an obvious catastrophe as the reported result, where 5% of the reads took more than 1 msec (1000 usec), and 1% of the reads took over 250 msec.
My understanding is that CFQ struggles to control the drive, because even when one of the NCQ write requests is completed, it may still be sitting in the device writeback cache. (Don't ask me to explain why devices need both NCQ and writeback cache, it seems a very strange interface to me). So I tried testing after disabling the device writeback cache, with hdparm -W0
. This achieved a more plausible "balance" between the reader and writer throughput.
(Using BFQ, the reader achieves an average of 40MB/s. Additionally disabling device write cache appears to help on some of the responsiveness tests used for BFQ though. My initial impressions of BFQ have been very good).
Leaving the device writeback cache enabled, scheduler = deadline
with wbt_lat_usec = 0
gave worse read throughput than CFQ. wbt_lat_usec = 75000
(default for rotational storage) improved the read throughput with deadline
, but it still favoured the writer - it looked somewhat worse than CFQ.
It seems like the argument for WBT on rotating hard disks was written assuming it could be used with CFQ. The conflict between WBT and CFQ was not recognised until 6 months later. Secondly, it seems the main reason WBT was implemented was for people with fast flash storage who are using a low-overhead scheduler like deadline
, or the new kyber
scheduler.
(If I then disable device write cache, it allows the reader to achieve 60MB/s. Possibly slightly unfair in the other direction. Not bad, but... the original arguments were not phrased at all to suggest this step. The severe results were shown for CFQ with default settings; there was no comparison with CFQ + hdparm -W0
).
Differences between devices
I ran the same test on a second system, with a Seagate ST500LM012 HN-M5
SATA notebook drive 5400 RPM, and the Debian kernel 4.9.0-8. It also uses the CFQ scheduler. The drive appears to support NCQ. sysfs
suggests it has a queue_depth
of 31, instead of the 32 on the first drive. hdparm -W
showed the device writeback cache was enabled.
On this second system, vmstat 1
seems to show the writer was starved, with literally 0 writes for most (although not all) of the time the reader was running. The reader achieved 80 MB/s. This behaviour seems very strange and not very desirable. I don't know exactly what combination of Linux and hardware behaviour might cause this.
The second system also has a SATA SSD, a Crucial M4-CT128M4SSD2
. Using this SSD, vmstat 1
shows the reader is reduced to 50-60 MB/s (compared to 500 MB/s on an idle system), while the writer achieves 130 MB/s (compared to 200 MB/s on an idle system). So this case unfairly favoured the writer, which was the expected problem.
Test script
#!/bin/sh
# Simultaneous buffered read+write test
# Tries to reproduce the HDD results from
# https://lwn.net/Articles/681763/
# Test size etc is suitable for normal HDD speeds.
# Not tested on anything faster than SATA SSD.
# Test in a sub-directory, of whatever the current directory was
# when the user ran this script.
TEST_DIR=./fio-test
mkdir -p "$TEST_DIR" || exit
cd "$TEST_DIR" || exit
echo "= Draining buffered writes ="
sync
echo "= Creating file for read test="
READ="--name=read --rw=read --size=1G --ioengine=psync"
fio $READ --create_only=1 --overwrite=0
echo "= Simultaneous buffered read+write test ="
[ "$(echo 50_100*)" != "50_100*" ] && rm 50_100*
fio --name=50_100 --rw=write --size=5000M --filesize=100M --nrfiles=50
--create_on_open=1 --ioengine=psync
> log.50_100 2>&1 &
PID_50_100=$!
fio $READ &
PID_READ=$!
wait $PID_50_100
sync
[ -e /proc/$PID_READ ] && kill $PID_READ && wait $PID_READ
echo "= write test log ="
cat log.50_100
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?
So far, I haven't reproduced the most severe results. The behaviour varies between disks.
My laptop hard disk is a WDC WD5000LPLX-7
, a WD Black Mobile model, 7200 RPM. My kernel version is 4.18.16-200.fc28.x86_64
. It uses the CFQ I/O scheduler.
I tried reproducing the report about trying to start google-chrome
and it being basically stalled while a dd
command is running. Even though that was reported for an SSD, and it is not very scientific to rely on an unspecified google-chrome
version. google-chrome
took about 20 seconds to start normally (after running echo 3 | sudo tee drop_caches
). When running dd if=/dev/zero of=foo bs=1M count=10k
, google-chrome
takes maybe up to twice as long to start, but that was less than half-way through the dd
command.
I tried to reproduce the read-starvation, shown in the output of vmstat 1
. I tried this by writing a script (see below), to match the scenario described in the cover letter for WBT v3. I looked at vmstat 1
and the reader was not starved as severely. The reader was unfairly reduced to 10-20MB/s, v.s. the writer which achieved 50-60MB/s. So it is fair to say this system hits some limitation, of using the so-called "Completely Fair Queuing" I/O scheduler :-). It is not as bad as the reported result, where the reader was reduced to under 4MB/s while the writer achieved over 70MB/s.
The read latencies look like
lat (msec) : 2=0.11%, 4=0.67%, 10=0.01%, 20=0.01%, 50=0.03%
lat (msec) : 100=0.02%, 250=0.04%, 500=0.03%, 750=0.01%, 1000=0.01%
whereas if I run the read test on an idle system, it looked like
lat (msec) : 2=0.57%, 4=0.98%, 10=0.01%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%
So I'm definitely seeing longer tail latencies. But it's not such an obvious catastrophe as the reported result, where 5% of the reads took more than 1 msec (1000 usec), and 1% of the reads took over 250 msec.
My understanding is that CFQ struggles to control the drive, because even when one of the NCQ write requests is completed, it may still be sitting in the device writeback cache. (Don't ask me to explain why devices need both NCQ and writeback cache, it seems a very strange interface to me). So I tried testing after disabling the device writeback cache, with hdparm -W0
. This achieved a more plausible "balance" between the reader and writer throughput.
(Using BFQ, the reader achieves an average of 40MB/s. Additionally disabling device write cache appears to help on some of the responsiveness tests used for BFQ though. My initial impressions of BFQ have been very good).
Leaving the device writeback cache enabled, scheduler = deadline
with wbt_lat_usec = 0
gave worse read throughput than CFQ. wbt_lat_usec = 75000
(default for rotational storage) improved the read throughput with deadline
, but it still favoured the writer - it looked somewhat worse than CFQ.
It seems like the argument for WBT on rotating hard disks was written assuming it could be used with CFQ. The conflict between WBT and CFQ was not recognised until 6 months later. Secondly, it seems the main reason WBT was implemented was for people with fast flash storage who are using a low-overhead scheduler like deadline
, or the new kyber
scheduler.
(If I then disable device write cache, it allows the reader to achieve 60MB/s. Possibly slightly unfair in the other direction. Not bad, but... the original arguments were not phrased at all to suggest this step. The severe results were shown for CFQ with default settings; there was no comparison with CFQ + hdparm -W0
).
Differences between devices
I ran the same test on a second system, with a Seagate ST500LM012 HN-M5
SATA notebook drive 5400 RPM, and the Debian kernel 4.9.0-8. It also uses the CFQ scheduler. The drive appears to support NCQ. sysfs
suggests it has a queue_depth
of 31, instead of the 32 on the first drive. hdparm -W
showed the device writeback cache was enabled.
On this second system, vmstat 1
seems to show the writer was starved, with literally 0 writes for most (although not all) of the time the reader was running. The reader achieved 80 MB/s. This behaviour seems very strange and not very desirable. I don't know exactly what combination of Linux and hardware behaviour might cause this.
The second system also has a SATA SSD, a Crucial M4-CT128M4SSD2
. Using this SSD, vmstat 1
shows the reader is reduced to 50-60 MB/s (compared to 500 MB/s on an idle system), while the writer achieves 130 MB/s (compared to 200 MB/s on an idle system). So this case unfairly favoured the writer, which was the expected problem.
Test script
#!/bin/sh
# Simultaneous buffered read+write test
# Tries to reproduce the HDD results from
# https://lwn.net/Articles/681763/
# Test size etc is suitable for normal HDD speeds.
# Not tested on anything faster than SATA SSD.
# Test in a sub-directory, of whatever the current directory was
# when the user ran this script.
TEST_DIR=./fio-test
mkdir -p "$TEST_DIR" || exit
cd "$TEST_DIR" || exit
echo "= Draining buffered writes ="
sync
echo "= Creating file for read test="
READ="--name=read --rw=read --size=1G --ioengine=psync"
fio $READ --create_only=1 --overwrite=0
echo "= Simultaneous buffered read+write test ="
[ "$(echo 50_100*)" != "50_100*" ] && rm 50_100*
fio --name=50_100 --rw=write --size=5000M --filesize=100M --nrfiles=50
--create_on_open=1 --ioengine=psync
> log.50_100 2>&1 &
PID_50_100=$!
fio $READ &
PID_READ=$!
wait $PID_50_100
sync
[ -e /proc/$PID_READ ] && kill $PID_READ && wait $PID_READ
echo "= write test log ="
cat log.50_100
add a comment |
up vote
0
down vote
accepted
What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?
So far, I haven't reproduced the most severe results. The behaviour varies between disks.
My laptop hard disk is a WDC WD5000LPLX-7
, a WD Black Mobile model, 7200 RPM. My kernel version is 4.18.16-200.fc28.x86_64
. It uses the CFQ I/O scheduler.
I tried reproducing the report about trying to start google-chrome
and it being basically stalled while a dd
command is running. Even though that was reported for an SSD, and it is not very scientific to rely on an unspecified google-chrome
version. google-chrome
took about 20 seconds to start normally (after running echo 3 | sudo tee drop_caches
). When running dd if=/dev/zero of=foo bs=1M count=10k
, google-chrome
takes maybe up to twice as long to start, but that was less than half-way through the dd
command.
I tried to reproduce the read-starvation, shown in the output of vmstat 1
. I tried this by writing a script (see below), to match the scenario described in the cover letter for WBT v3. I looked at vmstat 1
and the reader was not starved as severely. The reader was unfairly reduced to 10-20MB/s, v.s. the writer which achieved 50-60MB/s. So it is fair to say this system hits some limitation, of using the so-called "Completely Fair Queuing" I/O scheduler :-). It is not as bad as the reported result, where the reader was reduced to under 4MB/s while the writer achieved over 70MB/s.
The read latencies look like
lat (msec) : 2=0.11%, 4=0.67%, 10=0.01%, 20=0.01%, 50=0.03%
lat (msec) : 100=0.02%, 250=0.04%, 500=0.03%, 750=0.01%, 1000=0.01%
whereas if I run the read test on an idle system, it looked like
lat (msec) : 2=0.57%, 4=0.98%, 10=0.01%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%
So I'm definitely seeing longer tail latencies. But it's not such an obvious catastrophe as the reported result, where 5% of the reads took more than 1 msec (1000 usec), and 1% of the reads took over 250 msec.
My understanding is that CFQ struggles to control the drive, because even when one of the NCQ write requests is completed, it may still be sitting in the device writeback cache. (Don't ask me to explain why devices need both NCQ and writeback cache, it seems a very strange interface to me). So I tried testing after disabling the device writeback cache, with hdparm -W0
. This achieved a more plausible "balance" between the reader and writer throughput.
(Using BFQ, the reader achieves an average of 40MB/s. Additionally disabling device write cache appears to help on some of the responsiveness tests used for BFQ though. My initial impressions of BFQ have been very good).
Leaving the device writeback cache enabled, scheduler = deadline
with wbt_lat_usec = 0
gave worse read throughput than CFQ. wbt_lat_usec = 75000
(default for rotational storage) improved the read throughput with deadline
, but it still favoured the writer - it looked somewhat worse than CFQ.
It seems like the argument for WBT on rotating hard disks was written assuming it could be used with CFQ. The conflict between WBT and CFQ was not recognised until 6 months later. Secondly, it seems the main reason WBT was implemented was for people with fast flash storage who are using a low-overhead scheduler like deadline
, or the new kyber
scheduler.
(If I then disable device write cache, it allows the reader to achieve 60MB/s. Possibly slightly unfair in the other direction. Not bad, but... the original arguments were not phrased at all to suggest this step. The severe results were shown for CFQ with default settings; there was no comparison with CFQ + hdparm -W0
).
Differences between devices
I ran the same test on a second system, with a Seagate ST500LM012 HN-M5
SATA notebook drive 5400 RPM, and the Debian kernel 4.9.0-8. It also uses the CFQ scheduler. The drive appears to support NCQ. sysfs
suggests it has a queue_depth
of 31, instead of the 32 on the first drive. hdparm -W
showed the device writeback cache was enabled.
On this second system, vmstat 1
seems to show the writer was starved, with literally 0 writes for most (although not all) of the time the reader was running. The reader achieved 80 MB/s. This behaviour seems very strange and not very desirable. I don't know exactly what combination of Linux and hardware behaviour might cause this.
The second system also has a SATA SSD, a Crucial M4-CT128M4SSD2
. Using this SSD, vmstat 1
shows the reader is reduced to 50-60 MB/s (compared to 500 MB/s on an idle system), while the writer achieves 130 MB/s (compared to 200 MB/s on an idle system). So this case unfairly favoured the writer, which was the expected problem.
Test script
#!/bin/sh
# Simultaneous buffered read+write test
# Tries to reproduce the HDD results from
# https://lwn.net/Articles/681763/
# Test size etc is suitable for normal HDD speeds.
# Not tested on anything faster than SATA SSD.
# Test in a sub-directory, of whatever the current directory was
# when the user ran this script.
TEST_DIR=./fio-test
mkdir -p "$TEST_DIR" || exit
cd "$TEST_DIR" || exit
echo "= Draining buffered writes ="
sync
echo "= Creating file for read test="
READ="--name=read --rw=read --size=1G --ioengine=psync"
fio $READ --create_only=1 --overwrite=0
echo "= Simultaneous buffered read+write test ="
[ "$(echo 50_100*)" != "50_100*" ] && rm 50_100*
fio --name=50_100 --rw=write --size=5000M --filesize=100M --nrfiles=50
--create_on_open=1 --ioengine=psync
> log.50_100 2>&1 &
PID_50_100=$!
fio $READ &
PID_READ=$!
wait $PID_50_100
sync
[ -e /proc/$PID_READ ] && kill $PID_READ && wait $PID_READ
echo "= write test log ="
cat log.50_100
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?
So far, I haven't reproduced the most severe results. The behaviour varies between disks.
My laptop hard disk is a WDC WD5000LPLX-7
, a WD Black Mobile model, 7200 RPM. My kernel version is 4.18.16-200.fc28.x86_64
. It uses the CFQ I/O scheduler.
I tried reproducing the report about trying to start google-chrome
and it being basically stalled while a dd
command is running. Even though that was reported for an SSD, and it is not very scientific to rely on an unspecified google-chrome
version. google-chrome
took about 20 seconds to start normally (after running echo 3 | sudo tee drop_caches
). When running dd if=/dev/zero of=foo bs=1M count=10k
, google-chrome
takes maybe up to twice as long to start, but that was less than half-way through the dd
command.
I tried to reproduce the read-starvation, shown in the output of vmstat 1
. I tried this by writing a script (see below), to match the scenario described in the cover letter for WBT v3. I looked at vmstat 1
and the reader was not starved as severely. The reader was unfairly reduced to 10-20MB/s, v.s. the writer which achieved 50-60MB/s. So it is fair to say this system hits some limitation, of using the so-called "Completely Fair Queuing" I/O scheduler :-). It is not as bad as the reported result, where the reader was reduced to under 4MB/s while the writer achieved over 70MB/s.
The read latencies look like
lat (msec) : 2=0.11%, 4=0.67%, 10=0.01%, 20=0.01%, 50=0.03%
lat (msec) : 100=0.02%, 250=0.04%, 500=0.03%, 750=0.01%, 1000=0.01%
whereas if I run the read test on an idle system, it looked like
lat (msec) : 2=0.57%, 4=0.98%, 10=0.01%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%
So I'm definitely seeing longer tail latencies. But it's not such an obvious catastrophe as the reported result, where 5% of the reads took more than 1 msec (1000 usec), and 1% of the reads took over 250 msec.
My understanding is that CFQ struggles to control the drive, because even when one of the NCQ write requests is completed, it may still be sitting in the device writeback cache. (Don't ask me to explain why devices need both NCQ and writeback cache, it seems a very strange interface to me). So I tried testing after disabling the device writeback cache, with hdparm -W0
. This achieved a more plausible "balance" between the reader and writer throughput.
(Using BFQ, the reader achieves an average of 40MB/s. Additionally disabling device write cache appears to help on some of the responsiveness tests used for BFQ though. My initial impressions of BFQ have been very good).
Leaving the device writeback cache enabled, scheduler = deadline
with wbt_lat_usec = 0
gave worse read throughput than CFQ. wbt_lat_usec = 75000
(default for rotational storage) improved the read throughput with deadline
, but it still favoured the writer - it looked somewhat worse than CFQ.
It seems like the argument for WBT on rotating hard disks was written assuming it could be used with CFQ. The conflict between WBT and CFQ was not recognised until 6 months later. Secondly, it seems the main reason WBT was implemented was for people with fast flash storage who are using a low-overhead scheduler like deadline
, or the new kyber
scheduler.
(If I then disable device write cache, it allows the reader to achieve 60MB/s. Possibly slightly unfair in the other direction. Not bad, but... the original arguments were not phrased at all to suggest this step. The severe results were shown for CFQ with default settings; there was no comparison with CFQ + hdparm -W0
).
Differences between devices
I ran the same test on a second system, with a Seagate ST500LM012 HN-M5
SATA notebook drive 5400 RPM, and the Debian kernel 4.9.0-8. It also uses the CFQ scheduler. The drive appears to support NCQ. sysfs
suggests it has a queue_depth
of 31, instead of the 32 on the first drive. hdparm -W
showed the device writeback cache was enabled.
On this second system, vmstat 1
seems to show the writer was starved, with literally 0 writes for most (although not all) of the time the reader was running. The reader achieved 80 MB/s. This behaviour seems very strange and not very desirable. I don't know exactly what combination of Linux and hardware behaviour might cause this.
The second system also has a SATA SSD, a Crucial M4-CT128M4SSD2
. Using this SSD, vmstat 1
shows the reader is reduced to 50-60 MB/s (compared to 500 MB/s on an idle system), while the writer achieves 130 MB/s (compared to 200 MB/s on an idle system). So this case unfairly favoured the writer, which was the expected problem.
Test script
#!/bin/sh
# Simultaneous buffered read+write test
# Tries to reproduce the HDD results from
# https://lwn.net/Articles/681763/
# Test size etc is suitable for normal HDD speeds.
# Not tested on anything faster than SATA SSD.
# Test in a sub-directory, of whatever the current directory was
# when the user ran this script.
TEST_DIR=./fio-test
mkdir -p "$TEST_DIR" || exit
cd "$TEST_DIR" || exit
echo "= Draining buffered writes ="
sync
echo "= Creating file for read test="
READ="--name=read --rw=read --size=1G --ioengine=psync"
fio $READ --create_only=1 --overwrite=0
echo "= Simultaneous buffered read+write test ="
[ "$(echo 50_100*)" != "50_100*" ] && rm 50_100*
fio --name=50_100 --rw=write --size=5000M --filesize=100M --nrfiles=50
--create_on_open=1 --ioengine=psync
> log.50_100 2>&1 &
PID_50_100=$!
fio $READ &
PID_READ=$!
wait $PID_50_100
sync
[ -e /proc/$PID_READ ] && kill $PID_READ && wait $PID_READ
echo "= write test log ="
cat log.50_100
What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?
So far, I haven't reproduced the most severe results. The behaviour varies between disks.
My laptop hard disk is a WDC WD5000LPLX-7
, a WD Black Mobile model, 7200 RPM. My kernel version is 4.18.16-200.fc28.x86_64
. It uses the CFQ I/O scheduler.
I tried reproducing the report about trying to start google-chrome
and it being basically stalled while a dd
command is running. Even though that was reported for an SSD, and it is not very scientific to rely on an unspecified google-chrome
version. google-chrome
took about 20 seconds to start normally (after running echo 3 | sudo tee drop_caches
). When running dd if=/dev/zero of=foo bs=1M count=10k
, google-chrome
takes maybe up to twice as long to start, but that was less than half-way through the dd
command.
I tried to reproduce the read-starvation, shown in the output of vmstat 1
. I tried this by writing a script (see below), to match the scenario described in the cover letter for WBT v3. I looked at vmstat 1
and the reader was not starved as severely. The reader was unfairly reduced to 10-20MB/s, v.s. the writer which achieved 50-60MB/s. So it is fair to say this system hits some limitation, of using the so-called "Completely Fair Queuing" I/O scheduler :-). It is not as bad as the reported result, where the reader was reduced to under 4MB/s while the writer achieved over 70MB/s.
The read latencies look like
lat (msec) : 2=0.11%, 4=0.67%, 10=0.01%, 20=0.01%, 50=0.03%
lat (msec) : 100=0.02%, 250=0.04%, 500=0.03%, 750=0.01%, 1000=0.01%
whereas if I run the read test on an idle system, it looked like
lat (msec) : 2=0.57%, 4=0.98%, 10=0.01%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%
So I'm definitely seeing longer tail latencies. But it's not such an obvious catastrophe as the reported result, where 5% of the reads took more than 1 msec (1000 usec), and 1% of the reads took over 250 msec.
My understanding is that CFQ struggles to control the drive, because even when one of the NCQ write requests is completed, it may still be sitting in the device writeback cache. (Don't ask me to explain why devices need both NCQ and writeback cache, it seems a very strange interface to me). So I tried testing after disabling the device writeback cache, with hdparm -W0
. This achieved a more plausible "balance" between the reader and writer throughput.
(Using BFQ, the reader achieves an average of 40MB/s. Additionally disabling device write cache appears to help on some of the responsiveness tests used for BFQ though. My initial impressions of BFQ have been very good).
Leaving the device writeback cache enabled, scheduler = deadline
with wbt_lat_usec = 0
gave worse read throughput than CFQ. wbt_lat_usec = 75000
(default for rotational storage) improved the read throughput with deadline
, but it still favoured the writer - it looked somewhat worse than CFQ.
It seems like the argument for WBT on rotating hard disks was written assuming it could be used with CFQ. The conflict between WBT and CFQ was not recognised until 6 months later. Secondly, it seems the main reason WBT was implemented was for people with fast flash storage who are using a low-overhead scheduler like deadline
, or the new kyber
scheduler.
(If I then disable device write cache, it allows the reader to achieve 60MB/s. Possibly slightly unfair in the other direction. Not bad, but... the original arguments were not phrased at all to suggest this step. The severe results were shown for CFQ with default settings; there was no comparison with CFQ + hdparm -W0
).
Differences between devices
I ran the same test on a second system, with a Seagate ST500LM012 HN-M5
SATA notebook drive 5400 RPM, and the Debian kernel 4.9.0-8. It also uses the CFQ scheduler. The drive appears to support NCQ. sysfs
suggests it has a queue_depth
of 31, instead of the 32 on the first drive. hdparm -W
showed the device writeback cache was enabled.
On this second system, vmstat 1
seems to show the writer was starved, with literally 0 writes for most (although not all) of the time the reader was running. The reader achieved 80 MB/s. This behaviour seems very strange and not very desirable. I don't know exactly what combination of Linux and hardware behaviour might cause this.
The second system also has a SATA SSD, a Crucial M4-CT128M4SSD2
. Using this SSD, vmstat 1
shows the reader is reduced to 50-60 MB/s (compared to 500 MB/s on an idle system), while the writer achieves 130 MB/s (compared to 200 MB/s on an idle system). So this case unfairly favoured the writer, which was the expected problem.
Test script
#!/bin/sh
# Simultaneous buffered read+write test
# Tries to reproduce the HDD results from
# https://lwn.net/Articles/681763/
# Test size etc is suitable for normal HDD speeds.
# Not tested on anything faster than SATA SSD.
# Test in a sub-directory, of whatever the current directory was
# when the user ran this script.
TEST_DIR=./fio-test
mkdir -p "$TEST_DIR" || exit
cd "$TEST_DIR" || exit
echo "= Draining buffered writes ="
sync
echo "= Creating file for read test="
READ="--name=read --rw=read --size=1G --ioengine=psync"
fio $READ --create_only=1 --overwrite=0
echo "= Simultaneous buffered read+write test ="
[ "$(echo 50_100*)" != "50_100*" ] && rm 50_100*
fio --name=50_100 --rw=write --size=5000M --filesize=100M --nrfiles=50
--create_on_open=1 --ioengine=psync
> log.50_100 2>&1 &
PID_50_100=$!
fio $READ &
PID_READ=$!
wait $PID_50_100
sync
[ -e /proc/$PID_READ ] && kill $PID_READ && wait $PID_READ
echo "= write test log ="
cat log.50_100
edited Nov 24 at 21:02
answered Nov 22 at 22:12
sourcejedi
22k43396
22k43396
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483269%2fdetermine-the-specific-benefit-of-writeback-throttling-config-wbt%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Generally the way a lack of write throttling shows up is: 1. something does a huge amount of buffered write I/O to something slow (e.g. a USB2 attached disk) but at this stage it is buffered and not being flushed. 2. Something else is doing write I/O to a disk that is medium speed or fast. 3. A sync is done for some reason (either time or by request) that forces the system to have to flush I/O to the slow disk. Something tells me you've seen this link before but see utcc.utoronto.ca/~cks/space/blog/linux/… (and the links in the comments) for details.
– Anon
Nov 24 at 6:48
@Anon read the cover letter for WBT, the link I gave. It is designed to help on a kernel developers laptop - which has SATA SSD if I read the speeds correctly - and it was also originally advertised to help on internal SATA HDD. CKS' scenario is different, that's about interference between different devices. There's different cover letters for each WBT patch version, but they're all about a problem on a single device.
– sourcejedi
Nov 24 at 9:53
@Anon Also, I believe CKS, and there are very similar looking reports of that USB problem on this site, but the LWN article on "pernicious USB-stick stall problem" is broken. It completely mis-represents the original report & series of responses. That LWN article needs to be discounted. At least as a citation for an analysis of that problem. unix.stackexchange.com/questions/480399/…
– sourcejedi
Nov 24 at 10:04
@Anon analysis can get very confusing because there are four writeback buffers: linux dirty page cache, linux I/O scheduler queue, device queue (NCQ), device writeback cache. I want to understand what WBT can solve and what it does not solve.
– sourcejedi
Nov 24 at 10:50