Determine the specific benefit of Writeback Throttling (CONFIG_WBT)











up vote
1
down vote

favorite













[PATCH 0/8] Throttled background buffered writeback v7



Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background
activity... But for as long as I can remember, heavy buffered writers
have not behaved like that. For instance, if I do something like this:



$ dd if=/dev/zero of=foo bs=1M count=10k


on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done.




And now we have the patches for this applied to Linux, and available in Fedora Workstation and elsewhere. Yay.



But this "writeback throttling" (WBT) has no effect by default, at least on SATA HDDs and SSDs. The default IO scheduler CFQ is not compatible with WBT. And nor is BFQ (the successor to CFQ, for the multi-queue block layer). You need to switch to an I/O scheduler which does not try to throttle background writeback itself. So you have to make some trade-off :-(.



CFQ was probably advertised in the past with descriptions that would have sounded similarly attractive. BFQ certainly is as well... but from the documentation, it seems to use a similar class of heuristic to CFQ. I don't see it measuring the IO latency and throttling writeback if the latency is too high.



I have a cheap 2-year old laptop with a spinning hard drive. It has full support for SATA NCQ, i.e. 32-deep I/O queue on the device. I've certainly noticed some tendency for the system to become unusable when copying large files (20G VM files. I have 8GB of RAM). Based on the cover letter for v3 of the patches, I expect a system like mine can indeed suffer from the problem, that background writeback can submit far too much IO at a time.




  1. What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?

  2. It sounds like this problem is quite severe, and has been understood now for at least a couple of years. You would hope that some monitoring systems know how to show the underlying problem. If you have a performance problem, and you think this is one of the possible causes, what measurement(s) can you look at in order to diagnose it?










share|improve this question
























  • Generally the way a lack of write throttling shows up is: 1. something does a huge amount of buffered write I/O to something slow (e.g. a USB2 attached disk) but at this stage it is buffered and not being flushed. 2. Something else is doing write I/O to a disk that is medium speed or fast. 3. A sync is done for some reason (either time or by request) that forces the system to have to flush I/O to the slow disk. Something tells me you've seen this link before but see utcc.utoronto.ca/~cks/space/blog/linux/… (and the links in the comments) for details.
    – Anon
    Nov 24 at 6:48












  • @Anon read the cover letter for WBT, the link I gave. It is designed to help on a kernel developers laptop - which has SATA SSD if I read the speeds correctly - and it was also originally advertised to help on internal SATA HDD. CKS' scenario is different, that's about interference between different devices. There's different cover letters for each WBT patch version, but they're all about a problem on a single device.
    – sourcejedi
    Nov 24 at 9:53










  • @Anon Also, I believe CKS, and there are very similar looking reports of that USB problem on this site, but the LWN article on "pernicious USB-stick stall problem" is broken. It completely mis-represents the original report & series of responses. That LWN article needs to be discounted. At least as a citation for an analysis of that problem. unix.stackexchange.com/questions/480399/…
    – sourcejedi
    Nov 24 at 10:04










  • @Anon analysis can get very confusing because there are four writeback buffers: linux dirty page cache, linux I/O scheduler queue, device queue (NCQ), device writeback cache. I want to understand what WBT can solve and what it does not solve.
    – sourcejedi
    Nov 24 at 10:50

















up vote
1
down vote

favorite













[PATCH 0/8] Throttled background buffered writeback v7



Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background
activity... But for as long as I can remember, heavy buffered writers
have not behaved like that. For instance, if I do something like this:



$ dd if=/dev/zero of=foo bs=1M count=10k


on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done.




And now we have the patches for this applied to Linux, and available in Fedora Workstation and elsewhere. Yay.



But this "writeback throttling" (WBT) has no effect by default, at least on SATA HDDs and SSDs. The default IO scheduler CFQ is not compatible with WBT. And nor is BFQ (the successor to CFQ, for the multi-queue block layer). You need to switch to an I/O scheduler which does not try to throttle background writeback itself. So you have to make some trade-off :-(.



CFQ was probably advertised in the past with descriptions that would have sounded similarly attractive. BFQ certainly is as well... but from the documentation, it seems to use a similar class of heuristic to CFQ. I don't see it measuring the IO latency and throttling writeback if the latency is too high.



I have a cheap 2-year old laptop with a spinning hard drive. It has full support for SATA NCQ, i.e. 32-deep I/O queue on the device. I've certainly noticed some tendency for the system to become unusable when copying large files (20G VM files. I have 8GB of RAM). Based on the cover letter for v3 of the patches, I expect a system like mine can indeed suffer from the problem, that background writeback can submit far too much IO at a time.




  1. What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?

  2. It sounds like this problem is quite severe, and has been understood now for at least a couple of years. You would hope that some monitoring systems know how to show the underlying problem. If you have a performance problem, and you think this is one of the possible causes, what measurement(s) can you look at in order to diagnose it?










share|improve this question
























  • Generally the way a lack of write throttling shows up is: 1. something does a huge amount of buffered write I/O to something slow (e.g. a USB2 attached disk) but at this stage it is buffered and not being flushed. 2. Something else is doing write I/O to a disk that is medium speed or fast. 3. A sync is done for some reason (either time or by request) that forces the system to have to flush I/O to the slow disk. Something tells me you've seen this link before but see utcc.utoronto.ca/~cks/space/blog/linux/… (and the links in the comments) for details.
    – Anon
    Nov 24 at 6:48












  • @Anon read the cover letter for WBT, the link I gave. It is designed to help on a kernel developers laptop - which has SATA SSD if I read the speeds correctly - and it was also originally advertised to help on internal SATA HDD. CKS' scenario is different, that's about interference between different devices. There's different cover letters for each WBT patch version, but they're all about a problem on a single device.
    – sourcejedi
    Nov 24 at 9:53










  • @Anon Also, I believe CKS, and there are very similar looking reports of that USB problem on this site, but the LWN article on "pernicious USB-stick stall problem" is broken. It completely mis-represents the original report & series of responses. That LWN article needs to be discounted. At least as a citation for an analysis of that problem. unix.stackexchange.com/questions/480399/…
    – sourcejedi
    Nov 24 at 10:04










  • @Anon analysis can get very confusing because there are four writeback buffers: linux dirty page cache, linux I/O scheduler queue, device queue (NCQ), device writeback cache. I want to understand what WBT can solve and what it does not solve.
    – sourcejedi
    Nov 24 at 10:50















up vote
1
down vote

favorite









up vote
1
down vote

favorite












[PATCH 0/8] Throttled background buffered writeback v7



Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background
activity... But for as long as I can remember, heavy buffered writers
have not behaved like that. For instance, if I do something like this:



$ dd if=/dev/zero of=foo bs=1M count=10k


on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done.




And now we have the patches for this applied to Linux, and available in Fedora Workstation and elsewhere. Yay.



But this "writeback throttling" (WBT) has no effect by default, at least on SATA HDDs and SSDs. The default IO scheduler CFQ is not compatible with WBT. And nor is BFQ (the successor to CFQ, for the multi-queue block layer). You need to switch to an I/O scheduler which does not try to throttle background writeback itself. So you have to make some trade-off :-(.



CFQ was probably advertised in the past with descriptions that would have sounded similarly attractive. BFQ certainly is as well... but from the documentation, it seems to use a similar class of heuristic to CFQ. I don't see it measuring the IO latency and throttling writeback if the latency is too high.



I have a cheap 2-year old laptop with a spinning hard drive. It has full support for SATA NCQ, i.e. 32-deep I/O queue on the device. I've certainly noticed some tendency for the system to become unusable when copying large files (20G VM files. I have 8GB of RAM). Based on the cover letter for v3 of the patches, I expect a system like mine can indeed suffer from the problem, that background writeback can submit far too much IO at a time.




  1. What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?

  2. It sounds like this problem is quite severe, and has been understood now for at least a couple of years. You would hope that some monitoring systems know how to show the underlying problem. If you have a performance problem, and you think this is one of the possible causes, what measurement(s) can you look at in order to diagnose it?










share|improve this question
















[PATCH 0/8] Throttled background buffered writeback v7



Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background
activity... But for as long as I can remember, heavy buffered writers
have not behaved like that. For instance, if I do something like this:



$ dd if=/dev/zero of=foo bs=1M count=10k


on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done.




And now we have the patches for this applied to Linux, and available in Fedora Workstation and elsewhere. Yay.



But this "writeback throttling" (WBT) has no effect by default, at least on SATA HDDs and SSDs. The default IO scheduler CFQ is not compatible with WBT. And nor is BFQ (the successor to CFQ, for the multi-queue block layer). You need to switch to an I/O scheduler which does not try to throttle background writeback itself. So you have to make some trade-off :-(.



CFQ was probably advertised in the past with descriptions that would have sounded similarly attractive. BFQ certainly is as well... but from the documentation, it seems to use a similar class of heuristic to CFQ. I don't see it measuring the IO latency and throttling writeback if the latency is too high.



I have a cheap 2-year old laptop with a spinning hard drive. It has full support for SATA NCQ, i.e. 32-deep I/O queue on the device. I've certainly noticed some tendency for the system to become unusable when copying large files (20G VM files. I have 8GB of RAM). Based on the cover letter for v3 of the patches, I expect a system like mine can indeed suffer from the problem, that background writeback can submit far too much IO at a time.




  1. What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?

  2. It sounds like this problem is quite severe, and has been understood now for at least a couple of years. You would hope that some monitoring systems know how to show the underlying problem. If you have a performance problem, and you think this is one of the possible causes, what measurement(s) can you look at in order to diagnose it?







linux linux-kernel io cache






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 24 at 10:49

























asked Nov 21 at 17:41









sourcejedi

22k43396




22k43396












  • Generally the way a lack of write throttling shows up is: 1. something does a huge amount of buffered write I/O to something slow (e.g. a USB2 attached disk) but at this stage it is buffered and not being flushed. 2. Something else is doing write I/O to a disk that is medium speed or fast. 3. A sync is done for some reason (either time or by request) that forces the system to have to flush I/O to the slow disk. Something tells me you've seen this link before but see utcc.utoronto.ca/~cks/space/blog/linux/… (and the links in the comments) for details.
    – Anon
    Nov 24 at 6:48












  • @Anon read the cover letter for WBT, the link I gave. It is designed to help on a kernel developers laptop - which has SATA SSD if I read the speeds correctly - and it was also originally advertised to help on internal SATA HDD. CKS' scenario is different, that's about interference between different devices. There's different cover letters for each WBT patch version, but they're all about a problem on a single device.
    – sourcejedi
    Nov 24 at 9:53










  • @Anon Also, I believe CKS, and there are very similar looking reports of that USB problem on this site, but the LWN article on "pernicious USB-stick stall problem" is broken. It completely mis-represents the original report & series of responses. That LWN article needs to be discounted. At least as a citation for an analysis of that problem. unix.stackexchange.com/questions/480399/…
    – sourcejedi
    Nov 24 at 10:04










  • @Anon analysis can get very confusing because there are four writeback buffers: linux dirty page cache, linux I/O scheduler queue, device queue (NCQ), device writeback cache. I want to understand what WBT can solve and what it does not solve.
    – sourcejedi
    Nov 24 at 10:50




















  • Generally the way a lack of write throttling shows up is: 1. something does a huge amount of buffered write I/O to something slow (e.g. a USB2 attached disk) but at this stage it is buffered and not being flushed. 2. Something else is doing write I/O to a disk that is medium speed or fast. 3. A sync is done for some reason (either time or by request) that forces the system to have to flush I/O to the slow disk. Something tells me you've seen this link before but see utcc.utoronto.ca/~cks/space/blog/linux/… (and the links in the comments) for details.
    – Anon
    Nov 24 at 6:48












  • @Anon read the cover letter for WBT, the link I gave. It is designed to help on a kernel developers laptop - which has SATA SSD if I read the speeds correctly - and it was also originally advertised to help on internal SATA HDD. CKS' scenario is different, that's about interference between different devices. There's different cover letters for each WBT patch version, but they're all about a problem on a single device.
    – sourcejedi
    Nov 24 at 9:53










  • @Anon Also, I believe CKS, and there are very similar looking reports of that USB problem on this site, but the LWN article on "pernicious USB-stick stall problem" is broken. It completely mis-represents the original report & series of responses. That LWN article needs to be discounted. At least as a citation for an analysis of that problem. unix.stackexchange.com/questions/480399/…
    – sourcejedi
    Nov 24 at 10:04










  • @Anon analysis can get very confusing because there are four writeback buffers: linux dirty page cache, linux I/O scheduler queue, device queue (NCQ), device writeback cache. I want to understand what WBT can solve and what it does not solve.
    – sourcejedi
    Nov 24 at 10:50


















Generally the way a lack of write throttling shows up is: 1. something does a huge amount of buffered write I/O to something slow (e.g. a USB2 attached disk) but at this stage it is buffered and not being flushed. 2. Something else is doing write I/O to a disk that is medium speed or fast. 3. A sync is done for some reason (either time or by request) that forces the system to have to flush I/O to the slow disk. Something tells me you've seen this link before but see utcc.utoronto.ca/~cks/space/blog/linux/… (and the links in the comments) for details.
– Anon
Nov 24 at 6:48






Generally the way a lack of write throttling shows up is: 1. something does a huge amount of buffered write I/O to something slow (e.g. a USB2 attached disk) but at this stage it is buffered and not being flushed. 2. Something else is doing write I/O to a disk that is medium speed or fast. 3. A sync is done for some reason (either time or by request) that forces the system to have to flush I/O to the slow disk. Something tells me you've seen this link before but see utcc.utoronto.ca/~cks/space/blog/linux/… (and the links in the comments) for details.
– Anon
Nov 24 at 6:48














@Anon read the cover letter for WBT, the link I gave. It is designed to help on a kernel developers laptop - which has SATA SSD if I read the speeds correctly - and it was also originally advertised to help on internal SATA HDD. CKS' scenario is different, that's about interference between different devices. There's different cover letters for each WBT patch version, but they're all about a problem on a single device.
– sourcejedi
Nov 24 at 9:53




@Anon read the cover letter for WBT, the link I gave. It is designed to help on a kernel developers laptop - which has SATA SSD if I read the speeds correctly - and it was also originally advertised to help on internal SATA HDD. CKS' scenario is different, that's about interference between different devices. There's different cover letters for each WBT patch version, but they're all about a problem on a single device.
– sourcejedi
Nov 24 at 9:53












@Anon Also, I believe CKS, and there are very similar looking reports of that USB problem on this site, but the LWN article on "pernicious USB-stick stall problem" is broken. It completely mis-represents the original report & series of responses. That LWN article needs to be discounted. At least as a citation for an analysis of that problem. unix.stackexchange.com/questions/480399/…
– sourcejedi
Nov 24 at 10:04




@Anon Also, I believe CKS, and there are very similar looking reports of that USB problem on this site, but the LWN article on "pernicious USB-stick stall problem" is broken. It completely mis-represents the original report & series of responses. That LWN article needs to be discounted. At least as a citation for an analysis of that problem. unix.stackexchange.com/questions/480399/…
– sourcejedi
Nov 24 at 10:04












@Anon analysis can get very confusing because there are four writeback buffers: linux dirty page cache, linux I/O scheduler queue, device queue (NCQ), device writeback cache. I want to understand what WBT can solve and what it does not solve.
– sourcejedi
Nov 24 at 10:50






@Anon analysis can get very confusing because there are four writeback buffers: linux dirty page cache, linux I/O scheduler queue, device queue (NCQ), device writeback cache. I want to understand what WBT can solve and what it does not solve.
– sourcejedi
Nov 24 at 10:50












1 Answer
1






active

oldest

votes

















up vote
0
down vote



accepted











What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?




So far, I haven't reproduced the most severe results. The behaviour varies between disks.



My laptop hard disk is a WDC WD5000LPLX-7, a WD Black Mobile model, 7200 RPM. My kernel version is 4.18.16-200.fc28.x86_64. It uses the CFQ I/O scheduler.



I tried reproducing the report about trying to start google-chrome and it being basically stalled while a dd command is running. Even though that was reported for an SSD, and it is not very scientific to rely on an unspecified google-chrome version. google-chrome took about 20 seconds to start normally (after running echo 3 | sudo tee drop_caches). When running dd if=/dev/zero of=foo bs=1M count=10k, google-chrome takes maybe up to twice as long to start, but that was less than half-way through the dd command.



I tried to reproduce the read-starvation, shown in the output of vmstat 1. I tried this by writing a script (see below), to match the scenario described in the cover letter for WBT v3. I looked at vmstat 1 and the reader was not starved as severely. The reader was unfairly reduced to 10-20MB/s, v.s. the writer which achieved 50-60MB/s. So it is fair to say this system hits some limitation, of using the so-called "Completely Fair Queuing" I/O scheduler :-). It is not as bad as the reported result, where the reader was reduced to under 4MB/s while the writer achieved over 70MB/s.



The read latencies look like



lat (msec)   : 2=0.11%, 4=0.67%, 10=0.01%, 20=0.01%, 50=0.03%
lat (msec) : 100=0.02%, 250=0.04%, 500=0.03%, 750=0.01%, 1000=0.01%


whereas if I run the read test on an idle system, it looked like



lat (msec)   : 2=0.57%, 4=0.98%, 10=0.01%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%


So I'm definitely seeing longer tail latencies. But it's not such an obvious catastrophe as the reported result, where 5% of the reads took more than 1 msec (1000 usec), and 1% of the reads took over 250 msec.



My understanding is that CFQ struggles to control the drive, because even when one of the NCQ write requests is completed, it may still be sitting in the device writeback cache. (Don't ask me to explain why devices need both NCQ and writeback cache, it seems a very strange interface to me). So I tried testing after disabling the device writeback cache, with hdparm -W0. This achieved a more plausible "balance" between the reader and writer throughput.



(Using BFQ, the reader achieves an average of 40MB/s. Additionally disabling device write cache appears to help on some of the responsiveness tests used for BFQ though. My initial impressions of BFQ have been very good).



Leaving the device writeback cache enabled, scheduler = deadline with wbt_lat_usec = 0 gave worse read throughput than CFQ. wbt_lat_usec = 75000 (default for rotational storage) improved the read throughput with deadline, but it still favoured the writer - it looked somewhat worse than CFQ.



It seems like the argument for WBT on rotating hard disks was written assuming it could be used with CFQ. The conflict between WBT and CFQ was not recognised until 6 months later. Secondly, it seems the main reason WBT was implemented was for people with fast flash storage who are using a low-overhead scheduler like deadline, or the new kyber scheduler.



(If I then disable device write cache, it allows the reader to achieve 60MB/s. Possibly slightly unfair in the other direction. Not bad, but... the original arguments were not phrased at all to suggest this step. The severe results were shown for CFQ with default settings; there was no comparison with CFQ + hdparm -W0).



Differences between devices



I ran the same test on a second system, with a Seagate ST500LM012 HN-M5 SATA notebook drive 5400 RPM, and the Debian kernel 4.9.0-8. It also uses the CFQ scheduler. The drive appears to support NCQ. sysfs suggests it has a queue_depth of 31, instead of the 32 on the first drive. hdparm -W showed the device writeback cache was enabled.



On this second system, vmstat 1 seems to show the writer was starved, with literally 0 writes for most (although not all) of the time the reader was running. The reader achieved 80 MB/s. This behaviour seems very strange and not very desirable. I don't know exactly what combination of Linux and hardware behaviour might cause this.



The second system also has a SATA SSD, a Crucial M4-CT128M4SSD2. Using this SSD, vmstat 1 shows the reader is reduced to 50-60 MB/s (compared to 500 MB/s on an idle system), while the writer achieves 130 MB/s (compared to 200 MB/s on an idle system). So this case unfairly favoured the writer, which was the expected problem.



Test script



#!/bin/sh
# Simultaneous buffered read+write test
# Tries to reproduce the HDD results from
# https://lwn.net/Articles/681763/
# Test size etc is suitable for normal HDD speeds.
# Not tested on anything faster than SATA SSD.

# Test in a sub-directory, of whatever the current directory was
# when the user ran this script.
TEST_DIR=./fio-test

mkdir -p "$TEST_DIR" || exit
cd "$TEST_DIR" || exit

echo "= Draining buffered writes ="
sync

echo "= Creating file for read test="
READ="--name=read --rw=read --size=1G --ioengine=psync"
fio $READ --create_only=1 --overwrite=0

echo "= Simultaneous buffered read+write test ="

[ "$(echo 50_100*)" != "50_100*" ] && rm 50_100*

fio --name=50_100 --rw=write --size=5000M --filesize=100M --nrfiles=50
--create_on_open=1 --ioengine=psync
> log.50_100 2>&1 &
PID_50_100=$!

fio $READ &
PID_READ=$!

wait $PID_50_100
sync

[ -e /proc/$PID_READ ] && kill $PID_READ && wait $PID_READ

echo "= write test log ="
cat log.50_100





share|improve this answer























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "106"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483269%2fdetermine-the-specific-benefit-of-writeback-throttling-config-wbt%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote



    accepted











    What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?




    So far, I haven't reproduced the most severe results. The behaviour varies between disks.



    My laptop hard disk is a WDC WD5000LPLX-7, a WD Black Mobile model, 7200 RPM. My kernel version is 4.18.16-200.fc28.x86_64. It uses the CFQ I/O scheduler.



    I tried reproducing the report about trying to start google-chrome and it being basically stalled while a dd command is running. Even though that was reported for an SSD, and it is not very scientific to rely on an unspecified google-chrome version. google-chrome took about 20 seconds to start normally (after running echo 3 | sudo tee drop_caches). When running dd if=/dev/zero of=foo bs=1M count=10k, google-chrome takes maybe up to twice as long to start, but that was less than half-way through the dd command.



    I tried to reproduce the read-starvation, shown in the output of vmstat 1. I tried this by writing a script (see below), to match the scenario described in the cover letter for WBT v3. I looked at vmstat 1 and the reader was not starved as severely. The reader was unfairly reduced to 10-20MB/s, v.s. the writer which achieved 50-60MB/s. So it is fair to say this system hits some limitation, of using the so-called "Completely Fair Queuing" I/O scheduler :-). It is not as bad as the reported result, where the reader was reduced to under 4MB/s while the writer achieved over 70MB/s.



    The read latencies look like



    lat (msec)   : 2=0.11%, 4=0.67%, 10=0.01%, 20=0.01%, 50=0.03%
    lat (msec) : 100=0.02%, 250=0.04%, 500=0.03%, 750=0.01%, 1000=0.01%


    whereas if I run the read test on an idle system, it looked like



    lat (msec)   : 2=0.57%, 4=0.98%, 10=0.01%, 20=0.01%, 50=0.01%
    lat (msec) : 100=0.01%


    So I'm definitely seeing longer tail latencies. But it's not such an obvious catastrophe as the reported result, where 5% of the reads took more than 1 msec (1000 usec), and 1% of the reads took over 250 msec.



    My understanding is that CFQ struggles to control the drive, because even when one of the NCQ write requests is completed, it may still be sitting in the device writeback cache. (Don't ask me to explain why devices need both NCQ and writeback cache, it seems a very strange interface to me). So I tried testing after disabling the device writeback cache, with hdparm -W0. This achieved a more plausible "balance" between the reader and writer throughput.



    (Using BFQ, the reader achieves an average of 40MB/s. Additionally disabling device write cache appears to help on some of the responsiveness tests used for BFQ though. My initial impressions of BFQ have been very good).



    Leaving the device writeback cache enabled, scheduler = deadline with wbt_lat_usec = 0 gave worse read throughput than CFQ. wbt_lat_usec = 75000 (default for rotational storage) improved the read throughput with deadline, but it still favoured the writer - it looked somewhat worse than CFQ.



    It seems like the argument for WBT on rotating hard disks was written assuming it could be used with CFQ. The conflict between WBT and CFQ was not recognised until 6 months later. Secondly, it seems the main reason WBT was implemented was for people with fast flash storage who are using a low-overhead scheduler like deadline, or the new kyber scheduler.



    (If I then disable device write cache, it allows the reader to achieve 60MB/s. Possibly slightly unfair in the other direction. Not bad, but... the original arguments were not phrased at all to suggest this step. The severe results were shown for CFQ with default settings; there was no comparison with CFQ + hdparm -W0).



    Differences between devices



    I ran the same test on a second system, with a Seagate ST500LM012 HN-M5 SATA notebook drive 5400 RPM, and the Debian kernel 4.9.0-8. It also uses the CFQ scheduler. The drive appears to support NCQ. sysfs suggests it has a queue_depth of 31, instead of the 32 on the first drive. hdparm -W showed the device writeback cache was enabled.



    On this second system, vmstat 1 seems to show the writer was starved, with literally 0 writes for most (although not all) of the time the reader was running. The reader achieved 80 MB/s. This behaviour seems very strange and not very desirable. I don't know exactly what combination of Linux and hardware behaviour might cause this.



    The second system also has a SATA SSD, a Crucial M4-CT128M4SSD2. Using this SSD, vmstat 1 shows the reader is reduced to 50-60 MB/s (compared to 500 MB/s on an idle system), while the writer achieves 130 MB/s (compared to 200 MB/s on an idle system). So this case unfairly favoured the writer, which was the expected problem.



    Test script



    #!/bin/sh
    # Simultaneous buffered read+write test
    # Tries to reproduce the HDD results from
    # https://lwn.net/Articles/681763/
    # Test size etc is suitable for normal HDD speeds.
    # Not tested on anything faster than SATA SSD.

    # Test in a sub-directory, of whatever the current directory was
    # when the user ran this script.
    TEST_DIR=./fio-test

    mkdir -p "$TEST_DIR" || exit
    cd "$TEST_DIR" || exit

    echo "= Draining buffered writes ="
    sync

    echo "= Creating file for read test="
    READ="--name=read --rw=read --size=1G --ioengine=psync"
    fio $READ --create_only=1 --overwrite=0

    echo "= Simultaneous buffered read+write test ="

    [ "$(echo 50_100*)" != "50_100*" ] && rm 50_100*

    fio --name=50_100 --rw=write --size=5000M --filesize=100M --nrfiles=50
    --create_on_open=1 --ioengine=psync
    > log.50_100 2>&1 &
    PID_50_100=$!

    fio $READ &
    PID_READ=$!

    wait $PID_50_100
    sync

    [ -e /proc/$PID_READ ] && kill $PID_READ && wait $PID_READ

    echo "= write test log ="
    cat log.50_100





    share|improve this answer



























      up vote
      0
      down vote



      accepted











      What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?




      So far, I haven't reproduced the most severe results. The behaviour varies between disks.



      My laptop hard disk is a WDC WD5000LPLX-7, a WD Black Mobile model, 7200 RPM. My kernel version is 4.18.16-200.fc28.x86_64. It uses the CFQ I/O scheduler.



      I tried reproducing the report about trying to start google-chrome and it being basically stalled while a dd command is running. Even though that was reported for an SSD, and it is not very scientific to rely on an unspecified google-chrome version. google-chrome took about 20 seconds to start normally (after running echo 3 | sudo tee drop_caches). When running dd if=/dev/zero of=foo bs=1M count=10k, google-chrome takes maybe up to twice as long to start, but that was less than half-way through the dd command.



      I tried to reproduce the read-starvation, shown in the output of vmstat 1. I tried this by writing a script (see below), to match the scenario described in the cover letter for WBT v3. I looked at vmstat 1 and the reader was not starved as severely. The reader was unfairly reduced to 10-20MB/s, v.s. the writer which achieved 50-60MB/s. So it is fair to say this system hits some limitation, of using the so-called "Completely Fair Queuing" I/O scheduler :-). It is not as bad as the reported result, where the reader was reduced to under 4MB/s while the writer achieved over 70MB/s.



      The read latencies look like



      lat (msec)   : 2=0.11%, 4=0.67%, 10=0.01%, 20=0.01%, 50=0.03%
      lat (msec) : 100=0.02%, 250=0.04%, 500=0.03%, 750=0.01%, 1000=0.01%


      whereas if I run the read test on an idle system, it looked like



      lat (msec)   : 2=0.57%, 4=0.98%, 10=0.01%, 20=0.01%, 50=0.01%
      lat (msec) : 100=0.01%


      So I'm definitely seeing longer tail latencies. But it's not such an obvious catastrophe as the reported result, where 5% of the reads took more than 1 msec (1000 usec), and 1% of the reads took over 250 msec.



      My understanding is that CFQ struggles to control the drive, because even when one of the NCQ write requests is completed, it may still be sitting in the device writeback cache. (Don't ask me to explain why devices need both NCQ and writeback cache, it seems a very strange interface to me). So I tried testing after disabling the device writeback cache, with hdparm -W0. This achieved a more plausible "balance" between the reader and writer throughput.



      (Using BFQ, the reader achieves an average of 40MB/s. Additionally disabling device write cache appears to help on some of the responsiveness tests used for BFQ though. My initial impressions of BFQ have been very good).



      Leaving the device writeback cache enabled, scheduler = deadline with wbt_lat_usec = 0 gave worse read throughput than CFQ. wbt_lat_usec = 75000 (default for rotational storage) improved the read throughput with deadline, but it still favoured the writer - it looked somewhat worse than CFQ.



      It seems like the argument for WBT on rotating hard disks was written assuming it could be used with CFQ. The conflict between WBT and CFQ was not recognised until 6 months later. Secondly, it seems the main reason WBT was implemented was for people with fast flash storage who are using a low-overhead scheduler like deadline, or the new kyber scheduler.



      (If I then disable device write cache, it allows the reader to achieve 60MB/s. Possibly slightly unfair in the other direction. Not bad, but... the original arguments were not phrased at all to suggest this step. The severe results were shown for CFQ with default settings; there was no comparison with CFQ + hdparm -W0).



      Differences between devices



      I ran the same test on a second system, with a Seagate ST500LM012 HN-M5 SATA notebook drive 5400 RPM, and the Debian kernel 4.9.0-8. It also uses the CFQ scheduler. The drive appears to support NCQ. sysfs suggests it has a queue_depth of 31, instead of the 32 on the first drive. hdparm -W showed the device writeback cache was enabled.



      On this second system, vmstat 1 seems to show the writer was starved, with literally 0 writes for most (although not all) of the time the reader was running. The reader achieved 80 MB/s. This behaviour seems very strange and not very desirable. I don't know exactly what combination of Linux and hardware behaviour might cause this.



      The second system also has a SATA SSD, a Crucial M4-CT128M4SSD2. Using this SSD, vmstat 1 shows the reader is reduced to 50-60 MB/s (compared to 500 MB/s on an idle system), while the writer achieves 130 MB/s (compared to 200 MB/s on an idle system). So this case unfairly favoured the writer, which was the expected problem.



      Test script



      #!/bin/sh
      # Simultaneous buffered read+write test
      # Tries to reproduce the HDD results from
      # https://lwn.net/Articles/681763/
      # Test size etc is suitable for normal HDD speeds.
      # Not tested on anything faster than SATA SSD.

      # Test in a sub-directory, of whatever the current directory was
      # when the user ran this script.
      TEST_DIR=./fio-test

      mkdir -p "$TEST_DIR" || exit
      cd "$TEST_DIR" || exit

      echo "= Draining buffered writes ="
      sync

      echo "= Creating file for read test="
      READ="--name=read --rw=read --size=1G --ioengine=psync"
      fio $READ --create_only=1 --overwrite=0

      echo "= Simultaneous buffered read+write test ="

      [ "$(echo 50_100*)" != "50_100*" ] && rm 50_100*

      fio --name=50_100 --rw=write --size=5000M --filesize=100M --nrfiles=50
      --create_on_open=1 --ioengine=psync
      > log.50_100 2>&1 &
      PID_50_100=$!

      fio $READ &
      PID_READ=$!

      wait $PID_50_100
      sync

      [ -e /proc/$PID_READ ] && kill $PID_READ && wait $PID_READ

      echo "= write test log ="
      cat log.50_100





      share|improve this answer

























        up vote
        0
        down vote



        accepted







        up vote
        0
        down vote



        accepted







        What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?




        So far, I haven't reproduced the most severe results. The behaviour varies between disks.



        My laptop hard disk is a WDC WD5000LPLX-7, a WD Black Mobile model, 7200 RPM. My kernel version is 4.18.16-200.fc28.x86_64. It uses the CFQ I/O scheduler.



        I tried reproducing the report about trying to start google-chrome and it being basically stalled while a dd command is running. Even though that was reported for an SSD, and it is not very scientific to rely on an unspecified google-chrome version. google-chrome took about 20 seconds to start normally (after running echo 3 | sudo tee drop_caches). When running dd if=/dev/zero of=foo bs=1M count=10k, google-chrome takes maybe up to twice as long to start, but that was less than half-way through the dd command.



        I tried to reproduce the read-starvation, shown in the output of vmstat 1. I tried this by writing a script (see below), to match the scenario described in the cover letter for WBT v3. I looked at vmstat 1 and the reader was not starved as severely. The reader was unfairly reduced to 10-20MB/s, v.s. the writer which achieved 50-60MB/s. So it is fair to say this system hits some limitation, of using the so-called "Completely Fair Queuing" I/O scheduler :-). It is not as bad as the reported result, where the reader was reduced to under 4MB/s while the writer achieved over 70MB/s.



        The read latencies look like



        lat (msec)   : 2=0.11%, 4=0.67%, 10=0.01%, 20=0.01%, 50=0.03%
        lat (msec) : 100=0.02%, 250=0.04%, 500=0.03%, 750=0.01%, 1000=0.01%


        whereas if I run the read test on an idle system, it looked like



        lat (msec)   : 2=0.57%, 4=0.98%, 10=0.01%, 20=0.01%, 50=0.01%
        lat (msec) : 100=0.01%


        So I'm definitely seeing longer tail latencies. But it's not such an obvious catastrophe as the reported result, where 5% of the reads took more than 1 msec (1000 usec), and 1% of the reads took over 250 msec.



        My understanding is that CFQ struggles to control the drive, because even when one of the NCQ write requests is completed, it may still be sitting in the device writeback cache. (Don't ask me to explain why devices need both NCQ and writeback cache, it seems a very strange interface to me). So I tried testing after disabling the device writeback cache, with hdparm -W0. This achieved a more plausible "balance" between the reader and writer throughput.



        (Using BFQ, the reader achieves an average of 40MB/s. Additionally disabling device write cache appears to help on some of the responsiveness tests used for BFQ though. My initial impressions of BFQ have been very good).



        Leaving the device writeback cache enabled, scheduler = deadline with wbt_lat_usec = 0 gave worse read throughput than CFQ. wbt_lat_usec = 75000 (default for rotational storage) improved the read throughput with deadline, but it still favoured the writer - it looked somewhat worse than CFQ.



        It seems like the argument for WBT on rotating hard disks was written assuming it could be used with CFQ. The conflict between WBT and CFQ was not recognised until 6 months later. Secondly, it seems the main reason WBT was implemented was for people with fast flash storage who are using a low-overhead scheduler like deadline, or the new kyber scheduler.



        (If I then disable device write cache, it allows the reader to achieve 60MB/s. Possibly slightly unfair in the other direction. Not bad, but... the original arguments were not phrased at all to suggest this step. The severe results were shown for CFQ with default settings; there was no comparison with CFQ + hdparm -W0).



        Differences between devices



        I ran the same test on a second system, with a Seagate ST500LM012 HN-M5 SATA notebook drive 5400 RPM, and the Debian kernel 4.9.0-8. It also uses the CFQ scheduler. The drive appears to support NCQ. sysfs suggests it has a queue_depth of 31, instead of the 32 on the first drive. hdparm -W showed the device writeback cache was enabled.



        On this second system, vmstat 1 seems to show the writer was starved, with literally 0 writes for most (although not all) of the time the reader was running. The reader achieved 80 MB/s. This behaviour seems very strange and not very desirable. I don't know exactly what combination of Linux and hardware behaviour might cause this.



        The second system also has a SATA SSD, a Crucial M4-CT128M4SSD2. Using this SSD, vmstat 1 shows the reader is reduced to 50-60 MB/s (compared to 500 MB/s on an idle system), while the writer achieves 130 MB/s (compared to 200 MB/s on an idle system). So this case unfairly favoured the writer, which was the expected problem.



        Test script



        #!/bin/sh
        # Simultaneous buffered read+write test
        # Tries to reproduce the HDD results from
        # https://lwn.net/Articles/681763/
        # Test size etc is suitable for normal HDD speeds.
        # Not tested on anything faster than SATA SSD.

        # Test in a sub-directory, of whatever the current directory was
        # when the user ran this script.
        TEST_DIR=./fio-test

        mkdir -p "$TEST_DIR" || exit
        cd "$TEST_DIR" || exit

        echo "= Draining buffered writes ="
        sync

        echo "= Creating file for read test="
        READ="--name=read --rw=read --size=1G --ioengine=psync"
        fio $READ --create_only=1 --overwrite=0

        echo "= Simultaneous buffered read+write test ="

        [ "$(echo 50_100*)" != "50_100*" ] && rm 50_100*

        fio --name=50_100 --rw=write --size=5000M --filesize=100M --nrfiles=50
        --create_on_open=1 --ioengine=psync
        > log.50_100 2>&1 &
        PID_50_100=$!

        fio $READ &
        PID_READ=$!

        wait $PID_50_100
        sync

        [ -e /proc/$PID_READ ] && kill $PID_READ && wait $PID_READ

        echo "= write test log ="
        cat log.50_100





        share|improve this answer















        What instructions can I follow, to reproduce the problem, and get a very rough number showing whether it is a problem on a specific system?




        So far, I haven't reproduced the most severe results. The behaviour varies between disks.



        My laptop hard disk is a WDC WD5000LPLX-7, a WD Black Mobile model, 7200 RPM. My kernel version is 4.18.16-200.fc28.x86_64. It uses the CFQ I/O scheduler.



        I tried reproducing the report about trying to start google-chrome and it being basically stalled while a dd command is running. Even though that was reported for an SSD, and it is not very scientific to rely on an unspecified google-chrome version. google-chrome took about 20 seconds to start normally (after running echo 3 | sudo tee drop_caches). When running dd if=/dev/zero of=foo bs=1M count=10k, google-chrome takes maybe up to twice as long to start, but that was less than half-way through the dd command.



        I tried to reproduce the read-starvation, shown in the output of vmstat 1. I tried this by writing a script (see below), to match the scenario described in the cover letter for WBT v3. I looked at vmstat 1 and the reader was not starved as severely. The reader was unfairly reduced to 10-20MB/s, v.s. the writer which achieved 50-60MB/s. So it is fair to say this system hits some limitation, of using the so-called "Completely Fair Queuing" I/O scheduler :-). It is not as bad as the reported result, where the reader was reduced to under 4MB/s while the writer achieved over 70MB/s.



        The read latencies look like



        lat (msec)   : 2=0.11%, 4=0.67%, 10=0.01%, 20=0.01%, 50=0.03%
        lat (msec) : 100=0.02%, 250=0.04%, 500=0.03%, 750=0.01%, 1000=0.01%


        whereas if I run the read test on an idle system, it looked like



        lat (msec)   : 2=0.57%, 4=0.98%, 10=0.01%, 20=0.01%, 50=0.01%
        lat (msec) : 100=0.01%


        So I'm definitely seeing longer tail latencies. But it's not such an obvious catastrophe as the reported result, where 5% of the reads took more than 1 msec (1000 usec), and 1% of the reads took over 250 msec.



        My understanding is that CFQ struggles to control the drive, because even when one of the NCQ write requests is completed, it may still be sitting in the device writeback cache. (Don't ask me to explain why devices need both NCQ and writeback cache, it seems a very strange interface to me). So I tried testing after disabling the device writeback cache, with hdparm -W0. This achieved a more plausible "balance" between the reader and writer throughput.



        (Using BFQ, the reader achieves an average of 40MB/s. Additionally disabling device write cache appears to help on some of the responsiveness tests used for BFQ though. My initial impressions of BFQ have been very good).



        Leaving the device writeback cache enabled, scheduler = deadline with wbt_lat_usec = 0 gave worse read throughput than CFQ. wbt_lat_usec = 75000 (default for rotational storage) improved the read throughput with deadline, but it still favoured the writer - it looked somewhat worse than CFQ.



        It seems like the argument for WBT on rotating hard disks was written assuming it could be used with CFQ. The conflict between WBT and CFQ was not recognised until 6 months later. Secondly, it seems the main reason WBT was implemented was for people with fast flash storage who are using a low-overhead scheduler like deadline, or the new kyber scheduler.



        (If I then disable device write cache, it allows the reader to achieve 60MB/s. Possibly slightly unfair in the other direction. Not bad, but... the original arguments were not phrased at all to suggest this step. The severe results were shown for CFQ with default settings; there was no comparison with CFQ + hdparm -W0).



        Differences between devices



        I ran the same test on a second system, with a Seagate ST500LM012 HN-M5 SATA notebook drive 5400 RPM, and the Debian kernel 4.9.0-8. It also uses the CFQ scheduler. The drive appears to support NCQ. sysfs suggests it has a queue_depth of 31, instead of the 32 on the first drive. hdparm -W showed the device writeback cache was enabled.



        On this second system, vmstat 1 seems to show the writer was starved, with literally 0 writes for most (although not all) of the time the reader was running. The reader achieved 80 MB/s. This behaviour seems very strange and not very desirable. I don't know exactly what combination of Linux and hardware behaviour might cause this.



        The second system also has a SATA SSD, a Crucial M4-CT128M4SSD2. Using this SSD, vmstat 1 shows the reader is reduced to 50-60 MB/s (compared to 500 MB/s on an idle system), while the writer achieves 130 MB/s (compared to 200 MB/s on an idle system). So this case unfairly favoured the writer, which was the expected problem.



        Test script



        #!/bin/sh
        # Simultaneous buffered read+write test
        # Tries to reproduce the HDD results from
        # https://lwn.net/Articles/681763/
        # Test size etc is suitable for normal HDD speeds.
        # Not tested on anything faster than SATA SSD.

        # Test in a sub-directory, of whatever the current directory was
        # when the user ran this script.
        TEST_DIR=./fio-test

        mkdir -p "$TEST_DIR" || exit
        cd "$TEST_DIR" || exit

        echo "= Draining buffered writes ="
        sync

        echo "= Creating file for read test="
        READ="--name=read --rw=read --size=1G --ioengine=psync"
        fio $READ --create_only=1 --overwrite=0

        echo "= Simultaneous buffered read+write test ="

        [ "$(echo 50_100*)" != "50_100*" ] && rm 50_100*

        fio --name=50_100 --rw=write --size=5000M --filesize=100M --nrfiles=50
        --create_on_open=1 --ioengine=psync
        > log.50_100 2>&1 &
        PID_50_100=$!

        fio $READ &
        PID_READ=$!

        wait $PID_50_100
        sync

        [ -e /proc/$PID_READ ] && kill $PID_READ && wait $PID_READ

        echo "= write test log ="
        cat log.50_100






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 24 at 21:02

























        answered Nov 22 at 22:12









        sourcejedi

        22k43396




        22k43396






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483269%2fdetermine-the-specific-benefit-of-writeback-throttling-config-wbt%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Morgemoulin

            Scott Moir

            Souastre