From 1b814d7e38affa5cd073842ad1eb044e1ccb2089 Mon Sep 17 00:00:00 2001
From: Fam Zheng <famz@redhat.com>
Date: Wed, 22 Mar 2017 23:51:07 +0100
Subject: [PATCH 10/12] file-posix: Consider max_segments for
 BlockLimits.max_transfer

RH-Author: Fam Zheng <famz@redhat.com>
Message-id: <20170322235109.24122-2-famz@redhat.com>
Patchwork-id: 74433
O-Subject: [RHEV-7.3.z qemu-kvm-rhev PATCH v2 1/3] file-posix: Consider max_segments for BlockLimits.max_transfer
Bugzilla: 1431149
RH-Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
RH-Acked-by: Paolo Bonzini <pbonzini@redhat.com>
RH-Acked-by: Markus Armbruster <armbru@redhat.com>

BlockLimits.max_transfer can be too high without this fix, guest will
encounter I/O error or even get paused with werror=stop or rerror=stop. The
cause is explained below.

Linux has a separate limit, /sys/block/.../queue/max_segments, which in
the worst case can be more restrictive than the BLKSECTGET which we
already consider (note that they are two different things). So, the
failure scenario before this patch is:

1) host device has max_sectors_kb = 4096 and max_segments = 64;
2) guest learns max_sectors_kb limit from QEMU, but doesn't know
   max_segments;
3) guest issues e.g. a 512KB request thinking it's okay, but actually
   it's not, because it will be passed through to host device as an
   SG_IO req that has niov > 64;
4) host kernel doesn't like the segmenting of the request, and returns
   -EINVAL;

This patch checks the max_segments sysfs entry for the host device and
calculates a "conservative" bytes limit using the page size, which is
then merged into the existing max_transfer limit. Guest will discover
this from the usual virtual block device interfaces. (In the case of
scsi-generic, it will be done in the INQUIRY reply interception in
device model.)

The other possibility is to actually propagate it as a separate limit,
but it's not better. On the one hand, there is a big complication: the
limit is per-LUN in QEMU PoV (because we can attach LUNs from different
host HBAs to the same virtio-scsi bus), but the channel to communicate
it in a per-LUN manner is missing down the stack; on the other hand,
two limits versus one doesn't change much about the valid size of I/O
(because guest has no control over host segmenting).

Also, the idea to fall back to bounce buffering in QEMU, upon -EINVAL,
was explored. Unfortunately there is no neat way to ensure the bounce
buffer is less segmented (in terms of DMA addr) than the guest buffer.

Practically, this bug is not very common. It is only reported on a
Emulex (lpfc), so it's okay to get it fixed in the easier way.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
(cherry picked from commit 9103f1ceb46614b150bcbc3c9a4fbc72b47fedcc)

RHEL 7.3 uses sector based block limits, so divide the result by
BDRV_SECTOR_SIZE.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Miroslav Rezanina <mrezanin@redhat.com>
---
 block/raw-posix.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 3f55c7d..9d9a6a0 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -742,6 +742,48 @@ static int hdev_get_max_transfer_length(int fd)
 #endif
 }
 
+static int hdev_get_max_segments(const struct stat *st)
+{
+#ifdef CONFIG_LINUX
+    char buf[32];
+    const char *end;
+    char *sysfspath;
+    int ret;
+    int fd = -1;
+    long max_segments;
+
+    sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
+                                major(st->st_rdev), minor(st->st_rdev));
+    fd = open(sysfspath, O_RDONLY);
+    if (fd == -1) {
+        ret = -errno;
+        goto out;
+    }
+    do {
+        ret = read(fd, buf, sizeof(buf));
+    } while (ret == -1 && errno == EINTR);
+    if (ret < 0) {
+        ret = -errno;
+        goto out;
+    } else if (ret == 0) {
+        ret = -EIO;
+        goto out;
+    }
+    buf[ret] = 0;
+    /* The file is ended with '\n', pass 'end' to accept that. */
+    ret = qemu_strtol(buf, &end, 10, &max_segments);
+    if (ret == 0 && end && *end == '\n') {
+        ret = max_segments;
+    }
+
+out:
+    g_free(sysfspath);
+    return ret;
+#else
+    return -ENOTSUP;
+#endif
+}
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
     BDRVRawState *s = bs->opaque;
@@ -753,6 +795,12 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
             if (ret >= 0) {
                 bs->bl.max_transfer_length = ret;
             }
+            ret = hdev_get_max_segments(&st);
+            if (ret > 0) {
+                bs->bl.max_transfer_length =
+                    MIN(bs->bl.max_transfer_length,
+                        ret * getpagesize() / BDRV_SECTOR_SIZE);
+            }
         }
     }
 
-- 
1.8.3.1

