Skip to content

asahi: Improve VM bind performance for large BOs#495

Open
TrungNguyen1909 wants to merge 1 commit into
AsahiLinux:asahifrom
TrungNguyen1909:asahi-large-bo-binding-perf
Open

asahi: Improve VM bind performance for large BOs#495
TrungNguyen1909 wants to merge 1 commit into
AsahiLinux:asahifrom
TrungNguyen1909:asahi-large-bo-binding-perf

Conversation

@TrungNguyen1909
Copy link
Copy Markdown

Large BOs can take quite long time to randomly-map/bind to AGX address space since the current implementation iterates through every page entries that come before the requested BO offset.

This pull request provides an improvement by caching up all the SGT entries (linked list) into an KVVec and binary_search() on it. The additional memory-usage is 3*8*NENTS bytes per VmBo allocation (i.e Asahi VM x mapped). SGTable's Rust implementation currently do not export sg_nents so the allocation here is like O(NlogN) (can be reduced in the next version).

The cost for the first bind remains about the same but if we are doing bind/unbind from a large BO repeatedly then the cost for the sequential binds are greatly reduced, amortizing the initial cost.

A few questions that need to be addressed:

  • Is the additional memory justified? Should we only do this only on large (defined by larger than whatever constants), or all BOs?
  • If we are doing this on all BOs, should we drop the SGTable in the VmBo structure? Cannot be done since SGTable is owning the pages.
  • There is a possibility to merge SGT entries before they are being added to the sg_vec if the entries are physically contiguous.

Signed-off-by: Hoang Trung Nguyen <trung@kryptoslogic.com>
@WhatAmISupposedToPutHere
Copy link
Copy Markdown
Member

Looks good, any benchmarks that this shows up on?

@TrungNguyen1909
Copy link
Copy Markdown
Author

TrungNguyen1909 commented May 12, 2026

Hi, Thanks for the response!

This issue would hit the performance of our internal thing. I doubt if this would ever become an issue with mesa though, as to hit it, you would need a BO with 1e5 pages or something.

I don't have a standalone benchmark tool for this off my hands. But if you need it, I can probably write something.

EDIT: 1e6 -> 1e5 pages

@TrungNguyen1909
Copy link
Copy Markdown
Author

Hi, here is my testing result:

❯ cat bind_old.result 
Testing on 6.19.13-400.asahi.fc44.aarch64+16k with BO size 4294967296 (262144 pages) with 100000 binds
 first bind: 285258906ns
 other binds: 427006731ns per bind: 4270ns 0.234192Mop/s
 unbinds: 50062728ns per unbind: 500ns 2Mop/s
❯ cat bind_new.result 
Testing on 6.19.14-1.XXX.asahi.fc44.aarch64+16k with BO size 4294967296 (262144 pages) with 100000 binds
 first bind: 297553563ns
 other binds: 74415901ns per bind: 744ns 1.344086Mop/s
 unbinds: 51879748ns per unbind: 518ns 1.930501Mop/s

This consistently improves the binding performance by almost 6 times.

Hope that helps.

// g++ -O3 -std=c++23 -o asahi_vm_bind_test asahi_vm_bind_test.cpp
#include <bits/stdc++.h>
#include <cassert>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <drm/drm.h>
#include <drm/asahi_drm.h>
#include <sys/utsname.h>

using namespace std::chrono_literals;

int main(int argc, char** argv) {
    const uint64_t PAGE_16K = 0x4000;
    uint64_t bo_size = 4 * (1ULL << 30); // 4G
    uint64_t n_binds = 100000;
    if (argc >= 2) {
        bo_size = strtoull(argv[1], NULL, 10);
    }
    if (argc >= 3) {
        n_binds = strtoull(argv[2], NULL, 10);
    }
    auto n_pages = bo_size / PAGE_16K;
    auto handle = open("/dev/dri/renderD128", O_RDWR); 
    assert(handle >= 0);

    struct drm_asahi_params_global params;
    struct drm_asahi_get_params get_params{
        .param_group = 0,
        .pointer = (uintptr_t)&params,
        .size = sizeof(params),
    };
    assert(ioctl(handle, DRM_IOCTL_ASAHI_GET_PARAMS, &get_params) == 0);
    auto vm_base = params.vm_start;

    assert((params.vm_end - params.vm_start) > (n_binds * PAGE_16K));

    // Create a VM
    struct drm_asahi_vm_create vm_create {
        .kernel_start = params.vm_end - params.vm_kernel_min_size,
        .kernel_end   = params.vm_end,
    };
    assert(ioctl(handle, DRM_IOCTL_ASAHI_VM_CREATE, &vm_create) == 0);
    auto vm_id = vm_create.vm_id;

    struct drm_asahi_gem_create gem_create {
        .size  = bo_size,
        .flags = DRM_ASAHI_GEM_VM_PRIVATE,
        .vm_id = vm_id,
    };
    assert(ioctl(handle, DRM_IOCTL_ASAHI_GEM_CREATE, &gem_create) == 0);
    auto bo_handle = gem_create.handle;

    struct utsname n{};
    assert(uname(&n) == 0);
    printf("Testing on %s with BO size %llu (%llu pages) with %llu binds\n",
           n.release, bo_size, n_pages, n_binds);

    std::vector<struct drm_asahi_gem_bind_op> bind_ops;
    bind_ops.resize(n_binds);
    std::vector<struct drm_asahi_gem_bind_op> unbind_ops;
    unbind_ops.resize(n_binds);

    std::mt19937 gen;
    gen.seed(42);

    // Generate page-sized binds with random offsets.
    for (uint64_t i = 0; i < n_binds; i++) {
        bind_ops[i] = {
            .flags  = DRM_ASAHI_BIND_READ | DRM_ASAHI_BIND_WRITE,
            .handle = bo_handle,
            .offset = PAGE_16K * (gen() % n_pages),
            .range  = PAGE_16K,
            .addr   = vm_base + i * PAGE_16K,
        };
        unbind_ops[i] = {
            .flags  = DRM_ASAHI_BIND_UNBIND,
            .range  = PAGE_16K,
            .addr   = vm_base + i * PAGE_16K,
        };
    }
    {
        const auto start_first_bind = std::chrono::high_resolution_clock::now();
        struct drm_asahi_vm_bind vm_bind{
            .vm_id     = vm_id,
            .num_binds = 1,
            .stride    = sizeof(drm_asahi_gem_bind_op),
            .pad       = 0,
            .userptr   = (uintptr_t)&bind_ops[0],
        };
        assert(ioctl(handle, DRM_IOCTL_ASAHI_VM_BIND, &vm_bind) == 0);
        const auto end_first_bind = std::chrono::high_resolution_clock::now();
        const auto dur_first_bind = end_first_bind - start_first_bind;
        std::println(" first bind: {}", dur_first_bind);
    }

    {
        const auto start_binds = std::chrono::high_resolution_clock::now();
        struct drm_asahi_vm_bind vm_bind{
            .vm_id     = vm_id,
            .num_binds = (uint32_t)n_binds - 1,
            .stride    = sizeof(drm_asahi_gem_bind_op),
            .pad       = 0,
            .userptr   = (uintptr_t)&bind_ops[1],
        };
        assert(ioctl(handle, DRM_IOCTL_ASAHI_VM_BIND, &vm_bind) == 0);
        const auto end_binds = std::chrono::high_resolution_clock::now();
        const auto dur_binds = end_binds - start_binds;
        std::println(" other binds: {} per bind: {} {}Mop/s", dur_binds, dur_binds / (n_binds - 1), 1s / (dur_binds / (n_binds - 1)) / 1e6);
    }

    {
        const auto start_unbinds = std::chrono::high_resolution_clock::now();
        struct drm_asahi_vm_bind vm_bind{
            .vm_id     = vm_id,
            .num_binds = (uint32_t)n_binds,
            .stride    = sizeof(drm_asahi_gem_bind_op),
            .pad       = 0,
            .userptr   = (uintptr_t)&unbind_ops[0],
        };
        assert(ioctl(handle, DRM_IOCTL_ASAHI_VM_BIND, &vm_bind) == 0);
        const auto end_unbinds = std::chrono::high_resolution_clock::now();
        const auto dur_unbinds = end_unbinds - start_unbinds;
        std::println(" unbinds: {} per unbind: {} {}Mop/s", dur_unbinds, dur_unbinds / (n_binds - 1), 1s / (dur_unbinds / (n_binds - 1)) / 1e6);
    }

    drm_gem_close gem_close { .handle = bo_handle };
    assert(ioctl(handle, DRM_IOCTL_GEM_CLOSE, &gem_close) == 0);

    struct drm_asahi_vm_destroy vm_destroy {
        .vm_id = vm_id,
    };
    assert(ioctl(handle, DRM_IOCTL_ASAHI_VM_DESTROY, &vm_destroy) == 0);
    close(handle);
    return 0;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants