• IHeartBadCode@kbin.social
    link
    fedilink
    arrow-up
    185
    arrow-down
    1
    ·
    1 year ago

    Well the issue at hand is that this is starting to get to the point that like the x86 arch, you cannot just move the NR_CPUS value upward and call it done. The kernel needs to keep some information on hand about the CPUs, it’s usually about 8KB per CPU. That is usually allocated on the stack which is a bit of special memory that comes with some assurances like it being continuous and when things go out of scope they are automagically deallocated for you.

    However, because of those special assurances, just simply increasing the size of the stack can create all kinds of issues. Namely TLB missing, which one of the things to make CPUs go faster is to move bits of RAM into some special RAM inside the CPU called cache (which there’s different levels of cache and each level has different properties which is getting a bit too deep into details). The CPU attempts to make a guess as to the next bit of RAM that needs to move into cache before it’s actually needed, this called prediction. Usually the CPU gets it right but sometimes it gets it wrong and the CPU must tell the actual core that it needs to wait while it goes and gets the correct bit of RAM, because the cores move way faster than the transfer of RAM to the cache, this is why the CPU needs to move the bits from RAM into cache before the core actually needs it.

    So keeping the stack small pretty much ensures that you can fit the stack into one of the levels of cache on the CPU and allows the stack to be fast and have all that neat automagical stuff like deallocation when it goes out of scope. So you just cannot increase the NR_CPUS value because the stack will just get too large to nicely fit inside the cache, so it’ll get broken up into “pages” with the current page in cache and the other one still in RAM and there will be swapping between the pages which can introduce TLB misses.

    So the patch being submitted for particular configurations will set the CPUMASK_OFFSTACK flag. This moves that CPU information that’s being maintained to be off of the stack. That is to be allocated with slab allocation. Slab allocation is a kernel allocation algorithm that’s a bit different than if you did the usual C style malloc or calloc (which I will indicate that for any C programmers out there, you should use calloc first and if you have reasons use malloc. But calloc should be your go to for security reasons but I don’t want to paper over details here by just saying use calloc and never use malloc. There’s a difference and that difference is important in some cases).

    Without deep diving into kernel slabs, slabs are a bit different in that they don’t have some of those nice automagical things that come with the stack memory. So one must be a bit more careful with how they are used, but that’s the nice thing about the slab allocator is that it’s pretty smart about ensuring it’s doing the right thing. This is for the 5.3 kernel, but I love the charts that give a overview of how the slab allocator works. It’s pretty similar in 6.x kernels, but I don’t have any nifty charts for that version, but if some does I will love you if you posted a link.

    That said, it’s a bit slower but a fair enough tradeoff until there’s some change in ARM Cortex-X memory cache arrangement. Which going from memory I think Cortex-X4 has 32MB shared L3 cache, which if you have 8KB on the 8192 CPU max, you’ll need 64MB just to hold the CPU bitmap in L3 which is slow compared to the other levels. And there’s other stuff you’re going to need in the cache at any given time so hogging it all is not ideal. Setting the limit for stack usage to 512 is good as that means the bitmap is just 4MB and you can schedule well ahead of time (the kernel has a prefetcher which things within the kernel can do all kinds of special stuff with it to indicate when a bit of RAM needs to be moved into cache, for us measly users we can only make a suggestion called a hint, to the prefetcher) when to move it all into cache or leave it in RAM. So it’s a good balance for the moment.

    But Server style ARM is making headway and so it makes sense to do a lot with it in the same way the kernel handles server style x86 and other server style archs like POWER and what not. But not mess with it too much for consumer style ARM, which hardly needs these massive bitmaps.

    • CapeWearingAeroplane
      link
      fedilink
      English
      arrow-up
      14
      ·
      edit-2
      1 year ago

      Since you seem to know a lot about this: I would think that at some point the purely physical size of a device is prohibitive of using shared cache, just because the distance from a cpu to the cache can’t be too big. Do you know when this comes into play, if it does? Also, having written some multithreaded computational software, I’ve found that there’s typically (for the stuff I do) a limit to how many cores I can efficiently make use of, before the overhead of opening and closing threads eats the advantage of sharing the work between cores. What kind of “everyday” server stuff is efficiently making use of ≈300 cores? It’s clearly some set of tasks that can be done independently of one another, but do you know more specifically what kind of things people need this many cores on a server for?

      • d3Xt3r@lemmy.nz
        link
        fedilink
        English
        arrow-up
        5
        ·
        edit-2
        1 year ago

        What kind of “everyday” server stuff is efficiently making use of ≈300 cores? It’s clearly some set of tasks that can be done independently of one another, but do you know more specifically what kind of things people need this many cores on a server for?

        Traditionally VMs would be the use case, but these days, at least in the Linux/cloud world, it’s mainly containers. Containers, and the whole ecosystem that is built around them (such as Kubernetes/OpenShift etc) simply eat up those cores, as they’re designed to scale horizontally and dynamically. See: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale

        Normally, you’d run a cluster of multiple servers to host such workloads, but imagine if all those resources were available on one physical hosts - it’d be a lot more effecient, since at the very least, you’d be avoiding all that network overhead and delays. Of course, you’d still have at least a two node cluster for HA, but the efficiency of a high-end node still rules.

        • namingthingsiseasy@programming.dev
          link
          fedilink
          English
          arrow-up
          2
          ·
          1 year ago

          Normally, you’d run a cluster of multiple servers to host such workloads, but imagine if all those resources were available on one physical hosts - it’d be a lot more effecient, since at the very least, you’d be avoiding all that network overhead and delays.

          Exactly! Imagine you have two services in a data center. If they have to communicate a lot with each other, then you would prefer them as close to each other as possible. Why? Well it’s because of the difference between sending a request over a network vs. just sending it to another process on the same host. It’s much more efficient in terms of latency and bandwidth. There are, of course, downsides and other other costs (like the fact that the cores that are handling the requests themselves are much less powerful), so you have to tailor your hardware allocation to your workloads. In general, if you’re CPU-bound, you would want more powerful CPUs (necessitating fewer cores per host for power reasons), and if you’re I/O bound, you want to reduce network latency as much as possible.

          Now imagine you have thousands of services. The network I/O can get pretty extreme. Plus, occasionally, you have requirements like the fact that any data traveling from one host to another must be encrypted. So if you can keep as many services as possible on a single host, you reduce a lot of that overhead as well.

          tl;dr: everything comes down to trade-offs and understanding the needs of your workloads, but in general, running 300 low power cores is probably indicative of an I/O-bound application and could hypothetically be much more efficient and cost-effective.

      • Killing_Spark@feddit.de
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        1 year ago

        Hi Not the guy of the above comment but I’d like to chip in :)

        I don’t know about the cache, I think I heard something about this and the answer being basically that yes more distance just makes it slower.

        About the multithreading:

        If the cost of creating Threads is becoming an issue look into the concept of threadpools. They are a neat way of reusing ressources and ensuring you don’t try to have more parallelism than is actually possible.

        Edit: if your work is CPU bound, so the cores are actually computing all the time and not waiting on IO or networking, the rule of thumb is to not let the number of threads exceed the number of cores.

        As for usecases for servers with these many cores: shared computing for example VM hosts. The amount of VMs you can sensibly host on a server is limited by the amount of cores you have. Depending on the kind of hypervisor you are using you can share cores between VMs but that’s going to make the VMs slower.

        Another example of shared computing are HPC clusters where many people schedule some kind of work, the cluster allocates the ressources executes the task and returns the results to you. Having more cores allows more of these tasks to run in parallel effectively increasing the throughput of the cluster.

    • AlecSadler@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      13
      ·
      1 year ago

      How do I become as smart as you?

      Somewhat serious question, any recommended books that cover some of what you said?

      • namingthingsiseasy@programming.dev
        link
        fedilink
        English
        arrow-up
        11
        arrow-down
        1
        ·
        1 year ago

        I’m not the guy you responded to, nor am I a kernel expert, but I have a few suggestions:

        1. Sites like phoronix and lwn will go into pretty low-level kernel details like this from time to time. You could consider subscribing to their RSS feeds or something like that

        2. Review a few open university courses on either Operating Systems or Computer Architecture. Short of that, you can also just browse wikipedia for articles on these kinds of topics. I find it enjoyable to read them from time to time

        3. Subscribe to the LKML (which is probably a lot more information than any single person can process, but sites like lwn and phoronix highlight/summarize from time to time)

        I would also say that there are a lot of people out there who have made contributions to the Linux kernel, including this specific portion of the Linux kernel. The person you’re responding to may even do it as a part of his/her day job (and it certainly reads like he does). It’s not that uncommon.

        And the last thing to keep in mind is that learning knowledge like this doesn’t happen overnight. You learn a lot more by learning small things over several years, compared to learning a lot in a short time. Don’t make it a goal to learn things like this - instead, try to make it something you enjoy doing, so you keep doing it over the years and learning more and more small bits of knowledge over time. Eventually, all the different pieces start fitting together and you too could mash out an excellent post like GP’s!

  • DreadPotato
    link
    fedilink
    English
    arrow-up
    40
    ·
    1 year ago

    According to Phoronix, Ampere’s new CPUs have so many cores that Linux doesn’t support systems when two of Ampere’s 192-core chips (384 total cores) are installed in a single server. For now, the ARM64 Linux kernel only supports systems with 256 cores or less. To fix the issue, Ampere has submitted a patch proposing that the Linux kernel core limit be raised to 512

    If you’re already at 384 cores in a dual-processor setup, isn’t raising the limit to 512 too little? Why not just go for 1024 now that they’re at it, especially since the method they proposed doesn’t increase kernel image memory footprint.

    • IHeartBadCode@kbin.social
      link
      fedilink
      arrow-up
      22
      ·
      1 year ago

      Well looking at the patch

      +config NR_CPUS_RANGE_END
      +	int
      +	default 8192 if  SMP && CPUMASK_OFFSTACK
      +	default  512 if  SMP && !CPUMASK_OFFSTACK
      +	default    1 if !SMP
      +
      
      

      It looks like it’s doing and end range of 8192 but with the off stack flag set. And it seems that…

      +	  This is purely to save memory: each supported CPU adds about 8KB
      +	  to the kernel image.
      
      

      Which looks like they’re trying to save memory to avoid TLB stalling on the CPU’s bitmap. I think if the chip maker is indicating that slab allocation is fine for more at the moment (which the patch looks to be coming from Christoph Lameter, who works at Ampere), it’s best to assume they’ve tested it on their end. Or at least I would think so. If they felt that more on the stack was a fine option, I would think that, that’s exactly what they would pitch to the KML. Them saying there’s a need for offstack past 512, I’m guessing there’s a reason and the one I can think of is TLB stalls.

    • LennethAegis@kbin.social
      link
      fedilink
      arrow-up
      4
      ·
      1 year ago

      I agree, they are just going to hit the wall again way too fast. If the limit is 256 or 2^8, they should increase it to 65536 or 2^16. Now that’s a limit that feels safer to leave at for many year to come.

    • just_another_person@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 year ago

      Fine point. I assume because they know there is an entire waterfall of shit they don’t want to mess with regarding memory registers for SMP, and they know this is the limit where they can patch and not have to deal with all of that.

    • mindbleach@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      An immediate kludge buys time for a worthwhile general solution.

      And if that kludge only buys a few years, we’re less likely to see it Frankensteined into a shitty general solution.

      • DreadPotato
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        1 year ago

        Unfortunately these kludge solutions that last a few years have a tendency to ripple more kludge solutions when they run out, because the “proper” fix still wasn’t done. Shit that doesn’t work, but needs to work, gets high priority. Shit that works just well enough usually gets neglected until that shit doesn’t work (again).

        • mindbleach@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          This approach is slicing a finite resource. It can only extend so far, and it sounds like they extended it about that far in one step. The amount of information the kernel keeps about each core has to be drastically reduced, for the next order of magnitude, or else cache hardware and behavior will need to change in comically-parallel chips.

  • snekerpimp@lemmy.world
    link
    fedilink
    English
    arrow-up
    39
    arrow-down
    5
    ·
    edit-2
    1 year ago

    Umm, build it yourselves…? Anyone can fork an build a custom kernel.

    Edit: looks like they kinda sorta did.

    • ISometimesAdmin@the.coolest.zone
      link
      fedilink
      arrow-up
      34
      arrow-down
      1
      ·
      1 year ago

      Yeah the headline is stupid bait.

      They already built it. They’re trying to contribute the change upstream.

      Which is technically “requesting higher core support”, but is a very obnoxious way to phrase it.

    • fluxion@lemmy.world
      link
      fedilink
      English
      arrow-up
      10
      ·
      edit-2
      1 year ago

      If you fork it, you’re stuck maintaining your own kernel. It quickly becomes a nightmare as you accumulate more custom changes while bringing in fixes/features from mainline kernel.

      They’ve already submitted a patch to change the mainline/upstream kernel. If the community/maintainers accept the patch then they won’t need to fork it and can rely on distros backporting the support to older/downstream kernels if needed.

    • Appoxo@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 year ago

      Isnt proposing it something different than just creating a pull request?
      I’d say waiting for green light from the maintainers and then work on it might be more beneficial.

      • snekerpimp@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        1 year ago

        From my understanding, many companies have forked the main kernel to shape it to their needs. But I could be wrong.

  • biflip@infosec.pub
    link
    fedilink
    English
    arrow-up
    17
    ·
    1 year ago

    FreeBSD 14 came out with 1024 core support a couple of weeks ago. Coincidence?

  • Max_Power@feddit.de
    link
    fedilink
    English
    arrow-up
    6
    ·
    1 year ago

    Just make the core count an unsigned INT instead of a signed INT then. Problem solved /s

  • TimeSquirrel@kbin.social
    link
    fedilink
    arrow-up
    6
    arrow-down
    3
    ·
    edit-2
    1 year ago

    Isn’t there some kind of diminishing returns on this, where it starts to make more sense to offload things to a GPU or something instead of piling on ever more CPU cores? There has to be a lot of inefficiencies in that many interconnects.

    • AggressivelyPassive@feddit.de
      link
      fedilink
      English
      arrow-up
      21
      ·
      1 year ago

      GPUs aren’t really suitable for many workloads. These CPUs are typically used in servers, you can’t really offload a docker container onto a GPU.

    • hamsterkill@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      16
      ·
      1 year ago

      This is the type of processor companies want in things like VM servers that host large numbers of VMs.

      GPU processing units are really good at only specific kinds of computation. These are still all-around processors.

    • _s10e@feddit.de
      link
      fedilink
      English
      arrow-up
      8
      ·
      1 year ago

      The alternative to multiple cores is a single core that runs faster. We tried this and hit a limit. So, it’s many cores, now.

    • Blackmist@feddit.uk
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      One of their benchmark graphs is for Stable Diffusion, showing how much faster their CPU runs it than a 96 core AMD Epyc CPU. I’m like 99% sure that a GPU would run that at least 10 times faster.

    • namingthingsiseasy@programming.dev
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      GPUs are still pretty bad at handling conditional logic and are more optimized towards doing mathematical operations instead.

      But you are right in the sense that people are exploring different kinds of hardware for workloads that are getting increasingly specific. We’re not in a CPU vs GPU world anymore, but more like a “what kind of CPU do I need?” situation.