Arenas in Rust(russellw.github.io)

47 points by welovebunnies 5 hours ago | 12 comments

  • jmull 20 minutes ago
    > Doesn't that mean you are throwing out all the memory safety properties you were hoping to achieve by using Rust in the first place? ...That argument sounds logical, but it's not actually correct.

    Actually, it is correct. The more generally you use "arenas" the more they are subject to the same kinds of issues as other manual memory management systems. "arenas" can only avoid/reduce the "nondeterminism" and "security" issues of general manual memory management on top of a buffer of bytes by not becoming general manual memory management on top of a buffer of bytes.

  • mwkaufma 2 hours ago
    > Handles are deterministic. If a bug made your program crash on the last run, it will crash the same way on this run.

    > Given that the arena will still be subject to array bounds checking, handle bugs won't allow an attacker to overwrite arbitrary memory the way pointer bugs do.

    Just because you're in-bounds of the arena, doesn't mean you're not stomping on in-bounds neighboring data -- which can introduce "nondeterminism" as the author defines it. These are hand-waving arguments.

    [-]
    • mwkaufma 2 hours ago
      Anticipating pushback: yes, you can disallow "pointer arithmetic" on handles, and store fingerprints in the "slots" to ensure they still contain the handle's identity to detect user-after-free, but congrats, you've implemented sparse sets, which there are dozen's of C++ implementations of with the same safety guarantees, so it's unclear what rust is bringing in that case (e.g. https://github.com/skypjack/entt/blob/master/src/entt/entity...)
  • crq-yml 40 minutes ago
    I'm a bit agnostic about the specific solution these days. In general, early binding(so, static memory and types, formalized arenas and handles, in-line linear logic with few or no loops or branches) debugs more readily than late(dynamic memory allocs and types, raw pointers, virtual functions, runtime configuration). The appeal of late binding is in deferring the final computation to later so that your options stay open, while the converse is true with early binding - if you can write a program that always returns a precomputed answer, that's easier to grasp and verify.

    When we compare one method of early binding with another, it's probably going to be a comparison of the granularity. Arenas are "stupid simple" - they partition out memory into chunks that limit the blast radius of error, but you still make the error. Ownership logic is a "picky generalization" - it lets you be more intricate and gain some assurances if you put up with the procedures necessary, but it starts to inhibit certain uses because its view into usage is too narrow with too many corner cases.

    If we take Go's philosophy as an example of "what do you do if you want idioms for a really scalable codebase" - though you can certainly argue that it didn't work - it's that you usually want to lean on the stupid simple stuff and customize what you need for the rest. You don't opt into a picky Swiss Army Knife unless there's a specific problem to solve with it. Larger codebases have proportionately smaller sections that demand intricacy because more and more of the code is going to itself be a method of defining late binding, of configuring and deferring parts of processing.

    That said, Rust lets you fall back to "stupid simple", it just doesn't pave the way to make that the default.

  • Animats 3 hours ago
    This is the back-link problem in Rust. You can do back links with weak pointers, but it's rather verbose and involves extra run-time checks.

    I've been trying to figure out a compile-time approach to that problem. Here's an early version of a tech note on that.[1] It looks possible, but there are many issues. But it's important to look at now. The C++ people are trying to figure out safe backlinks.

    [1] https://github.com/John-Nagle/technotes/blob/main/docs/rust/...

  • wasmperson 2 hours ago
    > There are several ways to solve this problem. One way is to avoid using direct references to the particular class of objects at all. Instead, allocate a big array of objects, and refer to them with integer indexes into that array. There are several names that have been used for such arrays and indexes; let's call them arenas and handles.

    I thought the term "Arena" referred to linear allocators, but maybe it's not so narrowly defined.

    > At one level, this question is not very fair, because an answer to it as stated would be that one simply does not use doubly linked lists. They have been popular in introductory computer science lectures because they are a neat way to explain pointers in a data structure that's easy to draw on a whiteboard, but they are not a good match to modern hardware. The last time I used one was in the nineties. I know the Linux kernel uses them, but that design was also laid down in the nineties; if you were designing a kernel from scratch today, you would probably not do it that way.

    The arena data structure you described is inefficient unless it uses a linked list to track empty slots. All general-purpose heap allocators use linked lists in their implementations. Linked lists show up wherever you want pointer stability, low fragmentation, or a way to decouple the storage of objects from their participation in data structures. I struggle to imagine how it would be possible to implement something like Linux without at least one linked list in it.

    > Essentially you are bypassing the notion of pointers provided directly by the hardware

    Pointers aren't a hardware concept, they're a language concept. Array indices and pointers both rely on indirect addressing, which is the underlying hardware concept. The "handles" strategy in Rust feels like a kludgy approach not because it bypasses the hardware but because Rust's borrow checker isn't actually involved in ensuring safety in that case, just its bounds checking and mandatory initialization (and if all you need is those two things then...).

  • sfpotter 2 hours ago
    Here's an implementation of a doubly-linked list which is perfectly fine on modern hardware: an array of structs where each struct contains a pointer to its next and previous elements in addition to whatever other data is being stored.

    Here's a situation where a traditional pointer-based double-linked list is fine: when the payload is very large (e.g., an entire document in some app).

    [-]
  • pizlonator 3 hours ago
    Arenas let you use OOBAs to do data attacks, and those can give you the moral equivalent of remote code execution.

    Like if your arena contains both a buffer that OOB's and a buffer passed to a syscall, then having memory safety around the arena doesn't help you

    [-]
    • achierius 21 minutes ago
      And as a bonus, they also conveniently opt you out of any other hardening features your system malloc might have implemented (like MTE).

      I certainly appreciate the performance benefits of custom allocators but it's disheartening to see people naively adopting them just as the world's allocators are starting to get their acts together with regards to security.

    • Animats 3 hours ago
      Yes. This is a big headache in graphics, where there are big tables indexed by texture index down at the GPU level. Managing those as the content changes, with the GPU using the data while the CPU alters it, leads to huge complexity in Vulkan graphics.

      The two times I've had to deal with a really tough debugging problem in Rust, it's been related to some crate which did that.

  • echelon 3 hours ago
    I'd like to see something powerful for safety and speed in Crates.io and Cargo:

    The ability for crates to specify whether they use certain features of Rust:

    - unsafe

    - arena allocation or other future "useful" stuff

    - proc_macros, eg. serde (a huge compile speed hit)

    - deep transitive dependency (total #, or depth)

    - some proxy for the "number of compiler steps" (compiler workload complexity)

    We should be able to set a crate-wide or workspace-wide policy that can ban other crates that violate these rules.

    We could make this hand-written early on, but eventually enforced by the compiler. You could then easily guarantee that no downstream dependency uses unsafe code, has a crazy number of dependencies, or has crazy compile times.

    You could opt-in or opt-out to these things without having to think too much about them or constantly (manually) be on the lookout.

    This would be hugely beneficial to the Rust ecosystem, safety, code quality, and developer productivity / happiness.

  • IshKebab 3 hours ago
    Insightful article. As they say doubly linked lists are approximately useless and never used. But trees were nodes can refer to their parents are extremely common.

    And I agree, a custom arena is usually the best option here. Even though it seems like "giving up" and going back to C, it's still way better.

    There's this great comparison of Arenas in Rust: https://donsz.nl/blog/arenas/

    Some of them have keys that can't be reused too, so you can avoid "use after free" style bugs (at runtime) which is way better than C pointer wrangling.

    It's not perfect though. I kind of wish Rust did have some kind of support for this stuff natively. Pretty low on my list of Rust desires though - way behind better compile time and fewer async footguns.