Skip to content

Conversation

@EVAST9919
Copy link
Contributor

@EVAST9919 EVAST9919 commented Aug 30, 2025

Split from #6613 which serves as base implementation of BBH inside a Path. By it's own this pr only improves Path.ReceivePositionalInputAt performance.

Structure

When passing path vertices to the PathBBH we will be building a binary tree (from bottom to top), in which most-bottom row will contain path segments and their bounding boxes. Then we will update all the parent nodes with combined bounds of left and right children.

Of course not all paths have 2^n segments, so we need to create such a tree which will be able to scale to any amount of segments without having empty nodes to reduce memory needed to store them. All the nodes are stored in an array of size roughly ~2n segment count in the following order: [nodes from the 0 depth from left to right][nodes from the 1 depth from left to right]...[nodes from the n-1 depth from left to right][leafs(segments)]. Array will be populated from end to start starting with leafs and then their parents and their parents and so on. And in the end the whole tree is built in just 1 cycle.

Path.SetVertices benchmark

master:

Method Mean Error StdDev Allocated
Compute100Segments 592.5 ns 1.81 ns 1.69 ns -
Compute1KSegments 5,887.3 ns 110.78 ns 98.20 ns -
Compute10KSegments 59,171.8 ns 127.50 ns 113.03 ns -
Compute100KSegments 569,454.3 ns 2,341.02 ns 2,075.25 ns -
Compute1MSegments 5,895,614.8 ns 22,614.86 ns 20,047.48 ns -

pr:

Method Mean Error StdDev Allocated
Compute100Segments 2.234 us 0.0027 us 0.0021 us 56 B
Compute1KSegments 22.013 us 0.0453 us 0.0424 us 56 B
Compute10KSegments 365.738 us 2.3118 us 2.0493 us 56 B
Compute100KSegments 4,013.083 us 7.9266 us 7.4145 us -
Compute1MSegments 41,526.195 us 65.9597 us 61.6988 us -

pr (after 06d895d):

Method Mean Error StdDev Gen0 Allocated
Compute100Segments 1.508 us 0.0053 us 0.0049 us 0.0019 56 B
Compute1KSegments 15.007 us 0.0308 us 0.0288 us - 56 B
Compute10KSegments 152.082 us 0.5901 us 0.5231 us - 56 B
Compute100KSegments 1,535.200 us 2.7969 us 2.6163 us - 57 B
Compute1MSegments 21,892.779 us 80.7825 us 63.0697 us - -

While pr values are about ~2.5x slower (with more micro-optimisations can be improved further, target would be 2x given amount of nodes processed is ~2x segment count), it's worth noting that in case of master (with snaking sliders) we will see these timings each frame and with #6613 - only once, and then timings from Path.SetStartProgress table (which are basically nothing)

Path.ReceivePositionalInputAt benchmark

master:

Method Mean Error StdDev Allocated
Contains100 302.0 ns 0.63 ns 0.56 ns -
Contains1K 2,906.9 ns 12.67 ns 11.85 ns -
Contains10K 76,607.3 ns 267.95 ns 250.64 ns -
Contains100K 868,608.0 ns 2,824.05 ns 2,641.61 ns -
Contains1M 8,683,885.7 ns 106,430.14 ns 99,554.82 ns -

pr:

Method Mean Error StdDev Allocated
Contains100 195.1 ns 0.20 ns 0.19 ns -
Contains1K 289.8 ns 0.21 ns 0.20 ns -
Contains10K 444.2 ns 0.41 ns 0.37 ns -
Contains100K 594.2 ns 1.17 ns 1.04 ns -
Contains1M 945.1 ns 2.87 ns 2.54 ns -

Input performance showcase

master pr
https://github.com/user-attachments/assets/32cfe1e8-d9b3-4ace-a200-8ab1f85179b3 https://github.com/user-attachments/assets/f46a9a52-e76c-40df-abee-59fb23ead737

@peppy peppy requested a review from smoogipoo October 16, 2025 07:29
Copy link
Contributor

@smoogipoo smoogipoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanna resolve conflicts here and we can try getting this merged in?

Comment on lines 21 to 31
public static float BranchlessMin(float value1, float value2)
{
int b = Convert.ToInt32(value1 < value2);
return b * value1 + (1 - b) * value2;
}

public static float BranchlessMax(float value1, float value2)
{
int b = Convert.ToInt32(value1 > value2);
return b * value1 + (1 - b) * value2;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is doing much, as it's still branching internally: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Convert.cs,992

Would probably rather not do this and leave it to the JIT to hopefully do things correctly.

https://sharplab.io/#v2:C4LghgzgtgPgAgJgIwFgBQ6BmAbA9mYAAgEFCBeQgJTADsATXKAOgGUALMAJwFM6mA5bgA9gLAJY0A5tm4AKAJQBuLHgKEAQuSq0Gzdl14DhoidLlL0K/EQCyC9AG90hF4TgB2QjYJsmNibLEADQaFmgAvpZoONaE/PZoTmiubp7qnLQAxmwyEBD+NIEh6mGRGOVwSABshDFq6Vk53HkFsnVEAG5g2ACu3Egh7YRdvdwI8o7OrhJEAEZaAMK4NB3cnMBMACq4AJI0wADMCLIjfUiEADzD3X3jysmuHoTzAFTXo+cA1ISy5wC0z3khDepzG93CQA=

Program.<<Main>$>g__M|0_0(<>c__DisplayClass0_0 ByRef)
    L0000: push eax
    L0001: vmovss xmm0, [ecx+4]
    L0006: vmovss xmm1, [ecx]
    L000a: vrangess xmm2, xmm1, xmm0, 4
    L0011: vmovups xmm3, [Program.<<Main>$>g__M|0_0(<>c__DisplayClass0_0 ByRef)]
    L0019: vfixupimmss xmm1, xmm0, xmm3, 0
    L0020: vfixupimmss xmm2, xmm1, xmm3, 0
    L0027: vmovss [esp], xmm2
    L002c: fld st, dword ptr [esp]
    L002f: pop ecx
    L0030: ret

Program.<<Main>$>g__N|0_1(<>c__DisplayClass0_0 ByRef)
    L0000: push eax
    L0001: push dword ptr [ecx]
    L0003: push dword ptr [ecx+4]
    L0006: call 0x2ad60048
    L000b: fstp dword ptr [esp], st
    L000e: vmovss xmm0, [esp]
    L0013: vmovss [esp], xmm0
    L0018: fld st, dword ptr [esp]
    L001b: pop ecx
    L001c: ret

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind reverting since I'm not a huge fan of how it looks either. The only reason I pushed this is the fact that it does affect performance and it's quite noticeable (hence benchmarks in OP before and after the commit). But sure, let's revert for now and may be think about further improvements later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also made a microbenchmark for this, and the results are documented in the file: smoogipoo/Benchmarks@7004938

There's going to be more to this, and it likely has to do with CPU & branch prediction.

I'd be interested to see what the results are for you with that benchmark, but in general I would always assume the JIT is knowledgeable of tricks like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, with your benchmark my results are the same as well

Method Job Runtime Mean Error StdDev
Min .NET 10.0 .NET 10.0 7.134 us 0.0021 us 0.0019 us
BranchlessMin .NET 10.0 .NET 10.0 7.161 us 0.0480 us 0.0449 us
Min .NET 8.0 .NET 8.0 7.134 us 0.0041 us 0.0032 us
BranchlessMin .NET 8.0 .NET 8.0 7.132 us 0.0033 us 0.0026 us

@EVAST9919 EVAST9919 requested a review from smoogipoo January 5, 2026 13:14
@smoogipoo
Copy link
Contributor

smoogipoo commented Jan 6, 2026

I'm struggling a little bit with the algorithm here because it looks like you've made a conscious effort to build things in reverse order. Is there a reason you couldn't build the binary tree left-to-right?

That will also perform better during CPU prefetches too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants