-
Notifications
You must be signed in to change notification settings - Fork 452
Implement Bounding Box Hierarchy in Path
#6630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
smoogipoo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanna resolve conflicts here and we can try getting this merged in?
osu.Framework/Utils/MathUtils.cs
Outdated
| public static float BranchlessMin(float value1, float value2) | ||
| { | ||
| int b = Convert.ToInt32(value1 < value2); | ||
| return b * value1 + (1 - b) * value2; | ||
| } | ||
|
|
||
| public static float BranchlessMax(float value1, float value2) | ||
| { | ||
| int b = Convert.ToInt32(value1 > value2); | ||
| return b * value1 + (1 - b) * value2; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is doing much, as it's still branching internally: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Convert.cs,992
Would probably rather not do this and leave it to the JIT to hopefully do things correctly.
Program.<<Main>$>g__M|0_0(<>c__DisplayClass0_0 ByRef)
L0000: push eax
L0001: vmovss xmm0, [ecx+4]
L0006: vmovss xmm1, [ecx]
L000a: vrangess xmm2, xmm1, xmm0, 4
L0011: vmovups xmm3, [Program.<<Main>$>g__M|0_0(<>c__DisplayClass0_0 ByRef)]
L0019: vfixupimmss xmm1, xmm0, xmm3, 0
L0020: vfixupimmss xmm2, xmm1, xmm3, 0
L0027: vmovss [esp], xmm2
L002c: fld st, dword ptr [esp]
L002f: pop ecx
L0030: ret
Program.<<Main>$>g__N|0_1(<>c__DisplayClass0_0 ByRef)
L0000: push eax
L0001: push dword ptr [ecx]
L0003: push dword ptr [ecx+4]
L0006: call 0x2ad60048
L000b: fstp dword ptr [esp], st
L000e: vmovss xmm0, [esp]
L0013: vmovss [esp], xmm0
L0018: fld st, dword ptr [esp]
L001b: pop ecx
L001c: ret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind reverting since I'm not a huge fan of how it looks either. The only reason I pushed this is the fact that it does affect performance and it's quite noticeable (hence benchmarks in OP before and after the commit). But sure, let's revert for now and may be think about further improvements later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also made a microbenchmark for this, and the results are documented in the file: smoogipoo/Benchmarks@7004938
There's going to be more to this, and it likely has to do with CPU & branch prediction.
I'd be interested to see what the results are for you with that benchmark, but in general I would always assume the JIT is knowledgeable of tricks like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, with your benchmark my results are the same as well
| Method | Job | Runtime | Mean | Error | StdDev |
|---|---|---|---|---|---|
| Min | .NET 10.0 | .NET 10.0 | 7.134 us | 0.0021 us | 0.0019 us |
| BranchlessMin | .NET 10.0 | .NET 10.0 | 7.161 us | 0.0480 us | 0.0449 us |
| Min | .NET 8.0 | .NET 8.0 | 7.134 us | 0.0041 us | 0.0032 us |
| BranchlessMin | .NET 8.0 | .NET 8.0 | 7.132 us | 0.0033 us | 0.0026 us |
|
I'm struggling a little bit with the algorithm here because it looks like you've made a conscious effort to build things in reverse order. Is there a reason you couldn't build the binary tree left-to-right? That will also perform better during CPU prefetches too. |
Split from #6613 which serves as base implementation of BBH inside a
Path. By it's own this pr only improvesPath.ReceivePositionalInputAtperformance.Structure
When passing path vertices to the
PathBBHwe will be building a binary tree (from bottom to top), in which most-bottom row will contain path segments and their bounding boxes. Then we will update all the parent nodes with combined bounds of left and right children.Of course not all paths have 2^n segments, so we need to create such a tree which will be able to scale to any amount of segments without having empty nodes to reduce memory needed to store them. All the nodes are stored in an array of size roughly ~2n segment count in the following order:
[nodes from the 0 depth from left to right][nodes from the 1 depth from left to right]...[nodes from the n-1 depth from left to right][leafs(segments)]. Array will be populated from end to start starting with leafs and then their parents and their parents and so on. And in the end the whole tree is built in just 1 cycle.Path.SetVerticesbenchmarkmaster:
pr:
pr (after 06d895d):
While pr values are about ~2.5x slower (with more micro-optimisations can be improved further, target would be 2x given amount of nodes processed is ~2x segment count), it's worth noting that in case of master (with snaking sliders) we will see these timings each frame and with #6613 - only once, and then timings from
Path.SetStartProgresstable (which are basically nothing)Path.ReceivePositionalInputAtbenchmarkmaster:
pr:
Input performance showcase