Improvements to System.Math InternalCall code. #4847

tannergooding · 2016-05-08T07:07:06Z

This cleans up the native code for the System.Math internal calls. There were several instances of workarounds implemented for much older versions of the MSVC compiler that are just not relevant anymore.

The list of workarounds removed are as follows:

On Windows AMD64, Sin/Cos/Tan were implement in vm\amd64\JitHelpers_Fast.asm (using x87 floating-point instructions) because the CRT versions were previously too slow.
Log, Log10, Exp, Pow, and Round were being compiled with /fp:precise because the /fp:fast implementation was previously slower.
Exp(+/-INFINITY) was previously special handled as the CRT implementation did not return the expected values.
On Windows x86, Pow had some additional code (and inline assembly) that was unnecessary.
On Windows x86, Round was using inline assembly to call the frndint x87 floating-point instruction

In all of these cases, the replaced code was one or more of the following:

No longer needed as the bug had been fixed
Went against the Intel/AMD CPU Optimization Guidelines by using x87 floating-point instructions. Both manuals recommend using SSE/SSE2 instructions unless their are special circumstances requiring the use of the x87 floating-point instructions; the SSE/SSE2 instructions can be pipelined, have a lower latency, use hardware registers rather than the floating-point stack, and generally provided faster computations than the special x87 floating point instructions (such as fsin, fcos, and ftan).
Was no longer more performant than the 'correct' implementation

tannergooding · 2016-05-09T19:16:01Z

This should be reviewable. Everything is passing and looking good, so I don't have any other changes I am planning on making (unless otherwise requested).

jkotas · 2016-05-10T13:28:33Z

@tannergooding Could you please split this into several smaller PRs? It will be hard to make progress on it as is because of it has too much stuff in it.

File rename so that the diff for the rest looks more reasonable
Non-x86 specific changes
x86 specific changes. How did you test that these changes are good?

Thanks!

tannergooding · 2016-05-10T16:22:04Z

@jkotas, this becomes significantly easier to review if you look at each of the commits individually, rather than all at once. It also becomes simpler to review if you add ?w=1 to the end of the url, as it causes whitespace specific changes to be left out of the review.

jamesqo · 2016-05-11T21:51:42Z

src/classlibnative/float/floatdouble.cpp

+    // instead returns +1.0.
+
+    if(IS_DBL_INFINITY(y) && IS_DBL_NEGATIVEONE(x)) {
+        *((INT64 *)(&result)) = CLR_NAN_64;


Just a question, not too familiar with native code but isn't violating strict aliasing supposed to be UB?

Yes, violating strict aliasing is UB according to the spec, but clang, gcc, and msvc will handle this code as intended.

AndyAyersMS · 2016-05-19T18:09:35Z

@tannergooding any plans to implement perf tests for these math functions?

Also how about more rigorous correctness tests?

There are a few tests over in CoreFx but they're minimal and don't check boundary cases. There are pal tests for the underlying native methods but nothing that verifies these when called from managed code.

tannergooding · 2016-05-19T21:57:43Z

@AndyAyersMS. I'll see about getting some perf tests implemented as well as some more rigorous correctness tests.

AndyAyersMS · 2016-05-20T03:53:07Z

I wrote a simple perf test for Log, feel free to build on that if you like.

BTW math function performance seems significantly worse on Linux than on Windows.

tannergooding · 2016-05-22T16:20:46Z

I have added some very basic performance tests for all of the System.Math functions which are implemented via InternalCall.

All performance tests are implemented as follows:

100,000 iterations are executed
The time of all iterations are aggregated to compute the Total Time
The time of all iterations are averaged to compute the Average Time
A single iteration executes some simple operation, using the function under test, 5000 times

All of the functions which were not modified did not show any improvement or regression.

The execution time below is the Total Time for all 100,000 iterations, measured in seconds.

Hardware: Desktop w/ 3.7GHz Quad-Core A10-7850K (AMD) and 16GB RAM

Of the functions which were changed, they are showing the following improvments (x64):

Function	Improvment	Execution Time - Before	Execution Time - After
Cos	59.8373609%	17.5828667s	7.06169908s
Exp	13.0115399%	7.78959558s	6.77604924s
Log	0.691849597%	5.47524991s	5.43736941s
Log10	0.214979727%	5.79203835s	5.77961335s
Pow	0.982046086%	30.8941570s	30.5907621s
Round	31.0027713%	5.90780906s	4.07622453s
Sin	68.0349993%	17.6330190s	5.63639463s
Tan	59.4324800%	25.4978910s	10.3438620s

Of the functions which were changed, they are showing the following improvments (x86):

Function	Improvment	Perf Before	Perf After
Exp	21.8584468%	9.46195789s	7.39372086s
Log	14.4277698%	11.2499700s	9.62685020s
Log10	14.9602930%	11.5140937s	9.79155157s
Pow	15.1681239%	31.8424078s	27.0125119s
Round	0.172920825%	0.63726968s	0.63616771s

FYI. @AndyAyersMS

AndyAyersMS · 2016-05-22T17:40:14Z

tests/src/JIT/Performance/CodeQuality/Math/Functions/Abs.cs

+
+            var diff = Math.Abs(absDoubleExpectedResult - result);
+
+            if ((diff != 0) && (diff < absDoubleEpsilon))


I think this should just be if (diff > absDoubleEpsilon) -- no need to check against zero, and fail if the diff is too large.

Similarly for checks in the other tests.

AndyAyersMS · 2016-05-22T17:46:26Z

Thanks for adding tests.... left a few comments.

I'd be interested to see the Windows vs Linux numbers.

tannergooding · 2016-05-22T18:52:04Z

I've updated according to the feedback provided and am working on getting some Linux (and Mac) numbers for comparison.

AndyAyersMS · 2016-05-22T19:37:15Z

Thanks for the updates.

One last thing to think about. The code under jit/performance has to work well under 3 different use cases:

Regular test, run every CI. We do this a lot
Perf test, xunit-perf run. We do this rarely
Perf test, manual run. We do this very rarely.

I think what you have is good on the perf test aspects, but it's likely running this as a regular test will take a long time. What I've done for the other perf tests is (for now) give up on the manual perf test aspect and scale back iterations so when run as an exe the timing is reasonable, say a second or so at most in release and a short as possible in non-release.

To get back the manual mode capabilities, I have plans to go back and add optional -bench arguments to each of tests so they ramp up the work they do when run as a standalone perf test.

So, look at how long this test takes in CI and if it stands out, you should consider doing something to trim it back.

AndyAyersMS · 2016-05-22T20:26:50Z

Looks like you hit the timeout ...

D:\j\workspace\debug_windows---17180f1b\bin\tests\Windows_NT.x64.Debug\JIT\Performance\CodeQuality\Math\Functions\Functions\Functions.cmd Timed Out

tannergooding · 2016-05-22T20:40:59Z

I'm updating this to accept two arguments (similar to the SIMD perf tests). One for the math function to test and the other which will be the number of iterations to run. I'm going to default the iterations to 1 for debug and 1000 for release (I can run 2500 iterations of the PowTest in one second on my box)

tannergooding · 2016-05-22T21:41:13Z

Updated. The Math Functions performance test now accepts two arguments. One allows you to specify the name of the test to run (all is supported and is the default) and the other (-bench) allows you to specify the number of iterations to run (defaults to 1 on debug and 1000 on release).

AndyAyersMS · 2016-05-22T21:56:16Z

Nicely done -- LGTM.

bendono · 2016-05-22T22:49:25Z

src/classlibnative/float/Float.nativeproj

@@ -28,7 +28,7 @@
  </PropertyGroup>
  <!--Leaf Project Items-->
  <ItemGroup>
-    <CppCompile Include="FloatNative.cpp" />
+    <CppCompile Include="FloatDouble.cpp" />


The casing on the file name differs here from that used on the actual file. Should be fine for Windows, but is this not a problem for Linux type systems?

I believe (in the case of _._proj) casing differences are being handled already. However, I have adjusted the casing to match what exists on disk.

tannergooding · 2016-05-23T05:13:52Z

I have some perf numbers for Linux and Mac. I'm not sure I trust the Linux numbers because of how small they are (I did double check the I deployed the right bits), but Mac did show some small (and consistent) improvements (I was happy to see already good perf coming from Mac). I also updated my post above to list the hardware I ran the windows benchmarks on.

OS X showed very little improvement, but the improvement was consistent.

Hardware: MacBook Pro w/ 2.5GHz Quad-Core i7 (Intel) and 16GB RAM

Function	Improvment	Execution Time - Before	Execution Time - After
Cos	8.38015985%	5.39619623s	4.94398636s
Exp	3.44344613%	3.70642912s	3.57880023s
Log	0.155915035%	3.37851958s	3.37325196s
Log10	0.274460120%	3.40944615s	3.40008858s
Pow	1.00818607%	15.5424187s	15.3857222s
Round	0.243108725%	2.61207820s	2.60572801s
Sin	0.572526337%	4.50698218s	4.48117852s
Tan	23.0676004%	5.25166822s	4.04023438s

Linux showed even less improvement and it did not seem to be consistent.

Hardware: Asus ZenBook w/ 2.6GHz Quad-Core i3 (Intel) and 4GB RAM

Function	Improvment	Execution Time - Before	Execution Time - After
Cos	0.209634293%	54.7168588s	54.6021535s
Exp	1.89459115%	52.8744791s	51.8727239s
Log	0.217855270%	56.1605876s	56.0382388s
Log10	0.274460120%	58.8396717s	58.3054002s
Pow	0.0565113567%	51.8469237s	51.8176243s
Round	3.87328635%	4.83688612s	4.64953967s
Sin	1.90053224%	51.5278393s	50.5485361s
Tan	0.441676743%	58.7293771s	58.4699831s

FYI. @AndyAyersMS

Let me know what else, if anything, is required to get this merged.

AndyAyersMS · 2016-05-23T17:17:11Z

LGTM.

Something for future follow-up: try and run Linux on same HW as Windows, and compare relative perf (eg dual-boot a machine or similar). Data above and some experiments I've done suggest Linux math function performance may not be very good.

Also windows 10 math libs seem to light up on AVX2 capable machines. So for Windows, you might want to measure on both older and newer HW.

tannergooding · 2016-05-24T01:05:16Z

@AndyAyersMS, I did some investigations on Linux and it seems that the slowness is coming from the libm implementation being linked in (usr/lib/x86_64-linux-gnu/libm.a).

I tried swapping the implementation out with amdlibm.a and saw the numbers drop to be on par with Windows.

AndyAyersMS · 2016-05-24T01:45:06Z

Interesting... so do you think we should be linking amdlibm.a instead? Seems like we should open an issue for this.

tannergooding · 2016-05-24T02:50:25Z

I believe, at the very least, it is worth investigating further.

AMD's libm implementation doesn't ship natively, it has to be downloaded externally (http://developer.amd.com/tools-and-sdks/archive/libm/). It also comes with its own licensing agreement, which would need review by LCA before we could adopt it.

I'm still wondering if there is some additional detail I'm missing, because it is a little hard to believe that libm is 6-12x less performant than the Windows or OSX implementation. However, I did validate that the clang commands were the same for both Linux/Mac (aside from OS specific compiler defines and library references).

I have all the code available on my box (and I believe my office is just down the hall from yours) if you would like to take a closer look tomorrow.

tannergooding · 2016-05-24T07:24:37Z

Rebased code onto HEAD and finished cleaning up the code.

tannergooding · 2016-05-25T02:17:29Z

Updating the existing PAL tests. Should provide much better coverage for the for things both inside and outside of a functions input domain.

tannergooding · 2016-05-26T15:46:50Z

This is now passing (again), has new perf tests, and extended pal test coverage. Is there anything else required on my end to get this merged in?

tannergooding · 2016-05-27T15:10:09Z

Looking at the trigger phrase, it seems my previous comment caused a coverage test to run, which failed because it couldn't find ./corerun or ./runtests.sh.

AndyAyersMS · 2016-05-27T16:53:52Z

@tannergooding I would suggest getting somebody on the runtime side to sign off.

Just browsing through the new pal tests, I don't understand why the expected variance for inputs is sometimes PAL_EPS and sometimes a multiple or fraction of it. Did you come about these variances by trial and error? I would actually hope that math library functions we link to would end up fairly close to 1 ULP for most reasonable inputs.

I understood why we relaxed the variance on the managed side since we were adding up many results, but here you're just testing isolated values.

tannergooding · 2016-05-27T17:28:48Z

@AndyAyersMS, the epsilon used for these tests was already set to 1e-07, so I just maintained the what was already there.

I can certainly update the tests to be more have a more accurate base epsilon (1e-6 for single, and 1e-15 for double) so that the cross platform math library implementations are required to produce consistent/accurate results (this is a simple change as I ensured the inputs/outputs for all the tests were at the maximum precision allowed -- 17 digits).

As for why it varies by a multiple/division of 10, that is because floating point values have digits of precision (6-9 for single and 15-17 for double). So, for a single precision floating point, 0.123456789 and 123456789.0 are both as equally precise as allowed. The adjustments handle the ULP for the expected output (so that the last digit is what the delta is based on).

So, in the first example 0.1234567890 and 0.1234567891 should be considered as equally precise, because the 10th digit is outside the range of precision for a single-precision floating-point integer. The same is true for example 123456789.0 and 123456789.1 in the second example.

So PAL_EPSILON is used for numbers in the format of 0.xxxxxx, PAL_EPISLON / 10 is used for 0.0xxxxxx, PAL_EPSILON * 10 is used for x.xxxxx and so on.

tannergooding · 2016-05-28T00:36:07Z

@AndyAyersMS. I have added a new commit which updates the value of PAL_EPSILON to be more precise. Additionally, I have commented all instances of PAL_EPSILON to explain (roughly) what I explained above.

It seems that the various implementations (between Windows, Linux, and Mac) are accurate to within 2^-50 (2^-21 for float). Where the actual machine-epsilon is 2^-52 (2^-23 for float).

Do you know who on the Runtime side would be good to tag for review?

AndyAyersMS · 2016-05-28T02:04:14Z

@sergiy-k Can you suggest somebody to review the PAL changes?

tannergooding · 2016-06-01T22:25:22Z

Ping @sergiy-k Can you suggest somebody to review the PAL changes?
This was probably missed since it was sent late last Friday.

sergiy-k · 2016-06-01T22:31:27Z

/cc @janvorli

janvorli · 2016-06-01T22:42:43Z

@tannergooding You have accidentally (I guess) changed the license headers in multiple files from .NET Foundation to Microsoft. Can you please fix that?
LGTM otherwise.

tannergooding · 2016-06-01T23:14:53Z

@janvorli, I've rebased the changes against the current HEAD and fixed the license headers.

janvorli · 2016-06-01T23:18:38Z

LGTM

tannergooding · 2016-06-02T01:15:22Z

Everything is passing and I believe I have received all necessary sign-off.

Am I clear to merge?

tannergooding · 2016-06-02T01:25:28Z

Thanks!

Math.Exp() to match dotnet/coreclr#4847.

* Adding some basic System.Math performance tests. * Renaming 'floatnative' to 'floatdouble'. * Removing outdated workarounds in the floatdouble interop code. * Renaming 'finite.cpp' to 'math.cpp' * Updating the double-precision math tests. * Updating PAL_EPSILON to be more precise. Commit migrated from dotnet/coreclr@08786f2

dnfclas added the cla-already-signed label May 8, 2016

jamesqo reviewed May 11, 2016
View reviewed changes

AndyAyersMS reviewed May 22, 2016
View reviewed changes

bendono reviewed May 22, 2016
View reviewed changes

tannergooding and others added 6 commits June 1, 2016 16:00

Adding some basic System.Math performance tests.

d22a3b3

Renaming 'floatnative' to 'floatdouble'.

d5434ff

Removing outdated workarounds in the floatdouble interop code.

ecd3cfa

Renaming 'finite.cpp' to 'math.cpp'

1e5e5c6

Updating the double-precision math tests.

48c6f23

Updating PAL_EPSILON to be more precise.

c83e7ec

janvorli merged commit 08786f2 into dotnet:master Jun 2, 2016

tannergooding deleted the math branch June 2, 2016 03:21

tannergooding mentioned this pull request Sep 1, 2016

Stop treating sine/cosine/round as target intrinsics on x86. #7023

Merged

tarekgh added this to the 2.0.0 milestone Mar 22, 2017

tannergooding mentioned this pull request May 3, 2017

Removing the special handling for atan2(±∞, ±∞) and pow(-1.0, ±∞) dotnet/corert#3498

Merged

tijoytom-zz pushed a commit to tijoytom-zz/corert that referenced this pull request May 5, 2017

Remove the additional check from

e1797da

Math.Exp() to match dotnet/coreclr#4847.

tijoytom-zz mentioned this pull request May 5, 2017

Remove the additional check from Math.Exp() dotnet/corert#3533

Merged

jkotas pushed a commit to dotnet/corert that referenced this pull request May 5, 2017

Remove the additional check from (#3533)

c80854a

Math.Exp() to match dotnet/coreclr#4847.

tannergooding mentioned this pull request Jan 31, 2020

[Perf] Math functions are significantly slower on Ubuntu dotnet/runtime#7364

Closed

tarekgh mentioned this pull request Jan 31, 2020

Perf: double.ToString(CultureInfo) 7x slower on Linux dotnet/runtime#7700

Closed

stephentoub mentioned this pull request Jan 31, 2020

MathTests.Pow test failing after recent coreclr change dotnet/runtime#17484

Closed

karakasa mentioned this pull request Nov 21, 2023

Math.Sin difference between FW and 8 dotnet/runtime#95059

Closed


		var diff = Math.Abs(absDoubleExpectedResult - result);

		if ((diff != 0) && (diff < absDoubleEpsilon))

Improvements to System.Math InternalCall code. #4847

Improvements to System.Math InternalCall code. #4847

Uh oh!

Conversation

tannergooding commented May 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding commented May 9, 2016

Uh oh!

jkotas commented May 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding commented May 10, 2016

Uh oh!

jamesqo May 11, 2016

Choose a reason for hiding this comment

Uh oh!

tannergooding May 23, 2016

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS commented May 19, 2016

Uh oh!

tannergooding commented May 19, 2016

Uh oh!

AndyAyersMS commented May 20, 2016

Uh oh!

tannergooding commented May 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyAyersMS May 22, 2016

Choose a reason for hiding this comment

Uh oh!

tannergooding May 22, 2016

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS commented May 22, 2016

Uh oh!

tannergooding commented May 22, 2016

Uh oh!

AndyAyersMS commented May 22, 2016

Uh oh!

AndyAyersMS commented May 22, 2016

Uh oh!

tannergooding commented May 22, 2016

Uh oh!

tannergooding commented May 22, 2016

Uh oh!

AndyAyersMS commented May 22, 2016

Uh oh!

bendono May 22, 2016

Choose a reason for hiding this comment

Uh oh!

tannergooding May 23, 2016

Choose a reason for hiding this comment

Uh oh!

tannergooding commented May 23, 2016

Uh oh!

AndyAyersMS commented May 23, 2016

Uh oh!

tannergooding commented May 24, 2016

Uh oh!

AndyAyersMS commented May 24, 2016

Uh oh!

tannergooding commented May 24, 2016

Uh oh!

tannergooding commented May 24, 2016

Uh oh!

tannergooding commented May 25, 2016

Uh oh!

tannergooding commented May 26, 2016

Uh oh!

tannergooding commented May 27, 2016

Uh oh!

AndyAyersMS commented May 27, 2016

Uh oh!

tannergooding commented May 27, 2016

Uh oh!

tannergooding commented May 28, 2016

Uh oh!

AndyAyersMS commented May 28, 2016

tannergooding commented May 8, 2016 •

edited

Loading

jkotas commented May 10, 2016 •

edited

Loading

tannergooding commented May 22, 2016 •

edited

Loading