Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inaccuracies in rank estimation when using FFT histogram method #184

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

lemleb
Copy link

@lemleb lemleb commented Jan 28, 2025

After validating that the FFT convolution operation fails (return negative values, see Figure) for histograms with high bin counts, I implemented a potential countermeasure to address this issue. This approach introduces a compressed histogram that aims to mitigate the problem while maintaining precision. The compressed histograms include a scaling factor, a lower-bounds histogram, and an upper-bounds histogram. The convolution operation for compressed histograms multiplies the scaling factors and computes the convolution for both the LB and UB histograms separately. To make sure that this is no precision loss i made sure that the max bin count stays under a certain limit.
I implemented this approach in a way that ensures other ranking methods, such as NTL, remain usable.
I tested the implementation with a variety of input values and added a test case that uses a specific input set known to fail with the original implementation. The reference results for the test case are generated using the NTL ranking method. For tests involving standard histograms without high bin counts—those that do not represent edge cases—the implementation behaves exactly as before, yielding the same results
I hope this implementation helps address the issue of FFT convolution failing for histograms with high bin counts, and I am open to any suggestions for improvement and look forward to your feedback! (Also feel free to edit the implementation)
Figure_2
The figure shows a comparison between the result of the fft convolution and the result of the convolution implemented in numpy (not fft).

Copy link

cla-bot bot commented Jan 28, 2025

Thanks for your contribution!

We require contributors to sign our Contributor License Agreement, and we don"t have you on file. In order for us to review and merge your code, please follow the instructions on our website.

@rishubn rishubn changed the title Scaled ranking Fix inaccuracies in rank estimation when using FFT histogram method Jan 29, 2025
@rishubn
Copy link
Collaborator

rishubn commented Jan 29, 2025

Can show a comparison of the rank estimation between NTL, the old method and this fix? Does it match NTL?

@@ -7,11 +7,17 @@ mod bnp {
include!(concat!(env!("OUT_DIR"), "/binding_bnp.rs"));
}

const CONV_LIMIT: f64 = 34359738368.0 as f64; //define a certain limit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain the choice of this value in more detail

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing the original histogram-based ranking implementation, I observed that it begins to fail when the sum of the states (bin counts) exceeds 2^70. Since the fft convolution involves multiplication, I set the limit at the square root of this value, to ensure that the limit 2^70 is never exceeded.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why it happens at 2^70 and why your solution works in more detail? You need to also edit the comment to explain that constant in more detail.

@lemleb
Copy link
Author

lemleb commented Jan 29, 2025

@rishubn thank you for your feedback.
In a general ranking test case using 16 subkeys (each 8-bit) with random costs, the difference is negligible. To provide more specific details, here are some reference values showing the outputs of the NTL implementation and my scaled histogram implementation when calling the rank_accuracy function. Following the output of the rank_accuracy function, in the second lines, I have also included the log2 of these returned values. If these values would not be precise enough I think the unit tests would also fail as it compares the results with the output of naive ranking (rank2_vs_naive function).

NTL
2.3127785636369424e+38 < 2.697140362283015e+38 < 3.0667306146795653e+38
127.44289474785647 < 127.6646982087543 < 127.84996904981583

scaled histogram
2.3127785652135127e+38 < 2.6971403638637237e+38 < 3.0667306163263063e+38
127.44289474883992 < 127.66469820959982 < 127.84996905059052

Below, I list the output values for the edge case I implemented using the requested three different ranking methods. As before, the first lines show the results from rank_accuracy, and the second lines present the log2 of these results.

NTL
5952304925252321.0 < 6187845177933544.0 < 7321622167993872.0
52.40236985817056 < 52.45835852381074 < 52.70108474865651

original histogram
-447018641455.0 < 1.4837156954018768e+16 < 3.0498092486446173e+20
nan < 53.72006419228016 < 68.04728090939314

scaled histogram
6155488040303454.0 < 6401758333984994.0 < 7590098399148162.0
52.450794669201755 < 52.50738963956567 < 52.753040012382385

As you can see the original histogram implementation is far off, but the scaled version is nearly the same (accuracy within a 0.052 margin). This difference is caused by the rounding (LB and UB hostograms).

Copy link
Contributor

@cassiersg cassiersg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Thanks for the PR and the results, that looks very promising.

A few high-level remarks:

  • It would be good to keep the old F64Hist implementation around, and add a new method. It might not be useful to expose it to users of the library, but internally it is interesting to be able to keep evaluating the different approaches (in accuracy and performance).
  • Did you evaluate the performance impact of this change? At a high level, it should be around 2x (twice as many FFTs), but it might be that the divisions by scaler are a bottleneck for some sizes, making the algorithm much slower (if that happens, pre-computing 1/scaler should fix the problem).
  • I'm a bit surprised by the high value for CONV_LIMIT: given that f64 mantissa are 53-bit, I expected CONV_LIMIT to be closer to 2**26. This makes me a bit uncomfortable with the reliability of the code (how much precision do we loose due to rounding ?). As a long-term future idea, using integers instead of floats would close this question (that would be quite a bit of work: we could not use complex-number based FFTs, but would need integer FFT/NTT - I don't know if there is a library with a good implementation of those, besides arbitrary-precision libraries like NTL.)

@cla-bot cla-bot bot added the cla-signed label Feb 19, 2025
@lemleb
Copy link
Author

lemleb commented Feb 19, 2025

Hi @rishubn @cassiersg,

Apologies for the delay in the update—I encountered a bug that took longer than expected to resolve.

Thank you very much for your valuable feedback. I have tried to incorporate all of your suggestions into the implementation. Specifically, I’ve separated the scaled histogram implementation from the original, and you can now use it via the "scaledhist" method.

You were absolutely right about the convolution limit. After triggering some more edge cases I observed that my previous implementation still returned negative values in some cases. I’ve corrected that, and the value (2^26) should now be suitable for our needs.

Regarding performance, the new implementation requires approximately twice as much memory since it uses two histograms. The execution speed varies depending on the number of rescalings needed. For example, I measured the execution time of the rank_accuracy function in the two new test cases—once with the "scaledhist" method and once with the original hist method. In typical cases, where minimal rescaling is needed, the difference is negligible: the original implementation took 0.47795854503056034 seconds, while the scaled version took 0.5041293610120192 seconds. However, in the edge case where numerous scalings are required to avoid faulty results, the difference is more significant. The original hist implementation took 0.09203085798071697 seconds, while the scaled version took 0.6014815020025708 seconds.

I hope this update better aligns with the project’s requirements, and I look forward to your feedback.

PS: I think ci.yml has to be updated.

@cassiersg
Copy link
Contributor

cassiersg commented Feb 22, 2025

Hi!

Thanks for the new version, it looks great. I would like to make this the default rank estimation method (i.e., change the default in src/scalib/postprocessing/rankestimation.py), but I'm a bit worried by the performance cost. Would you have some time to look into it?

I do not understand how you have cases with "numerous scalings": the code is always running the scaling, no?

Overall, the scaling should have negligible cost, since it is O(n) (with a small constant) while the FFT is O(n*log n), so it's not obvious where the issue is coming from.

One possibility is the overhead of re-allocating new Vec's in rescale. That could be easily resolved by making rescale modify the histograms in-place (i.e. change to fn rescale(&mut self) and (in trait Histogram) fn convolve(&mut self, other: &mut Self) -> Self).

It that doesn't solve the issue, then I'm not sure what the problem is, we will probably need to profile the code to get a better idea. However, before doing that, it might be worth checking if the slowdown could come from the scaling creating many denormal numbers (on which the CPU is very slow): https://easyperf.net/blog/2018/11/08/Using-denormal-floats-is-slow-how-to-detect-it

@cassiersg
Copy link
Contributor

(CI is fixed, you can rebase on main)

@lemleb
Copy link
Author

lemleb commented Feb 23, 2025

Hi!

Thank you for your quick response!

I applied some changes (also the ones you suggested) and in my example the hist method returned the result in 0.099279 seconds and the scaled version in 0.213483 seconds. This indicates that the computation time is approximately twice as long. This outcome aligns with our expectations, as the presence of both upper and lower histograms (doubling the number of histograms) naturally leads to a twofold performance degradation.
I also found out what caused the extreme performance degradation. It is the rank_accuracy function in lib.rs which calls the rank_inner excessive amount of times. The scaled solution utilizes a lower and upper bound, it can happen (in the extreme edge cases) that the default acc_bit defined accuracy (between the min and max) is not met, and reruns the rank_inner many times. For the other ranking methods, the min-max range is an artificially defined margin, as we determine its extent. Therefore, in the scaled version, I increased the acc_bit value in the tests to compensate for this. Since rounding the histograms up and down naturally results in slightly larger gaps between the minimum and maximum values, this adjustment ensures more accurate and stable behavior.

Thank you for fixing the CI so fast, it works now.

Copy link
Contributor

@cassiersg cassiersg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation, that explains the results.

I think we are very close to merging this, I think the only parts missing are:

  • Ensuring that the other methods are not broken (see code comments), and adding a test for the "hist" method (it's fine to keep the NTL test disabled - although I'm also ok with enabling the NTL feature by default).
  • Add a bit of documentation in rankestimation.py: explain what hist vs scaledhist do inside the top-level comment (e.g. in the note about the algorithm), and add "scaledhist" to the documentation of rank_nbin.
  • Add an item in the CHANGELOG.

@@ -70,6 +91,9 @@ impl Histogram for F64Hist {
fn coefs_f64(&self) -> Vec<f64> {
self.state.clone()
}
fn coefs_f64_upper(&self) -> Vec<f64> {
unimplemented!()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should return self.state.

@@ -101,6 +275,9 @@ impl Histogram for BigNumHist {
fn coefs_f64(&self) -> Vec<f64> {
self.coefs().collect()
}
fn coefs_f64_upper(&self) -> Vec<f64> {
unimplemented!()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, should be same as coefs_f64.

@lemleb
Copy link
Author

lemleb commented Feb 24, 2025

Hi,
thank you very much for your quick review! I’ve applied the requested updates. I also enabled NTL by default, as I believe it makes sense, but this caused the CI build wheels to fail. Locally it works fine. Could you take a look and help identify the cause?

@lemleb
Copy link
Author

lemleb commented Feb 24, 2025

I believe the libntl-dev library is missing. However, when I attempt to install it in the CI, it shows as successfully installed, yet it still doesn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants