Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IterableMatrix MD5 checksum. #83

Merged
merged 5 commits into from
Apr 19, 2024
Merged

Conversation

brgew
Copy link
Contributor

@brgew brgew commented Apr 4, 2024

Hi Ben,

I got around finally to adding an MD5 checksum calculation for an iterable matrix.

I want to thank you for your generous guidance of last November -- it made this addition fairly simple.

Some aspects of this addition include

  • I use md5 code that I lifted from CERN at https://root.cern.ch/doc/v610/md5_8inl_source.html. The file concatenated the original md5.h and md5.c file. I split them apart again and removed the main() function. The files preserve the copyright information as headers.
  • I #include <stdint.h> in md5.h, which I added as an #include in your src/matrix_utils.cpp. I also included cstdio because I am familiar with 'C'. I am a bit concerned about the stdint.h doing unexpected things with your C++ included code.
  • I named the R MD5 function iterable_matrix_md5sum(). It's a bit long...
  • I added a very brief test function to tests/testthat/test-matrix_utils.R.
  • I am unfamiliar with Rcpp so I am not confident that StringVector is the correct type to return but it appears to work.
  • I tested the md5 hash by adding C code to matrix_utils.cpp that dumped the stream of bytes to a file and I used the Linux md5sum utility to calculate the checksum from that file. I got the same string as returned by iterable_matrix_md5sum().
  • I included byte swapping code in case the architecture is big endian. I could not test it on a big endian machine but I checked that the byte swapping does what I want it to do.

I imagine that you will have reservations about some of this. I am keen to hear your feedback!

Ever grateful,
Brent

@bnprks
Copy link
Owner

bnprks commented Apr 12, 2024

Hi Brent, apologies for the delayed response, I'm just getting back from some travel. Overall this looks good but I do have some notes as anticipated :)

Notes:

  • Your current code just hashes the values for each non-zero element in the matrix, but does not track the row or columns for the matrix entries. I'd suggest incorporating it.row() and it.col() into the hashed data. E.g. matrix(c(1,0)) and matrix(c(0,1)) probably hash to the same output in your current code.
  • iterable_matrix_md5sum() is a bit of a long name. Maybe checksum() would be shorter without conflicting with common functions in other packages?
  • Was there a reason to make iterable_matrix_md5sum() an S4 generic rather than a standard function? I'd generally lean towards standard function in BPCells.
  • I'm assuming it's your intent with converting matrices to double prior to hashing to have integer and double matrices hash to the same result if their contents are identical but use different data types. Seems like a reasonable choice.
  • The md5 sources would go best under the src/lib folder, as that's where I've been keeping code copied from external sources.

@brgew
Copy link
Contributor Author

brgew commented Apr 15, 2024

Hi Ben,

Thank you for the feedback! I made the changes that you requested to the best of my ability. I added some fiddly casting code to try to keep the number of copies and function calls to a near minimum -- I hope that this does not create problems in the future. I added also technical details and an example to the checksum() documentation. Please feel free to bounce it back to me if it doesn't meet your expectations.

Oh, I am not confident in my github skills: I made the changes in my account without making a new pull request or syncing. I see the changes when I look at the code on your account. I hope that you see them too.

Thank you for considering this function.

Ever grateful,
Brent

@bnprks
Copy link
Owner

bnprks commented Apr 18, 2024

Hi Brent,

Thanks for making those adjustments. I've gone through a detailed review and made a couple edits to just tidy things up on my end, which were:

  • Re-add the md5.h and .c files which seem to have been accidentally deleted by your second commit
  • Add a call to Rcpp::checkUserInterrupt() occasionally in the loop so that users can quit out of a checksum calculation early
  • Added a note to the changelog in NEWS.md and re-generated the news + function documentation html accordingly

Thanks for submitting these changes, and I hope they're helpful for your downstream work!

One last pair of details I thought of that may or may not be important to you from the checksum perspective:

  • Do you care about capturing the row/col names in the checksum value?
  • Do you care about capturing the matrix dimensions in the checksum? (as it is, extending a matrix with additional fully-zero rows/cols won't adjust the checksum)

Given that initially you'll probably be the main/only user of this function I'm happy to include as much or as little as you like in the checksum value. From my perspective the code is ready to merge as-is now, but I'll wait to hear back from you on these two questions in case it's something you want to adjust before we publish this in the main branch.

I'd be happy to make the edits to include row/col names in the checksum if desired, otherwise the relevant C++ functios are rowNames , colNames, rows, and cols.

P.S. In terms of github workflow, all new commits you make on that same github branch show up to me, and by default I also have commit permissions on your branch to make edits. So a git pull on your branch should get you a copy of my edits and a git push publishes your edits to me.

@brgew
Copy link
Contributor Author

brgew commented Apr 19, 2024

Hi Ben,

Thank you for cleaning up after me!

My gut tells me it would be wise to heed your advice so I'll look at adding to the checksum calculation as you suggest.

I appreciate your patience and guidance, and I hope all goes well with you!

Ever grateful,
Brent

@brgew
Copy link
Contributor Author

brgew commented Apr 19, 2024

Hi Ben,

Again, I appreciate your patience with me.

I added and pushed your suggested additions. Also, I looked at the checksum run time on a 32k x 2.7m sparse matrix, which was about 30 seconds and much better than I feared. I admire your code and coding judgement!

Thank you.

Ever grateful,
Brent

@bnprks
Copy link
Owner

bnprks commented Apr 19, 2024

Hi Brent, thanks for making those additional changes -- looks great! I had also been concerned about speed initially, but similarly found fast performance (about ~2x slower than a simple colsum in my large matrix test).

I'll merge this in so it'll be available in the main branch

@bnprks bnprks merged commit 35b9283 into bnprks:main Apr 19, 2024
@brgew
Copy link
Contributor Author

brgew commented Apr 19, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants