Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise HashStore config hashstore.yaml for an algorithms list #36

Closed
doulikecookiedough opened this issue Jun 16, 2023 · 4 comments
Closed
Assignees
Milestone

Comments

@doulikecookiedough
Copy link
Contributor

doulikecookiedough commented Jun 16, 2023

My initial comment for discussion:

The default algorithm list and other algorithm lists (for additional hex digest to include when storing an object) are bespoke to its respective Python/Java implementation, so I have not included it in the hashstore.yaml config file.

@mbjones' Feedback:

My intuition is that the hashstore.yaml file needs to be language agnostic, and be equally readable by both the java and python implementations. We do plan to have both libraries configure and read data from the same hashstore (one in Metacat, doing read/write, and one in MetaDIG, doing read-only

Here is the supported algorithms from Python (hashlib library):

Default: "sha1", "sha256", "sha384", "sha512", "md5"
Other: "sha224", "sha3_224", "sha3_256", "sha3_384", "sha3_512", "blake2b", "blake2s",

Here is the supported algorithms from Java (MessageDigest class):

Default: "MD2", "MD5", "SHA-1", "SHA-256", "SHA-384", "SHA-512"
Other: "SHA-512/224", "SHA-512/256"

Python (f-string) hashstore.yaml

# Default configuration variables for HashStore

############### Store Path ###############
# Default path for `FileHashStore` if no path is provided
store_path: "{store_path}"

############### Directory Structure ###############
# Desired amount of directories when sharding an object to form the permanent address
store_depth: {store_depth}  # WARNING: DO NOT CHANGE UNLESS SETTING UP NEW HASHSTORE
# Width of directories created when sharding an object to form the permanent address
store_width: {store_width}  # WARNING: DO NOT CHANGE UNLESS SETTING UP NEW HASHSTORE
# Example:
# Below, objects are shown listed in directories that are 3 levels deep (DIR_DEPTH=3),
# with each directory consisting of 2 characters (DIR_WIDTH=2).
#    /var/filehashstore/objects
#    ├── 7f
#    │   └── 5c
#    │       └── c1
#    │           └── 8f0b04e812a3b4c8f686ce34e6fec558804bf61e54b176742a7f6368d6

############### Format of the Metadata ###############
store_sysmeta_namespace: "{store_sysmeta_namespace}"

############### Hash Algorithms ###############
# Hash algorithm to use when calculating object's hex digest for the permanent address
store_algorithm: "{store_algorithm}"

I am opening this issue for now, and will comment after giving it some more thought.

@doulikecookiedough doulikecookiedough changed the title Revise HashStore config hashstore.yaml for an algorithms list. Revise HashStore config hashstore.yaml for an algorithms list Jun 16, 2023
@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Jun 19, 2023

With the current hashstore.yaml, both Metacat and Metadig should be able to configure HashStore with the same file (my intention is for it to be the same, and is coded as such), with the keys being:

- store_path
- store_depth
- store_width
- store_sysmeta_namespace
- store_algorithm

However, after typing up a storm and trying to get my thoughts out, I see now why not having the default list based on hashstore.yaml could be problematic. If we were to change this list for whatever reason, we would need to update both Java/Python in a bit more of an involved approach. If the initialization for the default list was based on hashstore.yaml, only the config file creation method would need to be addressed (simpler).

  • On that note, I am leaning towards adding the default list back into hashstore.yaml - but my concern at this point is that hashlib (Python) and MessageDigest (Java) have different naming conventions for instantiating new hash objects:
    • Ex. sha256 and SHA-256
  • Since my Python implementation already has a method for tidying up a String algorithm value clean_algorithm() (where it replaces - with _ or empty, and sets the letters to lower case), I think the default algorithm list in hashstore.yaml should contain values that are immediately compatible with Java's MessageDigest class.
  • The other algorithm list varies quite a bit between Python and Java, so I think we have a couple options here (if users being able to know what type of algorithms are available is a requirement):
    • New public API method like get_supported_algorithms() which returns a map/dictionary of both the default and the other algorithms list.
    • Add new keys specifically for python_other_algo_list and java_other_algo_list

Summary of Proposed Changes:

  • (Required) Add default algorithms list back into hashstore.yaml
# The default algorithm list includes the hash algorithms calculated when storing an
# object to disk and returned to the caller after successful storage.
filehashstore_default_algo_list:
- "SHA-1"
- "SHA-256"
- "SHA-384"
- "SHA-512"
- "MD5"
  • (Optional) Add new public API method get_supported_algorithms() to HashStore interface OR new keys to represent the other algo list in Python and Java.
    • Dou: I am leaning towards adding new keys... no matter what, when and if we have to update hashstore.yaml, we're going to have to update both libraries. It will also keep the Public API simple/clean as it is.

What do you think @mbjones?

@mbjones
Copy link
Member

mbjones commented Jun 21, 2023

DataONE used the library of congress vocabulary to standardize algorithm types. How about a simple utility method to lookup the language-specific value from the dataone controlled list. Something like this in python and another in java?

def lookup_algo(algo):
    d = {'MD5': 'md5', 'SHA-1': 'sha1', 'SHA-26: 'sha256'}
    return(d[algo])

Maybe load it from config and provide for the case when the algo is not found? Should be just a few lines of code to translate.

@doulikecookiedough
Copy link
Contributor Author

This is a great idea - thank you for the feedback! I have made the changes and it has been merged into develop.

The default_algo_list is now based on hashstore.yaml. Changes to this list should be made through the config. I have also left other_algo_list as is for now because it is specific to the Python implementation with no overlap with Java.

Python

(default) dataone_algo_translation = {
    "MD5": "md5",
    "SHA-1": "sha1",
    "SHA-256": "sha256",
    "SHA-384": "sha384",
    "SHA-512": "sha512",
}
other_algo_list = [
    "sha224",
    "sha3_224",
    "sha3_256",
    "sha3_384",
    "sha3_512",
    "blake2b",
    "blake2s",
]

Java:

public static final String[] SUPPORTED_HASH_ALGORITHMS = { "MD2", "MD5", "SHA-1", "SHA-256", "SHA-384", "SHA-512",
            "SHA-512/224", "SHA-512/256" };

Here is what the hashstore.yaml looks like for quick reference (python f-string)

hashstore_configuration_yaml = f"""
# Default configuration variables for HashStore

############### Store Path ###############
# Default path for `FileHashStore` if no path is provided
store_path: "{store_path}"

############### Directory Structure ###############
# Desired amount of directories when sharding an object to form the permanent address
store_depth: {store_depth}  # WARNING: DO NOT CHANGE UNLESS SETTING UP NEW HASHSTORE
# Width of directories created when sharding an object to form the permanent address
store_width: {store_width}  # WARNING: DO NOT CHANGE UNLESS SETTING UP NEW HASHSTORE
# Example:
# Below, objects are shown listed in directories that are 3 levels deep (DIR_DEPTH=3),
# with each directory consisting of 2 characters (DIR_WIDTH=2).
#    /var/filehashstore/objects
#    ├── 7f
#    │   └── 5c
#    │       └── c1
#    │           └── 8f0b04e812a3b4c8f686ce34e6fec558804bf61e54b176742a7f6368d6

############### Format of the Metadata ###############
# The default metadata format
store_metadata_namespace: "{store_metadata_namespace}"

############### Hash Algorithms ###############
# Hash algorithm to use when calculating object's hex digest for the permanent address
store_algorithm: "{store_algorithm}"
# Algorithm values supported by python hashlib 3.9.0+ for File Hash Store (FHS)
# The default algorithm list includes the hash algorithms calculated when storing an
# object to disk and returned to the caller after successful storage.
store_default_algo_list:
- "MD5"
- "SHA-1"
- "SHA-256"
- "SHA-384"
- "SHA-512"
"""

I will circle back and close this issue if no further comments.

@doulikecookiedough
Copy link
Contributor Author

Further relevant links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants