Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support multiple hash types and identifier formats #1

Open
cboettig opened this issue Feb 9, 2020 · 25 comments
Open

support multiple hash types and identifier formats #1

cboettig opened this issue Feb 9, 2020 · 25 comments

Comments

@cboettig
Copy link
Owner

cboettig commented Feb 9, 2020

When a URL is registered with hash-archive.org, the API computes 5 hashes: md5, sha1, sha256, sha384 and sha512.

The API then recognizes queries for the content that use any of these 5 hashes. Additionally, the query can give either the base64 encoded or not (and also recognizes queries that use either the recommended hash://<algo>/ pattern, or the ni:///<algo>; format suggested in RFC6920, though that's a little clearer how to deal with since it's just a matter of stripping of prefixes.

For me, this raises some questions on the design of a local hash look-up table. It's a lot easier to have a database (or perhaps better, simply a key-value store like redis?) look up all URIs associated with hash://sha256/xxxx than if we have to also account for another 4 (or more?) other hashes. (Likewise when registering a url, it's clearly more expensive to both compute and store multiple hashes -- though maybe not significantly so?) Also not sure if there's a clever way to handle the fast retrieval that is agnostic to whether the hash is in base64 or not, (or using the ni:/// or hash:// prefix etc).

From an aesthetic/conceptual point of view, having "the identifier" be, say, always the plain sha256 hash uri seems cleaner / easier to understand. but maybe I'm just being cowardly and there's obvious ways to manage register and query efficiently for all 10 forms? @jhpoelen advice?

@jhpoelen
Copy link
Collaborator

jhpoelen commented Feb 9, 2020

I agree that defaulting to hash://sha256/... seems cleaner and easier to understand.

I'd say, expose the hash://sha256/ by default first and add options to calculate multiple (e.g., md5, sha1, sha256, sha384, sha512). It is up to the registry implementations to decide which hashes to support.

Also, base64 is case sensitive. I like the idea that hex encoded hashes are case insensitive and more resilient against transcription / translation errors. Internally, registries can store hashes in the way they see fit.

As far as ni: vs hash : at the cost of two extra characters, the hash: prefix gets me an instant recognition that the following path is some hash. The ni: prefix caused a response is me along the lines of: huh, what's that?

@cboettig
Copy link
Owner Author

Resolved, see #8

@cboettig cboettig reopened this Mar 25, 2020
@cboettig cboettig changed the title How many hash types? support multiple hash types and identifier formats Mar 25, 2020
@cboettig
Copy link
Owner Author

Just revisiting this in broader scope. I still agree we should maintain the current hash://sha256/ default behaviour, as I think it can be somewhat confusing off-putting to a new user to be confronted with the myriad of different hashes, as well as the myriad different 'standards' for associating a hash algorithm with the hash itself in the identifier.

However, that doesn't mean under the hood we cannot be a bit more flexible. Two specific proposals here:

  1. The local registry could compute and store all 5 hashes mentioned above. I think this would be a good default, but with the option to turn it off (or possibly to set your own subset, though I'm not sure if we should let users not compute sha256).

  2. We could have a function that could convert between the 6 formats recognized by hash-archive.org:

It seems it shouldn't be too hard to have a little parser which could understand each of these formats and convert it to hash-uri format.

@mbjones
Copy link

mbjones commented Mar 26, 2020

Two quick comments on hash://<algo/ versus ni:///<algo>.

  1. hash is far more readable. ni is an accepted standard by IETF. If you want the hash scheme to be used and have uptake, I think it should be standardized through an IETC RFC. That's a lot of work. During the review, someone will undoubtedly point out that ni is already accepted on the standards track. YMMV.

  2. More substantively, I think its critical to standardize the hash algorithm names. ni provides an IANA registry for hash names. This has limited utility due to the RFC policy that "the Designated Expert SHOULD NOT accept additions where the underlying hash function (with no truncation) is considered weak for collisions." That excludes having an official name for algorithms in widespread use like MD5. I've looked for other sources of this. DataONE recognizes the Library of Congress Cryptographic Algorithm Vocabulary in our ChecksumAlgorithm metadata element, and specifically uses the madsrdf:authoritativeLabel field as the name of the algorithm. Finally, the IPFS world has been standardizing everything under the sun, including a list of checksum algorithm names and codes. This is a more complete list, but it uses different algorithm names for different versions of algorithms (e.g., sha2-256 versus sha3-256), which may not match our use case here. I'm not a fan of their binary encoding of these strings, but they probably have the most complete list. It would be great to not create yet another algorithm vocabulary. Can we pick one of these to follow?

@cboettig
Copy link
Owner Author

Thanks @mbjones ! Yeah, I keep re-worrying about this. I definitely agree that it would be much nicer to have one standard. And ni being an RFC is certainly a plus, if unfortunately less readable.

I believe SSB, Subresource integrity, and MultiHashes aren't valid URIs (right?), which seems like an important property for an identifier to have (e.g. so we can use it in metadata fields that require URI-type objects).

It sounds like you would lean towards ni:/// then as the preferred standard? I'll note it's also being recommended over in @mfenner's issue at https://github.com/schemaorg/schemaorg/issues/1831#issuecomment-369970941 . Of course it doesn't appear to be using the IPFS strings to refer to algorithms either. (though also it's not clear to me if IPFS is a recognized 'standard' by IETC RFC like ni:/// is)

@btrask has these notes on compatibility at https://hash-archive.org/#compat,

MultiHashes are not compatible with IPFS, and Magnet URIs are not compatible with BitTorrent, due to the way each protocol computes its hashes.

I'm not entirely clear why those remarks are specific to MultiHashes and Magnet URIs respectively, since my read was that neither IPFS or BitTorrent hashes would be compatible with any other scheme since both salt their hashes. (Also confused that MultiHashes are not compatible with IPFS, sounds like they were developed by IPFS team in the first place?). @btrask comments also point out

However, you can submit URLs from the ipfs.io gateway to bridge between them.

That sounds like it might be worthwhile -- can I post URLs to a IPFS gateway similarly to the way we post them the hash-archive.org? I don't quite see how to do this (i.e. is there a REST API?).

I'm a bit unclear on why we would standardize the name of the hash function independently of standardizing the whole identifier. i.e. if we agreed to adopt ni:/// as the canonical format, doesn't that do that for us?

@jhpoelen
Copy link
Collaborator

I definitely agree that it would be much nicer to have one standard. And ni being an RFC is certainly a plus, if unfortunately less readable.

The ni:// RDF has been around since 2013 and has not gained traction as far as I can tell. Also, in my mind, the readability or friendliness of (e.g., ni:///sha-256;UyaQV-Ev4rdLoHyJJWCi11OHfrYv9E1aGQAlMO2X_-Q) not the only issue - I am worried that these hash encodings with case sensitive and non-alphanumeric codes will not survive in publication texts.

I believe SSB, Subresource integrity, and MultiHashes aren't valid URIs (right?), which seems like an important property for an identifier to have (e.g. so we can use it in metadata fields that require URI-type objects).

Subresource integrity (e.g., sha384-MBO5IDfYaE6c6Aao94oZrIOiC6CGiSN2n4QUbHNPhzk5Xhm0djZLQqTpL0HzTUxk) and most other hash strings are valid URIs.

That sounds like it might be worthwhile -- can I post URLs to a IPFS gateway similarly to the way we post them the hash-archive.org? I don't quite see how to do this (i.e. is there a REST API?).

IPFS hashes have semantics encoded in them - if I understand it correctly, the IPFS hashes refer to file structures in addition to the content they contain. This is why their hashes do not translate well into vanilla content hashes. So, you wouldn't just have to keep a IPFS gateway up and running, but you'd also have to settle on how to translate a single content object into a IPFS container (a directory with named files).

Adopting a common vocabulary of algorithms terms/labels would be helpful to have. I feel that most (e.g., dat/ssb/ipfs) have adopted these labels.

Lastly, as long as schemes are use consistently and the hashes are easy to compute, conversions can be applied to translate the one format into the other.

Ideally, standards are a reflection of community adoption of a commonly used idea. Thinking about some specific RFC (or update existing ones) will hopefully come up once we see widespread adoption of content hashes as primary identifiers for datasets. In my mind, the combination of URIs, a hash algorithm vocab can go a long way.

I am open to be convinced though . . .

@mbjones
Copy link

mbjones commented Mar 26, 2020

What vocabulary is contentid using for the hash algorithm names? Where can I see the list? Does it differ from the other lists that I cited?

@jhpoelen
Copy link
Collaborator

jhpoelen commented Mar 26, 2020

The code in http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/sha256.html (i.e. sha256) seems equivalent to algo name sha256 used in contentid to refer to the same algorithm.

A translation table with references to http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/sha256.html and friends should be easy to add. In fact, that'd be a useful addition and would re-use an already existing definition out there.

@cboettig
Copy link
Owner Author

👍

contentid does:

hash <- openssl::sha256(con)
paste0("hash://sha256/", as.character(hash))

This states that the algo name sha256 in the hash uri spec corresponds to the algorithm SHA256 in openssl

It would have been great if the hash uri draft specification was more explicit on this point.

I totally agree with the points about readability, though the fact that base64 encoded identifiers are also shorter is at least a plus to consider. (though having / as a valid possible character in the identifier is very annoying to me...)

I worry too about the uptake issues in ni:///, though it's not clear that hash-uri has either better uptake (or is being actively maintained as something beyond a draft spec last updated 4 years ago)?

Maybe more should be said about the other options for a moment?

  • Magnet is the other one above that is hex-encoded and widely recognized (not sure if there's a published standard. (though this may also create confusion since these magent URIs won't resolve in a browser). it does have a rather cumbersome prefix.
  • subresource integrity is a W3C standard, though it's not clear it was meant to be used outside of <integrity> HTML element. I also don't think it's a URI syntax without a scheme: (?)
  • multihashes --- well, base58 encoding isn't available in openssl, and I don't think it's a URI....

@jhpoelen
Copy link
Collaborator

just to have some examples -

format example description
magnet magnet:?xt=urn:btih:c12fe1c06bba254a9dc9f519b335aa7c1367a88a https://en.wikipedia.org/wiki/Magnet_URI_scheme -
subresource integrity sha384-oqVuAfXRKap7fdgcCY5uykM6+R9GqQ8K/uxy9rx7HNQlGYl1kPzQho1wx4JwY8wC https://developer.mozilla.org/en-US/docs/Web/Security/Subresource_Integrity
multihash 12209cbc07c3f991725836a3aa2a581ca2029198aa420b9d99bc0e131d9f3e2cbe47 https://github.com/multiformats/multihash

and my take on them:

Magnet uses hashes of the [...] the hex-encoded SHA-1 hash (btih, "BitTorrent info-hash") of the torrent file info section in question. [...] . This includes filenames etc. So, not a content hash.

subresource integrity uses base64 encoding - nice for computers, not so nice for humans and traditional media. Also, it doesn't fit in the "url" references field often used on publications.

Target audience for multihash is computers, not humans. Also, by encoding the hash algorithm as some hex code, it obfuscates the actual meaning of the code and requires some additional resource to figure out what is going on. Also, the multihash does not fit well in traditional references boxes and would be hard to recognize in traditional publications.

Curious to hear your thoughts.

cboettig added a commit that referenced this issue Mar 26, 2020
This adds support to recognize different identifier formats in functions that take a content identifier as an argument (`retrieve`, `resolve`, `query_sources`).

This also adds potential columns in the local tsv registry to compute multiple hashes, but at this time we have not implemented multiple hash calculations.  Performance issues should be mitigated by streaming, #38
@cboettig cboettig mentioned this issue Mar 26, 2020
@cboettig
Copy link
Owner Author

Quickly re Magnet, I'm going by the magnet syntax on hash-archive.org, see so:

https://hash-archive.org/history/https://zenodo.org/record/3678928/files/vostok.icecore.co2, so

magnet:?xt=urn:sha256:9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37

is the same as

hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37

i.e. would be using the same content hash. I haven't read the actually magnet spec, so maybe hash-archive is wrongly asserting this is a valid magnet link anyway.

I agree with you on things that aren't in the URI format not looking like familiar identifiers, and that being a problem for going in a references field (or in a metadata field that takes a uri). SRI format does make a nice internal format for computers -- I also notice it's the format that hash-archive API returns all the hashes in. (probably saves a bit of space relative to storing hash uris directly on the disk? maybe should consider that for contentid too).

I also note a little weirdness in the base64 hash of the ni:// object at https://hash-archive.org/history/https://zenodo.org/record/3678928/files/vostok.icecore.co2, it lacks the trailing (optional?) = relative to the other hashes. Looks like that corresponds to leaving off the 7e37 in the hex version, which also seems optional? hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37 and hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff64628 (i.e. with and without 7e37 on the end) both resolve successfully to the same content on hash-archive.org....

Also, Matt noted above that while hash-archive.org reports ni:// based-ids with md5, md5 is not actually a valid algorithm in the Named Info standard, so these are somewhat chimeric.

@jhpoelen
Copy link
Collaborator

I agree with you on things that aren't in the URI format not looking like familiar identifiers, and that being a problem for going in a references field

Please note that I was referring to familiar URL format, not URI .

The test below passes:, because each of the four strings are valid URIs.

@Test
    public void validURI() {
        URI.create("sha384-MBO5IDfYaE6c6Aao94oZrIOiC6CGiSN2n4QUbHNPhzk5Xhm0djZLQqTpL0HzTUxk");
        URI.create("magnet:?xt=urn:sha256:9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37");
        URI.create("sha384-oqVuAfXRKap7fdgcCY5uykM6+R9GqQ8K/uxy9rx7HNQlGYl1kPzQho1wx4JwY8wC");
        URI.create("ni:///sha-256;UyaQV-Ev4rdLoHyJJWCi11OHfrYv9E1aGQAlMO2X_-Q");
    }

@cboettig
Copy link
Owner Author

@jhpoelen thanks for clarifying! I'm quite confused by https://tools.ietf.org/html/rfc3986#section-3.1

The generic URI syntax consists of a hierarchical sequence of
   components referred to as the scheme, authority, path, query, and
   fragment.

      URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

I read that as the : being required to be a URI. I'm with you re magnet and ni:// but SRI uses a -. (I'm not sure how to run the test above)

@jhpoelen
Copy link
Collaborator

Looks like I base my assumption on https://tools.ietf.org/html/rfc2396 . This is the 1998 spec that is still in use by various platforms (see e.g., https://docs.oracle.com/javase/7/docs/api/java/net/URI.html). The 2005 spec you reference, https://tools.ietf.org/html/rfc3986 , obsoletes the old spec. Thanks for pointing this out.

It appears I was unaware of the newer URI, probably because the old URI and the IRI (https://tools.ietf.org/html/rfc3987, used a bunch in RDF land) share similar qualities (e.g., scheme optional).

@cboettig
Copy link
Owner Author

@jhpoelen thanks for the clarification, I was actually unaware of the old 2396 spec! (is that still in use even on current java?) I believe IRIs per RFC-3987 you link must also conform the newer RFC-3986 spec:

A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
sequence of characters chosen from a limited subset of the repertoire
of US-ASCII [ASCII] characters.

Re Magnet, looks like the hash-archive's use of this format is valid, at least for sha1 and md5 hashes (wikipedia), however, most browsers / torrent clients probably only understand the bittorrent urn prefix,, xt=urn:btih:, and not, say, the sha1 prefix: xt=urn:sha1:. It's not clear this has ever been set out as a standard, though at least magnet is a registered IANA prefix, which also documents the sha1 use: https://www.iana.org/assignments/uri-schemes/prov/magnet . (hash is not a registered prefix)

contentid now understands magnet ids for this, but I think not a good first choice.

For now, I'm glad we can translate from these other formats (particularly seems good to be able to recognize ni://). Re the base64 issues with printed paper, I agree that's an issue, though the mere length of sha256 in base32 is also a bit of a challenge on text as well.

@jhpoelen
Copy link
Collaborator

@cboettig thanks for checking my assumptions and statements re: magnets URIs and IRIs. I'll have to go back and see why my current usage of URIs/ IRIs does not trigger any alarms in the well-used open source libraries I rely on.

I agree that the length of the hex encoded hashes can be an issue in print publications. But . . . note that urls in publications have similar issues. Perhaps I'll take inspiration from the urls shortener services like tinyurl.com and bit.ly .

@cboettig
Copy link
Owner Author

Speaking of length, do you have any insight on the rules for truncation? For instance, the ni:// RFC comments that support for truncated hashes is optional. It looks like hash-archive.org supports truncation of any of it's hashes, e.g. I can just delete stuff off the trailing end of a hash and it still resolves: https://hash-archive.org/sources/hash://sha256/33f94de39 (in this case, uniquely, otherwise if I truncate too much it resolves to all matches for the truncation -- which is still pretty useful!)

I guess the way hashing algorithms work I am just increasing my chances of collision by doing this? I suppose given the larger character set this would be worse on the hexadecimal hashes? RFC6920 mentions that while this is obviously not good for cryptographic purposes, truncation may be very useful in "naming things", which may fit the use case of publishing in paper. How would you feel about a recommendation for using truncated hashes in publishing? Would there be an 'acceptable length' or is this just a bad idea? (at very least it's nice to know that if the hash gets cut off on the margin of a paper, it is probably less of a problem than it is for URLs!)

Speaking of readability, did you see that the ni:// RFC has a section on "human speakable identifiers", https://tools.ietf.org/html/rfc6920#section-7 ... I'm not familiar with RFCs enough to know if this is actually part of the standard or just a joke?

We should possibly support truncated queries in our local contenturi registry as well?

@jhpoelen
Copy link
Collaborator

re: relative IRIs - just for the record:

from https://www.ietf.org/rfc/rfc2396.txt

Berners-Lee, et. al.        Standards Track                    [Page 26]

RFC 2396                   URI Generic Syntax                August 1998


A. Collected BNF for URI

      URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
      absoluteURI   = scheme ":" ( hier_part | opaque_part )
      relativeURI   = ( net_path | abs_path | rel_path ) [ "?" query ]

      hier_part     = ( net_path | abs_path ) [ "?" query ]
      opaque_part   = uric_no_slash *uric

      uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
                      "&" | "=" | "+" | "$" | ","

      net_path      = "//" authority [ abs_path ]
      abs_path      = "/"  path_segments
      rel_path      = rel_segment [ abs_path ]

      rel_segment   = 1*( unreserved | escaped |
                          ";" | "@" | "&" | "=" | "+" | "$" | "," )

Also, commonly used RDF libraries (e.g., https://commons.apache.org/proper/commons-rdf/) and triples stores (e.g., https://jena.apache.org/) do not complain when relative URIs like <70395948-8a1b-4a7a-9e9b-e22e7645dde7> are used.

However, other rdf libraries (e.g., https://www.npmjs.com/package/n3) do seem to use the updated more recent rfc :
For instance:

var test = require('tape');
var n3 = require('n3');
var df = require('n3').DataFactory;

test('success on creating nquad with relative IRI', function(t) {
    t.plan(1);
    const myQuad = df.quad(
        df.namedNode('77f79ac6-d88e-4d51-b853-35aae8aca8af'),
        df.namedNode('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
        df.namedNode('http://www.w3.org/ns/prov#Activity'),
        df.defaultGraph());

    t.equal(myQuad.subject.value, '77f79ac6-d88e-4d51-b853-35aae8aca8af');
});

test('fail to parse nquad with relative IRI', function(t) {
    t.plan(1);
    console.log(n3);
    var parser = new n3.Parser({ format: 'nquads' });
    parser.parse("<77f79ac6-d88e-4d51-b853-35aae8aca8af> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> .", function (error, quad, prefixes) { 
        t.notOk(quad);
    });

});

passes ok. Note that the validation for this library happens when parsing, but not on constructing programmatically.

Anyways. . . this was fun: I realize that replacing a URI spec for a more restrictive one is quite a undertaking.

@jhpoelen
Copy link
Collaborator

This RFC discussion around URIs/IRI reminds of the implicit decision by Hadley to let readr::read_tsv not conform to IANA's tab-separated-values spec because of concerns of backwards compatibility (tidyverse/readr#844) and the GBIF related discussion related to registration of application/dwca mime type (tdwg/dwc#195).

@jhpoelen
Copy link
Collaborator

jhpoelen commented Apr 2, 2020

Btw - in working towards complying with https://tools.ietf.org/html/rfc3986 , I noticed that doi: prefix is not registered with IANA (see https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml ) . Perhaps these IANA registry entries are sort of like ideas and patents - you need money to file a patent, but you don't need money to have an idea.

@cboettig
Copy link
Owner Author

cboettig commented Apr 2, 2020

@jhpoelen good point! Note that the Crossref guidelines also state that a DOI should always be written with the https://doi.org/ prefix, and never with doi: alone. (though that seems like a fragile way to avoid having the identifier written as http://doi.org or the older http://dx.doi.org etc..) I believe DOI is an ISO standard (and speaking of needing money, iirc you need to pay even to read an ISO standards document...)

That said, I think there is a valid critique that the hash:// format is not precisely specified, which increases the chance that different tools will implement it differently and be incompatible. In particular, I think it would be nice if it defined the conventions for how the algorithm should be named, as @mbjones points out. Otherwise I may well think that hash://sha2-256/916255b2b73680595dcb22b30991a757dd223208473fb4fbe90405757bc07953 is a valid hash uri format, and be surprised to see it register as a bad request. Likewise, the standard doesn't say anything about truncated identifiers, which is addressed explicitly in the ni:// RFC.

Nevertheless, I still agree the hash uri format is quite compelling in being pretty nearly self-describing in a way that is less true of most of the other formats, in that it seems plausible some future archivist could figure out what it means without access to an official standards document defining the spec.

And I can't help wondering if the ni:// details of the spec have been a barrier to it's adoption (e.g. hash-archive.org doesn't implement that spec in a fully compliant way, eg accepting the percent-encodings, and I struggle with it's alternate re-encoding of the base64).

@mbjones
Copy link

mbjones commented Apr 2, 2020

@cboettig @jhpoelen Fully agree with you both on all of these points.

Regarding registering DOIs as URIs, this is from the DOI FAQ:

11. DOI & URI: how does the DOI system work with web URI technologies?

DOI is a registered URI within the info-URI namespace (IETF RFC 4452, the "info" URI Scheme for Information Assets with Identifiers in Public Namespaces). See the DOI Handbook, 2 Numbering and 3 Resolution, for more information.

The original intent was that DOIs would be expressed as info: URIs, but these never really took hold. The DOI foundation seems to change their recommended format for displaying and linking to DOIs every couple of years. The URL-space variations are extensive -- I have counted over 22 http variants for their resolution URIs that are all valid for a single DOI, and of course there are also doi:, urn:doi:, and info:doi: variants as well.

There is a good discussion of how to represent over 700 identifier types in the Science on schema.org Dataset Guide. They mostly recommend following the identifiers.org guidelines and vocabulary.

@mfenner
Copy link

mfenner commented Apr 2, 2020

When DOIs where "invented" in the late 1990s, the idea was that they would become yet another protocol, with native support in browsers, etc. That of course never happened, and the HTTP protocol took over almost everything. In the last few years the DOI Foundation has changed its recommendations, aligning better with common practices. Almost all DOIs are resolved via the HTTP proxy at https://doi.org. dx.doi.org still exists but is deprecated, and http://doi.org or http in general should no longer be used. DOIs are case-insensitive, but are typically displayed in lowercase to align with conventions for URIs.

Expressing a persistent identifier as URI aligns with best web practices, and is something Crossref and DataCite recommend in their DOI display guidelines, e.g. https://support.datacite.org/docs/datacite-doi-display-guidelines.

@cboettig
Copy link
Owner Author

cboettig commented Apr 2, 2020

Thanks @mbjones and @mfenner for this background, super informative! (and great to hear from you Martin! as you may have seen, this discussion was partly inspired by your question about fingerprints in schema.org at https://github.com/schemaorg/schemaorg/issues/1831#issuecomment-369970941).

I suppose one take-away from the comparison to DOIs, (which certainly would seem to be our gold standard case for a widely recognized identifier) is that getting this right and consistent in standards and actual implementations is hard.

I'm not sure if that would be wise, since it places undue emphasis on https://hash-archive.org as a canonical resolver of sorts, which it was not meant to be and which goes against the spirit here (though notably it lists multiple locations that content resolves to, in this case, the two mirrors it has in DataONE).

  • A second lesson I might read into the DOI case study is that there will always be variation in practical implementations (e.g. the 22 variants and continual change Matt references), and so it may be better for a tool to simply handle as much of the variation as possible rather than hew strictly to the one true way. For contentid, that may mean following hash-archive.org example of being able to understand multiple formats.

  • The third lesson I might make of this is that perhaps hash://sha256/xx isn't a bad choice. I would hazard that simplicity/clarity to humans proved more important than standards in DOI use (e.g. lowercase letter conventions, which is also true of hash:// but not ni://, and it not really mattering that doi:// wasn't a protocol or doi: wasn't registered either, and the registered prefix info:doi: seems to have had less uptake). As least Humans can recognize a DOI in these different formats and still resolve them successfully, and tools can work around the variations (to a degree).

hash:// is a (lowercase) URI, albeit with an unregistered protocol that is not understood browsers. We can contrast that to ni:// is a registered prefix but still protocol browsers don't understand, and to magnet: URIs, a registered prefix that isn't a protocol, but is implemented in most browsers (but only for the bitTorrent subtype, btih, and not sha subtypes, and thus of little use to us here).

Lastly, as important as it is to have a consistent format, I think the principle of unsalted content hashes as identifiers is really the heart of this. At the heart of it, the identifier isn't really the URI string at all, the identifier is the content itself. The hash gives us a concise and reliable way to refer to that content, even if we have variation in precise syntax.

@jhpoelen
Copy link
Collaborator

jhpoelen commented Apr 2, 2020

I think the principle of unsalted content hashes as identifiers is really the heart of this. At the heart of it, the identifier isn't really the URI string at all, the identifier is the content itself. The hash gives us a concise and reliable way to refer to that content, even if we have variation in precise syntax.

👍

As long as we can easily translate, compute, or understand, content-based identifiers relation to their content (and, by inference, to each other), I think we are in a good starting position to reliably index content and their (last known) locations at scale.

I am glad to see that pragmatism and realism are taking the stage along with the sacred RFCs . I feel that RFC's like https://tools.ietf.org/html/rfc4452 (the info:) scheme, or more generally, https://tools.ietf.org/html/rfc3986 (the 2005 URI), come from a good place, and describe more of a (overly complicated?) vision than a reality. But hey, I am aware I lean perhaps a bit too much towards pragmatism . . .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants