-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support multiple hash types and identifier formats #1
Comments
I agree that defaulting to hash://sha256/... seems cleaner and easier to understand. I'd say, expose the hash://sha256/ by default first and add options to calculate multiple (e.g., md5, sha1, sha256, sha384, sha512). It is up to the registry implementations to decide which hashes to support. Also, base64 is case sensitive. I like the idea that hex encoded hashes are case insensitive and more resilient against transcription / translation errors. Internally, registries can store hashes in the way they see fit. As far as |
Resolved, see #8 |
Just revisiting this in broader scope. I still agree we should maintain the current However, that doesn't mean under the hood we cannot be a bit more flexible. Two specific proposals here:
It seems it shouldn't be too hard to have a little parser which could understand each of these formats and convert it to hash-uri format. |
Two quick comments on
|
Thanks @mbjones ! Yeah, I keep re-worrying about this. I definitely agree that it would be much nicer to have one standard. And I believe SSB, Subresource integrity, and MultiHashes aren't valid URIs (right?), which seems like an important property for an identifier to have (e.g. so we can use it in metadata fields that require URI-type objects). It sounds like you would lean towards @btrask has these notes on compatibility at https://hash-archive.org/#compat,
I'm not entirely clear why those remarks are specific to MultiHashes and Magnet URIs respectively, since my read was that neither IPFS or BitTorrent hashes would be compatible with any other scheme since both salt their hashes. (Also confused that MultiHashes are not compatible with IPFS, sounds like they were developed by IPFS team in the first place?). @btrask comments also point out
That sounds like it might be worthwhile -- can I post URLs to a IPFS gateway similarly to the way we post them the hash-archive.org? I don't quite see how to do this (i.e. is there a REST API?). I'm a bit unclear on why we would standardize the name of the hash function independently of standardizing the whole identifier. i.e. if we agreed to adopt |
The
Subresource integrity (e.g.,
IPFS hashes have semantics encoded in them - if I understand it correctly, the IPFS hashes refer to file structures in addition to the content they contain. This is why their hashes do not translate well into vanilla content hashes. So, you wouldn't just have to keep a IPFS gateway up and running, but you'd also have to settle on how to translate a single content object into a IPFS container (a directory with named files). Adopting a common vocabulary of algorithms terms/labels would be helpful to have. I feel that most (e.g., dat/ssb/ipfs) have adopted these labels. Lastly, as long as schemes are use consistently and the hashes are easy to compute, conversions can be applied to translate the one format into the other. Ideally, standards are a reflection of community adoption of a commonly used idea. Thinking about some specific RFC (or update existing ones) will hopefully come up once we see widespread adoption of content hashes as primary identifiers for datasets. In my mind, the combination of URIs, a hash algorithm vocab can go a long way. I am open to be convinced though . . . |
What vocabulary is contentid using for the hash algorithm names? Where can I see the list? Does it differ from the other lists that I cited? |
The A translation table with references to http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/sha256.html and friends should be easy to add. In fact, that'd be a useful addition and would re-use an already existing definition out there. |
👍
Lines 51 to 52 in b6cf92d
This states that the algo name It would have been great if the hash uri draft specification was more explicit on this point. I totally agree with the points about readability, though the fact that base64 encoded identifiers are also shorter is at least a plus to consider. (though having I worry too about the uptake issues in Maybe more should be said about the other options for a moment?
|
just to have some examples -
and my take on them: Magnet uses hashes of the subresource integrity uses base64 encoding - nice for computers, not so nice for humans and traditional media. Also, it doesn't fit in the "url" references field often used on publications. Target audience for multihash is computers, not humans. Also, by encoding the hash algorithm as some hex code, it obfuscates the actual meaning of the code and requires some additional resource to figure out what is going on. Also, the multihash does not fit well in traditional references boxes and would be hard to recognize in traditional publications. Curious to hear your thoughts. |
This adds support to recognize different identifier formats in functions that take a content identifier as an argument (`retrieve`, `resolve`, `query_sources`). This also adds potential columns in the local tsv registry to compute multiple hashes, but at this time we have not implemented multiple hash calculations. Performance issues should be mitigated by streaming, #38
Quickly re Magnet, I'm going by the magnet syntax on hash-archive.org, see so: https://hash-archive.org/history/https://zenodo.org/record/3678928/files/vostok.icecore.co2, so
is the same as
i.e. would be using the same content hash. I haven't read the actually magnet spec, so maybe hash-archive is wrongly asserting this is a valid magnet link anyway. I agree with you on things that aren't in the URI format not looking like familiar identifiers, and that being a problem for going in a references field (or in a metadata field that takes a uri). SRI format does make a nice internal format for computers -- I also notice it's the format that hash-archive API returns all the hashes in. (probably saves a bit of space relative to storing hash uris directly on the disk? maybe should consider that for I also note a little weirdness in the base64 hash of the Also, Matt noted above that while hash-archive.org reports |
Please note that I was referring to familiar URL format, not URI . The test below passes:, because each of the four strings are valid URIs. @Test
public void validURI() {
URI.create("sha384-MBO5IDfYaE6c6Aao94oZrIOiC6CGiSN2n4QUbHNPhzk5Xhm0djZLQqTpL0HzTUxk");
URI.create("magnet:?xt=urn:sha256:9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37");
URI.create("sha384-oqVuAfXRKap7fdgcCY5uykM6+R9GqQ8K/uxy9rx7HNQlGYl1kPzQho1wx4JwY8wC");
URI.create("ni:///sha-256;UyaQV-Ev4rdLoHyJJWCi11OHfrYv9E1aGQAlMO2X_-Q");
} |
@jhpoelen thanks for clarifying! I'm quite confused by https://tools.ietf.org/html/rfc3986#section-3.1
I read that as the |
Looks like I base my assumption on https://tools.ietf.org/html/rfc2396 . This is the 1998 spec that is still in use by various platforms (see e.g., https://docs.oracle.com/javase/7/docs/api/java/net/URI.html). The 2005 spec you reference, https://tools.ietf.org/html/rfc3986 , obsoletes the old spec. Thanks for pointing this out. It appears I was unaware of the newer URI, probably because the old URI and the IRI (https://tools.ietf.org/html/rfc3987, used a bunch in RDF land) share similar qualities (e.g., scheme optional). |
@jhpoelen thanks for the clarification, I was actually unaware of the old 2396 spec! (is that still in use even on current java?) I believe IRIs per RFC-3987 you link must also conform the newer RFC-3986 spec:
Re
For now, I'm glad we can translate from these other formats (particularly seems good to be able to recognize |
@cboettig thanks for checking my assumptions and statements re: magnets URIs and IRIs. I'll have to go back and see why my current usage of URIs/ IRIs does not trigger any alarms in the well-used open source libraries I rely on. I agree that the length of the hex encoded hashes can be an issue in print publications. But . . . note that urls in publications have similar issues. Perhaps I'll take inspiration from the urls shortener services like tinyurl.com and bit.ly . |
Speaking of length, do you have any insight on the rules for truncation? For instance, the I guess the way hashing algorithms work I am just increasing my chances of collision by doing this? I suppose given the larger character set this would be worse on the hexadecimal hashes? RFC6920 mentions that while this is obviously not good for cryptographic purposes, truncation may be very useful in "naming things", which may fit the use case of publishing in paper. How would you feel about a recommendation for using truncated hashes in publishing? Would there be an 'acceptable length' or is this just a bad idea? (at very least it's nice to know that if the hash gets cut off on the margin of a paper, it is probably less of a problem than it is for URLs!) Speaking of readability, did you see that the We should possibly support truncated queries in our local |
re: relative IRIs - just for the record: from https://www.ietf.org/rfc/rfc2396.txt
Also, commonly used RDF libraries (e.g., https://commons.apache.org/proper/commons-rdf/) and triples stores (e.g., https://jena.apache.org/) do not complain when relative URIs like However, other rdf libraries (e.g., https://www.npmjs.com/package/n3) do seem to use the updated more recent rfc : var test = require('tape');
var n3 = require('n3');
var df = require('n3').DataFactory;
test('success on creating nquad with relative IRI', function(t) {
t.plan(1);
const myQuad = df.quad(
df.namedNode('77f79ac6-d88e-4d51-b853-35aae8aca8af'),
df.namedNode('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
df.namedNode('http://www.w3.org/ns/prov#Activity'),
df.defaultGraph());
t.equal(myQuad.subject.value, '77f79ac6-d88e-4d51-b853-35aae8aca8af');
});
test('fail to parse nquad with relative IRI', function(t) {
t.plan(1);
console.log(n3);
var parser = new n3.Parser({ format: 'nquads' });
parser.parse("<77f79ac6-d88e-4d51-b853-35aae8aca8af> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> .", function (error, quad, prefixes) {
t.notOk(quad);
});
}); passes ok. Note that the validation for this library happens when parsing, but not on constructing programmatically. Anyways. . . this was fun: I realize that replacing a URI spec for a more restrictive one is quite a undertaking. |
This RFC discussion around URIs/IRI reminds of the implicit decision by Hadley to let |
Btw - in working towards complying with https://tools.ietf.org/html/rfc3986 , I noticed that |
@jhpoelen good point! Note that the Crossref guidelines also state that a DOI should always be written with the That said, I think there is a valid critique that the Nevertheless, I still agree the hash uri format is quite compelling in being pretty nearly self-describing in a way that is less true of most of the other formats, in that it seems plausible some future archivist could figure out what it means without access to an official standards document defining the spec. And I can't help wondering if the |
@cboettig @jhpoelen Fully agree with you both on all of these points. Regarding registering DOIs as URIs, this is from the DOI FAQ:
The original intent was that DOIs would be expressed as There is a good discussion of how to represent over 700 identifier types in the Science on schema.org Dataset Guide. They mostly recommend following the identifiers.org guidelines and vocabulary. |
When DOIs where "invented" in the late 1990s, the idea was that they would become yet another protocol, with native support in browsers, etc. That of course never happened, and the HTTP protocol took over almost everything. In the last few years the DOI Foundation has changed its recommendations, aligning better with common practices. Almost all DOIs are resolved via the HTTP proxy at https://doi.org. dx.doi.org still exists but is deprecated, and http://doi.org or http in general should no longer be used. DOIs are case-insensitive, but are typically displayed in lowercase to align with conventions for URIs. Expressing a persistent identifier as URI aligns with best web practices, and is something Crossref and DataCite recommend in their DOI display guidelines, e.g. https://support.datacite.org/docs/datacite-doi-display-guidelines. |
Thanks @mbjones and @mfenner for this background, super informative! (and great to hear from you Martin! as you may have seen, this discussion was partly inspired by your question about fingerprints in schema.org at https://github.com/schemaorg/schemaorg/issues/1831#issuecomment-369970941). I suppose one take-away from the comparison to DOIs, (which certainly would seem to be our gold standard case for a widely recognized identifier) is that getting this right and consistent in standards and actual implementations is hard.
I'm not sure if that would be wise, since it places undue emphasis on https://hash-archive.org as a canonical resolver of sorts, which it was not meant to be and which goes against the spirit here (though notably it lists multiple locations that content resolves to, in this case, the two mirrors it has in DataONE).
Lastly, as important as it is to have a consistent format, I think the principle of unsalted content hashes as identifiers is really the heart of this. At the heart of it, the identifier isn't really the URI string at all, the identifier is the content itself. The hash gives us a concise and reliable way to refer to that content, even if we have variation in precise syntax. |
👍 As long as we can easily translate, compute, or understand, content-based identifiers relation to their content (and, by inference, to each other), I think we are in a good starting position to reliably index content and their (last known) locations at scale. I am glad to see that pragmatism and realism are taking the stage along with the sacred RFCs . I feel that RFC's like https://tools.ietf.org/html/rfc4452 (the |
When a URL is registered with hash-archive.org, the API computes 5 hashes:
md5
,sha1
,sha256
,sha384
andsha512
.The API then recognizes queries for the content that use any of these 5 hashes. Additionally, the query can give either the base64 encoded or not (and also recognizes queries that use either the recommended
hash://<algo>/
pattern, or theni:///<algo>;
format suggested in RFC6920, though that's a little clearer how to deal with since it's just a matter of stripping of prefixes.For me, this raises some questions on the design of a local hash look-up table. It's a lot easier to have a database (or perhaps better, simply a key-value store like redis?) look up all URIs associated with
hash://sha256/xxxx
than if we have to also account for another 4 (or more?) other hashes. (Likewise when registering a url, it's clearly more expensive to both compute and store multiple hashes -- though maybe not significantly so?) Also not sure if there's a clever way to handle the fast retrieval that is agnostic to whether the hash is in base64 or not, (or using theni:///
orhash://
prefix etc).From an aesthetic/conceptual point of view, having "the identifier" be, say, always the plain sha256 hash uri seems cleaner / easier to understand. but maybe I'm just being cowardly and there's obvious ways to manage register and query efficiently for all 10 forms? @jhpoelen advice?
The text was updated successfully, but these errors were encountered: