Define and use parsed address segment representation inside `Multiaddr` instead of `Vec<u8>` #25

ntninja · 2017-10-25T23:10:56Z

This PR rewrites most of src/protocol.rs to provide a companion type for each variant of protocol::Protocol that can store a fully deserialized representation of the given address segment's data.

Known caveats:

The protocol::IPFSSegment type is pretty verbose because it has to replicate some functionality from the cid crate that is not exposed in a reusable way
The Multiaddr main struct also needs to carry a (useless) field with the serialized byte representation of the struct to ensure backwards-compatiblity with the .as_slice() method
The Protocol and Segment enumerations feel somewhat redundant, but I can't think of any good way of unifying them without starting to work around the type system.
Needs tests for new features!

(4) will be fixed
(1) I will propably fix at the source myself as well
(2) can be fixed any time if you're willing to break backward-compatibility

Any other comments about the code, approach, etc.?

debris

Hey, I saw an open pr and decided to review it to better understand multiaddr ;) Overall, the pr looks good. It's definitely an improvement. I have just a couple of minor grumbles

The Multiaddr main struct also needs to carry a (useless) field with the serialized byte representation of the struct to ensure backwards-compatiblity with the .as_slice() method

I don't agree with this point. Cargo.toml versions and Cargo.lock files could be used to use the old version of the library. imo, api for backwards compatibility is almost always redundant thanks to cargo toolchain

debris · 2017-10-28T00:36:47Z

src/protocol.rs

+    }
+
+    fn to_stream(&self, stream: &mut io::Write) -> io::Result<()> {
+        Ok(stream.write_u16::<BigEndian>(self.0)?)


result types are the same. it is enough to write

stream.write_u16::<BigEndian>(self.0)

Right! (Originally this function was named to_bytes(&self) -> Result<Vec<u8>> and the try!() had been necessary.)

Actually the code had been correct: stream.write_u16(…) returned an byteorder::Result for the byteorder 0.6 crate. It is updated to version 1.1 now.

debris · 2017-10-28T00:39:54Z

src/protocol.rs

+    ( $name:ident : Display $( $rest:tt )* ) => {
+        impl fmt::Display for $name {
+            fn fmt(&self, _: &mut fmt::Formatter) -> fmt::Result {
+                Ok(())


imo, this is super misleading. If display is empty, do not implement it

I've changed the definition of AddressSegement to only require the implementation of ToString for each address segement type. Is that good enough?

Otherwise the macro implementation of Display for the Segment enumeration type will become really painful/impossible:

impl fmt::Display for Segment { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { match self { $( &Segment::$var(ref addr) => { f.write_str(addr.protocol().as_str())?; if $addr_type::STREAM_LENGTH > 0 { f.write_char('/')?; f.write_str(addr.to_string().as_ref())?; } Ok(()) }, )* // Need extra match arm because of non-exhaustiveness _ => unreachable!() } } }

debris · 2017-10-28T00:42:23Z

src/protocol.rs

-    {$( $val:expr => $var:ident: $alph:expr, $size:expr, )*} => {
+
+/// Single multiaddress segment with its attached data
+pub trait AddressSegment : fmt::Display + ToString {


Why does it depend on fmt::Display? btw. And ToString is automatically implemented for any type which implements the Display trait docs

I just thought it would be nice to support rendering multiaddr segements as part of formatting strings. The ToString is mostly there to ensure that fact that everything implementing AddressSegement also implements ToString is highlighted in the documentation and is part of the intended API (instead of being just a fortunate accident).

debris · 2017-10-28T00:50:24Z

src/protocol.rs

@@ -82,48 +590,143 @@ macro_rules! build_protocol_enum {
            ///
            pub fn size(&self) -> isize {
                match *self {
-                    $( $var => $size, )*
+                    $( Protocol::$var => ($addr_type::STREAM_LENGTH * 8) as isize, )*
+                    _ => unreachable!()


It would be more readable if these unreachabe! macros described why they are unreachable ;)

One of them did 😉
Fixed!

debris · 2017-10-28T00:52:16Z

src/protocol.rs

-                match *self {
-                    $( $var => $alph.to_string(), )*
-                }
+                self.as_str().to_string()


it's be better to implement fmt::Display

debris · 2017-10-28T00:54:24Z

src/protocol.rs

+    }
+
+    fn to_stream(&self, stream: &mut io::Write) -> io::Result<()> {
+        Ok(stream.write_u16::<BigEndian>(self.0)?)


result types are the same. it is enough to write

stream.write_u16::<BigEndian>(self.0)

debris · 2017-10-28T00:55:44Z

src/protocol.rs

+
+        // Read CID hash from stream
+        let cid = Self::_read_cid_polyfill(stream).map_err(|err| {
+            println!("CID Parsing failed!");


use log crate instead

Actually this is left-over debugging code 🙂

Do you think it's worth adding logging for this anyhow?

I don't know, I don't have a strong opinion about that. Usually I prefer to keep the logs more 'top-level', but it's up to you, cause you know better if it's needed or not

Yes, now that I think about it again I'm going to remove it. The actual solution is fix up the cid crate so that we don't need _read_cid_polyfill anymore.

debris · 2017-10-28T00:58:23Z

src/protocol.rs

+// currently only implement them for up to 32 entries
+impl fmt::Debug for OnionSegment {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        write!(f, "OnionSegment({:?})", &*self.0 as &[u8])


use debug_struct instead docs

Thanks! Learnt something! 🙂

debris · 2017-10-28T01:00:12Z

src/protocol.rs

+    }
+
+    fn to_stream(&self, stream: &mut io::Write) -> io::Result<()> {
+        Ok(stream.write_u16::<BigEndian>(self.0)?)


result types are the same. it is enough to write

stream.write_u16::<BigEndian>(self.0)

debris · 2017-10-28T01:08:04Z

src/protocol.rs

+            type Err = Error;
+
+            fn from_str(s: &str) -> Result<Self> {
+                Ok($name(FromStr::from_str(s)?))


it's more idiomatic to write s.parse() instead of FromStr::from_str(s)

IMHO using FromStr::from_str here makes it more obvious that we are simply delegating the call to the inner type (otherwise I would prefer .parse() as well).

makes sense! ;)

debris · 2017-10-28T13:40:52Z

src/protocol.rs

+        }
+        let debug_ref = F(&*self.0 as &[u8]);
+
+        f.debug_tuple("OnionSegment").field(&debug_ref).finish()


I didn't know that &[u8] does not implement Debug 😞 In my projects, we usually display it as hex.

It does implement Debug, but one cannot cast &[u8] (an unsized type) to &Debug apparently. Hence the (not too hard to understand IMHO) workaround.

ntninja · 2017-10-28T13:59:28Z

The Multiaddr main struct also needs to carry a (useless) field with the serialized byte representation of the struct to ensure backwards-compatiblity with the .as_slice() method

I don't agree with this point. Cargo.toml versions and Cargo.lock files could be used to use the old version of the library. imo, api for backwards compatibility is almost always redundant thanks to cargo toolchain

But if one library uses the old version of multiaddr and one library uses a newer, incompatible version, won't that mean that a Multiaddr struct from the old library, won't be able to passed to the library that expects the newer version (linking errors)?
Basically any library exposing a Multiaddr in a public interface will have to release a major version as well when switching to the new version of multiaddr, no?

…iAddr segment

…sing a `Vec` Update `protocol.rs` to serialize its bytes to a Rust byte stream

…<Addr>` instead of `Vec<u8>` (multiformats#19)

…onsist of 1 or 2 parts)

Only implement `ToString` for empty address segments Write both the address segment name and value on `protocol::Segment`'s `Display`

ntninja · 2017-10-28T15:09:17Z

I just read the WikiPedia page on .onion addresses and realized that each address is 80 bits in length (not bytes!). I've updated the struct and dropped all the custom derive code that was required because of the array with more than 32 entries.
(It should have seemed weird before: 80 bytes is longer than SHA512 with 64 bytes!)

dignifiedquire · 2017-11-06T14:20:21Z

Thank you @Alexander255 for all the work here and thank you @debris for the review. I really like this direction but haven't had time to dig through it yet though. I will try and get to it in the next days.

ntninja · 2017-11-06T19:45:28Z

@dignifiedquire Thanks! I still plan on adding some tests (currently tracking down a bug found by this), before this should be committed through.

…bytes stream

ntninja · 2017-11-07T16:37:03Z

@dignifiedquire: I fixed some stuff, exposed byte stream serializing and parsing in MultiAddr and added two more tests. From my point of view, I'm done here – please review.

Also I did all changes in a backwards compatible way, but if you are willing to break compatibility to get rid of the ugly left-overs caused by this I'd be willing to prepare a separate PR for that. (i.e.: This PR would result in version 0.2.1, and, after the other PR we would bump the version to 0.3.0. Also all methods removed in 0.3.0 would cause deprecation warnings for downstream developers using 0.2.1.)

ntninja · 2017-11-07T19:54:53Z

Another note: This PR implements a design pretty much opposite to what #11 is describing. Individual address segments should be able to be used as trait objects using the AddressSegment trait and doing so is indeed extensible by crate users, however there is currently no mechanism to use these address segments inside the standard MultiAddr type.

crackcomm · 2017-11-08T10:58:39Z

@Alexander255 have you considered a Custom protocol or registering user-protocols?

Related: #15

ntninja · 2017-11-08T15:22:12Z

@crackcomm: I have considered adding a special address segment enum variant that can wrap any existing Box<AddressSegment> (even if they are provided by other libraries). Unfortunately however, such a thing is almost entirely useless without any way to store it as part of a MultiAddr object that can then be passed around. There is also no mechsim currently to register custom protocol parsers with the library (i.e. The situation on this is totally unchanged compared to how it was before.)

While all of these things are totally implementable, it basically boils down to a design decision:
Do we want rust-multiaddr to implement a closed set of types as efficiantly as possible (this PR based on the previous code) or do want it to be an open system where library users bring their own types and we only define some core defaults (needs a redesign of most of the library code)?

@dignifiedquire: Thoughts on this?

dignifiedquire · 2017-11-08T17:09:46Z

While having user extension is a nice feature, currently for multiaddr the process is to propose a new address format in https://github.com/multiformats/multiaddr and after it lands there it is adopted by implementations. So I think it is fine to have it implemented in the perf oriented way that is done now.

So given that I would like to see concrete use cases for those custom handlers before changing the architecture of the module to enable it.

crackcomm · 2017-11-08T19:22:41Z

@dignifiedquire good, it is fine, for now.

daviddias · 2018-01-25T17:42:57Z

@tomaka mind taking the lead and reviewing this PR?

tomaka · 2018-01-26T10:35:38Z

I don't know if this PR is the appropriate place to discuss this, but I've gone a different way in our local fork: https://github.com/libp2p/rust-libp2p/tree/cace5bf577c53809aad514db40e72f7290e030d4/rust-multiaddr

In libp2p, a multiaddr is generally stored somewhere in memory, is often sent/received over the network, and is often hashed and compared with other multiaddresses, but it is not often decoded. You only need to decode a multiaddr when you want to open a new connection to it, which doesn't happen that often.
Therefore I went with keeping a private Vec<u8> inside the Multiaddress struct, because it is both the most space/cache-local-efficient memory representation and the most efficient when it comes to comparing and hashing.

In addition to this, our local fork adds an AddrSegment enum which is very similar to the Segment enum of this PR. My reasoning, however, is that if you really want strong typing then you might as well just store a Vec<AddrSegment>. A Multiaddr can easily be decoded into an iterator of AddrSegment, and doesn't perform any additional verification compared to a Vec<AddrSegment>.

The only major downside, however, is that with a Vec<u8> the pop() operation needs to iterator over the whole array of bytes. With a Vec<AddrSegment> you can simply remove the last element. However in my opinion some benchmarking would needed, because reading the poped AddrSegment may require a memory fetch, which may counter-balance the small overhead.

dignifiedquire · 2022-05-01T13:58:01Z

closing this, as the impl stores the byte representation now, which is for many use cases quite good. happy to reconsider if there is a concrete demonstrable perf or correctness issue.

Make macro invocation of protocol::build_protocol_enum! more enum-like

3e00ab0

debris reviewed Oct 28, 2017

View reviewed changes

ntninja force-pushed the master branch from a1bf8ef to 0d565a6 Compare October 28, 2017 10:25

debris reviewed Oct 28, 2017

View reviewed changes

ntninja added 7 commits October 28, 2017 17:02

Refactor protocol.rs to expose first-class Rust types for each Mult…

9f5bbe0

…iAddr segment

Serialize bytes of address segment to a Rust byte stream instead of u…

6079746

…sing a `Vec` Update `protocol.rs` to serialize its bytes to a Rust byte stream

Change internal representation of the Multiaddr main struct to `Vec…

bc2b532

…<Addr>` instead of `Vec<u8>` (multiformats#19)

Have single source of truth regarding estimated address byte length

5063637

Unify names: An address is made up of “segments” (which in turn may c…

106d6fc

…onsist of 1 or 2 parts)

Fix some minor issues pointed out by @debris

2042cb7

Only implement `ToString` for empty address segments Write both the address segment name and value on `protocol::Segment`'s `Display`

Upgrade to byteorder 1.1

26a5797

ntninja force-pushed the master branch from d3df5c8 to 26a5797 Compare October 28, 2017 15:05

Validate the new MultiAddr address types in tests

a3b342f

dignifiedquire mentioned this pull request Nov 7, 2017

Make decapsulate return an error if the protocol is not found #15

Closed

ntninja added 3 commits November 7, 2017 16:58

Add methods for constructing and serializing a MultiAddr from/to a …

2b5f045

…bytes stream

Test (and fix) the parsing of byte streams by MultiAddr

cd88477

Add test for AddressSegment{Reader,Writer}Ext

c17c068

daviddias requested a review from tomaka January 25, 2018 17:43

dignifiedquire closed this May 1, 2022

Define and use parsed address segment representation inside Multiaddr instead of Vec<u8> #25

Define and use parsed address segment representation inside Multiaddr instead of Vec<u8> #25

Conversation

ntninja commented Oct 25, 2017

debris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ntninja Oct 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

debris Oct 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

debris Oct 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ntninja commented Oct 28, 2017

ntninja commented Oct 28, 2017

dignifiedquire commented Nov 6, 2017

ntninja commented Nov 6, 2017

ntninja commented Nov 7, 2017

ntninja commented Nov 7, 2017

crackcomm commented Nov 8, 2017 • edited Loading

ntninja commented Nov 8, 2017

dignifiedquire commented Nov 8, 2017

crackcomm commented Nov 8, 2017

daviddias commented Jan 25, 2018

tomaka commented Jan 26, 2018 • edited Loading

dignifiedquire commented May 1, 2022

Define and use parsed address segment representation inside `Multiaddr` instead of `Vec<u8>` #25

Define and use parsed address segment representation inside `Multiaddr` instead of `Vec<u8>` #25

ntninja Oct 28, 2017 •

edited

Loading

debris Oct 28, 2017 •

edited

Loading

debris Oct 28, 2017 •

edited

Loading

crackcomm commented Nov 8, 2017 •

edited

Loading

tomaka commented Jan 26, 2018 •

edited

Loading