Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define and use parsed address segment representation inside Multiaddr instead of Vec<u8> #25

Closed
wants to merge 12 commits into from

Conversation

ntninja
Copy link

@ntninja ntninja commented Oct 25, 2017

This PR rewrites most of src/protocol.rs to provide a companion type for each variant of protocol::Protocol that can store a fully deserialized representation of the given address segment's data.

Known caveats:

  1. The protocol::IPFSSegment type is pretty verbose because it has to replicate some functionality from the cid crate that is not exposed in a reusable way
  2. The Multiaddr main struct also needs to carry a (useless) field with the serialized byte representation of the struct to ensure backwards-compatiblity with the .as_slice() method
  3. The Protocol and Segment enumerations feel somewhat redundant, but I can't think of any good way of unifying them without starting to work around the type system.
  4. Needs tests for new features!

(4) will be fixed
(1) I will propably fix at the source myself as well
(2) can be fixed any time if you're willing to break backward-compatibility

Any other comments about the code, approach, etc.?

Copy link
Member

@debris debris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I saw an open pr and decided to review it to better understand multiaddr ;) Overall, the pr looks good. It's definitely an improvement. I have just a couple of minor grumbles

The Multiaddr main struct also needs to carry a (useless) field with the serialized byte representation of the struct to ensure backwards-compatiblity with the .as_slice() method

I don't agree with this point. Cargo.toml versions and Cargo.lock files could be used to use the old version of the library. imo, api for backwards compatibility is almost always redundant thanks to cargo toolchain

src/protocol.rs Outdated
}

fn to_stream(&self, stream: &mut io::Write) -> io::Result<()> {
Ok(stream.write_u16::<BigEndian>(self.0)?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result types are the same. it is enough to write

stream.write_u16::<BigEndian>(self.0)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! (Originally this function was named to_bytes(&self) -> Result<Vec<u8>> and the try!() had been necessary.)

Copy link
Author

@ntninja ntninja Oct 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the code had been correct: stream.write_u16(…) returned an byteorder::Result for the byteorder 0.6 crate. It is updated to version 1.1 now.

src/protocol.rs Outdated
( $name:ident : Display $( $rest:tt )* ) => {
impl fmt::Display for $name {
fn fmt(&self, _: &mut fmt::Formatter) -> fmt::Result {
Ok(())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo, this is super misleading. If display is empty, do not implement it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the definition of AddressSegement to only require the implementation of ToString for each address segement type. Is that good enough?

Otherwise the macro implementation of Display for the Segment enumeration type will become really painful/impossible:

        impl fmt::Display for Segment {
            fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
                match self {
                    $( &Segment::$var(ref addr) => {
                        f.write_str(addr.protocol().as_str())?;
                        if $addr_type::STREAM_LENGTH > 0 {
                            f.write_char('/')?;
                            f.write_str(addr.to_string().as_ref())?;
                        }

                        Ok(())
                    }, )*

                    // Need extra match arm because of non-exhaustiveness
                    _ => unreachable!()
                }
            }
        }

src/protocol.rs Outdated
{$( $val:expr => $var:ident: $alph:expr, $size:expr, )*} => {

/// Single multiaddress segment with its attached data
pub trait AddressSegment : fmt::Display + ToString {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it depend on fmt::Display? btw. And ToString is automatically implemented for any type which implements the Display trait docs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought it would be nice to support rendering multiaddr segements as part of formatting strings. The ToString is mostly there to ensure that fact that everything implementing AddressSegement also implements ToString is highlighted in the documentation and is part of the intended API (instead of being just a fortunate accident).

src/protocol.rs Outdated
@@ -82,48 +590,143 @@ macro_rules! build_protocol_enum {
///
pub fn size(&self) -> isize {
match *self {
$( $var => $size, )*
$( Protocol::$var => ($addr_type::STREAM_LENGTH * 8) as isize, )*
_ => unreachable!()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more readable if these unreachabe! macros described why they are unreachable ;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of them did 😉
Fixed!

src/protocol.rs Outdated
match *self {
$( $var => $alph.to_string(), )*
}
self.as_str().to_string()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's be better to implement fmt::Display

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

src/protocol.rs Outdated
}

fn to_stream(&self, stream: &mut io::Write) -> io::Result<()> {
Ok(stream.write_u16::<BigEndian>(self.0)?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result types are the same. it is enough to write

stream.write_u16::<BigEndian>(self.0)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

src/protocol.rs Outdated

// Read CID hash from stream
let cid = Self::_read_cid_polyfill(stream).map_err(|err| {
println!("CID Parsing failed!");
Copy link
Member

@debris debris Oct 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use log crate instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is left-over debugging code 🙂

Do you think it's worth adding logging for this anyhow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, I don't have a strong opinion about that. Usually I prefer to keep the logs more 'top-level', but it's up to you, cause you know better if it's needed or not

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now that I think about it again I'm going to remove it. The actual solution is fix up the cid crate so that we don't need _read_cid_polyfill anymore.

src/protocol.rs Outdated
// currently only implement them for up to 32 entries
impl fmt::Debug for OnionSegment {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "OnionSegment({:?})", &*self.0 as &[u8])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use debug_struct instead docs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Learnt something! 🙂

src/protocol.rs Outdated
}

fn to_stream(&self, stream: &mut io::Write) -> io::Result<()> {
Ok(stream.write_u16::<BigEndian>(self.0)?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result types are the same. it is enough to write

stream.write_u16::<BigEndian>(self.0)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

type Err = Error;

fn from_str(s: &str) -> Result<Self> {
Ok($name(FromStr::from_str(s)?))
Copy link
Member

@debris debris Oct 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's more idiomatic to write s.parse() instead of FromStr::from_str(s)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO using FromStr::from_str here makes it more obvious that we are simply delegating the call to the inner type (otherwise I would prefer .parse() as well).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense! ;)

src/protocol.rs Outdated
}
let debug_ref = F(&*self.0 as &[u8]);

f.debug_tuple("OnionSegment").field(&debug_ref).finish()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know that &[u8] does not implement Debug 😞 In my projects, we usually display it as hex.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does implement Debug, but one cannot cast &[u8] (an unsized type) to &Debug apparently. Hence the (not too hard to understand IMHO) workaround.

@ntninja
Copy link
Author

ntninja commented Oct 28, 2017

The Multiaddr main struct also needs to carry a (useless) field with the serialized byte representation of the struct to ensure backwards-compatiblity with the .as_slice() method

I don't agree with this point. Cargo.toml versions and Cargo.lock files could be used to use the old version of the library. imo, api for backwards compatibility is almost always redundant thanks to cargo toolchain

But if one library uses the old version of multiaddr and one library uses a newer, incompatible version, won't that mean that a Multiaddr struct from the old library, won't be able to passed to the library that expects the newer version (linking errors)?
Basically any library exposing a Multiaddr in a public interface will have to release a major version as well when switching to the new version of multiaddr, no?

@ntninja
Copy link
Author

ntninja commented Oct 28, 2017

I just read the WikiPedia page on .onion addresses and realized that each address is 80 bits in length (not bytes!). I've updated the struct and dropped all the custom derive code that was required because of the array with more than 32 entries.
(It should have seemed weird before: 80 bytes is longer than SHA512 with 64 bytes!)

@dignifiedquire
Copy link
Member

Thank you @Alexander255 for all the work here and thank you @debris for the review. I really like this direction but haven't had time to dig through it yet though. I will try and get to it in the next days.

@ntninja
Copy link
Author

ntninja commented Nov 6, 2017

@dignifiedquire Thanks! I still plan on adding some tests (currently tracking down a bug found by this), before this should be committed through.

@ntninja
Copy link
Author

ntninja commented Nov 7, 2017

@dignifiedquire: I fixed some stuff, exposed byte stream serializing and parsing in MultiAddr and added two more tests. From my point of view, I'm done here – please review.

Also I did all changes in a backwards compatible way, but if you are willing to break compatibility to get rid of the ugly left-overs caused by this I'd be willing to prepare a separate PR for that. (i.e.: This PR would result in version 0.2.1, and, after the other PR we would bump the version to 0.3.0. Also all methods removed in 0.3.0 would cause deprecation warnings for downstream developers using 0.2.1.)

@ntninja
Copy link
Author

ntninja commented Nov 7, 2017

Another note: This PR implements a design pretty much opposite to what #11 is describing. Individual address segments should be able to be used as trait objects using the AddressSegment trait and doing so is indeed extensible by crate users, however there is currently no mechanism to use these address segments inside the standard MultiAddr type.

@crackcomm
Copy link

crackcomm commented Nov 8, 2017

@Alexander255 have you considered a Custom protocol or registering user-protocols?

Related: #15

@ntninja
Copy link
Author

ntninja commented Nov 8, 2017

@crackcomm: I have considered adding a special address segment enum variant that can wrap any existing Box<AddressSegment> (even if they are provided by other libraries). Unfortunately however, such a thing is almost entirely useless without any way to store it as part of a MultiAddr object that can then be passed around. There is also no mechsim currently to register custom protocol parsers with the library (i.e. The situation on this is totally unchanged compared to how it was before.)

While all of these things are totally implementable, it basically boils down to a design decision:
Do we want rust-multiaddr to implement a closed set of types as efficiantly as possible (this PR based on the previous code) or do want it to be an open system where library users bring their own types and we only define some core defaults (needs a redesign of most of the library code)?

@dignifiedquire: Thoughts on this?

@dignifiedquire
Copy link
Member

While having user extension is a nice feature, currently for multiaddr the process is to propose a new address format in https://github.com/multiformats/multiaddr and after it lands there it is adopted by implementations. So I think it is fine to have it implemented in the perf oriented way that is done now.

So given that I would like to see concrete use cases for those custom handlers before changing the architecture of the module to enable it.

@crackcomm
Copy link

@dignifiedquire good, it is fine, for now.

@daviddias
Copy link
Member

@tomaka mind taking the lead and reviewing this PR?

@daviddias daviddias requested a review from tomaka January 25, 2018 17:43
@tomaka
Copy link
Member

tomaka commented Jan 26, 2018

I don't know if this PR is the appropriate place to discuss this, but I've gone a different way in our local fork: https://github.com/libp2p/rust-libp2p/tree/cace5bf577c53809aad514db40e72f7290e030d4/rust-multiaddr

In libp2p, a multiaddr is generally stored somewhere in memory, is often sent/received over the network, and is often hashed and compared with other multiaddresses, but it is not often decoded. You only need to decode a multiaddr when you want to open a new connection to it, which doesn't happen that often.
Therefore I went with keeping a private Vec<u8> inside the Multiaddress struct, because it is both the most space/cache-local-efficient memory representation and the most efficient when it comes to comparing and hashing.

In addition to this, our local fork adds an AddrSegment enum which is very similar to the Segment enum of this PR. My reasoning, however, is that if you really want strong typing then you might as well just store a Vec<AddrSegment>. A Multiaddr can easily be decoded into an iterator of AddrSegment, and doesn't perform any additional verification compared to a Vec<AddrSegment>.

The only major downside, however, is that with a Vec<u8> the pop() operation needs to iterator over the whole array of bytes. With a Vec<AddrSegment> you can simply remove the last element. However in my opinion some benchmarking would needed, because reading the poped AddrSegment may require a memory fetch, which may counter-balance the small overhead.

@dignifiedquire
Copy link
Member

closing this, as the impl stores the byte representation now, which is for many use cases quite good. happy to reconsider if there is a concrete demonstrable perf or correctness issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants