Separate completion queue for receiving #737

grom72 · 2021-01-11T12:45:57Z

There are workflows in which processing incoming messages might be blocked by unfinished processing of previously sent messages (send queue slot is not available until completion is read from completion queue).
That forces the application to buffer receiving messages until sending message completions are collected to have a free slot in RDMA send queue.

The solution would be to have optionally a separate completion queue for receiving messages.
The receiving completion queue should have a separate completion channel and separate API to get the completion channel file descriptor. A separate API to collect completions is also required.
In case, a receiving completion queue is created main API rpma_conn_completion_get should only return the sending completions (send/read/write).

janekmi · 2021-02-26T23:05:44Z

Idea No1

int rpma_conn_cfg_set_recv_cq_size(struct rpma_conn_cfg *cfg, uint32_t recv_cq_size);
int rpma_conn_cfg_get_recv_cq_size(const struct rpma_conn_cfg *cfg, uint32_t *recv_cq_size);

by default recv_cq_size == 0
it may automatically mean the user does not want a separate CQ for RECVs OR
we can introduce another knob allowing turn on/off this feature
- which is more explicit and easier to follow
- introduces a possibility of having recv_cq_size == 0 but with CQ for RECVs turned on which will require a good error code / message

int rpma_conn_get_recv_completion_fd(const struct rpma_conn *conn, int *fd);
int rpma_conn_recv_completion_wait(struct rpma_conn *conn);
int rpma_conn_recv_completion_get(struct rpma_conn *conn, struct rpma_completion *cmpl);

To be discussed if it is a better idea to complement the APIs introduced above with:

a set of rpma_conn_send_completion_*() APIs BUT
- in this case the plain completion APIs available today will return with an error OR
sometimes use rpma_conn_completion_*() variants as SEND completion functions

If I understand correctly this is more or less what @grom72 has proposed.

Idea No2

the initial implementation assumed:
1. a single CQ per connection AND
2. the same CQ handling completions for both SENDs and RECVs
this feature request want to address primarily the second limitation
Idea No1 attempts to solve it without introducing a lot of new APIs
but in general, ibverbs does not limit how many connections is handled by the same CQ
we may consider creating an entity being an abstraction over a pair of CQs to be used by the connection if configured

int rpma_cq_new_x(struct rpma_peer *peer, int cq_size, struct rpma_cq **cq);
int rpma_cq_new_y(struct rpma_peer *peer, int send_cq_size, int recv_cq_size, struct rpma_cq **cq);
int rpma_cq_delete(struct rpma_cq **cq);

_x and _y suffixes to be discussed
rpma_cq_new_x() will create :
- a single completion channel
- a single CQ attached to the completion channel
rpma_cq_new_y() will create:
- two completion channels
- one CQ attached to each of the completion channels
  - two in total
  - one for SENDs and one for RECVs
completion channels and CQs may be created lazily when it will be required to use them

int rpma_conn_cfg_set_cq(const struct rpma_conn_cfg *cfg, struct rpma_cq *cq);

if rpma_cq is set:

it is used instead of creating an internal dedicated CQ
a rpma_conn created using it will:
- fail for any of these calls:
  - rpma_conn_get_completion_fd()
  - rpma_conn_completion_wait()
  - rpma_conn_completion_get()
- the rpma_cq_*() counterparts have to be used instead
the following are valid only for rpma_cq created using rpma_cq_new_x() with a single CQ in mind
- since splitting CQEs on SEND CQEs and RECV CQEs does not make any sense here

int rpma_cq_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_completion_wait(struct rpma_cq *cq);
int rpma_cq_completion_get(struct rpma_cq *cq, struct rpma_completion *cmpl);

the rpma_cq_send_*() and rpma_cq_recv_*() are valid only if rpma_cq has been created with two CQs in mind (rpma_cq_new_y())

int rpma_cq_send_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_recv_get_fd(const struct rpma_cq *cq, int *fd);

int rpma_cq_send_completion_wait(struct rpma_conn *conn);
int rpma_cq_recv_completion_wait(struct rpma_conn *conn);

int rpma_cq_send_completion_get(struct rpma_cq *cq, struct rpma_completion *cmpl);
int rpma_cq_recv_completion_get(struct rpma_cq *cq, struct rpma_completion *cmpl);

Risks:

initially, we might consider limiting the use of a single rpma_cq object to a single rpma_conn connections
- potential lifting of this limitation does not affect the API

Weaknesses:

big set of APIs to cover
it is still limiting some CQ combinations e.g.
- a separate CQ for SENDs for each of the connections BUT
- a common CQ for RECVs for all of them
- but:
  1. it might be the use case too close to the original ibverbs API so maybe we do not want to cover this use case AND
  2. it may be addressed by introducing additional APIs which allow constructing more complex pair of CQs under the rpma_cq entity

janekmi · 2021-02-27T20:03:49Z

Idea No2B

it may be simpler to have a more specific entity let called it rpma_dcq instead of rpma_cq where dCQ stands for 'double CQ'
this entity would be used only when two CQs are required chence 'double CQ'
so it would have only a single constructor e.g.:

int rpma_dcq_new(struct rpma_peer *peer, int send_cq_size, int recv_cq_size, struct rpma_dcq **dcq);

the setter for the connection configuration would look almost identical
this entity would required only rpma_dcq_send_*() and rpma_dcq_recv_*() variants of the APIs described in the Idea No2

Why it is superior comparing to Idea No2?

would not intrduce another way of having a single CQ for both SENDs and RECVs
introducing a smaller number of new APIs

Weaknesses?

it still does not cover all possible CQ combinations BUT
- it is not part of the feature request AND
- the Idea No2B the same as Idea No2 allows adding easily such a functionality in the future

yangx-jy · 2021-03-08T03:41:15Z

Hi @janekmi @grom72

Is it necessary to expose the detailed process of creating CQ to application?
If so, it becomes complex for application to create CQ.

I like idea No1 because of three reasons:

idea No2 and No2B need to introcude a lot of APIs.
idea No2 and No2B will break current implementation.
It is easy for application to create CQ.

For idea No1, I want to keep rpma_conn_completion* APIs because they have two roles:

rpma_conn_completion* APIs handle all completions if separate RCQ is not specified.
rpma_conn_completion* APIs handle all completions except RECVs if separate RCQ is specified.

If you agree with idea NO1, I will try to implement it.

Best Regards,
Xiao Yang

janekmi · 2021-03-18T12:52:43Z

Nice to have you join this conversation @yangx-jy!

Returning to the issue after a few days I think I'm close to your position. I hope to have a strong opinion by tomorrow.

yangx-jy · 2021-04-01T03:47:22Z

Nice to have you join this conversation @yangx-jy!

Returning to the issue after a few days I think I'm close to your position. I hope to have a strong opinion by tomorrow.

Hi @janekmi

If you don't have objection, I will try to implement the separate RCQ next week as idea No1 describes. Because I still think idea No1 is easy for application to use APIs about the separate RCQ.

Best Regards,
Xiao Yang

janekmi · 2021-04-01T15:16:07Z

@yangx-jy I won't stop you. But I want to have this discussion rolling till we reach a consensus. Today I had a long and fruitful discussion with @grom72 (our beloved architect). We plan another session tomorrow but I want to keep you up to date with the state of the discussion.

Interestingly enough, we have decided to discuss this topic in the context of #958. So I strongly encourage @ultra-nexus to take a look at what is going here.

Idea No3A

This is a combination of Idea No1 and Idea No2. Where we do introduce a separate entity struct rpma_cq but we do not allow to create it outside of the context of the connection. This allows rethinking the idea of having a CQ that can be shared by multiple connections later on.

// by default 0 - no separate recv CQ
int rpma_conn_cfg_set_rcq_size(const struct rpma_conn_cfg *cfg, uint32_t cq_size);

struct rpma_cq;

// if no separate recv CQ return NULL; with an error?
int rpma_conn_get_rcq(const struct rpma_conn *conn, struct rpma_cq **cq_ptr);
int rpma_conn_get_cq(const struct rpma_conn *conn, struct rpma_cq **cq_ptr);

int rpma_cq_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_wait(struct rpma_cq *cq);
int rpma_cq_get(struct rpma_cq *cq, struct rpma_completion *cmpl);

At the same time, we have discussed possible SRQ interfaces for librpma and we came with the following:

struct rpma_srq;

int rpma_srq_new(struct rpma_peer *peer, uint32_t rq_size, struct rpma_srq **srq_ptr);

// overwrites rpma_conn_cfg_set_rq_size()
int rpma_conn_cfg_set_srq(const struct rpma_conn_cfg *cfg, struct rpma_srq *srq);

int rpma_srq_recv(struct rpma_srq *srq, struct rpma_mr_local *dst, size_t offset, size_t len, const void *op_context);

Note: @grom72 and I are thinking about giving SRQ a possibility to have a separate CQ for receives which will allow collecting all completions from the SRQ with a single rpma_cq_*() as it was proposed above.

Idea No3B

Very similar to what you see above but with an explicit struct rpma_cq constructor/destructor instead e.g.:

struct rpma_cq;
int rpma_cq_new(struct rpma_peer *peer, uint32_t cq_size, struct rpma_cq **cq_ptr);

ultra-nexus · 2021-04-02T10:03:23Z

I will also participate in this SCQ discussion, thanks.

janekmi · 2021-04-02T13:37:30Z

After today's discussion, @grom72 and I have agreed to follow Idea No3A. We have also completed the SRQ part so here it is:

/* a SRQ configuration structure */
struct rpma_srq_cfg;

int rpma_srq_cfg_new(struct rpma_srq_cfg **scfg_ptr);
int rpma_srq_cfg_delete(struct rpma_srq_cfg **scfg_ptr);

int rpma_srq_cfg_set_rq_size(struct rpma_srq_cfg *scfg, uint32_t rq_size);
// rcq_size == 0 means no CQ for receives is provided along with SRQ
int rpma_srq_cfg_set_rcq_size(struct rpma_srq_cfg *scfg, uint32_t rcq_size);

/* the SRQ itself */
struct rpma_srq;

// ibv_create_srq
int rpma_srq_new(struct rpma_peer *peer, struct rpma_srq_cfg *scfg, struct rpma_srq **srq_ptr);
int rpma_srq_delete(struct rpma_srq **srq_ptr);

// overwrites rpma_conn_cfg_set_rq_size()
int rpma_conn_cfg_set_srq(const struct rpma_conn_cfg *cfg, struct rpma_srq *srq);

int rpma_srq_recv(struct rpma_srq *srq, struct rpma_mr_local *dst, size_t offset, size_t len, const void *op_context);

int rpma_srq_get_rcq(const struct rpma_srq *conn, struct rpma_cq **cq_ptr);

/*
Having the rpma_cq you can use it as a single point controlling all CQEs for Receives coming from all connections using this SRQ:
- rpma_cq_get_fd
- rpma_cq_wait
- rpma_cq_get
*/

Note: We are still open for discussion. If you will have any thoughts or comments.

Plan A (Idea No3A)

Since all these changes are interrelated and it is hard to separate them I think we have to also think a little bit about the order of actions e.g.

Stage 1

struct rpma_cq;

// if no separate recv CQ return NULL; with an error?
int rpma_conn_get_rcq(const struct rpma_conn *conn, struct rpma_cq **cq_ptr);
int rpma_conn_get_cq(const struct rpma_conn *conn, struct rpma_cq **cq_ptr);

int rpma_cq_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_wait(struct rpma_cq *cq);
int rpma_cq_get(struct rpma_cq *cq, struct rpma_completion *cmpl);

Stage 2A in parallel with Stage 2B

// by default 0 - no separate recv CQ
int rpma_conn_cfg_set_rcq_size(const struct rpma_conn_cfg *cfg, uint32_t cq_size);

Note: rpma_conn_get_completion_fd(), rpma_conn_completion_wait() and rpma_conn_completion_get() documentation have to include information about extended behaviour when rpma_conn_cfg_set_rcq() is set.

Stage 2B in parallel with Stage 2A

/* a SRQ configuration structure */
struct rpma_srq_cfg;

int rpma_srq_cfg_new(struct rpma_srq_cfg **scfg_ptr);
int rpma_srq_cfg_delete(struct rpma_srq_cfg **scfg_ptr);

int rpma_srq_cfg_set_rq_size(struct rpma_srq_cfg *scfg, uint32_t rq_size);
// no rpma_srq_cfg_set_rcq_size() and rpma_srq_get_rcq() in this Stage

/* the SRQ itself */
struct rpma_srq;

int rpma_srq_new(struct rpma_peer *peer, struct rpma_srq_cfg *scfg, struct rpma_srq **srq_ptr);
int rpma_srq_delete(struct rpma_srq **srq_ptr);

// overwrites rpma_conn_cfg_set_rq_size()
int rpma_conn_cfg_set_srq(const struct rpma_conn_cfg *cfg, struct rpma_srq *srq);

int rpma_srq_recv(struct rpma_srq *srq, struct rpma_mr_local *dst, size_t offset, size_t len, const void *op_context);

Stage 3

// rcq_size == 0 means no CQ for receives is provided along with SRQ
int rpma_srq_cfg_set_rcq_size(struct rpma_srq_cfg *scfg, uint32_t rcq_size);

int rpma_srq_get_rcq(const struct rpma_srq *conn, struct rpma_cq **cq_ptr);

Stage 4

After all of these efforts, I expect nice examples to appear namely. :-)

Note: It might be a good idea to use RCQ in the GPSPM example:

https://github.com/pmem/rpma/tree/master/examples/09-flush-to-persistent-GPSPM

yangx-jy · 2021-04-02T14:47:03Z

Very cool. I and @ultra-nexus will try to understand the details of Idea No3A next week. We will add some comments here if we have some questions or suggestions about the big enhacement(#737 and #958).

We will also look into #977 and #976 as you metioned next week.

Thanks to @janekmi and @grom72 for giving us such an explicit design.

Best Regards,
Xiao Yang

ultra-nexus · 2021-04-06T03:20:10Z

Hi, @janekmi

The SRQ idea following No3A is structured and concise. I recently also drafted a implementation of SRQ just before the above discussion in this issue. My idea may not be well designed and I also post it here as a reference.

Idea No4

/*
 * Basic data structure
 */

struct rpma_peer {
	...
	struct srq_vec srq_vec; /* a struct to manage srq */
};

struct srq_vec {
	struct ibv_srq **data; /* srq pointer array */
	int *alloced_map; /* allocation map */
	unsigned long max_srq; /* array limit */
	unsigned long count; /* used count */
};



/*
 * allocate srq
 * This function return an index of srq_vec.data[]
 */
int
rpma_alloc_srq(struct rpma_peer *peer, unsigned int max_wr,
	unsigned int max_sge, unsigned int srq_limit);

/*
 * free srq
 */
int
rpma_free_srq(struct rpma_peer *peer, int idx);



/*
 * initialize req struct
 * srq_idx is the return value of rpma_alloc_srq()
 */
int
rpma_ep_next_conn_req(struct rpma_ep *ep, const struct rpma_conn_cfg *cfg,
	struct rpma_conn_req **req_ptr, int srq_idx); /* the last parameter is srq index */
int
rpma_conn_req_new(struct rpma_peer *peer, const char *addr,
	const char *port, const struct rpma_conn_cfg *cfg,
	struct rpma_conn_req **req_ptr, int srq_idx);  /* the last parameter is srq index */

static int
rpma_conn_req_from_id(struct rpma_peer *peer, struct rdma_cm_id *id,
	const struct rpma_conn_cfg *cfg, struct rpma_conn_req **req_ptr,
	int srq_idx)
{
	rpma_peer_create_qp(peer, id, cq, cfg, srq_idx);
	/* Save srq index in req struct. We will use it when creating connection */
	(*req_ptr)->srq_idx = srq_idx;
}

int
rpma_peer_create_qp(struct rpma_peer *peer, struct rdma_cm_id *id,
		struct ibv_cq *cq, const struct rpma_conn_cfg *cfg, int srq_idx)
{
	struct ibv_qp_init_attr qp_init_attr;
	if (srq_idx != -1) {
		qp_init_attr.srq = get_srq_by_idx(peer, srq_idx); /* Use srq */
	} else {
		qp_init_attr.srq = NULL;
	}
	rdma_create_qp(id, peer->pd, &qp_init_attr); /* Create qp */
}



/*
 * create new connection
 * The third argument of rpma_conn_new() is srq_idx saved in req struct
 */
static int
rpma_conn_req_connect_active(struct rpma_conn_req *req,
	struct rdma_conn_param *conn_param, struct rpma_conn **conn_ptr)
{
	/* pass req->srq_idx into rpma_conn_new() */
	rpma_conn_new(req->peer, req->id, req->cq, req->srq_idx, &conn);
}

static int
rpma_conn_req_accept(struct rpma_conn_req *req,
	struct rdma_conn_param *conn_param, struct rpma_conn **conn_ptr)
{
	/* pass req->srq_idx into rpma_conn_new() */
	rpma_conn_new(req->peer, req->id, req->cq, req->srq_idx, &conn);
}

int
rpma_conn_new(struct rpma_peer *peer, struct rdma_cm_id *id,
	struct ibv_cq *cq, int srq_idx, struct rpma_conn **conn_ptr)
{
	/*
	 * if this connection want to use SRQ, its srq_idx have to >= 0
	 * Otherwise, the connection use normal RQ.
	 */
	conn->srq = (srq_idx != -1 ? get_srq_by_idx(peer, srq_idx) : NULL);
}



/*
 * post a receive wr
 */
int
rpma_recv(struct rpma_conn *conn,
	struct rpma_mr_local *dst, size_t offset, size_t len,
	const void *op_context)
{
	rpma_mr_recv(conn->id->qp, conn->srq, /* conn->srq is not null if using SRQ */
				dst, offset, len,
				op_context);
}

int
rpma_conn_req_recv(struct rpma_conn_req *req,
	struct rpma_mr_local *dst, size_t offset, size_t len,
	const void *op_context)
{
	if (req->srq_idx == -1)
		srq = NULL;
	else
		srq = get_srq_by_idx(req->peer, req->srq_idx);
	rpma_mr_recv(req->id->qp, srq, /* conn->srq is not null if using SRQ */
				dst, offset, len,
				op_context);
}

int
rpma_mr_recv(struct ibv_qp *qp, struct ibv_srq *srq,
	struct rpma_mr_local *dst,  size_t offset,
	size_t len, const void *op_context)
{
	if (srq) /* Use SRQ */
		ret = ibv_post_srq_recv(srq, &wr, &bad_wr);
	else /* Use normal QP */
		ret = ibv_post_recv(qp, &wr, &bad_wr);
}

janekmi · 2021-04-06T13:02:54Z

Hi @ultra-nexus. Thank you for joining the conversation. :-)

I have called your Idea No4. (I know you were the first to propose an API for SRQ but since it was done in a separate PR #685 I think it will be easier to follow the numbering this way.)

Idea No4

Comments

rpma_alloc_srq() accepts a closed set of arguments that might lead in the near future to a need of creating another constructor because of new requirements/features which might be useful to encapsulate in the RPMA SRQ entity (please see e.g. rpma_srq_cfg_set_rcq_size() in the Idea No3A). In librpma, we have agreed to solve such issues by introducing a configuration object. This solution has issues of its own but for sure allows introducing new parameters easily.
Introducing int idx for identifying SRQ entities stored in the struct rpma_peer is a neat idea but it forces the library to handle variable-length array objects which introduces, among other, e.g. multithread issues. Since struct rpma_peer can be used by multiple threads, so the library has to take care of multithread access and alteration of this table which is a lot simpler on the application level where the application exactly knows how many SRQs it needs and what operations have to be MT-safe.
The proposed changes to rpma_ep_next_conn_req() and rpma_conn_req_new() are SRQ-specific and should be introduced using the struct rpma_conn_cfg.
Regarding rpma_recv() and using it for both SRQ and non-SRQ connections, @grom72 and I have considered using SRQ transparently as it is proposed in Idea No4, but since the receive buffer you are posting may be consumed by any connection using the same SRQ, posting it by providing a connection object might confuse the user who might think the receive buffer is somehow connected to the provided connection where in fact it is not. This is why in the Idea No3A we have introduced a separate rpma_srq_recv() function dedicated for SRQ use.

ultra-nexus · 2021-04-07T03:35:24Z

Hi, @janekmi Thanks for you reply.

rpma_alloc_srq() accepts a closed set of arguments that might lead in the near future to a need of creating another constructor because of new requirements/features which might be useful to encapsulate in the RPMA SRQ entity (please see e.g. rpma_srq_cfg_set_rcq_size() in the Idea No3A). In librpma, we have agreed to solve such issues by introducing a configuration object. This solution has issues of its own but for sure allows introducing new parameters easily.

Yes. From the point of view of encapsulation, a rpma_srq_cfg structure is more elegant.

Introducing int idx for identifying SRQ entities stored in the struct rpma_peer is a neat idea but it forces the library to handle variable-length array objects which introduces, among other, e.g. multithread issues. Since struct rpma_peer can be used by multiple threads, so the library has to take care of multithread access

Thanks. This makes sense. I didn't consider the multithread scenario.

and alteration of this table which is a lot simpler on the application level where the application exactly knows how many SRQs it needs and what operations have to be MT-safe.

If I understand correctly, we delegate the SRQ management stuff to the application and the rpma is only responsible for allocation and free.

Regarding rpma_recv() and using it for both SRQ and non-SRQ connections, @grom72 and I have considered using SRQ transparently as it is proposed in Idea No4, but since the receive buffer you are posting may be consumed by any connection using the same SRQ, posting it by providing a connection object might confuse the user who might think the receive buffer is somehow connected to the provided connection where in fact it is not. This is why in the Idea No3A we have introduced a separate rpma_srq_recv() function dedicated for SRQ use.

If we introduce a rpma_srq_recv() function, do we need to rename rpma_recv() to something like rpma_rq_recv to make these two function names more symmetric.

janekmi · 2021-04-07T09:05:03Z

@ultra-nexus:

and alteration of this table which is a lot simpler on the application level where the application exactly knows how many SRQs it needs and what operations have to be MT-safe.

If I understand correctly, we delegate the SRQ management stuff to the application and the rpma is only responsible for allocation and free.

Yes. The user application is responsible for all the resource management. The librpma library knows how to allocate/free resources but the application has to tell the library to do so.

If we introduce a rpma_srq_recv() function, do we need to rename rpma_recv() to something like rpma_rq_recv to make these two function names more symmetric.

In general, in the librpma library we tend to add infixes which indicates what entity a given function is related to e.g. _conn_ for all functions related to struct rpma_conn. rpma_recv() is one of few exceptions introduced for conciseness. I agree we should keep an eye on the naming conventions but for now, I think we will keep these few exceptions.

yangx-jy · 2021-04-07T16:01:31Z

Hi @janekmi

After reading Idea No3A about separate receive CQ again, I want to confirm if I understand the design correctly.

Decide to use seperate receive CQ by rcq size from rpma_conn_cfg_set_rcq_size
struct rpma_cq is a member of struct rpam_conn and it includes the information about a particular completion queue, for example:

struct rpma_conn {
        struct rdma_cm_id *id;
        struct rdma_event_channel *evch;
        struct rpma_cq *cq;
        struct rpma_cq *recv_cq;
};

struct rpma_cq {
        struct ibv_comp_channel *channel;
        struct ibv_cq *cq;
};

Get struct rpma_cq object from struct rpma_conn by rpma_conn_get_rcq or rpma_conn_get_cq
Convert rpma_conn_get_completion_fd/rpma_conn_completion_wait/rpma_conn_completion_get to rpma_cq_get_fd/rpma_cq_wait/rpma_cq_get

Best Regards,
Xiao Yang

janekmi · 2021-04-07T22:34:04Z

@yangx-jy:

Yes.
Yes.
Yes.
No. I think it is still useful to keep these simplified versions for scenarios assuming both CQs are one and the same. Idea No3A assumes rpma_cq_get_fd(), rpma_cq_wait(), and rpma_cq_get() are new APIs whereas at the same time the old APIs (rpma_conn_get_completion_fd() / rpma_conn_completion_wait() / rpma_conn_completion_get()) remain unchanged.

yangx-jy · 2021-04-08T01:10:16Z

4\. No. I think it is still useful to keep these simplified versions for scenarios assuming both CQs are one and the same. **Idea No3A** assumes `rpma_cq_get_fd()`, `rpma_cq_wait()`, and `rpma_cq_get()` are new APIs whereas at the same time the old APIs (`rpma_conn_get_completion_fd()` / `rpma_conn_completion_wait()` / `rpma_conn_completion_get()`) remain unchanged.

Hi @janekmi

Is it OK to take use of struct rpma_cq in old APIs？for example rpma_conn_get_completion_fd():

int
rpma_conn_get_completion_fd(const struct rpma_conn *conn, int *fd)
{
        if (conn == NULL || fd == NULL)
                return RPMA_E_INVAL;

        *fd = conn->cq->channel->fd;

        return 0;
}

I want to unify the internal logic in both new APIs and old APIs.

Best Regards,
Xiao Yang

yangx-jy · 2021-04-12T03:01:45Z

4\. No. I think it is still useful to keep these simplified versions for scenarios assuming both CQs are one and the same. **Idea No3A** assumes `rpma_cq_get_fd()`, `rpma_cq_wait()`, and `rpma_cq_get()` are new APIs whereas at the same time the old APIs (`rpma_conn_get_completion_fd()` / `rpma_conn_completion_wait()` / `rpma_conn_completion_get()`) remain unchanged.

Do you expect the following design?

rpma_cq_get_fd(), rpma_cq_wait(), and rpma_cq_get() are only valid for two CQs (i.e. send CQ and recv CQ are different).
rpma_conn_get_completion_fd(), rpma_conn_completion_wait() and rpma_conn_completion_get() are only valid for one CQ (i.e. send CQ and recv CQ are same).
In this case, I think we may rename rpma_conn_get_cq to rpma_conn_get_scq.

If rpma_cq_get_fd(), rpma_cq_wait(), and rpma_cq_get() are valid for both one CQ and two CQs, why don't we use the unified new APIs and remove old APIs directly?

janekmi · 2021-04-12T14:48:27Z

I think it will be handy if one could use the new APIs in both cases. But it is a nice to have feature. Depending on how hard it will be to do so.
I'm reluctant to remove old APIs right away because we already have some software build on top of it. But I do not rule out this option completely. It may happen before release 1.0. In the meantime, we can mark the old API deprecated if we decide they are not any good.

yangx-jy · 2021-04-16T05:39:00Z

Hi @janekmi

When I am writing the implementation of separate completion queue for receiving, I have a question: why do we have rpma_conn_get_rcq and rpma_conn_get_cq? I think application cannot access the member of rpma_cq object outside of the context of the connection.

Can we define rpma_cq_get_fd, rpma_cq_wait and rpma_cq_get as internal functions and use them with rpma_conn->cq/rcq in librpma, like this:

struct rpma_conn {
        struct rdma_cm_id *id;
        struct rdma_event_channel *evch;
        struct rpma_cq *cq;
        struct rpma_cq *rcq;
};

struct rpma_cq {
        struct ibv_comp_channel *channel;
        struct ibv_cq *cq;
};

int rpma_cq_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_wait(struct rpma_cq *cq);
int rpma_cq_get(struct rpma_cq *cq, struct rpma_completion *cmpl);

int rpma_conn_get_completion_fd(const struct rpma_conn *conn, int *fd);
int rpma_conn_completion_wait(struct rpma_conn *conn);
int rpma_conn_completion_get(struct rpma_conn *conn, struct rpma_completion *cmpl)
{
        if (conn == NULL)
                return RPMA_E_INVAL;

        return rpma_cq_get(&conn->cq, cmpl);
}

int rpma_conn_get_recv_completion_fd(const struct rpma_conn *conn, int *fd);
int rpma_conn_recv_completion_wait(struct rpma_conn *conn);
int rpma_conn_recv_completion_get(struct rpma_conn *conn, struct rpma_completion *cmpl)
        if (conn == NULL)
                return RPMA_E_INVAL;

        return rpma_cq_get(&conn->rcq, cmpl);
}

Why do you want to expose rpma_cq object and its related functions to applications?

janekmi · 2021-04-16T13:55:07Z

I think application cannot access the member of rpma_cq object outside of the context of the connection.

You are right. The rpma_cq structure is meant to be opaque.

Why do you want to expose rpma_cq object and its related functions to applications?

One of many arguments may be that it plays nicely with the SRQ proposition also stated above.

Ref: #737 (comment)

yangx-jy · 2021-04-29T09:15:53Z

I plan to implement separate CQ by two steps.

Step1: Add struct rpma_cq and some related functions, like this:

struct rpma_conn {
        struct rdma_cm_id *id;
        struct rdma_event_channel *evch;
        struct rpma_cq *cq;
};

struct rpma_cq {
        struct ibv_comp_channel *channel;
        struct ibv_cq *cq;
};

/* internal API */
int rpma_cq_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_wait(struct rpma_cq *cq);
int rpma_cq_get_completion(struct rpma_cq *cq, struct rpma_completion *cmpl);
int rpma_cq_new(struct ibv_context *verbs, int cqe, struct rpma_cq **cq_ptr);
int rpma_cq_delete(struct rpma_cq **cq_ptr);

int rpma_conn_get_completion_fd(const struct rpma_conn *conn, int *fd) {
         if (conn == NULL)
                 return RPMA_E_INVAL;

         return rpma_cq_get_fd(conn->cq, fd);
}

int rpma_conn_completion_wait(struct rpma_conn *conn) {
         if (conn == NULL)
                 return RPMA_E_INVAL;

         return rpma_cq_wait(conn->cq);
}

int rpma_conn_completion_get(struct rpma_conn *conn, struct rpma_completion *cmpl) {
         if (conn == NULL)
                 return RPMA_E_INVAL;

         return rpma_cq_get_completion(conn->cq, cmpl); }
}

I have created a PR #1035 for step1.

Step2: Add separate RCQ, like this:

int rpma_conn_cfg_set_rcq_size(const struct rpma_conn_cfg *cfg, uint32_t cq_size);

int rpma_conn_get_recv_completion_fd(const struct rpma_conn *conn, int *fd);
int rpma_conn_recv_completion_wait(struct rpma_conn *conn);
int rpma_conn_recv_completion_get(struct rpma_conn *conn, struct rpma_completion *cmpl);

For Step2, I plan to drop rpma_conn_get_rcq and rpma_conn_get_cq because of three reasons:
1. Application has to call two APIs to complete one thing if we use rpma_cq* APIs and applications cannot access the members of rpma_cq object directly so it is unnecessary to expose rpma_cq object to applications. (it is easier for applications to use single rpma_conn_recv* API)
2. I cannot find any obvious benefits for SRQ for now.
3. This method will not break current implementation.

janekmi · 2021-04-29T20:11:45Z

Application has to call two APIs to complete one thing if we use rpma_cq* APIs and applications cannot access the members of rpma_cq object directly so it is unnecessary to expose rpma_cq object to applications. (it is easier for applications to use single rpma_conn_recv* API)

I will rather expect the end-user to cache the rpma_cq or its fd. So the additional API call is called only once. I think it is not a big price to pay for elasticity.

With the approach, you argue for, you are "closing" possible pathways of developing the librpma API in the future. E.g. there is no nice way to have a shared CQ for multiple connections which may be a real use case in the future assuming the RPMEM with the messaging will get wider adoption as it tends now. Whereas by making public the rpma_cq_*() APIs now we can get shared CQ basically for free. The only piece missing is the possibility to create a standalone CQ and configure the connection to use it.

I cannot find any obvious benefits for SRQ for now.

SRQ plays nicely with RPMEM because it is the easiest way to share a single PMEM space among multiple connections without the hassle of managing interactions between them. You just register the whole PMEM space for RECV. Post the buffers to SRQ. And you are done. When the message will come it may be either processed or persisted or left where it is (assuming the initiator knows how to flush its contents to persistency single-sidedly). No matter what you do with the RECVed buffer you just have to post another one to the SRQ. And that's it. You got a multiple-connection-capable persistent log based on RPMEM and SRQ. Isn't it lovely? :-)

This method will not break current implementation.

It does not break the current implementation. If you have a single CQ you can still use the old API as you have done till this time. But if you will have two CQs (one for SQ and one for RQ) you can use the rpma_cq_*() API for both SQ and RQ or only for RQ since we can agree the old API will just serve the SQ's completions.

Note: If you have an application using the new connection's configuration capabilities you have to modify your application anyway.

On the other hand, we can also agree to just use the new API for consistency and mark the old one as deprecated and when everything will be ready for release 1.0 just drop the old API. It is an option to consider.

Have I answered your fears? :-)

yangx-jy · 2021-04-30T03:55:36Z

Application has to call two APIs to complete one thing if we use rpma_cq* APIs and applications cannot access the members of rpma_cq object directly so it is unnecessary to expose rpma_cq object to applications. (it is easier for applications to use single rpma_conn_recv* API)

I will rather expect the end-user to cache the rpma_cq or its fd. So the additional API call is called only once. I think it is not a big price to pay for elasticity.

With the approach, you argue for, you are "closing" possible pathways of developing the librpma API in the future. E.g. there is no nice way to have a shared CQ for multiple connections which may be a real use case in the future assuming the RPMEM with the messaging will get wider adoption as it tends now. Whereas by making public the rpma_cq_*() APIs now we can get shared CQ basically for free. The only piece missing is the possibility to create a standalone CQ and configure the connection to use it.

I see.

I cannot find any obvious benefits for SRQ for now.

SRQ plays nicely with RPMEM because it is the easiest way to share a single PMEM space among multiple connections without the hassle of managing interactions between them. You just register the whole PMEM space for RECV. Post the buffers to SRQ. And you are done. When the message will come it may be either processed or persisted or left where it is (assuming the initiator knows how to flush its contents to persistency single-sidedly). No matter what you do with the RECVed buffer you just have to post another one to the SRQ. And that's it. You got a multiple-connection-capable persistent log based on RPMEM and SRQ. Isn't it lovely? :-)

Do you want to use the shared CQ of multiple connections for only SRQ?

My previous plan only considered to use the shared CQ on SRQ, like this:
1. For no SRQ, Take use of the following APIs and drop rpma_conn_get_cq() && rpma_conn_get_rcq():

int rpma_conn_get_completion_fd(const struct rpma_conn *conn, int *fd);
int rpma_conn_completion_wait(struct rpma_conn *conn);
int rpma_conn_completion_get(struct rpma_conn *conn, struct rpma_completion *cmpl);
int rpma_conn_get_recv_completion_fd(const struct rpma_conn *conn, int *fd);
int rpma_conn_recv_completion_wait(struct rpma_conn *conn);
int rpma_conn_recv_completion_get(struct rpma_conn *conn, struct rpma_completion *cmpl);

2. For SRQ, add rpma_srq_get_rcq() and public rpma_cq_get_fd(), rpma_cq_wait(), rpma_cq_get(), like this:

int rpma_srq_get_rcq(const struct rpma_srq *srq, struct rpma_cq **cq_ptr);
int rpma_cq_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_wait(struct rpma_cq *cq);
int rpma_cq_get(struct rpma_cq *cq, struct rpma_completion *cmpl);

It seems better to use the unified rpma_cq* APIs.

This method will not break current implementation.

It does not break the current implementation. If you have a single CQ you can still use the old API as you have done till this time. But if you will have two CQs (one for SQ and one for RQ) you can use the rpma_cq_*() API for both SQ and RQ or only for RQ since we can agree the old API will just serve the SQ's completions.

Agreed.

Note: If you have an application using the new connection's configuration capabilities you have to modify your application anyway.

On the other hand, we can also agree to just use the new API for consistency and mark the old one as deprecated and when everything will be ready for release 1.0 just drop the old API. It is an option to consider.

OK.

Have I answered your fears? :-)

Yes.
BTW, Should I public rpma_cq_get_fd(), rpma_cq_wait(), rpma_cq_get() on step1? or public them on step2? (step1 doesn't need to public them for now.)

janekmi · 2021-05-28T10:59:59Z

@yangx-jy

Do you want to use the shared CQ of multiple connections for only SRQ?

Although, it should just work. I do not want to make it possible just now. I think we should think through all possible consequences before making this decision. So, for now, I'm for having shared CQ among multiple connections only in the case of using SRQ.

BTW, Should I public rpma_cq_get_fd(), rpma_cq_wait(), rpma_cq_get() on step1? or public them on step2? (step1 doesn't need to public them for now.)

I think this one has already happened: #1035. Sorry for the delayed response.

API discussion

You have proposed to introduce new APIs to handle the CQ for RECVs:

int rpma_conn_get_recv_completion_fd(const struct rpma_conn *conn, int *fd);
int rpma_conn_recv_completion_wait(struct rpma_conn *conn);
int rpma_conn_recv_completion_get(struct rpma_conn *conn, struct rpma_completion *cmpl);

But at the same time, you think it will be convenient to have a getter for SRQ's CQ and a set of functions using this standalone CQ object directly:

int rpma_srq_get_rcq(const struct rpma_srq *srq, struct rpma_cq **cq_ptr);
int rpma_cq_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_wait(struct rpma_cq *cq);
int rpma_cq_get(struct rpma_cq *cq, struct rpma_completion *cmpl);

So in this case you will follow both paths at the same time:

the old one where you accept a struct rpma_conn objects and
the new one introducing a separate public struct rpma_cq object

@yangx-jy why you do not like the idea to use struct rpma_cq-oriented APIs for all CQs no matter what is the parent object of the CQ?

janekmi · 2021-05-28T18:32:40Z

@yangx-jy have let me know offline he agrees to progress in line with the Idea 3A outlined here: #737 (comment)

@yangx-jy I hope I have not misrepresented your position. :-)

yangx-jy · 2021-05-29T01:45:58Z

@yangx-jy why you do not like the idea to use struct rpma_cq-oriented APIs for all CQs no matter what is the parent object of the CQ?

Originally, I don't want to break current implementation and it is enough for applications with no SRQ to get cqe by struct rpma_conn object.
As you mentioned, it is clear and unified to use struct rpma_cq-oriented APIs for all CQs. I am fine with your idea.

yangx-jy · 2021-05-29T01:59:53Z

@yangx-jy have let me know offline he agrees to progress in line with the Idea 3A outlined here: #737 (comment)

@yangx-jy I hope I have not misrepresented your position. :-)

Yes, I agree with the Idea 3A.

Currently, I will achieve the new step2 for separate receiving CQ after #1035 is merged.
The details of the new step2:

struct rpma_conn {
        struct rdma_cm_id *id;
        struct rdma_event_channel *evch;
        struct rpma_cq *cq;
        struct rpma_cq *rcq;
};

/* new public APIs */
int rpma_conn_get_rcq(const struct rpma_conn *conn, struct rpma_cq **rcq_ptr);
int rpma_conn_get_cq(const struct rpma_conn *conn, struct rpma_cq **cq_ptr);

/* public these private APIs */
int rpma_cq_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_wait(struct rpma_cq *cq);
int rpma_cq_get_completion(struct rpma_cq *cq, struct rpma_completion *cmpl);

/* make rpma_conn_cfg_get_cqe() obtain cqe and recv_cqe */
int rpma_conn_cfg_get_cqe(const struct rpma_conn_cfg *cfg, int *cqe, int *recv_cqe);

janekmi · 2021-06-11T12:48:38Z

A question: when a separate CQ for receive is not present rpma_conn_get_rcq() should:

return NULL OR
return just a common CQ created for both sends and receives.

I'm currently, more and more for what @yangx-jy has proposed in #1080 (the 1st of possible implementations). I'm trying to imagine how an application using the new APIs will look like and I think if it will make use of the CQ for receives it will happen on a separate execution path. Returning a NULL, in this case, will prevent the application from continuing whereas returning the same CQ as rpma_conn_get_cq() will cause collecting completions from the same CQ on two execution paths which will cause harder to understand and time-dependent issues.

Of course, the application can compare both struct rpma_cq * pointers and detect they are the same but it is harder to explain, document and it shifts bigger responsibility to the end-user.

@grom72 what do you think? @yangx-jy, @ultra-nexus maybe you have more arguments for or against both rpma_conn_get_rcq() implementations?

janekmi added the enhancement New feature or request label Mar 5, 2021

janekmi mentioned this issue Apr 2, 2021

Find a way to track struct rpma_conn from which the struct rpma_completion comes from #977

Closed

janekmi mentioned this issue Apr 23, 2021

Shared Receive Queue (SRQ) #958

Closed

grom72 closed this as completed Aug 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate completion queue for receiving #737

Separate completion queue for receiving #737

grom72 commented Jan 11, 2021 •

edited

Loading

janekmi commented Feb 26, 2021 •

edited

Loading

janekmi commented Feb 27, 2021

yangx-jy commented Mar 8, 2021 •

edited

Loading

janekmi commented Mar 18, 2021

yangx-jy commented Apr 1, 2021

janekmi commented Apr 1, 2021

ultra-nexus commented Apr 2, 2021

janekmi commented Apr 2, 2021 •

edited

Loading

yangx-jy commented Apr 2, 2021

ultra-nexus commented Apr 6, 2021 •

edited by janekmi

Loading

janekmi commented Apr 6, 2021

ultra-nexus commented Apr 7, 2021 •

edited

Loading

janekmi commented Apr 7, 2021

yangx-jy commented Apr 7, 2021

janekmi commented Apr 7, 2021

yangx-jy commented Apr 8, 2021

yangx-jy commented Apr 12, 2021

janekmi commented Apr 12, 2021

yangx-jy commented Apr 16, 2021 •

edited

Loading

janekmi commented Apr 16, 2021

yangx-jy commented Apr 29, 2021 •

edited

Loading

janekmi commented Apr 29, 2021

yangx-jy commented Apr 30, 2021 •

edited

Loading

janekmi commented May 28, 2021

janekmi commented May 28, 2021

yangx-jy commented May 29, 2021

yangx-jy commented May 29, 2021 •

edited

Loading

janekmi commented Jun 11, 2021

Separate completion queue for receiving #737

Separate completion queue for receiving #737

Comments

grom72 commented Jan 11, 2021 • edited Loading

janekmi commented Feb 26, 2021 • edited Loading

Idea No1

Idea No2

Risks:

Weaknesses:

janekmi commented Feb 27, 2021

Idea No2B

Why it is superior comparing to Idea No2?

Weaknesses?

yangx-jy commented Mar 8, 2021 • edited Loading

janekmi commented Mar 18, 2021

yangx-jy commented Apr 1, 2021

janekmi commented Apr 1, 2021

Idea No3A

Idea No3B

ultra-nexus commented Apr 2, 2021

janekmi commented Apr 2, 2021 • edited Loading

Plan A (Idea No3A)

Stage 1

Stage 2A in parallel with Stage 2B

Stage 2B in parallel with Stage 2A

Stage 3

Stage 4

yangx-jy commented Apr 2, 2021

ultra-nexus commented Apr 6, 2021 • edited by janekmi Loading

Idea No4

janekmi commented Apr 6, 2021

Idea No4

Comments

ultra-nexus commented Apr 7, 2021 • edited Loading

janekmi commented Apr 7, 2021

yangx-jy commented Apr 7, 2021

janekmi commented Apr 7, 2021

yangx-jy commented Apr 8, 2021

yangx-jy commented Apr 12, 2021

janekmi commented Apr 12, 2021

yangx-jy commented Apr 16, 2021 • edited Loading

janekmi commented Apr 16, 2021

yangx-jy commented Apr 29, 2021 • edited Loading

janekmi commented Apr 29, 2021

yangx-jy commented Apr 30, 2021 • edited Loading

janekmi commented May 28, 2021

API discussion

janekmi commented May 28, 2021

yangx-jy commented May 29, 2021

yangx-jy commented May 29, 2021 • edited Loading

janekmi commented Jun 11, 2021

grom72 commented Jan 11, 2021 •

edited

Loading

janekmi commented Feb 26, 2021 •

edited

Loading

yangx-jy commented Mar 8, 2021 •

edited

Loading

janekmi commented Apr 2, 2021 •

edited

Loading

ultra-nexus commented Apr 6, 2021 •

edited by janekmi

Loading

ultra-nexus commented Apr 7, 2021 •

edited

Loading

yangx-jy commented Apr 16, 2021 •

edited

Loading

yangx-jy commented Apr 29, 2021 •

edited

Loading

yangx-jy commented Apr 30, 2021 •

edited

Loading

yangx-jy commented May 29, 2021 •

edited

Loading