-
Notifications
You must be signed in to change notification settings - Fork 56
Separate completion queue for receiving #737
Comments
Idea No1int rpma_conn_cfg_set_recv_cq_size(struct rpma_conn_cfg *cfg, uint32_t recv_cq_size);
int rpma_conn_cfg_get_recv_cq_size(const struct rpma_conn_cfg *cfg, uint32_t *recv_cq_size);
int rpma_conn_get_recv_completion_fd(const struct rpma_conn *conn, int *fd);
int rpma_conn_recv_completion_wait(struct rpma_conn *conn);
int rpma_conn_recv_completion_get(struct rpma_conn *conn, struct rpma_completion *cmpl); To be discussed if it is a better idea to complement the APIs introduced above with:
If I understand correctly this is more or less what @grom72 has proposed. Idea No2
int rpma_cq_new_x(struct rpma_peer *peer, int cq_size, struct rpma_cq **cq);
int rpma_cq_new_y(struct rpma_peer *peer, int send_cq_size, int recv_cq_size, struct rpma_cq **cq);
int rpma_cq_delete(struct rpma_cq **cq);
int rpma_conn_cfg_set_cq(const struct rpma_conn_cfg *cfg, struct rpma_cq *cq); if rpma_cq is set:
int rpma_cq_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_completion_wait(struct rpma_cq *cq);
int rpma_cq_completion_get(struct rpma_cq *cq, struct rpma_completion *cmpl);
int rpma_cq_send_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_recv_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_send_completion_wait(struct rpma_conn *conn);
int rpma_cq_recv_completion_wait(struct rpma_conn *conn);
int rpma_cq_send_completion_get(struct rpma_cq *cq, struct rpma_completion *cmpl);
int rpma_cq_recv_completion_get(struct rpma_cq *cq, struct rpma_completion *cmpl); Risks:
Weaknesses:
|
Idea No2B
int rpma_dcq_new(struct rpma_peer *peer, int send_cq_size, int recv_cq_size, struct rpma_dcq **dcq);
Why it is superior comparing to Idea No2?
Weaknesses?
|
Is it necessary to expose the detailed process of creating CQ to application? I like idea No1 because of three reasons:
For idea No1, I want to keep rpma_conn_completion* APIs because they have two roles:
If you agree with idea NO1, I will try to implement it. Best Regards, |
Nice to have you join this conversation @yangx-jy! Returning to the issue after a few days I think I'm close to your position. I hope to have a strong opinion by tomorrow. |
Hi @janekmi If you don't have objection, I will try to implement the separate RCQ next week as idea No1 describes. Because I still think idea No1 is easy for application to use APIs about the separate RCQ. Best Regards, |
@yangx-jy I won't stop you. But I want to have this discussion rolling till we reach a consensus. Today I had a long and fruitful discussion with @grom72 (our beloved architect). We plan another session tomorrow but I want to keep you up to date with the state of the discussion. Interestingly enough, we have decided to discuss this topic in the context of #958. So I strongly encourage @ultra-nexus to take a look at what is going here. Idea No3AThis is a combination of Idea No1 and Idea No2. Where we do introduce a separate entity // by default 0 - no separate recv CQ
int rpma_conn_cfg_set_rcq_size(const struct rpma_conn_cfg *cfg, uint32_t cq_size);
struct rpma_cq;
// if no separate recv CQ return NULL; with an error?
int rpma_conn_get_rcq(const struct rpma_conn *conn, struct rpma_cq **cq_ptr);
int rpma_conn_get_cq(const struct rpma_conn *conn, struct rpma_cq **cq_ptr);
int rpma_cq_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_wait(struct rpma_cq *cq);
int rpma_cq_get(struct rpma_cq *cq, struct rpma_completion *cmpl); At the same time, we have discussed possible SRQ interfaces for librpma and we came with the following: struct rpma_srq;
int rpma_srq_new(struct rpma_peer *peer, uint32_t rq_size, struct rpma_srq **srq_ptr);
// overwrites rpma_conn_cfg_set_rq_size()
int rpma_conn_cfg_set_srq(const struct rpma_conn_cfg *cfg, struct rpma_srq *srq);
int rpma_srq_recv(struct rpma_srq *srq, struct rpma_mr_local *dst, size_t offset, size_t len, const void *op_context); Note: @grom72 and I are thinking about giving SRQ a possibility to have a separate CQ for receives which will allow collecting all completions from the SRQ with a single Idea No3BVery similar to what you see above but with an explicit struct rpma_cq;
int rpma_cq_new(struct rpma_peer *peer, uint32_t cq_size, struct rpma_cq **cq_ptr); |
I will also participate in this SCQ discussion, thanks. |
After today's discussion, @grom72 and I have agreed to follow Idea No3A. We have also completed the SRQ part so here it is: /* a SRQ configuration structure */
struct rpma_srq_cfg;
int rpma_srq_cfg_new(struct rpma_srq_cfg **scfg_ptr);
int rpma_srq_cfg_delete(struct rpma_srq_cfg **scfg_ptr);
int rpma_srq_cfg_set_rq_size(struct rpma_srq_cfg *scfg, uint32_t rq_size);
// rcq_size == 0 means no CQ for receives is provided along with SRQ
int rpma_srq_cfg_set_rcq_size(struct rpma_srq_cfg *scfg, uint32_t rcq_size);
/* the SRQ itself */
struct rpma_srq;
// ibv_create_srq
int rpma_srq_new(struct rpma_peer *peer, struct rpma_srq_cfg *scfg, struct rpma_srq **srq_ptr);
int rpma_srq_delete(struct rpma_srq **srq_ptr);
// overwrites rpma_conn_cfg_set_rq_size()
int rpma_conn_cfg_set_srq(const struct rpma_conn_cfg *cfg, struct rpma_srq *srq);
int rpma_srq_recv(struct rpma_srq *srq, struct rpma_mr_local *dst, size_t offset, size_t len, const void *op_context);
int rpma_srq_get_rcq(const struct rpma_srq *conn, struct rpma_cq **cq_ptr);
/*
Having the rpma_cq you can use it as a single point controlling all CQEs for Receives coming from all connections using this SRQ:
- rpma_cq_get_fd
- rpma_cq_wait
- rpma_cq_get
*/ Note: We are still open for discussion. If you will have any thoughts or comments. Plan A (Idea No3A)Since all these changes are interrelated and it is hard to separate them I think we have to also think a little bit about the order of actions e.g. Stage 1struct rpma_cq;
// if no separate recv CQ return NULL; with an error?
int rpma_conn_get_rcq(const struct rpma_conn *conn, struct rpma_cq **cq_ptr);
int rpma_conn_get_cq(const struct rpma_conn *conn, struct rpma_cq **cq_ptr);
int rpma_cq_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_wait(struct rpma_cq *cq);
int rpma_cq_get(struct rpma_cq *cq, struct rpma_completion *cmpl); Stage 2A in parallel with Stage 2B// by default 0 - no separate recv CQ
int rpma_conn_cfg_set_rcq_size(const struct rpma_conn_cfg *cfg, uint32_t cq_size); Note: Stage 2B in parallel with Stage 2A/* a SRQ configuration structure */
struct rpma_srq_cfg;
int rpma_srq_cfg_new(struct rpma_srq_cfg **scfg_ptr);
int rpma_srq_cfg_delete(struct rpma_srq_cfg **scfg_ptr);
int rpma_srq_cfg_set_rq_size(struct rpma_srq_cfg *scfg, uint32_t rq_size);
// no rpma_srq_cfg_set_rcq_size() and rpma_srq_get_rcq() in this Stage
/* the SRQ itself */
struct rpma_srq;
int rpma_srq_new(struct rpma_peer *peer, struct rpma_srq_cfg *scfg, struct rpma_srq **srq_ptr);
int rpma_srq_delete(struct rpma_srq **srq_ptr);
// overwrites rpma_conn_cfg_set_rq_size()
int rpma_conn_cfg_set_srq(const struct rpma_conn_cfg *cfg, struct rpma_srq *srq);
int rpma_srq_recv(struct rpma_srq *srq, struct rpma_mr_local *dst, size_t offset, size_t len, const void *op_context); Stage 3// rcq_size == 0 means no CQ for receives is provided along with SRQ
int rpma_srq_cfg_set_rcq_size(struct rpma_srq_cfg *scfg, uint32_t rcq_size);
int rpma_srq_get_rcq(const struct rpma_srq *conn, struct rpma_cq **cq_ptr); Stage 4After all of these efforts, I expect nice examples to appear namely. :-) Note: It might be a good idea to use RCQ in the GPSPM example: |
Very cool. I and @ultra-nexus will try to understand the details of Idea No3A next week. We will add some comments here if we have some questions or suggestions about the big enhacement(#737 and #958). We will also look into #977 and #976 as you metioned next week. Thanks to @janekmi and @grom72 for giving us such an explicit design. Best Regards, |
Hi, @janekmi The SRQ idea following No3A is structured and concise. I recently also drafted a implementation of SRQ just before the above discussion in this issue. My idea may not be well designed and I also post it here as a reference. Idea No4/*
* Basic data structure
*/
struct rpma_peer {
...
struct srq_vec srq_vec; /* a struct to manage srq */
};
struct srq_vec {
struct ibv_srq **data; /* srq pointer array */
int *alloced_map; /* allocation map */
unsigned long max_srq; /* array limit */
unsigned long count; /* used count */
};
/*
* allocate srq
* This function return an index of srq_vec.data[]
*/
int
rpma_alloc_srq(struct rpma_peer *peer, unsigned int max_wr,
unsigned int max_sge, unsigned int srq_limit);
/*
* free srq
*/
int
rpma_free_srq(struct rpma_peer *peer, int idx);
/*
* initialize req struct
* srq_idx is the return value of rpma_alloc_srq()
*/
int
rpma_ep_next_conn_req(struct rpma_ep *ep, const struct rpma_conn_cfg *cfg,
struct rpma_conn_req **req_ptr, int srq_idx); /* the last parameter is srq index */
int
rpma_conn_req_new(struct rpma_peer *peer, const char *addr,
const char *port, const struct rpma_conn_cfg *cfg,
struct rpma_conn_req **req_ptr, int srq_idx); /* the last parameter is srq index */
static int
rpma_conn_req_from_id(struct rpma_peer *peer, struct rdma_cm_id *id,
const struct rpma_conn_cfg *cfg, struct rpma_conn_req **req_ptr,
int srq_idx)
{
rpma_peer_create_qp(peer, id, cq, cfg, srq_idx);
/* Save srq index in req struct. We will use it when creating connection */
(*req_ptr)->srq_idx = srq_idx;
}
int
rpma_peer_create_qp(struct rpma_peer *peer, struct rdma_cm_id *id,
struct ibv_cq *cq, const struct rpma_conn_cfg *cfg, int srq_idx)
{
struct ibv_qp_init_attr qp_init_attr;
if (srq_idx != -1) {
qp_init_attr.srq = get_srq_by_idx(peer, srq_idx); /* Use srq */
} else {
qp_init_attr.srq = NULL;
}
rdma_create_qp(id, peer->pd, &qp_init_attr); /* Create qp */
}
/*
* create new connection
* The third argument of rpma_conn_new() is srq_idx saved in req struct
*/
static int
rpma_conn_req_connect_active(struct rpma_conn_req *req,
struct rdma_conn_param *conn_param, struct rpma_conn **conn_ptr)
{
/* pass req->srq_idx into rpma_conn_new() */
rpma_conn_new(req->peer, req->id, req->cq, req->srq_idx, &conn);
}
static int
rpma_conn_req_accept(struct rpma_conn_req *req,
struct rdma_conn_param *conn_param, struct rpma_conn **conn_ptr)
{
/* pass req->srq_idx into rpma_conn_new() */
rpma_conn_new(req->peer, req->id, req->cq, req->srq_idx, &conn);
}
int
rpma_conn_new(struct rpma_peer *peer, struct rdma_cm_id *id,
struct ibv_cq *cq, int srq_idx, struct rpma_conn **conn_ptr)
{
/*
* if this connection want to use SRQ, its srq_idx have to >= 0
* Otherwise, the connection use normal RQ.
*/
conn->srq = (srq_idx != -1 ? get_srq_by_idx(peer, srq_idx) : NULL);
}
/*
* post a receive wr
*/
int
rpma_recv(struct rpma_conn *conn,
struct rpma_mr_local *dst, size_t offset, size_t len,
const void *op_context)
{
rpma_mr_recv(conn->id->qp, conn->srq, /* conn->srq is not null if using SRQ */
dst, offset, len,
op_context);
}
int
rpma_conn_req_recv(struct rpma_conn_req *req,
struct rpma_mr_local *dst, size_t offset, size_t len,
const void *op_context)
{
if (req->srq_idx == -1)
srq = NULL;
else
srq = get_srq_by_idx(req->peer, req->srq_idx);
rpma_mr_recv(req->id->qp, srq, /* conn->srq is not null if using SRQ */
dst, offset, len,
op_context);
}
int
rpma_mr_recv(struct ibv_qp *qp, struct ibv_srq *srq,
struct rpma_mr_local *dst, size_t offset,
size_t len, const void *op_context)
{
if (srq) /* Use SRQ */
ret = ibv_post_srq_recv(srq, &wr, &bad_wr);
else /* Use normal QP */
ret = ibv_post_recv(qp, &wr, &bad_wr);
} |
Hi @ultra-nexus. Thank you for joining the conversation. :-) I have called your Idea No4. (I know you were the first to propose an API for SRQ but since it was done in a separate PR #685 I think it will be easier to follow the numbering this way.) Idea No4Comments
|
Hi, @janekmi Thanks for you reply.
Yes. From the point of view of encapsulation, a
Thanks. This makes sense. I didn't consider the multithread scenario.
If I understand correctly, we delegate the SRQ management stuff to the application and the rpma is only responsible for allocation and free.
If we introduce a |
Yes. The user application is responsible for all the resource management. The librpma library knows how to allocate/free resources but the application has to tell the library to do so.
In general, in the librpma library we tend to add infixes which indicates what entity a given function is related to e.g. _conn_ for all functions related to |
Hi @janekmi After reading Idea No3A about separate receive CQ again, I want to confirm if I understand the design correctly.
Best Regards, |
|
Hi @janekmi Is it OK to take use of
I want to unify the internal logic in both new APIs and old APIs. Best Regards, |
Do you expect the following design?
If |
I think it will be handy if one could use the new APIs in both cases. But it is a nice to have feature. Depending on how hard it will be to do so. |
Hi @janekmi When I am writing the implementation of Can we define
Why do you want to expose |
You are right. The
One of many arguments may be that it plays nicely with the SRQ proposition also stated above. Ref: #737 (comment) |
I plan to implement separate CQ by two steps. Step1: Add struct rpma_cq and some related functions, like this:
I have created a PR #1035 for step1. Step2: Add separate RCQ, like this:
For Step2, I plan to drop |
I will rather expect the end-user to cache the With the approach, you argue for, you are "closing" possible pathways of developing the librpma API in the future. E.g. there is no nice way to have a shared CQ for multiple connections which may be a real use case in the future assuming the RPMEM with the messaging will get wider adoption as it tends now. Whereas by making public the
SRQ plays nicely with RPMEM because it is the easiest way to share a single PMEM space among multiple connections without the hassle of managing interactions between them. You just register the whole PMEM space for RECV. Post the buffers to SRQ. And you are done. When the message will come it may be either processed or persisted or left where it is (assuming the initiator knows how to flush its contents to persistency single-sidedly). No matter what you do with the RECVed buffer you just have to post another one to the SRQ. And that's it. You got a multiple-connection-capable persistent log based on RPMEM and SRQ. Isn't it lovely? :-)
It does not break the current implementation. If you have a single CQ you can still use the old API as you have done till this time. But if you will have two CQs (one for SQ and one for RQ) you can use the Note: If you have an application using the new connection's configuration capabilities you have to modify your application anyway. On the other hand, we can also agree to just use the new API for consistency and mark the old one as deprecated and when everything will be ready for release 1.0 just drop the old API. It is an option to consider. Have I answered your fears? :-) |
I see.
Do you want to use the shared CQ of multiple connections for only SRQ? My previous plan only considered to use the shared CQ on SRQ, like this:
2. For SRQ, add
It seems better to use the unified rpma_cq* APIs.
Agreed.
OK.
Yes. |
Although, it should just work. I do not want to make it possible just now. I think we should think through all possible consequences before making this decision. So, for now, I'm for having shared CQ among multiple connections only in the case of using SRQ.
I think this one has already happened: #1035. Sorry for the delayed response. API discussionYou have proposed to introduce new APIs to handle the CQ for RECVs: int rpma_conn_get_recv_completion_fd(const struct rpma_conn *conn, int *fd);
int rpma_conn_recv_completion_wait(struct rpma_conn *conn);
int rpma_conn_recv_completion_get(struct rpma_conn *conn, struct rpma_completion *cmpl); But at the same time, you think it will be convenient to have a getter for SRQ's CQ and a set of functions using this standalone CQ object directly: int rpma_srq_get_rcq(const struct rpma_srq *srq, struct rpma_cq **cq_ptr);
int rpma_cq_get_fd(const struct rpma_cq *cq, int *fd);
int rpma_cq_wait(struct rpma_cq *cq);
int rpma_cq_get(struct rpma_cq *cq, struct rpma_completion *cmpl); So in this case you will follow both paths at the same time:
@yangx-jy why you do not like the idea to use |
@yangx-jy have let me know offline he agrees to progress in line with the Idea 3A outlined here: #737 (comment) @yangx-jy I hope I have not misrepresented your position. :-) |
Originally, I don't want to break current implementation and it is enough for applications with |
Yes, I agree with Currently, I will achieve
|
A question: when a separate CQ for receive is not present
I'm currently, more and more for what @yangx-jy has proposed in #1080 (the 1st of possible implementations). I'm trying to imagine how an application using the new APIs will look like and I think if it will make use of the CQ for receives it will happen on a separate execution path. Returning a NULL, in this case, will prevent the application from continuing whereas returning the same CQ as Of course, the application can compare both @grom72 what do you think? @yangx-jy, @ultra-nexus maybe you have more arguments for or against both |
There are workflows in which processing incoming messages might be blocked by unfinished processing of previously sent messages (send queue slot is not available until completion is read from completion queue).
That forces the application to buffer receiving messages until sending message completions are collected to have a free slot in RDMA send queue.
The solution would be to have optionally a separate completion queue for receiving messages.
The receiving completion queue should have a separate completion channel and separate API to get the completion channel file descriptor. A separate API to collect completions is also required.
In case, a receiving completion queue is created main API rpma_conn_completion_get should only return the sending completions (send/read/write).
The text was updated successfully, but these errors were encountered: