-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporation of TensorStore C++ autosummary extension #92
Comments
I have been browsing the tensorstore docs, and I found it confusing when I click on a cross-reference and it loads a separate page with 1 memeber on it. I'm very interested in this C++ autosummary idea because I haven't found a good alternative to breathe that cuts Doxygen out of the workflow. |
I agree that for members with just a short description, a separate page doesn't really make sense. I figured that over time the documentation would be expanded and almost everything would have a longer description and examples and such, and therefore justify a separate page. For example, the documentation for this member is quite long and would not work well without a separate page: Note: The numpy docs put every function on a separate page. But I was also thinking about adding a rule where if the documentation is short enough, not to generate a separate page. In principle there could also be an option to do something other than separate pages for each entity --- would just have to figure out what that "something else" would be. With the C++ autosummary, there is a first pass that uses libclang to generate a JSON description of the API (this could happen either separately from the documentation build, or as the first step of it), and then the actual autosummary extension reads this json file. |
Your example does make sense for using a separate page. To be clear, I was confused at the structure of the TOC. It looks like each TOC entry is specific to a class or module? I didn't expect that clicking a item in the secondary TOC would lead me to a subpage. That approach kind of exemplifies #58 |
I've been looking into the libclang python wrapper. There is a types-clang stub lib for it also. I think the stubs are maintained by the LLVM team, but not all objects are typed properly (namely the The |
The current version that I have implemented only allows |
I have written a test script that strips out all comment like syntax: def strip_comment(cmt: str = None) -> Optional[str]:
"""
:param cmt: the `Cursor.raw_comment` property.
:returns: A list of the lines without the surrounding C++ comment syntax.
"""
if cmt is None:
return None
# the first line is never indented (in clang-parsed form)
# so dedent all subsequent lines only
first_line_end = cmt.find("\n")
cmt = cmt[:first_line_end] + dedent(cmt[first_line_end:])
# split into a list of lines & account for CRLF and LF line endings
body = [line.rstrip("\r") for line in cmt.splitlines()]
# strip all the comment syntax out
if body[0].startswith("//"):
body = [line.lstrip("/").lstrip("!").lstrip("<").strip() for line in body]
elif body[0].startswith("/*"):
body[0] = body[0].lstrip("/").lstrip("*").lstrip("!")
multi_lined_asterisk = True # should be ok for single-line comments
if len(body) > 1:
line_has_asterisk = [line.startswith("*") for line in body[1:]]
multi_lined_asterisk = line_has_asterisk.count(True) == len(body) - 1
body = [
(line.lstrip("*").lstrip("<").strip() if multi_lined_asterisk else line.strip())
for line in body
]
body[-1] = body[-1].rstrip("*/").rstrip()
return body Still testing it out though... |
I found a similar implementation for autodoc-ing C code using libclang... It monkey patches the libclang python binding to expose methods that allow using clang to parse the comments and return them as xml (see monkey patch src here). So, for a function's docstring that looks like /*!
* \brief A member function.
* \param c a character.
* \param n an integer.
* \exception std::out_of_range parameter is out of range.
* \return a character pointer.
*/
const char* Test6::member(char c, int n) const the patched clang binding's parsed comment can be interpreted with lxml, which results in <Function isInstanceMethod="1" file="path\\to\\tests\\doxygen\\func.hpp" line="23" column="20">
<Name>member</Name>
<USR>c:@S@Test6@F@member#C#I#1</USR>
<Declaration>const char *Test6::member(char c, int n) const</Declaration>
<Abstract>
<Para> A member function. </Para>
</Abstract>
<Parameters>
<Parameter>
<Name>c</Name>
<Index>0</Index>
<Direction isExplicit="0">in</Direction>
<Discussion>
<Para> a character. </Para>
</Discussion>
</Parameter>
<Parameter>
<Name>n</Name>
<Index>1</Index>
<Direction isExplicit="0">in</Direction>
<Discussion>
<Para> an integer. </Para>
</Discussion>
</Parameter>
</Parameters>
<Exceptions>
<Para> std::out_of_range parameter is out of range. </Para>
</Exceptions>
<ResultDiscussion>
<Para> a character pointer.</Para>
</ResultDiscussion>
</Function> This will provide a better way to parse Doxygen style comments rather than parsing the raw text manually. 👍🏼 |
That's interesting, I didn't know about the xml comment output from clang. I've been working on putting together an initial/draft version of the cpp apigen based on what I implemented for tensorstore. At this point it is almost ready. |
I'm working in a separate repo (local only - nothing's been pushed). I'm trying to write a replacement for breathe, but there are things that doxygen does and clang doesn't; namely
I eagerly await your implementation. |
I started exploring the XML approach because the doxygen commands can have nested paragraphs (using |
Doxygen's grouping is probably one of the worst implemented features it has. I quick tested this doctring /**
* @ingroup group2
* @brief class C3 in group 2
*/
class C3 {}; and got the following parsed xml <Class file="path\to\tests\doxygen\group.cpp" line="35" column="7">
<Name>C3</Name>
<USR>c:@S@C3</USR>
<Declaration>class C3 {}</Declaration>
<Abstract>
<Para> class C3 in group 2</Para>
</Abstract>
<Discussion>
<Verbatim xml:space="preserve" kind="verbatim"> group2</Verbatim>
</Discussion>
</Class> So, we'd still have to parse /**
* @defgroup group2 The Second Group
* This is the second group
*/ It also may not consolidate consecutive comment blocks like Doxygen does. |
I don't intend to support |
To be clear --- is that XML output from clang or from doxygen? One of the nice things about the way I'm currently parsing doc comments is that line number information is preserved, so that any Sphinx warnings and errors are shown with a location in the original source file. In general, I think it may be sufficient to just support the most common doxygen commands, e.g. so that most doc comments don't have to be changed. |
What is it that you don't like about doxygen's grouping? The grouping feature I've implemented for the C++ and Python apigen extensions is certainly inspired by doxygen, though there are some differences. |
I'm not running Doxygen at all. Its all clang output (using the monkey patch I referenced) then parsed with
It divides up the summary into "modules" (not to be confused with compilation units). Anything within groups or subgroups won't show in its original namespace's document summary. This makes navigating Doxygen-generated docs that extensively use the grouping (for C++ libs) very confusing. |
I see you expose filtering options in the config like The |
The way it currently works is that in the builder-inited event, which happens before sphinx actually reads any of the rst documents (or even determines the list of rst files) we parse the code, extract the API description as json, and generate the separate pages for documenting each entity. Then the directives that are parsed when sphinx processes the rst files just insert group summaries. By the time the rst directives are parsed, it is too late to generate more files, at least with the current sphinx build process. The built-in autosummary extension does something hacky to work around this limitation --- it first does its own "parsing" of all the source files to locate any autosummary directives, but this isn't a real parsing of the rst, just some regular expressions. That means it doesn't work for another directive to generate an autosummary directive. In any case for tensorstore the current configuration method seemed to work pretty well, in order to exclude internal/private apis, but basically include everything else. |
I didn't see any undoc'd members or access specifier filtering. I'd like to add that as well, but sometimes the In research, I decided to keep a list of parsed USR and avoid serializing entities (to JSON) with duplicate USR ids (this sped up the serialization a bit). Clang seems to copy the first found docstring for decls or defs, which made having duplicate USRs redundant to me. |
Currently there is no access specifier filtering, but entities without a doc comment are skipped. There can be more than one declaration but only one definition within a single translation unit. The information from all occurrences is merged (doc comment, default template arguments). |
The idea is that you arrange to include all the headers you want to document in a single translation unit (i.e. unity build), that way every header is parsed only once. |
I often see undoc'd member's names as if they're self-explanatory (and could be useful if cross-refing an undoc'd member). |
Some APIs have more than 1 entrypoint though. I agree with this conception, sadly it isn't a perfect world. |
I wish clang would expose a way to filter AST by relativity to indexed unit. |
What do you mean by entrypoint? As long as they can be used together, you can just create a new header that |
What do you have in mind as far as relativity? |
If it isn't directly in the translation unit's |
That could be done with the |
The issue with parsing every public header of a library separately is that there is likely to be huge overlap in the transitive dependencies of those headers, which means a lot of headers will have to be parsed multiple times, making it much slower. |
Some projects have a lib and an executable as separate building entrypoints.
This sounds like a good idea for passing in the |
Um, how do you want to collab on this? I ask because I know you like to force push after amending git history. Should I PR any proposals into the cpp-apigen branch? I'm going to patch in support for all doc comment prefixes and a corresponding function to strip the comment syntax (resulting in just the comments' text as |
How about I'll stop pushing (force or fast-forward) to this branch and then we both can make changes by opening additional PRs against this branch? |
Sounds fair. |
I'm having trouble building on windows: It keeps failing to find any declarations. I've narrowed it down to the
to path = path.replace("\\", "/") This took me a while to track down. |
Apparently getting the comments from tokens isn't great. We're only getting single line comments this way (those that start with I actually think this may be related to how you're traversing the file (looks like its going line-by-line - I think). In my research, I was able to use the |
Oh yes, thanks for catching that. |
I haven't checked but I think multiline comments will also appear as tokens. It is true that currently I am querying tokens line by line in order to find the comments that occur before a cursor. Maybe the issue is that if you query with a start location that is within a multiline comment then it doesn't get returned --- perhaps the start location has to be at or before the start of the multiline comment. It is also indeed possible to just iterate through all the tokens, and then associate them with entities. That would allow finding Parsing the comments manually from the source files # would also not be too hard --- can be done with regular expressions I think. |
I was also thinking that the apigen ext should be its own sub-pkg. The api_parse.py file is already over 2000 lines... |
I'm ok with ignoring the file-level comments since it doesn't seem applicable from the autodoc perspective (really is more specific to doxygen). I think some stray comments can also be ignored like those that define groups in doxygen using I still think it would be easier to try filling in the int var; ///< some doc info about `var` Instead of traversing line by line, we could traverse between cursor locations. |
Seems reasonable |
I have a TensorStore C++ autosummary also mostly implemented, although not yet released in the TensorStore repository. I would like to potentially incorporate it into this theme, but wanted to find out if that might be useful to anyone else. (If not, I'll just maintain it separately within TensorStore.)
It works similarly to the TensorStore Python autosummary extension, in that every entity is documented on a separate page.
At the top level, entities are organized into groups (like doxygen groups) and e.g. a
.. cpp-apigen:: group
directive is used to insert the summary of the members of a group.The text was updated successfully, but these errors were encountered: