-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotation API #24
Annotation API #24
Conversation
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
What is the goal of this experimentation? I understand that static typing is a bonus and I like having a well defined value object for input/output. A huge downside that I see is that this is even further away from a pure DSL and closer to just writing pure Groovy and thus you are limiting the audience of who can read/write workflows in this way. I would say that the nextflow DSL already exposes a lot of Groovy to pipeline authors, which increases the complexity and limits accessibility. |
I agree that the DSL already exposes a lot of Groovy code, and I would love for it to become more declarative. I think most people want to either use a purely declarative syntax or use a programming language that they're already familiar with, but the DSL sits in a weird middle ground that satisfies no one. The goal of this annotation API then is to provide the pure Groovy approach, while at the same time we could work to make the DSL more declarative. This way, users can use one approach or the other based on what they are more comfortable with, and they can even be mixed within the same pipeline (i.e. you might write your pipeline with the DSL but use modules written in pure Groovy). In other words, the programmatic approach "frees" the DSL to become more declarative without the burden of having to support every possible use case. If the DSL doesn't support some particular edge case, you can always fall back to the programmatic approach. As for static types, I have a few ideas for how to add also them to the DSL, but there are a few design trade-offs to consider. The annotations here support static types by basically being more verbose, so I think in the DSL we would want something more concise. |
I have to admit, it feels harder to read.
This syntax also makes me feel like it's two separate commands. I like that the process section scopes are more visible, but the input definition is now split in two places. There's the syntax in the annotation, and the args to I'm also really against this syntax:
It's much more verbose and I don't see any benefit to writing in this way over the current way. Also what would be the syntax for a channels with Also there's still the two object types |
I guess I missed that this wasn't intended for newcomers, but while I may be competent in Nextflow, a lot of the people I work with are not as experienced, and they need to be able to understand what I write too. |
All good points Mahesh. While I don't like the added verbosity of the inputs / outputs either, it was done in order to enable arbitrary types. For inputs, you declare the inputs and their types (the function arguments) and in the For outputs, you declare which envs/files/stdout you want to stage out of the task, and then you declare the outputs, one channel for each This rnaseq-nf isn't complicated enough to illustrate this point, so let me try an example here: @ProcessFn(
inputs={
path('file1.txt') { sample[1] } // if sample were a tuple [ id, left, right ]
path('file1.txt') { sample['left'] } // if sample were a map
path('file1.txt') { sample.left } // if sample were a record
},
outputs={
path('$file1', 'file1.txt')
path('$file2', 'file2.txt')
emit { [ sample.id, path('$file1'), path('$file2') ] } // tuple
emit { [ id: sample.id, left: path('$file1'), right: path('$file2') ] } // map
emit { new SamplePair( sample.id, path('$file1'), path('$file2') ) } // record
})
def PROC(sample) {
// ...
}
|
Any suggestions how this might go with reusable modules (i.e. nf-core modules)? Say one workflow defines the |
This is an exploration activity. I don't think it will ever be used for real pipelines. |
@mahesh-panchal this is a good question regardless of the annotation API, since we want to have static types for the DSL. That is, how can users share and re-use types without clobbering each other's definitions. I think in nf-core you would want to follow a similar approach for types as you do with modules. For example the Beyond that, I would say that the community will just have to find alignment around common types. You can always start out by having a custom type for every module and then combining types where it makes sense. Types can extend a common base type or interface where there is shared structure/behavior, but that is a game that should be played with caution. |
The goals of this exploration were to (1) improve error messages for syntax errors (2) support static types and (3) explore the trade-offs of writing pipelines in pure Groovy. The annotation API isn't a specific thing that we want to do so much as a vehicle for those goals, and it satisfied all of them. However, I think if we add a programmatic interface it should be in Python. A pure Groovy approach is probably just another weird middle ground. Thinking about the DSL, I think we can minimize the amount of Groovy code by (1) isolating as much Groovy code as possible to the |
I think the idea makes some sense. I think the goal of the Nextflow DSL is more to make it easy to define a workflow without needing to know Groovy. However a quick glance at NF-core reveals that many workflows use both meaning Nextflow developers now need to know by NF and Groovy. Most of the DSLs that I have seen and used augment an existing programming language to make it easier to perform certain tasks so it is not too surprising that Groovy can leak into Nextflow (it would be hard to prevent it). To truly prevent the need to learn two languages I think you would have to either define Nextflow as a language (not a DSL). This could certainly be done. The other path would be to do what is suggested here and write only Groovy and use annotations or some other metaprogramming to define the parts that will be distributed processes/ dependencies in the DAG etc. |
}, | ||
script=true | ||
) | ||
def MULTIQC(List<Path> files, Path config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the typing on input channels. It makes me wonder if it would be possible to add a decorator like this:
input:
@type:List<Path>
@description: a list of files output from various processes for multiqc to parse
path('*')
output:
@type:Path
@description: multiqc html summary
path('multiqc_report.html')
To the current syntax that could be both checked, and generate some automatic documentation. It would be especially nice if the type of channel -- value or queue -- was included. This would be helpful in development and maintenance, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The syntax would have to be something like this:
input:
path('*'), type: List<Path>, description: 'a list of files output from various processes for multiqc to parse'
But you have the right idea.
As for channels, I think the process definition should be decoupled from whether the input channels are queue or value, because it shouldn't really matter
Closing since this exploration has concluded. Follow nextflow-io/nextflow#4553 for updates on adding static types to process inputs/outputs. |
Folks, I have developed a prototype for typed process inputs/outputs which reuses some ideas from this exploration but in the context of the existing DSL. Feel free to check it out and offer any feedback there: #26 |
This PR uses an experimental annotation API (Nextflow PR coming soon) to define processes.
Notable changes:
@ProcessFn
annotation that defines additional process configuration.inputs
section in the annotation only defines how to stage files / environment vars / stdin based on the parametersemit
script=true
denotes that the process generates script tasks, in which case the function should return the task scriptSample
class for better type checking