Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotation API #24

Closed
wants to merge 3 commits into from
Closed

Annotation API #24

wants to merge 3 commits into from

Conversation

bentsherman
Copy link
Member

@bentsherman bentsherman commented Dec 3, 2023

This PR uses an experimental annotation API (Nextflow PR coming soon) to define processes.

Notable changes:

  • processes are defined as regular functions with a @ProcessFn annotation that defines additional process configuration.
  • the function parameters define the process inputs, while the inputs section in the annotation only defines how to stage files / environment vars / stdin based on the parameters
  • similarly, the outputs section separates the staging out of files / envs / stdout from the outputs definition with emit
  • the annotation uses the same builder DSL to define directives, inputs, and outputs
  • the script=true denotes that the process generates script tasks, in which case the function should return the task script
  • processes are invoked exactly as before, each input channel is mapped to a function argument
  • since the process inputs can be statically typed, the sample tuple is replaced with a Sample class for better type checking

Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
@Midnighter
Copy link

What is the goal of this experimentation? I understand that static typing is a bonus and I like having a well defined value object for input/output. A huge downside that I see is that this is even further away from a pure DSL and closer to just writing pure Groovy and thus you are limiting the audience of who can read/write workflows in this way. I would say that the nextflow DSL already exposes a lot of Groovy to pipeline authors, which increases the complexity and limits accessibility.

@bentsherman
Copy link
Member Author

I agree that the DSL already exposes a lot of Groovy code, and I would love for it to become more declarative. I think most people want to either use a purely declarative syntax or use a programming language that they're already familiar with, but the DSL sits in a weird middle ground that satisfies no one.

The goal of this annotation API then is to provide the pure Groovy approach, while at the same time we could work to make the DSL more declarative. This way, users can use one approach or the other based on what they are more comfortable with, and they can even be mixed within the same pipeline (i.e. you might write your pipeline with the DSL but use modules written in pure Groovy).

In other words, the programmatic approach "frees" the DSL to become more declarative without the burden of having to support every possible use case. If the DSL doesn't support some particular edge case, you can always fall back to the programmatic approach.

As for static types, I have a few ideas for how to add also them to the DSL, but there are a few design trade-offs to consider. The annotations here support static types by basically being more verbose, so I think in the DSL we would want something more concise.

@bentsherman bentsherman changed the title Use annotation API Annotation API Dec 12, 2023
@mahesh-panchal
Copy link

I have to admit, it feels harder to read.

@WorkflowFn(main=true)
def MAIN() {

This syntax also makes me feel like it's two separate commands.

I like that the process section scopes are more visible, but the input definition is now split in two places. There's the syntax in the annotation, and the args to def which takes the channels. This is going to be very confusing to new-comers.

I'm also really against this syntax:

    outputs={
        path '$file0', { pair.id }
        emit { path('$file0') }
    },

It's much more verbose and I don't see any benefit to writing in this way over the current way.

Also what would be the syntax for a channels with tuple input? Would that be replaced with the record type?
I would really like to see how the process construct the record object for output. In this example also it seems like
there's a one to one channel to def args correspondence, but I wonder what this would look like if an input channel say was a list of record objects, e.g. [ TAXON( name:'species', id:'1234' ), READ( group:'rg1', path:'/path/to/reads' ) ], or would these necessarily have to be composed together such that TAXON and READ are part of some collection record that has taxon and read as fields?

Also there's still the two object types path and Path which can add to the confusion. I think if type checking is going to be implemented, one needs to take away this source of potential confusion.

@mahesh-panchal
Copy link

I guess I missed that this wasn't intended for newcomers, but while I may be competent in Nextflow, a lot of the people I work with are not as experienced, and they need to be able to understand what I write too.

@bentsherman
Copy link
Member Author

All good points Mahesh. While I don't like the added verbosity of the inputs / outputs either, it was done in order to enable arbitrary types.

For inputs, you declare the inputs and their types (the function arguments) and in the inputs section you define what you want to stage into the task in terms of the arguments. So you could have a Sample class with various Path members, like reads and then you declare with Paths you want to stage e.g. path { sample.reads } and so on.

For outputs, you declare which envs/files/stdout you want to stage out of the task, and then you declare the outputs, one channel for each emit. The key point is that you can emit anything -- a simple value, tuple, map, custom type -- and you can reference what you staged out by matching key.

This rnaseq-nf isn't complicated enough to illustrate this point, so let me try an example here:

@ProcessFn(
inputs={
    path('file1.txt') { sample[1] } // if sample were a tuple [ id, left, right ]
    path('file1.txt') { sample['left'] } // if sample were a map
    path('file1.txt') { sample.left } // if sample were a record
},
outputs={
    path('$file1', 'file1.txt')
    path('$file2', 'file2.txt')
    emit { [ sample.id, path('$file1'), path('$file2') ] } // tuple
    emit { [ id: sample.id, left: path('$file1'), right: path('$file2') ] } // map
    emit { new SamplePair( sample.id, path('$file1'), path('$file2') ) } // record
})
def PROC(sample) {
  // ...
}

Also there's still the two object types path and Path which can add to the confusion.

path has a different meaning here, it is really a directive rather than a type. We could just as easily have directives like pathIn and pathOut but that felt weird. Whereas Path is the standard Java type that you're already using when you invoke file() or Channel.fromPath(). You just never have to say Path explicitly but we start doing static types then you will.

@mahesh-panchal
Copy link

Any suggestions how this might go with reusable modules (i.e. nf-core modules)? Say one workflow defines the Read class with just name and path, but another workflow defines it with name, cohort, path, strand. Is there a way to pass on input field values of a class to output field values of a class using a predefined process definition. How much would need to change when included into two different workflows where the classes are defined differently. I guess one should interfaces here for the type. That would make it easier if the user wants to define the class with a different name.

@pditommaso
Copy link
Member

This is an exploration activity. I don't think it will ever be used for real pipelines.

@bentsherman
Copy link
Member Author

@mahesh-panchal this is a good question regardless of the annotation API, since we want to have static types for the DSL. That is, how can users share and re-use types without clobbering each other's definitions.

I think in nf-core you would want to follow a similar approach for types as you do with modules. For example the nf-core/modules repo could have a lib directory with a groovy script for each type definition, and modules can reference those types. Currently this is the only way to re-use custom types beyond a single module, but we could make them include-able components like processes if it makes sense.

Beyond that, I would say that the community will just have to find alignment around common types. You can always start out by having a custom type for every module and then combining types where it makes sense. Types can extend a common base type or interface where there is shared structure/behavior, but that is a game that should be played with caution.

@bentsherman
Copy link
Member Author

bentsherman commented Dec 14, 2023

The goals of this exploration were to (1) improve error messages for syntax errors (2) support static types and (3) explore the trade-offs of writing pipelines in pure Groovy. The annotation API isn't a specific thing that we want to do so much as a vehicle for those goals, and it satisfied all of them. However, I think if we add a programmatic interface it should be in Python. A pure Groovy approach is probably just another weird middle ground.

Thinking about the DSL, I think we can minimize the amount of Groovy code by (1) isolating as much Groovy code as possible to the lib directory while enforcing a simpler syntax for Nextflow scripts (basically processes + workflows + basic Groovy syntax) and (2) studying the nf-core pipelines for gaps in the DSL that are being filled by Groovy and plug them (see the current work on channel topics, cmd output type, module config). A linter would be much easier to write if we don't have to support the entire Groovy language 😉

@markjschreiber
Copy link

I think the idea makes some sense. I think the goal of the Nextflow DSL is more to make it easy to define a workflow without needing to know Groovy. However a quick glance at NF-core reveals that many workflows use both meaning Nextflow developers now need to know by NF and Groovy. Most of the DSLs that I have seen and used augment an existing programming language to make it easier to perform certain tasks so it is not too surprising that Groovy can leak into Nextflow (it would be hard to prevent it).

To truly prevent the need to learn two languages I think you would have to either define Nextflow as a language (not a DSL). This could certainly be done. The other path would be to do what is suggested here and write only Groovy and use annotations or some other metaprogramming to define the parts that will be distributed processes/ dependencies in the DAG etc.

},
script=true
)
def MULTIQC(List<Path> files, Path config) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the typing on input channels. It makes me wonder if it would be possible to add a decorator like this:

    input:
    @type:List<Path>
    @description: a list of files output from various processes for multiqc to parse
    path('*') 
    
    output:
    @type:Path
    @description: multiqc html summary
    path('multiqc_report.html')

To the current syntax that could be both checked, and generate some automatic documentation. It would be especially nice if the type of channel -- value or queue -- was included. This would be helpful in development and maintenance, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syntax would have to be something like this:

    input:
    path('*'), type: List<Path>, description: 'a list of files output from various processes for multiqc to parse'

But you have the right idea.

As for channels, I think the process definition should be decoupled from whether the input channels are queue or value, because it shouldn't really matter

@bentsherman
Copy link
Member Author

Closing since this exploration has concluded. Follow nextflow-io/nextflow#4553 for updates on adding static types to process inputs/outputs.

@bentsherman bentsherman deleted the programmatic-api branch March 24, 2024 15:19
@bentsherman
Copy link
Member Author

Folks, I have developed a prototype for typed process inputs/outputs which reuses some ideas from this exploration but in the context of the existing DSL. Feel free to check it out and offer any feedback there: #26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants