Issue loading trajectory with pyemma.coordinates.source when topology file changes #1541

AnjaConev · 2022-02-18T21:25:51Z

Hello,

Thanks for this amazing package - it has been very fun to work with and very reliable!

I have encountered a strange issue when loading the trajectory with pyemma.coordinates.source.
(PyEmma version = 2.5.7 on Debian GNU/Linux 9)

Here is the example of what is happening:

I have a trajectory and a reference PDB file (X.xtc and X.pdb). They both have 10 atoms.
I load the trajectory for the first time with:

pyemma.coordinates.source("./X.xtc", top="./X.pdb")

Everything works as expected.

Later in my workflow I have new files with the same names: X.xtc, X.pdb but they now refer to a new trajectory with 18 atoms.
I make the same call to load this new trajectory:

pyemma.coordinates.source("./X.xtc", top="./X.pdb")

It now fails with the error:
ValueError: xyz must be shape (Any, 10, 3). You supplied (1, 18, 3)

If I reload the package and call source again with the new files it works.

It seems like the pdb file name gets cashed somewhere and pyemma.coordinates.source still thinks that we are dealing with the old X.pdb.
I was able to work around this issue without reloading the package by running:
mdtraj_top = mdtraj.load("./X.pdb").topology
pyemma.coordinates.source("./X.xtc", top= mdtraj_top)

The code for reproducing this example and the corresponding files are in example.zip
example.zip

I just wanted to report this as it might be an unexpected behavior and I was stuck for a while on this issue.

My python environment: pip_list.txt

Full error message:

The text was updated successfully, but these errors were encountered:

thempel · 2022-02-21T16:49:42Z

Hi, I can reproduce this error locally. I believe it's a problem with the topology cache. (This cache is there because it accelerates loading large sets of trajectories with the same topology.) As far as I see, it is implemented using an LRU-cache here:

PyEMMA/pyemma/coordinates/util/patches.py

Lines 40 to 52 in 8c2bc84

    
           @lru_cache(maxsize=32) 
        
           def _load(top_file): 
        
               return load_topology(top_file) 
        
           def load_topology_cached(top_file): 
        
               if isinstance(top_file, str): 
        
                   return _load(top_file) 
        
               if isinstance(top_file, Topology): 
        
                   return top_file 
        
               if isinstance(top_file, Trajectory): 
        
                   return top_file.topology 
        
               raise NotImplementedError()

From this code it also becomes clear why your workaround works. Unfortunately I don't think that the cache can be turned off as there is no option in the config. Maybe @clonker knows how to fix the cache?

clonker · 2022-02-21T17:10:19Z

I'll take a look at it. 🙂 Thanks for reporting this @AnjaConev!

AnjaConev · 2022-02-21T17:20:09Z

Thank you both for a quick response!
@thempel that explanation makes sense.
Maybe you can clear the cache when a new call is made to the source api?
I'm not sure if this would mess up other things though :)

clonker · 2022-02-21T17:21:31Z

Definitely sounds like a potential fix! Thanks 🚀

clonker · 2022-03-01T15:57:19Z

Opted for extending the lru cache key by hash of the first MB of the file contents plus last modified and creation date - that should be unique enough 🙂
Although I am pretty sure things should be fixed now, perhaps you (@AnjaConev) can check your script against the devel branch to see if it works for you too.

AnjaConev · 2022-03-04T21:50:35Z

Nice approach 😄
I tested it the new version on the examples that were previously failing for me and everything worked well!!
Thanks for your help!

clonker self-assigned this Feb 21, 2022

clonker added a commit to clonker/PyEMMA that referenced this issue Mar 1, 2022

fixes issue markovmodel#1541

f5d0531

clonker mentioned this issue Mar 1, 2022

Improved topology LRU cache #1545

Merged

AnjaConev closed this as completed Mar 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue loading trajectory with pyemma.coordinates.source when topology file changes #1541

Issue loading trajectory with pyemma.coordinates.source when topology file changes #1541

AnjaConev commented Feb 18, 2022

thempel commented Feb 21, 2022

clonker commented Feb 21, 2022

AnjaConev commented Feb 21, 2022

clonker commented Feb 21, 2022

clonker commented Mar 1, 2022

AnjaConev commented Mar 4, 2022

Issue loading trajectory with pyemma.coordinates.source when topology file changes #1541

Issue loading trajectory with pyemma.coordinates.source when topology file changes #1541

Comments

AnjaConev commented Feb 18, 2022

thempel commented Feb 21, 2022

clonker commented Feb 21, 2022

AnjaConev commented Feb 21, 2022

clonker commented Feb 21, 2022

clonker commented Mar 1, 2022

AnjaConev commented Mar 4, 2022