Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.

Issue loading trajectory with pyemma.coordinates.source when topology file changes #1541

Closed
AnjaConev opened this issue Feb 18, 2022 · 6 comments
Assignees

Comments

@AnjaConev
Copy link

Hello,

Thanks for this amazing package - it has been very fun to work with and very reliable!

I have encountered a strange issue when loading the trajectory with pyemma.coordinates.source.
(PyEmma version = 2.5.7 on Debian GNU/Linux 9)

Here is the example of what is happening:

I have a trajectory and a reference PDB file (X.xtc and X.pdb). They both have 10 atoms.
I load the trajectory for the first time with:

pyemma.coordinates.source("./X.xtc", top="./X.pdb")

Everything works as expected.

Later in my workflow I have new files with the same names: X.xtc, X.pdb but they now refer to a new trajectory with 18 atoms.
I make the same call to load this new trajectory:

pyemma.coordinates.source("./X.xtc", top="./X.pdb")

It now fails with the error:
ValueError: xyz must be shape (Any, 10, 3). You supplied (1, 18, 3)

If I reload the package and call source again with the new files it works.

It seems like the pdb file name gets cashed somewhere and pyemma.coordinates.source still thinks that we are dealing with the old X.pdb.
I was able to work around this issue without reloading the package by running:
mdtraj_top = mdtraj.load("./X.pdb").topology
pyemma.coordinates.source("./X.xtc", top= mdtraj_top)

The code for reproducing this example and the corresponding files are in example.zip
example.zip

I just wanted to report this as it might be an unexpected behavior and I was stuck for a while on this issue.

My python environment: pip_list.txt

Full error message:
error_message

@thempel
Copy link
Member

thempel commented Feb 21, 2022

Hi, I can reproduce this error locally. I believe it's a problem with the topology cache. (This cache is there because it accelerates loading large sets of trajectories with the same topology.) As far as I see, it is implemented using an LRU-cache here:

@lru_cache(maxsize=32)
def _load(top_file):
return load_topology(top_file)
def load_topology_cached(top_file):
if isinstance(top_file, str):
return _load(top_file)
if isinstance(top_file, Topology):
return top_file
if isinstance(top_file, Trajectory):
return top_file.topology
raise NotImplementedError()

From this code it also becomes clear why your workaround works. Unfortunately I don't think that the cache can be turned off as there is no option in the config. Maybe @clonker knows how to fix the cache?

@clonker
Copy link
Member

clonker commented Feb 21, 2022

I'll take a look at it. 🙂 Thanks for reporting this @AnjaConev!

@clonker clonker self-assigned this Feb 21, 2022
@AnjaConev
Copy link
Author

Thank you both for a quick response!
@thempel that explanation makes sense.
Maybe you can clear the cache when a new call is made to the source api?
I'm not sure if this would mess up other things though :)

@clonker
Copy link
Member

clonker commented Feb 21, 2022

Definitely sounds like a potential fix! Thanks 🚀

clonker added a commit to clonker/PyEMMA that referenced this issue Mar 1, 2022
@clonker
Copy link
Member

clonker commented Mar 1, 2022

Opted for extending the lru cache key by hash of the first MB of the file contents plus last modified and creation date - that should be unique enough 🙂
Although I am pretty sure things should be fixed now, perhaps you (@AnjaConev) can check your script against the devel branch to see if it works for you too.

@AnjaConev
Copy link
Author

Nice approach 😄
I tested it the new version on the examples that were previously failing for me and everything worked well!!
Thanks for your help!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants