Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors from bulk analysis #166

Open
avivajpeyi opened this issue Dec 20, 2021 · 11 comments
Open

Errors from bulk analysis #166

avivajpeyi opened this issue Dec 20, 2021 · 11 comments

Comments

@avivajpeyi
Copy link
Collaborator

avivajpeyi commented Dec 20, 2021

From a total of 4.5K TOIs,

@avivajpeyi
Copy link
Collaborator Author

avivajpeyi commented Dec 20, 2021

Full log file for all jobs:

http://catalog.tess-atlas.cloud.edu.au/content/toi_notebooks/tess_atlas_runner.log

grepped Error logs

Attached is a file of grepped results from the logs of the jobs
errors.txt

Error summary

Here are some details on the errors (2K+ errors):

  • 36 from plotting issues
  • 1323 from TICEntry.load(toi=TOI_NUMBER) (I guess data didn't download correctly? Are these the same as the lk issue?)
  • 352 from sample_prior (dang! I thought I had a try-except block that would have caught sampling prior error on returning array for single-transit systems #146 )! This must be related
  • 93 have an error relating to if star.density_data_present
  • 5 have an error with %matplotlib inline (why is this an error?)
  • 94 have an error with inference_data = run_inference(planet_transit_model)
  • 279 have an error with planet_transit_model, params = build_planet_transit_model(tic_entry)
  • 275 failed on get_optimized_init_params(planet_transit_model, **params)
  • 13 failed on import os LOL wut
  • 10 failed on if tic_entry.inference_data is None:
  • 7 failed on tic_entry.save_data(inference_data=inference_data)
  • 5 failed on lightcurve_models = compute_variable(
  • 5 failed on tic_entry.display()

@avivajpeyi
Copy link
Collaborator Author

Summary

Over the weekend I ran the following batches independently

  • normal TOIs (4878 TOIs)
  • multi-planet TOIs (130 TOIs)
  • single-transit TOIs (77 TOIs)

Focusing on errors from normal TOIS

''Normal'' TOI errors:

2563/4878 failed (~50%)

  • ConnectionError (~48%)
/fred/oz200/avajpeyi/projects/tess-atlas/src/tess_atlas/data/lightcurve_data.py in from_database(cls, tic, outdir)
     40
     41         logger.info("Downloading LightCurveData from MAST")
---> 42         search = lk.search_lightcurve(
     43             target=f"TIC {tic}", mission="TESS", author="SPOC"
     44         )

...
ConnectionError: HTTPSConnectionPool(host='mast.stsci.edu', port=443): Max retries exceeded with url: /portal/Mashup/Mashup.asmx/columnsconfig (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2b47dea1d9a0>: Failed to establish a new connection: [Errno -2] Name or service not known'))
  • ValueError on phase-plot (~100 TOIs)
src/tess_atlas/plotting/extra_plotting/ci.py in plot_xy_binned(x, y, yerr, ax, bins)
     81
     82 def plot_xy_binned(x, y, yerr, ax, bins):
---> 83     bins = np.linspace(min(x), max(x), bins)
     84     denom, _ = np.histogram(x, bins)
     85     num, _ = np.histogram(x, bins, weights=y)

ValueError: min() arg is an empty sequence
  • LinAlgError during runtime is RuntimeError "Chain Failed" (~78)
RuntimeError: Chain 0 failed.

/tmp/ipykernel_124356/3349958539.py in run_inference(model)
      4         sampling_kwargs = dict(tune=2000, draws=2000, chains=2, cores=2)
      5         logger.info(f"Run sampler with kwargs: {sampling_kwargs}")
----> 6         inference_data = pmx.sample(
      7             **sampling_kwargs, start=init_params, return_inferencedata=True
      8         )
...
fred/oz200/avajpeyi/envs/tess/lib/python3.8/site-packages/pymc3/parallel_sampling.py in recv_draw(processes, timeout)
    357             else:
    358                 error = RuntimeError("Chain %s failed." % proc.chain)
--> 359             raise error from old_error
    360         elif msg[0] == "writing_done":
    361             proc._readable = True
  • LinAlgError during optimization (~13)
LinAlgError: failed to factorize or solve matrix

src/tess_atlas/data/inference_data_tools.py in get_optimized_init_params(model, planet_params, noise_params, stellar_params, period_params, theta, verbose)
    140         theta = pmx.optimize(theta, [noise_params[0]], **kwargs)
    141         theta = pmx.optimize(theta, planet_params, **kwargs)
--> 142         theta = pmx.optimize(theta, noise_params, **kwargs)
    143         theta = pmx.optimize(theta, stellar_params, **kwargs)
    144         theta = pmx.optimize(theta, period_params, **kwargs)
   ...
   

@avivajpeyi
Copy link
Collaborator Author

Maybe I can try to resub the ones that had ConnectionErrors and see how that goes

@dfm
Copy link
Owner

dfm commented Mar 8, 2022

Great, and then I'd try manually running one or two from each other category to see where the problem is coming from and if you can fix it by tweaking things.

@avivajpeyi
Copy link
Collaborator Author

avivajpeyi commented Mar 8, 2022

On re-running the LK download for the TOIs, I got 2423 TOIs with ConnectionErrors... fewer, but still a large chunk!

See #157

@dfm
Copy link
Owner

dfm commented Mar 8, 2022

yeah - you're definitely getting throttled. We probably need to put in some more friendly timeouts and back off strategies so that we don't piss off the archives. You could try catching that exception, adding a sleep for a minute and then trying again.

@avivajpeyi
Copy link
Collaborator Author

LinAlgError: see #80

@avivajpeyi
Copy link
Collaborator Author

avivajpeyi commented Jun 12, 2022

Reran the TOIs with the various fixes:
plot

Some things to look into:
1. 3̶0̶2̶6̶ 2833 TOIs should have been analysed -- so there are ~1K missing? Why didn't their jobs start?
1504 had a cgroup out-of-memory error! These weren't able to log their execution time before getting killed by slurm. #192

2. From 1939 analysed TOIs why did ~586 fail?

  • 354 TOIs had ValueError: supplied range of [nan, nan] is not finite errors. This might have to do with the incorrect period guess during optimization (all seem to be flat-lining) incorrect t0 guess from exofop #83

  • Some (eg TOI 106, 2431, 2262, 1692, 1121) seem to have just not reached the plotting stage
    -- but why? No errors, sampling took 3 hours so there was more time for plotting -- was mem an issue? (YES this falls in plotting Out of memory during plot_phase #192)

  • TOI 2423 failed in the test_model(planet_transit_model) stage -- p_1_lowerbound__ NaN

3. From the 1353 successful runs, are the fits sensible?

Most look good! Some look a bit weird (uncategorised)

  • TOI 224
  • TOI 237
  • TOI 297

Some initial fits still looking off:

looking at the generation logs (job to download TOI data) -- there are 2833 generation logs (so there should be at least 2833 notebooks)?
--> Yes 2833 notebooks are present

More questions:
4. There should be̶ 3̶0̶2̶6̶ 2833 logs -- we have 2833. Why weren't the ~200 TOI "generation" jobs started

>>> from tess_atlas.data.exofop import get_toi_list
>>> len(get_toi_list(remove_toi_without_lk=True))
2833
>>> len(get_toi_list(remove_toi_without_lk=False))
5525

Ok so 2833 makes sense!

5. If there are 2833 generation jobs -- why aren't there 2833 TOI notebooks that were run (rather than 1939 )
--> so there are 2833notebooks -- it's just that some of the TOIs (the ones that ran out of mem) were unable to log the runtime because of the rude halt they faced with the mem error

@avivajpeyi
Copy link
Collaborator Author

Summary from sleuthing:

  1. ATM 2833 TOIs have lk data we can use
  2. 2833/2833 finished downloading the data they need (generation step succeded)
  3. 2614/2833 TOIs have saved netcdfs -- ie they finished inference (219 TOIs did not save a netcdf)
  4. 1504/2833 TOIs failed in the phase-plotting stage (out of mem)!
  5. 354/2833 TOIs have 'flat-lined' models (maybe period is wrong?)

@avivajpeyi
Copy link
Collaborator Author

avivajpeyi commented Jun 20, 2022

The from the 219 TOI analyses that did not have netcdfs appear to have not completed their pmx.sample step before the job ran out of time? Lots of their logs reach 'sampling' but then nothing else...

~28 of these have a ConnectionResetError:

Example TOI 104: 4328/8000 4:01:13<3:24:40 Sampling 2 chains, 36 divergences]

See #214

@avivajpeyi
Copy link
Collaborator Author

avivajpeyi commented Jul 14, 2022

2712/2833 analyses finished!

Screen Shot 2022-07-14 at 12 23 49 am

Summary of errors

121 Errors:

Execution errors

Need to check lens of the jobs that had time/mem errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants