Errors from bulk analysis #166

avivajpeyi · 2021-12-20T10:01:54Z

From a total of 4.5K TOIs,

~2k dont have lightcurve data for them (light curve data not found for some TOIs #157)
891/2190 runs passed sampling step
326/2190 runs have the LinAlg error
973/2190 are unaccounted --> what caused these to fail?

avivajpeyi · 2021-12-20T10:13:55Z

Full log file for all jobs:

http://catalog.tess-atlas.cloud.edu.au/content/toi_notebooks/tess_atlas_runner.log

grepped Error logs

Attached is a file of grepped results from the logs of the jobs
errors.txt

Error summary

Here are some details on the errors (2K+ errors):

36 from plotting issues
1323 from TICEntry.load(toi=TOI_NUMBER) (I guess data didn't download correctly? Are these the same as the lk issue?)
352 from sample_prior (dang! I thought I had a try-except block that would have caught sampling prior error on returning array for single-transit systems #146 )! This must be related
93 have an error relating to if star.density_data_present
5 have an error with %matplotlib inline (why is this an error?)
94 have an error with inference_data = run_inference(planet_transit_model)
279 have an error with planet_transit_model, params = build_planet_transit_model(tic_entry)
275 failed on get_optimized_init_params(planet_transit_model, **params)
13 failed on import os LOL wut
10 failed on if tic_entry.inference_data is None:
7 failed on tic_entry.save_data(inference_data=inference_data)
5 failed on lightcurve_models = compute_variable(
5 failed on tic_entry.display()

avivajpeyi · 2022-03-08T08:15:45Z

Summary

Over the weekend I ran the following batches independently

normal TOIs (4878 TOIs)
multi-planet TOIs (130 TOIs)
single-transit TOIs (77 TOIs)

Focusing on errors from normal TOIS

''Normal'' TOI errors:

2563/4878 failed (~50%)

ConnectionError (~48%)

/fred/oz200/avajpeyi/projects/tess-atlas/src/tess_atlas/data/lightcurve_data.py in from_database(cls, tic, outdir)
     40
     41         logger.info("Downloading LightCurveData from MAST")
---> 42         search = lk.search_lightcurve(
     43             target=f"TIC {tic}", mission="TESS", author="SPOC"
     44         )

...
ConnectionError: HTTPSConnectionPool(host='mast.stsci.edu', port=443): Max retries exceeded with url: /portal/Mashup/Mashup.asmx/columnsconfig (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2b47dea1d9a0>: Failed to establish a new connection: [Errno -2] Name or service not known'))

ValueError on phase-plot (~100 TOIs)

src/tess_atlas/plotting/extra_plotting/ci.py in plot_xy_binned(x, y, yerr, ax, bins)
     81
     82 def plot_xy_binned(x, y, yerr, ax, bins):
---> 83     bins = np.linspace(min(x), max(x), bins)
     84     denom, _ = np.histogram(x, bins)
     85     num, _ = np.histogram(x, bins, weights=y)

ValueError: min() arg is an empty sequence

LinAlgError during runtime is RuntimeError "Chain Failed" (~78)

RuntimeError: Chain 0 failed.

/tmp/ipykernel_124356/3349958539.py in run_inference(model)
      4         sampling_kwargs = dict(tune=2000, draws=2000, chains=2, cores=2)
      5         logger.info(f"Run sampler with kwargs: {sampling_kwargs}")
----> 6         inference_data = pmx.sample(
      7             **sampling_kwargs, start=init_params, return_inferencedata=True
      8         )
...
fred/oz200/avajpeyi/envs/tess/lib/python3.8/site-packages/pymc3/parallel_sampling.py in recv_draw(processes, timeout)
    357             else:
    358                 error = RuntimeError("Chain %s failed." % proc.chain)
--> 359             raise error from old_error
    360         elif msg[0] == "writing_done":
    361             proc._readable = True

LinAlgError during optimization (~13)

LinAlgError: failed to factorize or solve matrix

src/tess_atlas/data/inference_data_tools.py in get_optimized_init_params(model, planet_params, noise_params, stellar_params, period_params, theta, verbose)
    140         theta = pmx.optimize(theta, [noise_params[0]], **kwargs)
    141         theta = pmx.optimize(theta, planet_params, **kwargs)
--> 142         theta = pmx.optimize(theta, noise_params, **kwargs)
    143         theta = pmx.optimize(theta, stellar_params, **kwargs)
    144         theta = pmx.optimize(theta, period_params, **kwargs)
   ...

avivajpeyi · 2022-03-08T08:16:52Z

Maybe I can try to resub the ones that had ConnectionErrors and see how that goes

dfm · 2022-03-08T11:12:09Z

Great, and then I'd try manually running one or two from each other category to see where the problem is coming from and if you can fix it by tweaking things.

avivajpeyi · 2022-03-08T12:40:09Z

On re-running the LK download for the TOIs, I got 2423 TOIs with ConnectionErrors... fewer, but still a large chunk!

See #157

dfm · 2022-03-08T12:43:23Z

yeah - you're definitely getting throttled. We probably need to put in some more friendly timeouts and back off strategies so that we don't piss off the archives. You could try catching that exception, adding a sleep for a minute and then trying again.

avivajpeyi · 2022-03-08T12:53:13Z

LinAlgError: see #80

avivajpeyi · 2022-06-12T15:12:33Z

Reran the TOIs with the various fixes:

Some things to look into:
1. 3̶0̶2̶6̶ 2833 TOIs should have been analysed -- so there are ~1K missing? Why didn't their jobs start?
1504 had a cgroup out-of-memory error! These weren't able to log their execution time before getting killed by slurm. #192

2. From 1939 analysed TOIs why did ~586 fail?

354 TOIs had ValueError: supplied range of [nan, nan] is not finite errors. This might have to do with the incorrect period guess during optimization (all seem to be flat-lining) incorrect t0 guess from exofop #83
Some (eg TOI 106, 2431, 2262, 1692, 1121) seem to have just not reached the plotting stage
-- but why? No errors, sampling took 3 hours so there was more time for plotting -- was mem an issue? (YES this falls in plotting Out of memory during plot_phase #192)
TOI 2423 failed in the test_model(planet_transit_model) stage -- p_1_lowerbound__ NaN

3. From the 1353 successful runs, are the fits sensible?

Most look good! Some look a bit weird (uncategorised)

TOI 224
TOI 237
TOI 297

Some initial fits still looking off:

TOI 326
TOI 340
TOI 348
TOI 358
TOI 424
TOI 430
TOI 467
(incorrect period guess from exofop #101)

looking at the generation logs (job to download TOI data) -- there are 2833 generation logs (so there should be at least 2833 notebooks)?
--> Yes 2833 notebooks are present

More questions:
4. There should be̶ 3̶0̶2̶6̶ 2833 logs -- we have 2833. Why weren't the ~200 TOI "generation" jobs started

>>> from tess_atlas.data.exofop import get_toi_list
>>> len(get_toi_list(remove_toi_without_lk=True))
2833
>>> len(get_toi_list(remove_toi_without_lk=False))
5525

Ok so 2833 makes sense!

5. If there are 2833 generation jobs -- why aren't there 2833 TOI notebooks that were run (rather than 1939 )
--> so there are 2833notebooks -- it's just that some of the TOIs (the ones that ran out of mem) were unable to log the runtime because of the rude halt they faced with the mem error

avivajpeyi · 2022-06-17T16:57:50Z

Summary from sleuthing:

ATM 2833 TOIs have lk data we can use
2833/2833 finished downloading the data they need (generation step succeded)
2614/2833 TOIs have saved netcdfs -- ie they finished inference (219 TOIs did not save a netcdf)
1504/2833 TOIs failed in the phase-plotting stage (out of mem)!
354/2833 TOIs have 'flat-lined' models (maybe period is wrong?)

avivajpeyi · 2022-06-20T18:51:47Z

The from the 219 TOI analyses that did not have netcdfs appear to have not completed their pmx.sample step before the job ran out of time? Lots of their logs reach 'sampling' but then nothing else...

~28 of these have a ConnectionResetError:

Example TOI 104: 4328/8000 4:01:13<3:24:40 Sampling 2 chains, 36 divergences]

See #214

avivajpeyi · 2022-07-14T18:40:38Z

2712/2833 analyses finished!

Summary of errors

121 Errors:

timeout: 26
out of mem: 8
Notebook NotJSONError #227: 36
execution errors: 51

Execution errors

number of periods < 0 #230 (2/51)
Test point nan (in period param) #229 (11/51)
Initial lc phase plot error #228 (38/51)

Need to check lens of the jobs that had time/mem errors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors from bulk analysis #166

Errors from bulk analysis #166

avivajpeyi commented Dec 20, 2021 •

edited

Loading

avivajpeyi commented Dec 20, 2021 •

edited

Loading

avivajpeyi commented Mar 8, 2022

avivajpeyi commented Mar 8, 2022

dfm commented Mar 8, 2022

avivajpeyi commented Mar 8, 2022 •

edited

Loading

dfm commented Mar 8, 2022

avivajpeyi commented Mar 8, 2022

avivajpeyi commented Jun 12, 2022 •

edited

Loading

avivajpeyi commented Jun 17, 2022

avivajpeyi commented Jun 20, 2022 •

edited

Loading

avivajpeyi commented Jul 14, 2022 •

edited

Loading

Errors from bulk analysis #166

Errors from bulk analysis #166

Comments

avivajpeyi commented Dec 20, 2021 • edited Loading

avivajpeyi commented Dec 20, 2021 • edited Loading

Full log file for all jobs:

grepped Error logs

Error summary

avivajpeyi commented Mar 8, 2022

Summary

''Normal'' TOI errors:

avivajpeyi commented Mar 8, 2022

dfm commented Mar 8, 2022

avivajpeyi commented Mar 8, 2022 • edited Loading

dfm commented Mar 8, 2022

avivajpeyi commented Mar 8, 2022

avivajpeyi commented Jun 12, 2022 • edited Loading

avivajpeyi commented Jun 17, 2022

avivajpeyi commented Jun 20, 2022 • edited Loading

avivajpeyi commented Jul 14, 2022 • edited Loading

Summary of errors

Execution errors

avivajpeyi commented Dec 20, 2021 •

edited

Loading

avivajpeyi commented Dec 20, 2021 •

edited

Loading

avivajpeyi commented Mar 8, 2022 •

edited

Loading

avivajpeyi commented Jun 12, 2022 •

edited

Loading

avivajpeyi commented Jun 20, 2022 •

edited

Loading

avivajpeyi commented Jul 14, 2022 •

edited

Loading