Linux Segfaults #4504

BEFH · 2018-03-27T14:40:21Z

We are getting segfaults on some nodes of our cluster, but not others when running several pandoc versions:

Our failing nodes all have one of the following processors:

Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz,
Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz

However, it consistently segfaults on some of these nodes but not on others.

It does not segfault on our AMD Opteron(TM) Processor 6276s.

Kernel versions are as follows:
2.6.32-358.23.2.el6.x86_64,
2.6.32-696.18.7.el6.x86_64
2.6.32-358.23.2.el6.x86_64
2.6.32-358.6.2.el6.x86_64

All nodes are on CentOS 6.2

This occurs with both the system pandoc (1.19.2.1), a modulecmd version (2.0), and the statically linked, pre-compiled version 2.1.3.

↪ pandoc --version
fish: 'pandoc --version' terminated by signal SIGSEGV (Address boundary error)

It also segfaults in bash:

[fultob01@interactive5 ~]$ pandoc --version
Segmentation fault

With my statically linked pandoc, I get a really nice strace. It looks like pandoc is trying to write to 0x4200000000, which is out of bounds, and bode allows the write but shouldn't, so pandoc segfaults when it attempts to read. I have no idea what the solution is for this, but for now, I'll use mothra or manda. Do you have any idea why bode is allowing pandoc to write to that address?

Here's the strace trace:

execve("/hpc/users/fultob01/local/bin/pandoc", ["pandoc"], [/* 54 vars */]) = 0
arch_prctl(ARCH_SET_FS, 0x49f0420)      = 0
set_tid_address(0x49f0458)              = 132178
brk(0)                                  = 0x6501000
brk(0x6502000)                          = 0x6502000
brk(0x6505000)                          = 0x6505000
brk(0x6508000)                          = 0x6508000
getrusage(RUSAGE_SELF, {ru_utime={0, 0}, ru_stime={0, 0}, ...}) = 0
sysinfo({uptime=1493582, loads=[17088, 8768, 6176] totalram=67442647040, freeram=27952783360, sharedram=0, bufferram=101777408} t
brk(0x6519000)                          = 0x6519000
mmap(0x4200000000, 1099512676352, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x4200000000
mmap(0x4200000000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
--- SIGSEGV (Segmentation fault) @ 0 (0) ---

On one of our working nodes, it looks like this:

execve("/hpc/users/fultob01/local/bin/pandoc", ["pandoc"], [/* 79 vars */]) = 0
arch_prctl(ARCH_SET_FS, 0x49f0420)      = 0
set_tid_address(0x49f0458)              = 113282
brk(0)                                  = 0x5e74000
brk(0x5e75000)                          = 0x5e75000
brk(0x5e78000)                          = 0x5e78000
brk(0x5e7b000)                          = 0x5e7b000
getrusage(RUSAGE_SELF, {ru_utime={0, 0}, ru_stime={0, 1999}, ...}) = 0
sysinfo({uptime=5121149, loads=[10304, 5024, 768] totalram=67440967680, freeram=43350913024, sharedram=0, bufferram=224522240} to
brk(0x5e8c000)                          = 0x5e8c000
mmap(0x4200000000, 1099512676352, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(0x4200001000, 549756862464, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(0x4200002000, 274878955520, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(0x4200003000, 137440002048, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(0x4200004000, 68720525312, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(0x4200005000, 34360786944, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x4200005000
munmap(0x4200005000, 1028096)           = 0

The text was updated successfully, but these errors were encountered:

mb21 · 2018-03-27T15:33:51Z

What OS are you on? Can you try to compile from source on the exact OS you're trying to run it on?

If you're on Fedora, this is probably a duplicate of #4461

BEFH · 2018-03-27T16:31:40Z

I'm on CentOS which is related to fedora. I will coordinate with my cluster people.

…

On Tue, Mar 27, 2018 at 11:34 AM Mauro Bieg ***@***.***> wrote: What OS are you on? Can you try to compile from source <https://github.com/jgm/pandoc/blob/master/INSTALL.md#compiling-from-source> on the exact OS you're trying to run it on? If you're on Fedora, this is probably a duplicate of #4461 <#4461> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4504 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADOs6AYrY3D-c2Rq03Q5JLLKgRt6XAN-ks5tilvsgaJpZM4S9DY4> .

jgm · 2018-03-27T16:45:53Z

I agree, it would be helpful to know if the problem persists with pandoc compiled on the target system.
On the systems where you get the segfault, what causes it? Does every pandoc command cause it? You mention pandoc --version. Do you also get the segfault with, say, a simple conversion? Does it matter whether you use -s? People have had similar problems on Windows 7; see #4283.

BEFH · 2018-03-27T18:59:17Z

Could you please suggest some simple commands with the files to use? It segfaults even if I run pandoc with no arguments, and it segfaults from rmarkdown.

On a related note, I'm getting an "error 139" on the working nodes for some files. The command run is as follows:

/sc/orga/projects/LOAD/Brian/anaconda3/bin/pandoc +RTS -K512m -RTS Post_imputation.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output /sc/orga/projects/LOAD/Brian/projects/ADSP/data/pre_impute_merge/impute_stats/stats/CHARGE_CHS_impStats.html --smart --email-obfuscation none --self-contained --standalone --section-divs --template /hpc/packages/minerva-common/rpackages/3.4.3/site-library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:bootstrap' --include-in-header /tmp/101010540.tmpdir/RtmpomNOqg/rmarkdown-str799f60520417.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'

I couldn't find error 139 anywhere in the source, but I have read that it might be related to segfaults.

jgm · 2018-03-27T22:41:11Z

Brian Fulton-Howard <[email protected]> writes:

Could you please suggest some simple commands with the files to use? It segfaults even if I run `pandoc` with no arguments, and it segfaults from rmarkdown.

I was thinking of something like echo "Hello" | pandoc

On a related note, I'm getting an "error 139" on the working nodes for some files. The command run is as follows: ``` /sc/orga/projects/LOAD/Brian/anaconda3/bin/pandoc +RTS -K512m -RTS Post_imputation.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output /sc/orga/projects/LOAD/Brian/projects/ADSP/data/pre_impute_merge/impute_stats/stats/CHARGE_CHS_impStats.html --smart --email-obfuscation none --self-contained --standalone --section-divs --template /hpc/packages/minerva-common/rpackages/3.4.3/site-library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:bootstrap' --include-in-header /tmp/101010540.tmpdir/RtmpomNOqg/rmarkdown-str799f60520417.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' ``` I couldn't find error 139 anywhere in the source, but I have read that it might be related to segfaults.

We don't use exit code 139 for anything. You might try simplifying your command line piece by piece to see if you can isolate something that's correlated with the problem. I assume these nodes have enough memory to use 512m of stack space (that's what +RTS -K512m -RTS calls for)? One thing to try is increasing or decreasing this.

BEFH · 2018-03-27T23:30:42Z

So I (with the help of a dedicated sysadmin) have confirmed that when pandoc segfaults immediately, it does so regardless of the simplicity of the command.

I have also confirmed that error 139 is added by GHC and means there is a segfault. Error 139 occurs with some files on the Intel processors that don't immediately segfault. It does not occur on our ancient, slow AMD processors.

Our AMD servers have 256 GB of memory, and the Intel servers have 64 GB.

We installing Haskel Platform 8.2.2 and compiling the latest pandoc from source, as you requested. We will let you know the results when we manage to compile.

jgm · 2018-03-28T00:11:44Z

Brian Fulton-Howard <[email protected]> writes:

So I (with the help of a dedicated sysadmin) have confirmed that when pandoc segfaults immediately, it does so regardless of the simplicity of the command. I have also confirmed that error 139 is added by GHC and means there is a segfault. Error 139 occurs with some files on the Intel processors that don't immediately segfault. It does not occur on our ancient, slow AMD processors. Our AMD servers have 256 GB of memory, and the Intel servers have 64 GB. We installing Haskel Platform 8.2.2 and compiling the latest pandoc from source, as you requested. We will let you know the results when we manage to compile.

Great. My prediction is that the natively compiled version will work. If it still segfaults, then very likely this points to a bug in GHC.

BEFH · 2018-03-28T14:57:19Z

After native compilation, it no longer segfaults. Instead, this:

pandoc: internal error: Unable to commit 1048576 bytes of memory
    (GHC version 8.2.2 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

jgm · 2018-03-28T18:39:20Z

Very strange! You might try compiling with a different version of ghc, such as the latest, ghc 8.4.1. (Linux binaries are available.) If that doesn't fix things, reporting as a GHC bug would be appreciated, I'm sure. Brian Fulton-Howard <[email protected]> writes:

…

After native compilation, it no longer segfaults. Instead, this: ``` pandoc: internal error: Unable to commit 1048576 bytes of memory (GHC version 8.2.2 for x86_64_unknown_linux) Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug ``` -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #4504 (comment)

nh2 · 2018-06-14T20:13:55Z

Try run it with gdb --args and see if there's a backtrace from some C code (if you're unlucky, there is none, and it happens straight from Haskell, but if you're lucky, there's some C code in between).

jgm · 2018-10-12T04:43:40Z

Relevant ghc ticket:
https://ghc.haskell.org/trac/ghc/ticket/15054
Looks like a bug still not fixed in ghc 8.6.1.

billglick · 2019-12-09T18:36:42Z

Are there any known work arounds to prevent the issue?

I'm running into it on several VMs running RHEL 6 with various memory footprints (24GB, 48GB, 52GB, 64GB, etc.) with 50% or more free memory. I can't figure out why it Cannot allocate memory on some, but works fine on others.

mahermassoud · 2021-02-18T21:28:25Z

Having the same issue. Even when i just run pandoc cli command

$ pandoc
pandoc: internal error: Unable to commit 1048576 bytes of memory
    (GHC version 8.10.1 for x86_64_unknown_linux)
    Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug
Aborted (core dumped)

jgm · 2021-02-18T22:38:16Z

@mahermassoud please give more information:
How exactly you installed pandoc, what version, what architecture and OS you're using.

I note that the ghc issue linked above is still open.

mahermassoud · 2021-02-18T23:08:05Z

@jgm I believe I'm on redhat because yum is installed

$ uname -a
Linux polaris.pbtech 2.6.32-642.11.1.el6.x86_64 #1 SMP Fri Nov 18 19:25:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

I'm in a totally new conda environment where i installed r and r studio using
conda install rstudio

No idea which version it is.

Let me know if you need more info

I got a similar issue installing with
pip install pandoc

jgm · 2021-02-19T01:33:32Z

pip install pandoc installs a python library, not the pandoc executable.
Sorry, I can't help more without knowing more details. I have no idea what rstudio is doing to install pandoc, how their pandoc is compiled, etc.

jgm · 2022-10-03T02:58:01Z

Closing ; the upstream ghc issues has been fixed for a long time.

mb21 added the platform:linux label Mar 27, 2018

mb21 added the status:waiting-for-upstream label Sep 8, 2019

jgm closed this as completed Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linux Segfaults #4504

Linux Segfaults #4504

BEFH commented Mar 27, 2018

mb21 commented Mar 27, 2018

BEFH commented Mar 27, 2018 via email

jgm commented Mar 27, 2018

BEFH commented Mar 27, 2018

jgm commented Mar 27, 2018 via email

BEFH commented Mar 27, 2018

jgm commented Mar 28, 2018 via email

BEFH commented Mar 28, 2018

jgm commented Mar 28, 2018 via email

nh2 commented Jun 14, 2018

jgm commented Oct 12, 2018

billglick commented Dec 9, 2019

mahermassoud commented Feb 18, 2021

jgm commented Feb 18, 2021

mahermassoud commented Feb 18, 2021

jgm commented Feb 19, 2021

jgm commented Oct 3, 2022

Linux Segfaults #4504

Linux Segfaults #4504

Comments

BEFH commented Mar 27, 2018

mb21 commented Mar 27, 2018

BEFH commented Mar 27, 2018 via email

jgm commented Mar 27, 2018

BEFH commented Mar 27, 2018

jgm commented Mar 27, 2018 via email

BEFH commented Mar 27, 2018

jgm commented Mar 28, 2018 via email

BEFH commented Mar 28, 2018

jgm commented Mar 28, 2018 via email

nh2 commented Jun 14, 2018

jgm commented Oct 12, 2018

billglick commented Dec 9, 2019

mahermassoud commented Feb 18, 2021

jgm commented Feb 18, 2021

mahermassoud commented Feb 18, 2021

jgm commented Feb 19, 2021

jgm commented Oct 3, 2022