Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken Pipe errors #154

Closed
1 task done
IAmScRay opened this issue May 3, 2024 · 34 comments
Closed
1 task done

Broken Pipe errors #154

IAmScRay opened this issue May 3, 2024 · 34 comments

Comments

@IAmScRay
Copy link

IAmScRay commented May 3, 2024

Describe the bug

I have managed to register my prover & receive TTKOh tokens, but when trying to test the proof generation via curl, I receive numerous task *** panicked errors no matter what block height I set or Holesky RPC I use.

I enabled Rust backtrace and here are my logs:

raiko  | Generating proof...
raiko  | WARNING: running SGX in hardware mode!
raiko  | Current directory: "/opt/raiko/bin"
raiko  | 
raiko  | 
raiko  | thread 'tokio-runtime-worker' panicked at /opt/raiko/provers/sgx/prover/src/lib.rs:280:48:
raiko  | Unable to serialize input: Io(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" })
raiko  | stack backtrace:
raiko  |    0: rust_begin_unwind
raiko  |              at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:645:5
raiko  |    1: core::panicking::panic_fmt
raiko  |              at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:72:14
raiko  |    2: core::result::unwrap_failed
raiko  |              at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:1654:5
raiko  |    3: core::result::Result<T,E>::expect
raiko  |              at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:1034:23
raiko  |    4: sgx_prover::prove::{{closure}}::{{closure}}
raiko  |              at ./provers/sgx/prover/src/lib.rs:280:9
raiko  |    5: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
raiko  |              at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/blocking/task.rs:42:21
raiko  |    6: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
raiko  |              at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/core.rs:328:17
raiko  |    7: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
raiko  |              at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/loom/std/unsafe_cell.rs:16:9
raiko  |    8: tokio::runtime::task::core::Core<T,S>::poll
raiko  |              at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/core.rs:317:30
raiko  |    9: tokio::runtime::task::harness::poll_future::{{closure}}
raiko  |              at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:485:19
raiko  |   10: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
raiko  |              at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panic/unwind_safe.rs:272:9
raiko  |   11: std::panicking::try::do_call
raiko  |              at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
raiko  |   12: std::panicking::try
raiko  |              at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
raiko  |   13: std::panic::catch_unwind
raiko  |              at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
raiko  |   14: tokio::runtime::task::harness::poll_future
raiko  |              at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:473:18
raiko  |   15: tokio::runtime::task::harness::Harness<T,S>::poll_inner
raiko  |              at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:208:27
raiko  |   16: tokio::runtime::task::harness::Harness<T,S>::poll
raiko  |              at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:153:15
raiko  |   17: tokio::runtime::task::raw::poll
raiko  |              at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/raw.rs:271:5
raiko  |   18: tokio::runtime::task::raw::RawTask::poll
raiko  |              at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/raw.rs:201:18
raiko  |   19: tokio::runtime::task::UnownedTask<S>::run
raiko  |              at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/mod.rs:464:9
raiko  |   20: tokio::runtime::blocking::pool::Task::run
raiko  |              at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/blocking/pool.rs:159:9
raiko  |   21: tokio::runtime::blocking::pool::Inner::run
raiko  |              at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/blocking/pool.rs:513:17
raiko  |   22: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
raiko  |              at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/blocking/pool.rs:471:13
raiko  | note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
raiko  | Error: task 45 panicked

Because of this error, I am not sure if my prover is suitable for participation in Hekla testnet. Tokens remain untouched.

FYI, my prover address is 0x18c2942c51d0947d4d51ddf863e2aa9bc409a241, instance ID - 206.

Steps to reproduce

No response

Spam policy

  • I verify that this issue is NOT SPAM and understand SPAM issues will be closed and reported to GitHub, resulting in ACCOUNT TERMINATION.
@smtmfft
Copy link
Contributor

smtmfft commented May 7, 2024

If your sgx id registration successes, you are able to run as a prover.
The error has lots of potential reasons, maybe bootstrap conf changes after your register (only guess), we are refactorying the error reporting part now. Anyway, do you use the latest alpha-7 branch and follow the doc/tutorial exactly??

@isekaitaiku
Copy link

isekaitaiku commented May 10, 2024

I also encountered a broken pipe error.
Raiko version: cc0e8d8 (2024/05/10 latest)
Changes made: Increased sgx.max_threads from 16 to 32 in provers/sgx/config/sgx-guest.docker.manifest.template.
Hardware :
OVH Rise3
CPU : Intel Xeon-E 2288G - 8c/16t - 3.7 GHz/5 GHz
RAM : 32 GB ECC 2666 MHz
OS : Ubuntu Server 22.04 LTS "Jammy Jellyfish"

The following command occasionally results in an error, with a roughly 50% occurrence rate.
I set up my own Holesky node. Raiko is running on a different server.

 curl --location 'http://localhost:8080' \
--header 'Content-Type: application/json' \
--data '{
    "jsonrpc": "2.0",
    "method": "proof",
    "params": [
        {
            "proof_type": "sgx",
            "block_number": 107906,
            "rpc": "https://rpc.hekla.taiko.xyz/",
            "l1_rpc": "my holesky node",
            "beacon_rpc": "my holesky node",
            "prover": "0x7b399987d24fc5951f3e94a4cb16e87414bf2229",
            "graffiti": "0x0000000000000000000000000000000000000000000000000000000000000000",
            "sgx": {
                "setup": false,
                "bootstrap": false,
                "prove": true
            }
        }
    ],
    "id": 0
}'

error

kzg check enabled!
Guest program peak memory used: 9.147931 MB
Generating proof...
WARNING: running SGX in hardware mode!
Current directory: "/opt/raiko/bin"

thread 'tokio-runtime-worker' panicked at /opt/raiko/provers/sgx/prover/src/lib.rs:280:48:
Unable to serialize input: Io(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" })
Error: task 542 panicked

@mihaiciuciu3410
Copy link

mihaiciuciu3410 commented May 10, 2024

Hey,

Same issue here:

kzg check enabled!
Guest program peak memory used: 1.954458 MB
Generating proof...
WARNING: running SGX in hardware mode!
Current directory: "/opt/raiko/bin"

thread 'tokio-runtime-worker' panicked at /opt/raiko/provers/sgx/prover/src/lib.rs:280:48:
Unable to serialize input: Io(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" })
Error: task 217 panicked

prover address: 0xDD8897F0729D1D0E7eCF36Df004562CC4F243E11
sgx instance id: 2534

@smtmfft
Copy link
Contributor

smtmfft commented May 10, 2024

Hey,

Same issue here:

kzg check enabled! Guest program peak memory used: 1.954458 MB Generating proof... WARNING: running SGX in hardware mode! Current directory: "/opt/raiko/bin"

thread 'tokio-runtime-worker' panicked at /opt/raiko/provers/sgx/prover/src/lib.rs:280:48: Unable to serialize input: Io(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }) Error: task 217 panicked

prover address: 0xDD8897F0729D1D0E7eCF36Df004562CC4F243E11 sgx instance id: 2534

what is the block num?

@mihaiciuciu3410
Copy link

Hey,
Same issue here:
kzg check enabled! Guest program peak memory used: 1.954458 MB Generating proof... WARNING: running SGX in hardware mode! Current directory: "/opt/raiko/bin"
thread 'tokio-runtime-worker' panicked at /opt/raiko/provers/sgx/prover/src/lib.rs:280:48: Unable to serialize input: Io(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }) Error: task 217 panicked
prover address: 0xDD8897F0729D1D0E7eCF36Df004562CC4F243E11 sgx instance id: 2534

what is the block num?

Does it matter what block do we interogate? In this case was 96419

@smtmfft
Copy link
Contributor

smtmfft commented May 10, 2024

Does it matter what block do we interogate? In this case was 96419

Not really, but we can check that block on our side to see if proof generation itself is ok.

@mihaiciuciu3410
Copy link

Does it matter what block do we interogate? In this case was 96419

Not really, but we can check that block on our side to see if proof generation itself is ok.

same issue for 110568 for example

@mihaiciuciu3410
Copy link

Does these 2 blocks work for you?

@smtmfft
Copy link
Contributor

smtmfft commented May 10, 2024

Does these 2 blocks work for you?

Yes, I used a infura's holesky rpc & @isekaitaiku 's beacon rpc (above) to run these 3 suspicious blocks: 96419, 107906 and 110568. All good. Any other changes you have made??

@mihaiciuciu3410
Copy link

Does these 2 blocks work for you?

Yes, I used a infura's holesky rpc & @isekaitaiku 's beacon rpc (above) to run these 3 suspicious blocks: 96419, 107906 and 110568. All good. Any other changes you have made??

what do you mean when you say changes?

@mihaiciuciu3410
Copy link

0xDD8897F0729D1D0E7eCF36Df004562CC4F243E11 my prover address should have TTKOh in order the call to work properly?

@smtmfft
Copy link
Contributor

smtmfft commented May 10, 2024

what do you mean when you say changes?

just changes, like modifications in manifest file or src code.

Could you use prove_block script (in raiko repo) to do a local test? cmd is: prove_block.sh taiko_a7 sgx 110568
image
edit these 2 rpc before you call, if no panic, then, the proof generation is ok.

0xDD8897F0729D1D0E7eCF36Df004562CC4F243E11 my prover address should have TTKOh in order the call to work properly?

Yes, but this panic happened ahead of that, so it's unrelated to currently problem.

@isekaitaiku
Copy link

isekaitaiku commented May 10, 2024

I tried running "prove_block.sh taiko_a7 sgx 110568"
(I changed the l1Rpc and beaconRpc in the script to my own nodes)

As a result, logs showed progress beyond block_number=107906 curl used case.
Out of 10 attempts, 8 were successful and 2 failed.

Error Case

Bootstrap details saved in /root/.config/raiko/config/bootstrap.json
Encrypted private key saved in /root/.config/raiko/secrets/priv.key

thread 'tokio-runtime-worker' panicked at /opt/raiko/provers/sgx/prover/src/lib.rs:280:48:
Unable to serialize input: Io(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" })
Error: task 1037 panicked

block_number=110568 3 transactions
block_number=107906 36 transactions
It seems that the more transactions a block contains, the more likely it is to fail.

@mihaiciuciu3410
Copy link

mihaiciuciu3410 commented May 10, 2024

what do you mean when you say changes?

just changes, like modifications in manifest file or src code.

Changes made: Increased sgx.max_threads from 16 to 32 in provers/sgx/config/sgx-guest.docker.manifest.template.

Could you use prove_block script (in raiko repo) to do a local test? cmd is: prove_block.sh taiko_a7 sgx 110568 image edit these 2 rpc before you call, if no panic, then, the proof generation is ok.

taiko@taiko-testnet-validator:~/raiko$ bash prove_block.sh taiko_a7 sgx 110568

  • proving block 110568
    nothing happens in ~25 minutes

0xDD8897F0729D1D0E7eCF36Df004562CC4F243E11 my prover address should have TTKOh in order the call to work properly?

Yes, but this panic happened ahead of that, so it's unrelated to currently problem.

@smtmfft
Copy link
Contributor

smtmfft commented May 11, 2024

Changes made: Increased sgx.max_threads from 16 to 32 in provers/sgx/config/sgx-guest.docker.manifest.template.

You change the config & then re-setup the sgx include bootstrap?

@smtmfft
Copy link
Contributor

smtmfft commented May 11, 2024

It seems that the more transactions a block contains, the more likely it is to fail.

weird.....can you disable the setup & bootstrap in the script & try again?

@isekaitaiku
Copy link

isekaitaiku commented May 11, 2024

You change the config & then re-setup the sgx include bootstrap?
weird.....can you disable the setup & bootstrap in the script & try again?

Here's what I did when updating to the latest version(cc0e8d8)
Please let me know if there are any steps I missed.

  1. Deleted old Docker images.
  2. git pull
  3. Updated the manifest file.
  4. Ran docker compose build --no-cache.
  5. Ran docker compose up init (deleted files created by previous init).
  6. Performed Onchain RA.
  7. Changed the environment variable for the instance ID and started Raiko.

@smtmfft
Copy link
Contributor

smtmfft commented May 11, 2024

Merged a refine PR for better show error message #182.
Hopefully we can see clear clue here.

@isekaitaiku
Copy link

Thank you for your assistance.
I have updated to the latest version of Raiko and got it running using the following steps:
I haven't edited the manifest file.

  1. Deleted old Docker images.
  2. git pull
  3. Updated the manifest file.
  4. Ran docker compose build --no-cache.
  5. Ran docker compose up init (deleted files created by previous init).
  6. Performed Onchain RA.
  7. Changed the environment variable for the instance ID and started Raiko.

failed 3/5

Bootstrap details saved in /root/.config/raiko/config/bootstrap.json
Encrypted private key saved in /root/.config/raiko/secrets/priv.key

Error: Can not serialize input for SGX io error: Broken pipe (os error 32), output is Ok(Output { status: ExitStatus(unix_wait_status(256)), stdout: "Starting one shot mode\nGlobal options: GlobalOpts { secrets_dir: \"/root/.config/raiko/secrets\", config_dir: \"/root/.config/raiko/config\" }, OneShot options: OneShotArgs { sgx_instance_id: 456 }\nmemory allocation of 184 bytes failed\n", stderr: "Gramine is starting. Parsing TOML manifest file, this may take some time...\n-----------------------------------------------------------------------------------------------------------------------\nGramine detected the following insecure configurations:\n\n  - loader.insecure__use_cmdline_argv = true   (forwarding command-line args from untrusted host to the app)\n  - sys.insecure__allow_eventfd = true         (host-based eventfd is enabled)\n  - sgx.allowed_files = [ ... ]                (some files are passed through from untrusted host without verification)\n\nGramine will continue application execution, but this configuration must not be used in production!\n-----------------------------------------------------------------------------------------------------------------------\n\n[P1:T14:sgx-guest] error: Out-of-memory in library OS\n" })

@smtmfft
Copy link
Contributor

smtmfft commented May 11, 2024

Out-of-memory in library OS ... never met before.

how many memory do you have?

@isekaitaiku
Copy link

isekaitaiku commented May 11, 2024

If prove is not running, it looks like there would be about 18G of memory left over.

free -h
               total        used        free      shared  buff/cache   available
Mem:            30Gi       2.6Gi       2.6Gi       6.0Mi        25Gi        27Gi
Swap:          1.0Gi        12Mi       1.0Gi

vmstat -s
     32326272 K total memory
      2783340 K used memory
      9474848 K active memory
     **18301852 K inactive memory**
      2612496 K free memory
      1301804 K buffer memory
     25628632 K swap cache
      1048568 K total swap
        12800 K used swap
      1035768 K free swap
      6509879 non-nice user cpu ticks
         2686 nice user cpu ticks
      1543243 system cpu ticks
   1067140957 idle cpu ticks
        49497 IO-wait cpu ticks
            0 IRQ cpu ticks
       243655 softirq cpu ticks
            0 stolen cpu ticks
      8782527 pages paged in
    235608356 pages paged out
         7704 pages swapped in
        12613 pages swapped out
   1187692914 interrupts
   2217490688 CPU context switches
   1714740053 boot time
       513653 forks

@isekaitaiku
Copy link

OVH Rise3 SGX's settings
image

@smtmfft
Copy link
Contributor

smtmfft commented May 11, 2024

OVH Rise3 SGX's settings image

Can this increased? like 4G or even bigger? I remember ours are 16G (half of 32) if not mistake.

And what's your OVH instance info?? Another user reported a rare occurrence also on OVH, which is related to intel certification failure. maybe he can switch to another instance like yours.

@isekaitaiku
Copy link

Can this increased? like 4G or even bigger? I remember ours are 16G (half of 32) if not mistake.

It seems that the memory allocated for SGX is capped at a maximum of 256MB.
Is this value important? Should I contact OVH to try and increase it?

image

And what's your OVH instance info?? Another user reported a rare occurrence also on OVH, which is related to intel certification failure. maybe he can switch to another instance like yours.

image

@ryssroad
Copy link

ryssroad commented May 11, 2024

Should I contact OVH to try and increase it?

It's a hardware (CPU) limit I think

@isekaitaiku
Copy link

@isekaitaiku
Copy link

I rented a server capable of allocating 512MB to SGX and ran the Raiko prove_test on it.
block_number=117304, txs_num=80
As a result, the likelihood of errors occurring has drastically decreased (although it still fails due to a broken pipe about 1 in 20 times).

@mihaiciuciu3410
Copy link

SGX bootstrap stderr: Gramine is starting. Parsing TOML manifest file, this may take some time...
error: AESM service returned error 12; this may indicate that infrastructure for the DCAP attestation requested by Gramine is missing on this machine
error: load_enclave() failed with error: Operation not permitted (EPERM). What does this error mean?

@davaymne
Copy link

SGX bootstrap stderr: Gramine is starting. Parsing TOML manifest file, this may take some time... error: AESM service returned error 12; this may indicate that infrastructure for the DCAP attestation requested by Gramine is missing on this machine error: load_enclave() failed with error: Operation not permitted (EPERM). What does this error mean?

What server and provider you use?
I ve got exact the same problem with ovh advanced-6 CPUIntel Xeon Gold 6312U

@davaymne
Copy link

@isekaitaiku what server do you finally use?

@isekaitaiku
Copy link

isekaitaiku commented May 13, 2024

@isekaitaiku what server do you finally use?

ovh advanced-1
The broken pipe error occurs occasionally, so it is not recommended.

@smtmfft
Copy link
Contributor

smtmfft commented May 13, 2024

SGX bootstrap stderr: Gramine is starting. Parsing TOML manifest file, this may take some time... error: AESM service returned error 12; this may indicate that infrastructure for the DCAP attestation requested by Gramine is missing on this machine error: load_enclave() failed with error: Operation not permitted (EPERM). What does this error mean?

That's means the pccs server does not work properly, a common reason is incorrect config, could you double check settings at https://github.com/taikoxyz/raiko/blob/taiko/alpha-7/README_Docker_and_RA.md#raiko-docker section?

BTW: which platform are you using?? we saw a platform in OVH can not support PCCS because not registered to Intel see intel/SGXDataCenterAttestationPrimitives#398. Hopefully yours are just incorrect config.

@liujiufa
Copy link

SGX bootstrap stderr: Gramine is starting. Parsing TOML manifest file, this may take some time... error: AESM service returned error 12; this may indicate that infrastructure for the DCAP attestation requested by Gramine is missing on this machine error: load_enclave() failed with error: Operation not permitted (EPERM). What does this error mean?

That's means the pccs server does not work properly, a common reason is incorrect config, could you double check settings at https://github.com/taikoxyz/raiko/blob/taiko/alpha-7/README_Docker_and_RA.md#raiko-docker section?

BTW: which platform are you using?? we saw a platform in OVH can not support PCCS because not registered to Intel see intel/SGXDataCenterAttestationPrimitives#398. Hopefully yours are just incorrect config.

It's useful for me tks

@mratsim
Copy link
Contributor

mratsim commented Jul 4, 2024

Closing as since the issue was raised 2 months ago, we've been in mainnet the codebase has undergone numerous changes and SGX is used in production.

Feel free to comment and we'll reopen for investigation if there is still an issue.

@mratsim mratsim closed this as completed Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants