Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

运行求交失败 #484

Open
lvying0019 opened this issue Jan 2, 2025 · 6 comments
Open

运行求交失败 #484

lvying0019 opened this issue Jan 2, 2025 · 6 comments

Comments

@lvying0019
Copy link

Issue Type

Running

Search for existing issues similar to yours

Yes

OS Platform and Distribution

Ubuntu 22.04.1

Kuscia Version

secretpadImage版本:0.11.0b0 secretflowServingImage版本:0.7.0b0 kusciaImage版本:0.12.0b0 secretflowImage版本:1.10.0b1 dataProxyImage版本:0.1.0b0

Deployment

docker

deployment Version

Docker version 20.10.10

App Running type

secretflow

App Running version

secretpadImage版本:0.11.0b0 secretflowServingImage版本:0.7.0b0 kusciaImage版本:0.12.0b0 secretflowImage版本:1.10.0b1 dataProxyImage版本:0.1.0b0

Configuration file used to run kuscia.

使用all in one前端界面

What happend and What you expected to happen.

运行求交失败
数据量:100W × 100W
协议ECDH
其他为默认

Kuscia log output.

2025-01-02 08:30:06.531 [error] [channel.cc:Proc:104] SendImpl error [external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '1010', http status code '503', response header '[x-b3-traceid]:[a2cf137d8da22e56];[content-length]:[152];[kuscia-error-message]:[<alice/root-kuscia-autonomy-alice-mpc-hp-prodesk-680-g4-mt100/internal> => <bob/root-kuscia-autonomy-bob-mpc-hp-prodesk-680-g4-mt/external $upstream_reset_before_response_started{remote_connection_failure,delayed_connect_error:_113}$ Service Unavailable>];[x-accel-buffering]:[no];[x-b3-spanid]:[a2cf137d8da22e56];[x-envoy-upstream-service-time]:[3514];[date]:[Thu, 02 Jan 2025 08:30:06 GMT];[server]:[envoy];', response body '', error msg '[E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 113'
@lanyy9527
Copy link

在kuscia容器中,查看下该psi任务的 双方的pod日志

@lvying0019
Copy link
Author

alice:
2025-01-02T16:23:41.741133012+08:00 stdout F wd: "/tmp/sf_woui-berndnhm-node-35_alice"
2025-01-02T16:23:41.741133902+08:00 stdout F }
2025-01-02T16:23:41.741134612+08:00 stdout F
2025-01-02T16:23:41.741135429+08:00 stdout F --
2025-01-02T16:23:41.741136108+08:00 stdout F
2025-01-02T16:23:41.741139456+08:00 stdout F 2025-01-02 08:23:41,741|alice|INFO|secretflow|entry.py:comp_eval:98|
2025-01-02T16:23:41.741140368+08:00 stdout F --
2025-01-02T16:23:41.741141173+08:00 stdout F cluster_config
2025-01-02T16:23:41.74114188+08:00 stdout F
2025-01-02T16:23:41.741142738+08:00 stdout F desc {
2025-01-02T16:23:41.741143559+08:00 stdout F parties: "bob"
2025-01-02T16:23:41.741144392+08:00 stdout F parties: "alice"
2025-01-02T16:23:41.741145192+08:00 stdout F devices {
2025-01-02T16:23:41.741146035+08:00 stdout F name: "spu"
2025-01-02T16:23:41.741146808+08:00 stdout F type: "spu"
2025-01-02T16:23:41.741147563+08:00 stdout F parties: "bob"
2025-01-02T16:23:41.74114833+08:00 stdout F parties: "alice"
2025-01-02T16:23:41.741149849+08:00 stdout F config: "{"runtime_config":{"protocol":"SEMI2K","field":"FM128"},"link_desc":{"connect_retry_times":60,"connect_retry_interval_ms":1000,"brpc_channel_protocol":"http","brpc_channel_connection_type":"pooled","recv_timeout_ms":1200000,"http_timeout_ms":1200000}}"
2025-01-02T16:23:41.741150692+08:00 stdout F }
2025-01-02T16:23:41.741151548+08:00 stdout F devices {
2025-01-02T16:23:41.741152339+08:00 stdout F name: "heu"
2025-01-02T16:23:41.741153131+08:00 stdout F type: "heu"
2025-01-02T16:23:41.741153918+08:00 stdout F parties: "bob"
2025-01-02T16:23:41.7411547+08:00 stdout F parties: "alice"
2025-01-02T16:23:41.741155526+08:00 stdout F config: "{"mode": "PHEU", "schema": "paillier", "key_size": 2048}"
2025-01-02T16:23:41.741156321+08:00 stdout F }
2025-01-02T16:23:41.741157124+08:00 stdout F ray_fed_config {
2025-01-02T16:23:41.74115886+08:00 stdout F cross_silo_comm_backend: "brpc_link"
2025-01-02T16:23:41.741161002+08:00 stdout F }
2025-01-02T16:23:41.741161865+08:00 stdout F }
2025-01-02T16:23:41.741162679+08:00 stdout F public_config {
2025-01-02T16:23:41.741163445+08:00 stdout F ray_fed_config {
2025-01-02T16:23:41.741164214+08:00 stdout F parties: "bob"
2025-01-02T16:23:41.74116498+08:00 stdout F parties: "alice"
2025-01-02T16:23:41.741165765+08:00 stdout F addresses: "woui-berndnhm-node-35-0-fed.bob.svc:80"
2025-01-02T16:23:41.741166616+08:00 stdout F addresses: "0.0.0.0:25035"
2025-01-02T16:23:41.741167407+08:00 stdout F }
2025-01-02T16:23:41.741168216+08:00 stdout F spu_configs {
2025-01-02T16:23:41.741169002+08:00 stdout F name: "spu"
2025-01-02T16:23:41.741169774+08:00 stdout F parties: "bob"
2025-01-02T16:23:41.741170561+08:00 stdout F parties: "alice"
2025-01-02T16:23:41.741171355+08:00 stdout F addresses: "http://woui-berndnhm-node-35-0-spu.bob.svc:80"
2025-01-02T16:23:41.741172137+08:00 stdout F addresses: "0.0.0.0:25034"
2025-01-02T16:23:41.741172975+08:00 stdout F }
2025-01-02T16:23:41.741173768+08:00 stdout F inference_config {
2025-01-02T16:23:41.741174576+08:00 stdout F parties: "bob"
2025-01-02T16:23:41.741175355+08:00 stdout F parties: "alice"
2025-01-02T16:23:41.741176184+08:00 stdout F addresses: "http://woui-berndnhm-node-35-0-inference.bob.svc"
2025-01-02T16:23:41.741176964+08:00 stdout F addresses: "0.0.0.0:25033"
2025-01-02T16:23:41.741177715+08:00 stdout F }
2025-01-02T16:23:41.741178501+08:00 stdout F }
2025-01-02T16:23:41.74117934+08:00 stdout F private_config {
2025-01-02T16:23:41.741180133+08:00 stdout F self_party: "alice"
2025-01-02T16:23:41.741181002+08:00 stdout F ray_head_addr: "woui-berndnhm-node-35-0-global.alice.svc:25036"
2025-01-02T16:23:41.741181813+08:00 stdout F }
2025-01-02T16:23:41.741182543+08:00 stdout F
2025-01-02T16:23:41.741183334+08:00 stdout F --
2025-01-02T16:23:41.741184035+08:00 stdout F
2025-01-02T16:23:41.741692991+08:00 stdout F 2025-01-02 08:23:41,741|alice|INFO|secretflow|driver.py:init:502| Try init sf in PRODUCTION mode
2025-01-02T16:23:41.741882812+08:00 stderr F 2025-01-02 08:23:41.741 INFO brpc_link.py:101 [alice] -- brpc options: {'proxy_max_restarts': 3, 'timeout_in_ms': 300000, 'recv_timeout_ms': 604800000, 'connect_retry_times': 3600, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled'}
2025-01-02T16:23:41.743525882+08:00 stderr F I0102 08:23:41.743458 7 external/com_github_brpc_brpc/src/brpc/server.cpp:1181] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=25035.
2025-01-02T16:23:41.743530624+08:00 stderr F W0102 08:23:41.743466 7 external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are disabled according to ServerOptions.has_builtin_services
2025-01-02T16:23:47.31137649+08:00 stderr F I0102 08:23:47.311111 287 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20250102.082346.7/id.db and ./rpc_data/rpcz/20250102.082346.7/time.db
2025-01-02T16:23:48.074178144+08:00 stderr F 2025-01-02 08:23:48.073 INFO brpc_link.py:127 [alice] -- Succeeded to listen on 0.0.0.0:25035.
100% ▕████████████████████████████████████████████████████████████▏
2025-01-02T16:27:36.549861529+08:00 stderr F [751.437] perfetto.cc:45899 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024 KB, total sessions:1, uid:0 session name: ""
2025-01-02T16:27:36.552461936+08:00 stdout F [2025-01-02 08:27:36.552] [info] [launch.cc:119] PSI config: {"protocol_config":{"protocol":"PROTOCOL_ECDH","role":"ROLE_SENDER","broadcast_result":true,"ecdh_config":{"curve":"CURVE_FOURQ"}},"input_config":{"type":"IO_TYPE_FILE_CSV","path":"/tmp/sf_woui-berndnhm-node-35_alice/7c199448-1489-4d5a-aa91-683eb81dbc1a/100w200f_y_1168552885.csv.deal_null.csv"},"output_config":{"type":"IO_TYPE_FILE_CSV","path":"/tmp/sf_woui-berndnhm-node-35_alice/7c199448-1489-4d5a-aa91-683eb81dbc1a/woui_berndnhm_node_35_output_0.csv"},"keys":["id"],"left_side":"ROLE_RECEIVER"}
2025-01-02T16:27:36.552513855+08:00 stdout F [2025-01-02 08:27:36.552] [info] [sender.cc:43] [EcdhPsiSender::Init] start
2025-01-02T16:27:36.552520468+08:00 stdout F [2025-01-02 08:27:36.552] [info] [interface.cc:78] [AbstractPsiParty::Init] start
2025-01-02T16:27:36.657610741+08:00 stdout F [2025-01-02 08:27:36.657] [info] [interface.cc:136] [AbstractPsiParty::Init][Check csv pre-process] start
2025-01-02T16:27:40.820760608+08:00 stdout F [2025-01-02 08:27:40.820] [info] [csv_checker.cc:243] Executing script to get duplicates: LC_ALL=C tail -n +2 /tmp/ef4869d5-d0b0-48ec-84a9-2c592f4f7e2f.psi_checked | LC_ALL=C sort --parallel=8 --buffer-size=1G --stable | LC_ALL=C uniq -d > /tmp/ef4869d5-d0b0-48ec-84a9-2c592f4f7e2f.psi_checked_duplicates
2025-01-02T16:27:41.032583+08:00 stdout F [2025-01-02 08:27:41.032] [info] [interface.cc:145] [AbstractPsiParty::Init][Check csv pre-process] end
2025-01-02T16:27:50.000901278+08:00 stdout F [2025-01-02 08:27:50.000] [info] [interface.cc:183] [AbstractPsiParty::Init] end
2025-01-02T16:27:50.00093986+08:00 stdout F [2025-01-02 08:27:50.000] [info] [sender.cc:51] [EcdhPsiSender::Init] end
2025-01-02T16:27:50.000949224+08:00 stdout F [2025-01-02 08:27:50.000] [info] [sender.cc:56] [EcdhPsiSender::PreProcess] start
2025-01-02T16:27:50.000957502+08:00 stdout F [2025-01-02 08:27:50.000] [info] [cryptor_selector.cc:69] Using FourQ
2025-01-02T16:27:50.035816131+08:00 stdout F [2025-01-02 08:27:50.035] [info] [sender.cc:93] [EcdhPsiSender::PreProcess] end
2025-01-02T16:27:50.035825942+08:00 stdout F [2025-01-02 08:27:50.035] [info] [sender.cc:98] [EcdhPsiSender::Online] start
2025-01-02T16:27:54.475563845+08:00 stdout F [2025-01-02 08:27:54.475] [info] [arrow_csv_batch_provider.cc:75] Reach the end of csv file /tmp/sf_woui-berndnhm-node-35_alice/7c199448-1489-4d5a-aa91-683eb81dbc1a/100w200f_y_1168552885.csv.deal_null.csv.
2025-01-02T16:27:54.482874994+08:00 stdout F [2025-01-02 08:27:54.482] [info] [thread_pool.cc:30] Create a fixed thread pool with size 7
2025-01-02T16:27:57.709337042+08:00 stdout F [2025-01-02 08:27:57.709] [info] [arrow_csv_batch_provider.cc:75] Reach the end of csv file /tmp/sf_woui-berndnhm-node-35_alice/7c199448-1489-4d5a-aa91-683eb81dbc1a/100w200f_y_1168552885.csv.deal_null.csv.
2025-01-02T16:27:57.72316652+08:00 stdout F [2025-01-02 08:27:57.723] [info] [ecdh_psi.cc:106] MaskSelf:alice --finished, batch_count=1, self_item_count=1000000
2025-01-02T16:27:57.723177388+08:00 stdout F [2025-01-02 08:27:57.723] [info] [ecdh_psi.cc:365] ID alice: MaskSelf finished.
2025-01-02T16:28:32.01609327+08:00 stdout F [2025-01-02 08:28:32.015] [info] [ecdh_psi.cc:212] RecvDualMaskedSelf:alice recv last batch finished, batch_count=1
2025-01-02T16:28:32.016104566+08:00 stdout F [2025-01-02 08:28:32.015] [info] [ecdh_psi.cc:373] ID alice: RecvDualMaskedSelf finished.
2025-01-02T16:28:34.58052961+08:00 stdout F [2025-01-02 08:28:34.580] [info] [ecdh_psi.cc:169] MaskPeer:alice --finished, batch_count=1, peer_item_count=1000000
2025-01-02T16:28:34.580540539+08:00 stdout F [2025-01-02 08:28:34.580] [info] [ecdh_psi.cc:369] ID alice: MaskPeer finished.
2025-01-02T16:28:41.884403661+08:00 stdout F [2025-01-02 08:28:41.884] [info] [sender.cc:121] [EcdhPsiSender::Online] end
2025-01-02T16:28:41.884438863+08:00 stdout F [2025-01-02 08:28:41.884] [info] [sender.cc:126] [EcdhPsiSender::PostProcess] start
2025-01-02T16:28:42.385592279+08:00 stdout F [2025-01-02 08:28:42.385] [info] [sender.cc:143] [EcdhPsiSender::PostProcess] end
2025-01-02T16:28:42.385603504+08:00 stdout F [2025-01-02 08:28:42.385] [info] [interface.cc:188] [AbstractPsiParty::Finalize] start
2025-01-02T16:28:42.385607437+08:00 stdout F [2025-01-02 08:28:42.385] [info] [interface.cc:202] [AbstractPsiParty::Finalize][Generate result] start
2025-01-02T16:28:42.390621635+08:00 stdout F [2025-01-02 08:28:42.390] [info] [key.cc:91] Executing sort scripts: tail -n +2 /tmp/psi_index_e108b52e-a36f-4428-8c1c-806c2d17bd67.csv | LC_ALL=C sort -n --parallel=8 --buffer-size=1G --stable --field-separator=, --key=1,1 >>/tmp/sorted_psi_index_6a1f5d1e-801c-4708-b6dc-708597af2fcf.csv
2025-01-02T16:28:42.562078702+08:00 stdout F [2025-01-02 08:28:42.561] [info] [key.cc:93] Finished sort scripts: tail -n +2 /tmp/psi_index_e108b52e-a36f-4428-8c1c-806c2d17bd67.csv | LC_ALL=C sort -n --parallel=8 --buffer-size=1G --stable --field-separator=, --key=1,1 >>/tmp/sorted_psi_index_6a1f5d1e-801c-4708-b6dc-708597af2fcf.csv, ret=0
2025-01-02T16:29:01.385632513+08:00 stdout F [2025-01-02 08:29:01.379] [info] [key.cc:91] Executing sort scripts: tail -n +2 /tmp/sf_woui-berndnhm-node-35_alice/7c199448-1489-4d5a-aa91-683eb81dbc1a/tmp-sort-in-5d155eb2-1284-4bd8-bb42-6b1ee207bd8f | LC_ALL=C sort --parallel=8 --buffer-size=1G --stable --field-separator=, --key=202,202 >>/tmp/sf_woui-berndnhm-node-35_alice/7c199448-1489-4d5a-aa91-683eb81dbc1a/tmp-sort-out-5d155eb2-1284-4bd8-bb42-6b1ee207bd8f
2025-01-02T16:29:53.71204713+08:00 stdout F 2025-01-02 08:29:53.711 [info] [channel.cc:SendRequestWithRetry:359] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '1010', http status code '503', response header '[x-b3-traceid]:[7a0654c4b58b56c5];[content-length]:[152];[kuscia-error-message]:[<alice/root-kuscia-autonomy-alice-mpc-hp-prodesk-680-g4-mt100/internal $upstream_reset_before_response_started{remote_connection_failure,delayed_connect_error:_111}$ Service Unavailable>];[x-accel-buffering]:[no];[x-b3-spanid]:[7a0654c4b58b56c5];[date]:[Thu, 02 Jan 2025 08:29:53 GMT];[server]:[envoy];', response body '', error msg '[E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 111'
2025-01-02T16:29:54.814761928+08:00 stdout F 2025-01-02 08:29:54.814 [info] [channel.cc:SendRequestWithRetry:359] send request failed and retry, retry_count=2, max_retry=3, interval_ms=3000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '1010', http status code '503', response header '[x-b3-traceid]:[a540d9a8443b802c];[content-length]:[152];[kuscia-error-message]:[<alice/root-kuscia-autonomy-alice-mpc-hp-prodesk-680-g4-mt100/internal $upstream_reset_before_response_started{remote_connection_failure,delayed_connect_error:_111}$ Service Unavailable>];[x-accel-buffering]:[no];[x-b3-spanid]:[a540d9a8443b802c];[date]:[Thu, 02 Jan 2025 08:29:54 GMT];[server]:[envoy];', response body '', error msg '[E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 111'
2025-01-02T16:29:57.917560164+08:00 stdout F 2025-01-02 08:29:57.917 [info] [channel.cc:SendRequestWithRetry:359] send request failed and retry, retry_count=3, max_retry=3, interval_ms=5000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '1010', http status code '503', response header '[x-b3-traceid]:[d655c377d465b3a2];[content-length]:[152];[kuscia-error-message]:[<alice/root-kuscia-autonomy-alice-mpc-hp-prodesk-680-g4-mt100/internal $upstream_reset_before_response_started{remote_connection_failure,delayed_connect_error:_111}$ Service Unavailable>];[x-accel-buffering]:[no];[x-b3-spanid]:[d655c377d465b3a2];[date]:[Thu, 02 Jan 2025 08:29:57 GMT];[server]:[envoy];', response body '', error msg '[E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 111'
2025-01-02T16:30:06.273926217+08:00 stdout F [2025-01-02 08:30:06.273] [info] [key.cc:93] Finished sort scripts: tail -n +2 /tmp/sf_woui-berndnhm-node-35_alice/7c199448-1489-4d5a-aa91-683eb81dbc1a/tmp-sort-in-5d155eb2-1284-4bd8-bb42-6b1ee207bd8f | LC_ALL=C sort --parallel=8 --buffer-size=1G --stable --field-separator=, --key=202,202 >>/tmp/sf_woui-berndnhm-node-35_alice/7c199448-1489-4d5a-aa91-683eb81dbc1a/tmp-sort-out-5d155eb2-1284-4bd8-bb42-6b1ee207bd8f, ret=0
2025-01-02T16:30:06.532086202+08:00 stdout F 2025-01-02 08:30:06.531 [error] [channel.cc:Proc:104] SendImpl error [external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '1010', http status code '503', response header '[x-b3-traceid]:[a2cf137d8da22e56];[content-length]:[152];[kuscia-error-message]:[<alice/root-kuscia-autonomy-alice-mpc-hp-prodesk-680-g4-mt100/internal> => <bob/root-kuscia-autonomy-bob-mpc-hp-prodesk-680-g4-mt/external $upstream_reset_before_response_started{remote_connection_failure,delayed_connect_error:_113}$ Service Unavailable>];[x-accel-buffering]:[no];[x-b3-spanid]:[a2cf137d8da22e56];[x-envoy-upstream-service-time]:[3514];[date]:[Thu, 02 Jan 2025 08:30:06 GMT];[server]:[envoy];', response body '', error msg '[E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 113'

bob:
2025-01-02T16:23:46.635887125+08:00 stdout F checkpoint_uri: "ckwoui-berndnhm-node-35-output-0"
2025-01-02T16:23:46.63589016+08:00 stdout F
2025-01-02T16:23:46.635893629+08:00 stdout F --
2025-01-02T16:23:46.635896689+08:00 stdout F
2025-01-02T16:23:46.635900451+08:00 stdout F 2025-01-02 08:23:46,545|bob|INFO|secretflow|entry.py:comp_eval:97|
2025-01-02T16:23:46.635903906+08:00 stdout F --
2025-01-02T16:23:46.635907352+08:00 stdout F storage_config
2025-01-02T16:23:46.635910575+08:00 stdout F
2025-01-02T16:23:46.635913948+08:00 stdout F type: "local_fs"
2025-01-02T16:23:46.635917277+08:00 stdout F local_fs {
2025-01-02T16:23:46.635920757+08:00 stdout F wd: "/tmp/sf_woui-berndnhm-node-35_bob"
2025-01-02T16:23:46.635924138+08:00 stdout F }
2025-01-02T16:23:46.635927045+08:00 stdout F
2025-01-02T16:23:46.635930295+08:00 stdout F --
2025-01-02T16:23:46.635933413+08:00 stdout F
2025-01-02T16:23:46.635936999+08:00 stdout F 2025-01-02 08:23:46,545|bob|INFO|secretflow|entry.py:comp_eval:98|
2025-01-02T16:23:46.635940371+08:00 stdout F --
2025-01-02T16:23:46.635943746+08:00 stdout F cluster_config
2025-01-02T16:23:46.63594669+08:00 stdout F
2025-01-02T16:23:46.635950005+08:00 stdout F desc {
2025-01-02T16:23:46.635953351+08:00 stdout F parties: "bob"
2025-01-02T16:23:46.635956628+08:00 stdout F parties: "alice"
2025-01-02T16:23:46.635960059+08:00 stdout F devices {
2025-01-02T16:23:46.635963738+08:00 stdout F name: "spu"
2025-01-02T16:23:46.635967037+08:00 stdout F type: "spu"
2025-01-02T16:23:46.635970331+08:00 stdout F parties: "bob"
2025-01-02T16:23:46.635973737+08:00 stdout F parties: "alice"
2025-01-02T16:23:46.635977669+08:00 stdout F config: "{"runtime_config":{"protocol":"SEMI2K","field":"FM128"},"link_desc":{"connect_retry_times":60,"connect_retry_interval_ms":1000,"brpc_channel_protocol":"http","brpc_channel_connection_type":"pooled","recv_timeout_ms":1200000,"http_timeout_ms":1200000}}"
2025-01-02T16:23:46.635985679+08:00 stdout F }
2025-01-02T16:23:46.635989196+08:00 stdout F devices {
2025-01-02T16:23:46.635992622+08:00 stdout F name: "heu"
2025-01-02T16:23:46.635996068+08:00 stdout F type: "heu"
2025-01-02T16:23:46.635999356+08:00 stdout F parties: "bob"
2025-01-02T16:23:46.636002725+08:00 stdout F parties: "alice"
2025-01-02T16:23:46.636006154+08:00 stdout F config: "{"mode": "PHEU", "schema": "paillier", "key_size": 2048}"
2025-01-02T16:23:46.636009519+08:00 stdout F }
2025-01-02T16:23:46.636012824+08:00 stdout F ray_fed_config {
2025-01-02T16:23:46.636016164+08:00 stdout F cross_silo_comm_backend: "brpc_link"
2025-01-02T16:23:46.636019614+08:00 stdout F }
2025-01-02T16:23:46.63602307+08:00 stdout F }
2025-01-02T16:23:46.636026729+08:00 stdout F public_config {
2025-01-02T16:23:46.636030045+08:00 stdout F ray_fed_config {
2025-01-02T16:23:46.636033379+08:00 stdout F parties: "bob"
2025-01-02T16:23:46.636036685+08:00 stdout F parties: "alice"
2025-01-02T16:23:46.636040088+08:00 stdout F addresses: "0.0.0.0:24961"
2025-01-02T16:23:46.63604333+08:00 stdout F addresses: "woui-berndnhm-node-35-0-fed.alice.svc:80"
2025-01-02T16:23:46.636046861+08:00 stdout F }
2025-01-02T16:23:46.636050367+08:00 stdout F spu_configs {
2025-01-02T16:23:46.636053689+08:00 stdout F name: "spu"
2025-01-02T16:23:46.636057022+08:00 stdout F parties: "bob"
2025-01-02T16:23:46.636060399+08:00 stdout F parties: "alice"
2025-01-02T16:23:46.636063758+08:00 stdout F addresses: "0.0.0.0:24960"
2025-01-02T16:23:46.636067046+08:00 stdout F addresses: "http://woui-berndnhm-node-35-0-spu.alice.svc:80"
2025-01-02T16:23:46.636070337+08:00 stdout F }
2025-01-02T16:23:46.636073689+08:00 stdout F inference_config {
2025-01-02T16:23:46.636077118+08:00 stdout F parties: "bob"
2025-01-02T16:23:46.636080537+08:00 stdout F parties: "alice"
2025-01-02T16:23:46.636083842+08:00 stdout F addresses: "0.0.0.0:24959"
2025-01-02T16:23:46.636087284+08:00 stdout F addresses: "http://woui-berndnhm-node-35-0-inference.alice.svc"
2025-01-02T16:23:46.636090634+08:00 stdout F }
2025-01-02T16:23:46.636093949+08:00 stdout F }
2025-01-02T16:23:46.63609727+08:00 stdout F private_config {
2025-01-02T16:23:46.636100613+08:00 stdout F self_party: "bob"
2025-01-02T16:23:46.636104087+08:00 stdout F ray_head_addr: "woui-berndnhm-node-35-0-global.bob.svc:24962"
2025-01-02T16:23:46.636107446+08:00 stdout F }
2025-01-02T16:23:46.636110471+08:00 stdout F
2025-01-02T16:23:46.636113832+08:00 stdout F --
2025-01-02T16:23:46.636116879+08:00 stdout F
2025-01-02T16:23:46.63612042+08:00 stdout F 2025-01-02 08:23:46,546|bob|INFO|secretflow|driver.py:init:502| Try init sf in PRODUCTION mode
2025-01-02T16:23:48.037557675+08:00 stderr F 2025-01-02 08:23:48.037 INFO brpc_link.py:127 [bob] -- Succeeded to listen on 0.0.0.0:24961.
100% ▕████████████████████████████████████████████████████████████▏
2025-01-02T16:27:36.413248521+08:00 stderr F [959.697] perfetto.cc:45899 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024 KB, total sessions:1, uid:0 session name: ""
2025-01-02T16:27:36.415195053+08:00 stdout F [2025-01-02 08:27:36.414] [info] [launch.cc:119] PSI config: {"protocol_config":{"protocol":"PROTOCOL_ECDH","role":"ROLE_RECEIVER","broadcast_result":true,"ecdh_config":{"curve":"CURVE_FOURQ"}},"input_config":{"type":"IO_TYPE_FILE_CSV","path":"/tmp/sf_woui-berndnhm-node-35_bob/7c199448-1489-4d5a-aa91-683eb81dbc1a/100w255f_x_1866212899.csv.deal_null.csv"},"output_config":{"type":"IO_TYPE_FILE_CSV","path":"/tmp/sf_woui-berndnhm-node-35_bob/7c199448-1489-4d5a-aa91-683eb81dbc1a/woui_berndnhm_node_35_output_0.csv"},"keys":["id"],"left_side":"ROLE_RECEIVER"}
2025-01-02T16:27:36.415245069+08:00 stdout F [2025-01-02 08:27:36.415] [info] [receiver.cc:41] [EcdhPsiReceiver::Init] start
2025-01-02T16:27:36.415251243+08:00 stdout F [2025-01-02 08:27:36.415] [info] [interface.cc:78] [AbstractPsiParty::Init] start
2025-01-02T16:27:36.723968838+08:00 stdout F [2025-01-02 08:27:36.723] [info] [interface.cc:136] [AbstractPsiParty::Init][Check csv pre-process] start
2025-01-02T16:27:49.738531164+08:00 stdout F [2025-01-02 08:27:49.738] [info] [csv_checker.cc:243] Executing script to get duplicates: LC_ALL=C tail -n +2 /tmp/1362ceb9-0ed9-4e30-a81c-ce6cb90be5f4.psi_checked | LC_ALL=C sort --parallel=8 --buffer-size=1G --stable | LC_ALL=C uniq -d > /tmp/1362ceb9-0ed9-4e30-a81c-ce6cb90be5f4.psi_checked_duplicates
2025-01-02T16:27:49.963026687+08:00 stdout F [2025-01-02 08:27:49.962] [info] [interface.cc:145] [AbstractPsiParty::Init][Check csv pre-process] end
2025-01-02T16:27:49.963222176+08:00 stdout F [2025-01-02 08:27:49.963] [info] [interface.cc:183] [AbstractPsiParty::Init] end
2025-01-02T16:27:49.963229272+08:00 stdout F [2025-01-02 08:27:49.963] [info] [receiver.cc:49] [EcdhPsiReceiver::Init] end
2025-01-02T16:27:49.963231073+08:00 stdout F [2025-01-02 08:27:49.963] [info] [receiver.cc:54] [EcdhPsiReceiver::PreProcess] start
2025-01-02T16:27:49.963232677+08:00 stdout F [2025-01-02 08:27:49.963] [info] [cryptor_selector.cc:69] Using FourQ
2025-01-02T16:27:49.994832801+08:00 stdout F [2025-01-02 08:27:49.994] [info] [receiver.cc:91] [EcdhPsiReceiver::PreProcess] end
2025-01-02T16:27:50.1044471+08:00 stdout F [2025-01-02 08:27:50.104] [info] [receiver.cc:96] [EcdhPsiReceiver::Online] start
2025-01-02T16:27:57.853861982+08:00 stdout F [2025-01-02 08:27:57.853] [info] [arrow_csv_batch_provider.cc:75] Reach the end of csv file /tmp/sf_woui-berndnhm-node-35_bob/7c199448-1489-4d5a-aa91-683eb81dbc1a/100w255f_x_1866212899.csv.deal_null.csv.
2025-01-02T16:27:57.863517969+08:00 stdout F [2025-01-02 08:27:57.863] [info] [thread_pool.cc:30] Create a fixed thread pool with size 7
2025-01-02T16:28:01.268726907+08:00 stdout F [2025-01-02 08:28:01.268] [info] [arrow_csv_batch_provider.cc:75] Reach the end of csv file /tmp/sf_woui-berndnhm-node-35_bob/7c199448-1489-4d5a-aa91-683eb81dbc1a/100w255f_x_1866212899.csv.deal_null.csv.
2025-01-02T16:28:01.282917827+08:00 stdout F [2025-01-02 08:28:01.282] [info] [ecdh_psi.cc:106] MaskSelf:bob --finished, batch_count=1, self_item_count=1000000
2025-01-02T16:28:01.282927432+08:00 stdout F [2025-01-02 08:28:01.282] [info] [ecdh_psi.cc:365] ID bob: MaskSelf finished.
2025-01-02T16:28:22.470805411+08:00 stdout F [2025-01-02 08:28:22.470] [info] [ecdh_psi.cc:169] MaskPeer:bob --finished, batch_count=1, peer_item_count=1000000
2025-01-02T16:28:22.47082298+08:00 stdout F [2025-01-02 08:28:22.470] [info] [ecdh_psi.cc:369] ID bob: MaskPeer finished.
2025-01-02T16:28:41.846961313+08:00 stdout F [2025-01-02 08:28:41.846] [info] [ecdh_psi.cc:212] RecvDualMaskedSelf:bob recv last batch finished, batch_count=1
2025-01-02T16:28:41.846971999+08:00 stdout F [2025-01-02 08:28:41.846] [info] [ecdh_psi.cc:373] ID bob: RecvDualMaskedSelf finished.
2025-01-02T16:28:41.847052976+08:00 stdout F [2025-01-02 08:28:41.846] [info] [receiver.cc:120] [EcdhPsiReceiver::Online] end
2025-01-02T16:28:41.847058632+08:00 stdout F [2025-01-02 08:28:41.846] [info] [receiver.cc:125] [EcdhPsiReceiver::PostProcess] start
2025-01-02T16:28:42.45319709+08:00 stdout F [2025-01-02 08:28:42.452] [info] [receiver.cc:142] [EcdhPsiReceiver::PostProcess] end
2025-01-02T16:28:42.453229982+08:00 stdout F [2025-01-02 08:28:42.452] [info] [interface.cc:188] [AbstractPsiParty::Finalize] start
2025-01-02T16:28:42.453238041+08:00 stdout F [2025-01-02 08:28:42.453] [info] [interface.cc:202] [AbstractPsiParty::Finalize][Generate result] start
2025-01-02T16:28:42.462402602+08:00 stdout F [2025-01-02 08:28:42.462] [info] [key.cc:91] Executing sort scripts: tail -n +2 /tmp/psi_index_5245f28c-5c0f-41bd-a396-2380601ad2fa.csv | LC_ALL=C sort -n --parallel=8 --buffer-size=1G --stable --field-separator=, --key=1,1 >>/tmp/sorted_psi_index_294f7334-273a-4c27-be5d-afaeba12ddce.csv
2025-01-02T16:28:42.643349192+08:00 stdout F [2025-01-02 08:28:42.643] [info] [key.cc:93] Finished sort scripts: tail -n +2 /tmp/psi_index_5245f28c-5c0f-41bd-a396-2380601ad2fa.csv | LC_ALL=C sort -n --parallel=8 --buffer-size=1G --stable --field-separator=, --key=1,1 >>/tmp/sorted_psi_index_294f7334-273a-4c27-be5d-afaeba12ddce.csv, ret=0

@lvying0019
Copy link
Author

过了几天,重新运行,成功了。请问是什么原因?

@UniqueMarvin
Copy link
Collaborator

error msg '[E1010]HTTP/1.1 503 Service Unavailable: upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 113'
从日志看是 alice --> bob 方向网络不通导致的,可以排查下 alice --> bob 方向是否经过了网关或防火墙。

@lvying0019
Copy link
Author

没有,是局域网的两台机器,使用连接在同一台路由器,除了限速20Mbps,50ms delay外,没做其他限制。

@UniqueMarvin
Copy link
Collaborator

如果问题可复现,可以 tcpdump 抓包分析下原因

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants