Skip to content

[CI] RelocationIT testIndexAndRelocateConcurrently fails #29161

@cbuescher

Description

@cbuescher

Link to build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-java-periodic/ESJAVA=java9,ESRUNTIME=openjdk10,nodes=linux/11/console

Doesn't reproduce locally so far:

REPRODUCE WITH: ./gradlew :server:integTest \
  -Dtests.seed=8489FAA2C90E2AAC \
  -Dtests.class=org.elasticsearch.recovery.RelocationIT \
  -Dtests.method="testIndexAndRelocateConcurrently" \
  -Dtests.security.manager=true \
  -Dtests.locale=fi-FI \
  -Dtests.timezone=Europe/Skopje

This looks like the cause of the failure:

04:29:14   1> [2018-03-20T05:29:03,303][WARN ][o.e.i.r.RecoverySourceHandler] [node_t0] [test][0][recover to node_t1] 0 Remote file corruption on node {node_t1}{dE8cLV8gTuCMFnxS0ICKLw}{bENA6S-IRRa1T-O0CtNuhA}{127.0.0.1}{127.0.0.1:59176}, recovering name [segments_2], length [294], checksum [4qv6i7], writtenBy [7.2.1]. local checksum OK
04:29:14   1> org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=4qv6i7 actual=1afllwc (resource=name [segments_2], length [294], checksum [4qv6i7], writtenBy [7.2.1]) (resource=VerifyingIndexOutput(segments_2))
04:29:14   1> 	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1226) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1205) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1234) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.indices.recovery.RecoveryTarget.writeFileChunk(RecoveryTarget.java:487) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:598) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:571) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) [main/:?]
04:29:14   1> 	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) [main/:?]
04:29:14   1> 	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1555) [main/:?]
04:29:14   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [main/:?]
04:29:14   1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
04:29:14   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
04:29:14   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
04:29:14   1> 	at java.lang.Thread.run(Thread.java:844) [?:?]
04:29:14   1> [2018-03-20T05:29:03,307][TRACE][o.e.i.r.PeerRecoveryTargetService] [node_t1] [test][0] Got exception on recovery
04:29:14   1> org.elasticsearch.transport.RemoteTransportException: [node_t0][127.0.0.1:31016][internal:index/shard/recovery/start_recovery]
04:29:14   1> Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
04:29:14   1> 	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:175) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1555) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [main/:?]
04:29:14   1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
04:29:14   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
04:29:14   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
04:29:14   1> 	at java.lang.Thread.run(Thread.java:844) [?:?]
04:29:14   1> Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [4] files with total size of [3.5kb]
04:29:14   1> 	at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:426) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:173) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1555) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[main/:?]
04:29:14   1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
04:29:14   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
04:29:14   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
04:29:14   1> 	at java.lang.Thread.run(Thread.java:844) ~[?:?]
04:29:14   1> Caused by: org.elasticsearch.transport.RemoteTransportException: [File corruption occurred on recovery but checksums are ok]
04:29:14   1> 	Suppressed: org.elasticsearch.transport.RemoteTransportException: [node_t1][127.0.0.1:59176][internal:index/shard/recovery/file_chunk]
04:29:14   1> 	Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=4qv6i7 actual=1afllwc (resource=name [segments_2], length [294], checksum [4qv6i7], writtenBy [7.2.1]) (resource=VerifyingIndexOutput(segments_2))
04:29:14   1> 		at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1226) ~[main/:?]
04:29:14   1> 		at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1205) ~[main/:?]
04:29:14   1> 		at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1234) ~[main/:?]
04:29:14   1> 		at org.elasticsearch.indices.recovery.RecoveryTarget.writeFileChunk(RecoveryTarget.java:487) ~[main/:?]
04:29:14   1> 		at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:598) ~[main/:?]
04:29:14   1> 		at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:571) ~[main/:?]
04:29:14   1> 		at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[main/:?]
04:29:14   1> 		at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[main/:?]
04:29:14   1> 		at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1555) ~[main/:?]
04:29:14   1> 		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [main/:?]
04:29:14   1> 		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
04:29:14   1> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
04:29:14   1> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
04:29:14   1> 		at java.lang.Thread.run(Thread.java:844) [?:?]
04:29:14   1> [2018-03-20T05:29:03,308][TRACE][o.e.i.r.PeerRecoveryTargetService] [node_t1] [test][0] failing recovery from {node_t0}{WB-cLhbrSqqhCbXQ30dZaQ}{DxteO7dbRBiM2AE_dcp74Q}{127.0.0.1}{127.0.0.1:31016}, id [1803]. Send shard failure: [true]

Metadata

Metadata

Assignees

Labels

:Distributed Indexing/RecoveryAnything around constructing a new shard, either from a local or a remote source.>test-failureTriaged test failures from CI

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions