Agent reconnect issue

It seems there was some network problem on Friday and agents were not accessible.
Agents are now fine, but ContinuaCI is stuck. Builds are trying to reconnect with message:
"
Agent agent_name which is executing stage is alive but not active. Waiting one second before rechecking. Retry 170
"
It’s hanging like that for 55 hours :slight_smile:

When I stop the stuck builds, all other in queue start properly.
I think there should be some fallback to either kill such builds after certain time (just like timeout on actions in stages) or maybe restart the connection between agents and server.

Scratch that… only one started properly, other are still trying to reconnect.

It seems that from ContinuaCI server POV agents are toggling between online and offline right now.

Hi Michal,

Is the agent service running? Check the Windows event log on the agent for errors.

Yes, all agent services are running, I see timeout errors on agents:
A call from the agent to the server to update build status failed: Exception: TimeoutException

On server side, I see only errors that workspace could not be deleted (seems normal on our setup), but it’s from friday, nothing new.

I encountered this problem once or twice in the past, usually restart of the server helped.
Might this be some windows networking issue ?

Yes, a timeout could be due to a network issue, or it could be that the server is too busy to accept the request, although usually you would then see errors on the server.

Check whether the CPU, disk, memory usage is high on either server or agent, then restart the server and agent and see whether that resolves the issue.

Which version of Continua are you running?

CPU usage was rather small 10-20%, after restart all is working.
Version is 1.9.2.983, but I encountered such problem couple times before on older versions, I would rather suspect its some problem windows socket reopen after some network issue.