Agent marked as disconnected

We have recently added 2 more agents to our setup, made from old workspaces.
Because of that they are still in separate network than the server and other agents.

We somtimes get a disconnect during build time:
Stage error:

The agent 'wrp-buildcews02' which was executing stage 'Build IBIS project' has gone offline. Agent status is Online, Authorized, Licensed, PropertiesCollected. Agent was last active at 13:56:25. Status checked at 13:56:47. Agent communication test failed.

Server Agent controller event:

An error occurred while checking if the agent is alive: The open operation did not complete within the allotted timeout of 00:00:10. The time allotted to this operation may have been a portion of a longer timeout. The socket transfer timed out after 00:00:01.4368187. You have exceeded the timeout set on your binding. The time allotted to this operation may have been a portion of a longer timeout. The read operation failed, see inner exception. The socket transfer timed out after 00:00:01.4368187. You have exceeded the timeout set on your binding. The time allotted to this operation may have been a portion of a longer timeout. A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. Number of retries = 0. Elapsed time = 3m 42s 503ms

Agent controller event:

An exception occurred while getting agent properties from cached collectors for agent 'wrp-buildcews02'. Details: 'Exception: TimeoutException

Message: This request operation sent to net.tcp://10.80.58.146:9002/IAgentService did not receive a reply within the configured timeout (00:02:00). The time allotted to this operation may have been a portion of a longer timeout. This may be because the service is still processing the operation or because the service was unable to send a reply message. Please consider increasing the operation timeout (by casting the channel/proxy to IContextChannel and setting the OperationTimeout property) and ensure that the service is able to connect to the client.

In reality Agent is alive, but I suspect that while it’s loaded with work and the network might be slower between those agents and server that it’s the root cause of this problem.

The question here is, are there any configuration parameters that can extend this wait time out or increase the retry counter ?

P.S.
Some logged event could state on which agent was the problem reported (not always visible).

Hi Michal,

The messages are actually misleading and we need to fix that.

The “Agent communication test failed” part of the first error refers to the second error message.

The second error message is raised if the agent cannot be contacted. The server makes a request to communicate with the agent with an open timeout of 10 seconds and send/receive timeouts of 5 minutes. If this fails, then the server will retry 10 times, every 200 milliseconds. Although the message says that the number of retries is 0, that is incorrect - the wrong number was logged.

In your case, the message states that it has not been able to communicate with the agent for 3 minutes and 42 seconds, which is quite a long time. This is despite the agent having registered with the server 22 seconds ago (which is why it recorded as being “online”). So it appears to be a one-way communication issue (from server to agent).

There are a number of reasons for the agent communication to fail, some of which may only be logged to the debug log on the server. It is also worth checking the Windows event log on the agent for any errors which may have caused it not to respond to server requests.

The third error occurred while getting list of properties from the agent. The timeout for this operation is 2 minutes, which is also quite a long time. This error is distinct from the build process and should not cause the build to fail - providing the server has the properties if requires to execute the build. The Windows event log and agent debug log may provide clues to what is happening here.

Looking through our code, there is a lot we can do to improve the agent communication tests and messages that are displayed, and we are working on that now.

There are currently no parameters that you can use to extend the timeouts or retries, as we have thought the current number to be more than sufficient. We will consider this though.

Indeed 3:42 is a long time, not sure what could be the cause.
I will try to look through windows logs next time it happens, if it won’t work, I’ll contact my IT, maybe we could just move them to the same network.