Agent selection deadlock

I recently changed CI setup, so that all configurations/stages can be run on every agent. Before tests were run on thinner clients, so there was a context switch.
I also have a condition that every configuration keeps on running on the same agent, to omit workspace copy.
But this issue happens also in one stage configuration.

It seems that it might be that there are context/agents switch, and some 2 stages might be using the same shared lock and creates a deadlock ?
Or there is something wrong with the stage queueing, looks like 2 same stages are queued ?

Can you advise on how to debug this ?

Below there are two examples:

Build stopped by user 1 hour, 13 minutes ago 0 milliseconds Build stopped by user ‘------------’
Build stopping 1 hour, 13 minutes ago 484 milliseconds Stopping build.
Stage initialising 1 hour, 13 minutes ago 156 milliseconds Updating unit test totals
Stage initialising 1 hour, 13 minutes ago 15 milliseconds Sending stage ‘Test GUI’ to agent ‘wrpbuildce02’
Build stopping 1 hour, 13 minutes ago 15 milliseconds Build stopping
Stage initialising 1 hour, 13 minutes ago 421 milliseconds Initialising stage…
Stage ready 1 hour, 13 minutes ago 46 milliseconds Agent ‘wrpbuildce02’ reserved and license allocated for stage ‘Test GUI’
Stage awaiting agent 1 hour, 13 minutes ago 31 milliseconds Agent shared resources acquired for agent ‘wrpbuildce02’: Stage has acquired the following shared resource locks from agent ‘wrpbuildce02’: a read lock on agent shared resource ‘Agent.GUITests.InProgress’ with single lock quota for current agent
Stage awaiting agent 1 hour, 13 minutes ago 281 milliseconds Reserving agent for stage ‘Test GUI’ which is at position 7 in the stage queue.
Stage awaiting agent 1 hour, 13 minutes ago 865 milliseconds Suitable agent conditions met, checking 7 stages which are higher in stage queue.
Stage awaiting agent Yesterday at 22:00 10 hours, 43 minutes Adding stage ‘Test GUI’ to stage queue. There are currently 5 stages on the queue.
Build stopped by user 1 hour, 22 minutes ago 0 milliseconds Build stopped by user ‘-------’
Build stopping 1 hour, 22 minutes ago 2 seconds Build stopping
Stage awaiting agent 1 hour, 22 minutes ago 8 seconds No agents are currently available to execute the stage. wrp-buildcews02: The expression [‘$Agent.Hostname.ToLower()$’ equals ‘$Build.Stages.First().AgentName.ToLower()$’] evaluated to False. ‘wrp-buildcews02’ does not equal ‘wrp-buildcews01’. This may be because a required tool is not installed on the agent. wrp-testauto01: The expression [‘$Agent.Hostname.ToLower()$’ equals ‘$Build.Stages.First().AgentName.ToLower()$’] evaluated to False. ‘wrp-testauto01’ does not equal ‘wrp-buildcews01’. This may be because a required tool is not installed on the agent. wrp-testauto02: The expression [‘$Agent.Hostname.ToLower()$’ equals ‘$Build.Stages.First().AgentName.ToLower()$’] evaluated to False. ‘wrp-testauto02’ does not equal ‘wrp-buildcews01’. This may be because a required tool is not installed on the agent. wrp-testauto03: The expression [‘$Agent.Hostname.ToLower()$’ equals ‘$Build.Stages.First().AgentName.ToLower()$’] evaluated to False. ‘wrp-testauto03’ does not equal ‘wrp-buildcews01’. This may be because a required tool is not installed on the agent. wrpbuildce01: The expression [‘$Agent.Hostname.ToLower()$’ equals ‘$Build.Stages.First().AgentName.ToLower()$’] evaluated to False. ‘wrpbuildce01’ does not equal ‘wrp-buildcews01’. This may be because a required tool is not installed on the agent. wrpbuildce02: The expression [‘$Agent.Hostname.ToLower()$’ equals ‘$Build.Stages.First().AgentName.ToLower()$’] evaluated to False. ‘wrpbuildce02’ does not equal ‘wrp-buildcews01’. This may be because a required tool is not installed on the agent. wrpbuildce03: The expression [‘$Agent.Hostname.ToLower()$’ equals ‘$Build.Stages.First().AgentName.ToLower()$’] evaluated to False. ‘wrpbuildce03’ does not equal ‘wrp-buildcews01’. This may be because a required tool is not installed on the agent. wrpbuildcesrv01: The expression [‘$Agent.Hostname.ToLower()$’ equals ‘$Build.Stages.First().AgentName.ToLower()$’] evaluated to False. ‘wrpbuildcesrv01’ does not equal ‘wrp-buildcews01’. This may be because a required tool is not installed on the agent. wrpbuildcesrv01: The expression [‘$Agent.MSBuild.16.0.PathX86$’ exists] evaluated False. ‘$Agent.MSBuild.16.0.PathX86$’ does not exist. This may be because a required tool is not installed on the agent. Agent: wrp-buildcews01 is currently executing its maximum number of concurrent build stages: 1
Stage awaiting agent Yesterday at 20:01 12 hours, 42 minutes Adding stage ‘Build IBIS project’ to stage queue. There are currently 1 stages on the queue.
Stage awaiting agent Yesterday at 20:01 31 milliseconds Queuing stage ‘Clone’
Stage not run Yesterday at 20:01 15 milliseconds Skip conditions for stage ‘Build IBIS project’ were not met
The expression [‘%DeliverToTrunkStartedBy%’ does not equal ‘’] evaluated to False. ‘’ equals ‘’.
The expression [‘%StartedBy%’ does not equal ‘’] evaluated to False. ‘’ equals ‘’.
Stage completed Yesterday at 20:01 46 milliseconds Stage completed.
Stage executing Yesterday at 20:01 15 milliseconds Completing stage.
Stage executing Yesterday at 20:01 46 milliseconds Updating unit test totals
Stage executing Yesterday at 20:01 31 milliseconds Registering artifacts
Stage executing Yesterday at 20:01 0 milliseconds Loading build variables from stage
Stage executing Yesterday at 20:01 15 milliseconds Stage done on agent ‘wrp-buildcews01’.
Stage executing Yesterday at 20:01 1 second Syncing workspace from agent ‘wrp-buildcews01’ to server
Stage executing Yesterday at 20:01 15 milliseconds Creating archive files in workspace of agent ‘wrp-buildcews01’ ready to copy to server
Stage executing Yesterday at 20:00 1 minute Executing stage on agent ‘wrp-buildcews01’
Stage initialising Yesterday at 20:00 1 second Initialising workspace on agent ‘wrp-buildcews01’
Stage initialising Yesterday at 20:00 62 milliseconds Sending stage ‘Clone’ to agent ‘wrp-buildcews01’
Stage initialising Yesterday at 20:00 771 milliseconds Initialising stage…
Stage ready Yesterday at 20:00 78 milliseconds Agent ‘wrp-buildcews01’ reserved and license allocated for stage ‘Clone’
Stage awaiting agent Yesterday at 20:00 15 milliseconds Agent shared resources acquired for agent ‘wrp-buildcews01’: Stage has acquired the following shared resource locks from agent ‘wrp-buildcews01’: a read lock on agent shared resource ‘Agent.AgentRepoStore.Clone’ with single lock quota for current agent
Stage awaiting agent Yesterday at 20:00 124 milliseconds Reserving agent ‘wrp-buildcews01’ for stage ‘Clone’ which is at position 1 in the stage queue.
Stage awaiting agent Yesterday at 20:00 322 milliseconds Adding stage ‘Clone’ to stage queue. There are currently 0 stages on the queue.

Hi Michal,

We’re not sure what the issue is in the two example timelines. The first one acquires a lock on the shared resource ‘Agent.GUITests.InProgress’ and starts to run on agent ‘wrpbuildce02’ and is then manually stopped. The second one is waiting for an agent named ‘wrp-buildcews01’ which doesn’t appear to be be in the list of available agents. This suggests that agent ‘wrp-buildcews01’ is offline?

Can you further explain what behaviour you are seeing?

OK - we can see now that you’ve highlighted the issue in bold

which is indeed quite odd.

We’ll need to check through the code to see how this could have happened. Meanwhile, are you able to reproduce this issue with debug logging enabled. If you can send us a debug log, either to me via direct message or via email to support at finalbuilder.com, this should help us to diagnose the issue quickly.

The first one indeed acquires a lock, but look at the time line, it is in queue for hours.
Timeline proceeds when manual stop is triggered → deadlock until manual stop ?

Second one indeed is waiting on wrp-buildcews01, which is online, but other build are waiting on “Clone” stage, which also aquires lock per agent. I suspect that the lock check is interleaving stage/agent switch and creates a deadlock.

So in second example:
Build 1: on Agent1 ==> get Clone lock - > Clone stage → finish Clone stage → release Agent1 → queue next stage → wait for Agent1 → deadlock (when is the lock freed?)
Build 2: on Agent1 ==> get Clone lock → Clone lock probably still taken by Build 1 → deadlock
Every other build that choses this agent is queued.

I suspect that stage shared lock is release too late, or aquired too early.

I removed those locks, as we do not need them anymore, I will see if this helps.

So it seems I have another deadlock, even though no shared resources are used.

There are few builds that just hang on “nothing”.
I can sort them by the term “stages … on queue”

1:

2:

3:

4:

OK, I think I found the culprit.
I think this should have happened after update to newest version.
Getting lock with expression had an empty expression:


For sure it was not empty before.
Also I am not sure if it wasn’t “Acquire Specific”.

When I stop build with such “empty” lock, all other build start :slight_smile:

Hi Michal,

We just tried reproducing this issue by installing v1.9.2.1140 on a clean server. We added a new quota list shared resource, set up a shared resource lock with an “Acquire Expression” operation the first stage of a new configuration, and ran some builds to test that it worked correctly. We then upgraded to v1.9.2.1185 and checked the shared resource lock. It had not changed. We also ran some builds which worked as before. We also note that similar shared resource locks on our test server have not changed during preceding upgrades. So there must be another reason why the shared resource lock has been blanked out.

If we manually blank out the expression we see that there is an error logged on the Event Log page, with the message “Stage cannot acquire shared resource lock for the expression ‘’ as it expands to an empty value.”. We will update the error handling so that this is logged to the build timeline.

The blank expression doesn’t necessarily explain the timelines in your original post. Were any other messages logged to the event log during those builds? We would be really interested to see debug logs, if you are able to reproduce those timelines.

Hello,
thank you for conducting tests.
If it works after update, then it’s hard for me to explain why it blanked in my setup. I wouldn’t suspect human intervention :slight_smile:
In my setup it was the last stage. What is interesting expression locks used in “Conditions” in other configuration were OK. But there I use variables. In failing case I just used strings.

Also I saw this error about not able to aquire empty lock, that’s why I started to look through the configurations. But the log could state which configuration it was and maybe on which stage.

Also, did you test if you use an empty expression on a lock and you receive an error and queue other builds, do they queue up like in my case or do they start normally ?
If I manually stopped the build hanging on empty expression, other builds started. Otherwise they queued with expression “there are currently X stages on queue” and every newly added build had X+1.

Unfortunately I do not have time to reproduce this issue on my side.
I did not see any other interesting messages in that time.

Hi Michal,

v1.9.2.1199 includes improvements to the error handling so that errors with shared resource lock expressions cause the build to fail and log the error message to the build log.

1 Like