Hi guys,
as I mentioned in an earlier ticket we are regularly encountering timeouts during the Mercurial export although the repository cache and the working copy are located on an SSD. These problems only occur if we need to checkout one of our newer branches which have about 49,000 files in it. Our older branches with about 10,000 files can be exported flawlessly within a few minutes. Sometimes the Mercurial export to the working copy just takes 12-15 minutes and sometimes it reaches the timeout. The build task itself (MSBuild) does not show an unexpected increase in duration for the new branches and the task’s duration does not deviate by more than five percent for the older and new branches. The MSBuild task takes a bit longer for the newer branches than the older branches as we added an Angular based web application to our product but the increased during the build is exactly the same as on out development workstations. It looks like there is a huge issue with Mercurial performance when it comes to repsoitories with a huge number of files. Unfortunately we are not able to exclude the dependencies for the Angular web application from our repository.
The ContinuaCI server and the agents are installed on virtual machines running Windows Server 2012 R2 or later. We already moved the virtual maschine to different hardware and different hypervisors and had various results. Is there a way to improve the Mercurial performance or export the working copy directly from our Git repository?
Here are some stats about our repository and environment:
Git repository (on Continua CI server):
- hosted on GitHub Enterprise within our company network
- repository size: 640 MB, 3,800 files, 270 folders
- commits: 14,500
- working copy size (newer branches): 800 MB, 49,000 files, 6,100 folders
- working copy size (older branches): 500 MB, 10,000 files, 1,000 folders
Mercurial repository (on ContinuaCI Agent):
- repository size: 500 MB, 53,000 files, 6,700 folders
Host 1 (huge deviation in export duration, timeouts):
- dual Xeon E5-2620 v4, 8 cores each, 16 threads each, 2,1 GHz
- 256 GB Registered ECC RAM
- dual SATA3-SSD, RAID 1
- VMWare ESXi 6.5
Host 2 (huge deviation in export duration, timeouts)
- dual Xeon E5645, 6 cores each, 12 threads each, 2.4 GHz
- 192 GB Registered ECC RAM
- dual SAS-HDD, RAID 1
- VMWare ESXi 6.5
Host 3 (more or less stable export duration)
- single Core i7-4770, 4 cores, 8 threads, 3.4 GHz
- 16 GB RAM
- SATA3-SSD
- Windows 8.1
- VMWare Workstation 12
Software VM:
- Windows Server 2016 Standard
- Continua CI Agent 1.8.1.451
- multiple Visual Studio Versions
- some build tools
Continua CI settings:
- concurrent stage limit: 2 (global), 1 (agent)
Kind regards
Kay Zumbusch
Hi Kay,
Thank you for sending these details. The server’s specifications seem to be good enough for the IO required on the agent, providing there are no other VMs using a lot of IO. Can you test the actual IO throughout on the agent VM using a tool such as iometer ?
We created a test repository today with a working folder containing 100,000 random files, around 6,000 subfolders and a total size of approximately 1 GB. This repository takes around 5-6 minutes to export using the hg archive command on one of our agents, running on much lesser hardware than yours.
Previously, we have found that anti-virus software can significantly affect IO performance. Can you check that the workspace (and repository folders) are excluded from real-time anti-virus protection - use daily scheduled anti-virus scans instead.
It’s strange that running the command on your agent sometimes takes 12-15 minutes and then times-out with other runs on the same agent. We’re thinking that the command may actually be stalled and waiting for input. Can you test this by running the command in your timeout error message directly in a DOS command prompt?
e.g.
“C:\Program Files\VSoft Technologies\ContinuaCI Agent\hg\hg.exe” archive -r c76ab7fdf7128d73951127fb4287ef01a41f2508 -X “re:(?i)^(._Continua_CI_empty_changeset_marker)$” -X “re:(?i)^(._Continua_CI_file_hash_dictionary)$” --subrepos -R C:\ContinuaCI\Repos\da38c763 C:\ContinuaCI\Ws\14951\Source --config ui.username=Continua --noninteractive --encoding cp1252
Change the output folder C:\ContinuaCI\Ws\14951\Source to a temporary folder so avoid overwriting anything in the build workspace.
How long does it take? Are there any error messages or prompts for input?
Otherwise, you can use the Git actions in FinalBuilder to checkout to the agent workspace, although we can’t see how this can be any quicker as you will still have to create a new working copy in each new workspace folder first. If you decide to do this, then you can un-tick the “Checkout files to workspace” option of the repository to stop Continua CI copying files to the workspace.
I imagine that the new Angular files consist of a large amount of scripts in a node-modules folder. It may be worth excluding this folder from the repository and using the npm action to update these on the agent, but again we don’t think this is necessarily going to make things any faster. You can’t get away from having to write the files to the workspace and this may even add an overhead when getting the files from the npm repository.
Meanwhile, we are continually investigating ways to improve this the part of the process - we have some ideas, but each of then will take some time to implement and test the performance difference.
Hi Kay
Just to expand on the hardware side of things, our production Hyper-V host has the following specs (as I told Dave the wrong cpu specs) :
2 x Intel Xeon E5-2630v3 8-Core, 16 Threads, 2.40GHz
Samsung 128GB DDR4 1.2V 2133MHz ECC Registered DIMM, Dual Rank, Low Profile
6 x Hitachi Ultrastar 4TB SAS 7K4000 HUS724040ALS640 3.5" 6Gb/s Hard Drive in RAID 10 configuration.
LSI MegaRAID SAS 9271-8i SGL Single Pack, 8-port Internal, PCI-e 3.0 x8, 6Gb/s SAS+SATA, 1GB DDR3
So slightly higher speed cpu, but older generation (server is a couple of years old).
This server is currently running our main Continua CI machine (4 virtual cores, dynamic ram), plus 7 agent machines and a domain controller machine.
The tests Dave ran with the archive command were done on one of the agent machines running on this host.
We also have other older machines running Hyper-V 2012R2 with lower specs, and we don’t see this sort of thing on there either.
Looking though our build history for both the Continua CI builds and FinalBuilder’s Builds, the mercurial achive time is pretty consistent (around 2 minutes for 2 repositories, both with lots of files). I have seen it take a bit longer when we loaded up the server with lots of builds (stress testing) but we haven’t seen the timeouts.
I guess the main difference is our host is running Hyper-V 2012R2 and yours is running ESXi. I don’t have any experience with ESXi so it’s difficult to advise on configuration etc, but then we didn’t do much in the way of tuning our server. I would imagine the the two host os’s would comparable on performance these days. VMWare have some pretty good options (better than MS) for looking at performance :
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008205
What else do you have running on the ESXi hosts?
Hi guys,
almost our complete infrastructure is devided between the two ESXi hosts. They host about 20 production VMs, 10 VMs for the Continua CI environment and 40 QA VMs. The QA VMs idle most of the time especially during build time. They just use up some RAM. The production VMs are two Windows domain controllers, one Microsoft SQL Server (Continua CI database), two Microsoft Exchange servers and a bunch of Linux based servers for various tasks. Our main build VM is the only VM on the SSD storage of the ESXI host. All build VMs for the CI environment have 2 virtual cores and 4 GB RAM.
We will take a deeper look at the monitoring features of ESXi but we already did some monitoring and were not able to find any bottlenecks.
Hi Kay
I just published a blog post on CI Server performance, it was something I had already been working on before your posts last week.
https://www.finalbuilder.com/resources/blogs/postid/754/continuous-integration-server-performance
You didn’t mention if you have Anti-Virus software running on the server and/or agent machines. As Dave mentioned, realtime Av software can really kill performance, so it’s best to exclude the server’s share folder and the agents workspace/rc folders from realtime scanning, instead using scheduled scans overnight.
Hi Vincent,
thanks for explicitly asking for the virus protection once more. As our build servers usually are not protected by anti virus products and I never installed our anti virus software I always thought that the newest build server is unprotected as well. I just remembered that Windows Server 2016 includes Windows Defender which is activated on the new build server. I just excluded the repo cache and the workspace folders and will keep an eye on the performance. I’ll get back to you if I have new stats about the performance.
Kind regards
Kay Zumbusch