So, there will be no Nutanix in my homelab, not for awhile. It was an utter shitshow. Let’s begin.
First, let’s not talk about the extremely LONG installation time for a single Nutanix Node. Heaven forbid you do it over an 1gig IPMI session, be prepared to wait hours. But installing from USB takes about 53 minutes, per node. I was not going to go out and buy 2 more USB drives to install all at once. So I had to deal with waiting 2+ hours.
Failure #1 CE, with HBA passthrough to the CVM
I followed the instructions these instructions to pass the HBA through to the CVM. First problem — The CVM would not boot. Since the CVMs are not up, no Prism, so can’t console to see what’s going on. The commands “virsh console <name>" or "virsh ttyconsole <name>
” don’t work. So I had to rely on “virsh screenshot <name>
” and I saw that the CVMs were hanging either initializing the HBA, or finding any devices. Lots of reading, searching, etc, and I discover <rom bar="off">
. This was after hours of research, searching, and trial and error. I guess had I started with my all NVMe node, I would have saw this, as CE automatically does passthrough of the NVMe drives. But as soon as I added this, CVMs started.
At this point, after waiting for all 4 nodes to be installed, CVMs setup, and the networking FIXED.. Let’s talk about the networking. In the Non-CE version, Nutanix uses the highest speed adapters, and does allow you to set a VLAN on install. CE, uses the first device found. For those not familiar with the Cisco M5 series — they have 2 on-board 10g COPPER ports. My only copper ports are 1G, and I don’t want a bunch of 10g copper SFPs so I’ve created a few Virtual NICs on my MLOM-40G card. This is a dual 40G QSFP+ adapter, and you can define any number of virtual nics that are presented to the OS as 40G interfaces. All my nodes are configured the same. :
vNIC | Description | MTU |
eth0 | Port0: Management Interface | 1500 |
eth1 | Port 1: Unused | 1500 |
eth2 | Port 0: VM Traffic | 1500 |
eth3 | Port 0: vSAN | 9000 |
eth4 | Port 0: vMotion | 9000 |
eth5 | Port 0: Storage | 9000 |
Since currently, everything is just using 1x 40G port, all vNICs are assigned to port 0. Once I do add the 2nd 40G port, I will duplicate this and have everything be dual homed.
So, I had to add eth0 (which Nutanix defined as eth2 since the 10g ports were eth0 and 1). I also had to add the VLAN tag (I could just set that port in the Cisco CIMC to be in the correct VLAN, but I’d rather keep it as a trunk port and the the OS tag. Changing the settings of a vNIC requires a host reboot. Changing tags, no reboot.
Finally. All networking is correct, cluster started, and can log into Prism. First problem, the virtual switch had to be created manually. Next thing I see, a bunch of controller errors. First, I start thinking my 240gb drives on the MSTOR are bad. But no, run tests on all 10 that I have, all clean, no errors, etc. All I see are errors about the storage controller, and the Storage Controller being down on 3 out of 4 nodes. Services restarting frequently.
I’ve tried swapping drives, but the same issue. The only node I did not have this issue on, was the node that had the UCSC-SAS-M5 HBA versus the UCSC-MRAID-M5 controller. No clue why, the drives are in JBOD mode. After not being able to successfully copy files, and those being stable, decided to try again. I had successfully gotten Nutanix Move to deploy and boot. But then I went and put a node in maintenance mode to reboot it. All hell let lose. Could not go into maintenance mode, Nutanix Move would not start, and the one VM that I had test moved over, would not start.
Done. Done. Done. I speak with my associates, and tell them one more day of slowness, if I can’t get it stable, I will revert back to ESXi.
Try 2: No HBA Passthrough
So I try again. I start all over. I reinstall everything. As above, setup networking, start the cluster, wait about 15-20 minutes. First thing I see when I login to Prism is 4 hosts in critical. The Storage Controller on 3 out of 4 nodes was Down. Different node this time, which is odd because the node that was working the first time, I didn’t change anything. I changed around the drives on the others.
But same issues. This time, cannot even deploy Nutanix Move. After about an hour of trying, I knew I had to get busy on re-installing ESXi. Reason I say reinstall is, 1) I had already destroyed the drives my vSAN ESA cluster was on, and while I do have backup of the ESXi configs, since I am changing a few things, renumbering, etc, I might as well reinstall. Total time to get all 4 nodes imaged, running, and connected back to vCenter? Less than an hour. Now, I still have to do the configuring, but still. In about 75 minutes, I was moving workloads back over.
Distributed Switches, iSCSI, affinity rules, testing HA and DRS. Total time, 2.5 hours and my cluster was humming along like nothing ever happened. No more host memory errors due to running all my workloads on 2 nodes…
So, I’m back to ESXi. I’m actually happy about it. I got to correct some things I did not like from my previous cluster. Cleaned up my distributed switches, moved the ESXi boot directory and esxdata to a drive that’s not on the same HBA. The M4 is gone, and life is good again.
Will I try Nutanix Again?
Of course. I’m thinking of getting 2-4 SFF PCs, all NVMe. Nothing fancy. Just to learn the platform. I’m disappointed that I could not get everything going, but I couldn’t keep causing slowdowns to my workloads. I’ve thought about getting the HBA Nutanix expects, go all NVMe and trying again. Who knows.
Oh, for those who are wondering — I did try the Foundation way of installing. That has strict requirements. I tried editing the python scripts that run to install, but I just didn’t have enough time to try to find where the /root/.python whatever directories are initially stored, or how they are created to make modifications to the layout.py files, etc. I had hacked Foundation quite a bit, but just got tired. If I get the expected equipment, I may try again.