I have rebuilt the cluster several times, rewritten blank GPT tables on already-claimed disks, full wipe on all hosts, reusing a dSwitch, setting up a new dSwitch altogether--complete with host config wipe and partition table wipe. I keep getting stuck in the same place, trying to figure out why hosts won't identify their VMkernels for vSAN. I tried checking the vSAN checkbox off and back on on each host too (BTW.)
![Screen_Shot_2020-04-18_at_13_32_40-2.png]()
It gets a little more confusing. The version thing says disks are on version 10. I just remediated the cluster last night, again this morning, to the current 6.x image, 6.7.0, 15160138, AKA update 3. From the information I read on the release notes, the matching vSAN version is 7, not 10. I don't think they pulled a Microsoft and jumped straight to version 10. Even so, I have not installed nor downloaded any component from vSphere 7. It doesn't make any sense.
![Screen Shot 2020-04-19 at 00.28.52.png]()
It gets still a little more confusing. Right there in the cluster details, all disks appear okay-ish--if you ignore the connectivity warning on hyperserver1(shot below)--and vSAN datastore capacity actually diminished and then grew back a little when I added the last host, this was hyperserver3.
I have to do it in stages because vCenter is hosted in the cluster itself. To make it smoother/error-free, dSwitches and VMkernel adapters are done beforehand, vCenter is manually unregistered/registered from hosts, then preempt the final host by dropping it in a disposable cluster with same settings. With EVC and other settings matching, moving VMs and hosts itself across clusters goes without errors.
![Screen Shot 2020-04-19 at 00.32.05.png]()
The vSAN datastore went from some terabyte number down to around 60 gigabytes, then a few minutes later grew its capacity to around half a terabyte. It continues at that size. The amount needed for our VMs is minuscule, probably less than 400GB and that's with future-proofing, but the raw capacity is more though so I'm not sure if it set itself at the min capacity all hosts can provide (like standard RAID arrays) or it's stalled or something else. I read in the documentation that doing certain tasks would invoke "a rolling reformat of every disk group in the cluster" so I figured that was what it was doing and why it grew earlier, if it's really doing that, it's definitely stalled.
I have changed disks, host versions and configs, distributed network switches, turned things on and off hoping to trigger reactions, as I'm writing this I keep trying things (1) to avoid bothering and (2) for the screenshots and the only progress I got was when I realized on of the hosts had mixed NIC speeds (1G vs 10G) on the dSwitch. That was hs1, removing the offending NIC got the other two hosts previously reported with no vSAN VMkernels of their own, sort of fine. (no warnings in one place, see shot above, but still missing VMkernerls on other place, see way up in first shot) And, capacity is still missing.
The only thing I have kept constant throughout all this is vCenter, is it possible that vCenter has [selectively] stale data? How could I fix that without deploying everything again? Setting up makeshift DNS servers and other bits from the network takes forever each time. Even wiping the partition tables is a physical chore, as I forgot the proper method and I'm booting the hosts one at a time with a gparted Live flash drive to wipe them faster–vSphere viciously fights against accessing the disks; at least that's reassuring if any data was actually in there.
I'll appreciate any advice/help you give me. All disks are empty, I don't want to but I can wipe if necessary. All VMs (including vCenter) are backed up in central storage.
Thanks.