Setting up my Ampere Server (Build Log 2)

Setting up my Ampere Server (Build Log 2)
The CPU in all its glory. It's actually really huge.

Today is the day, the Ampere CPU and motherboard finally arrived. I'm still waiting on the GPU, and I have not yet decided on the SSDs to use for a ZFS pool. So for today, I'm mostly focused on putting the server together and doing some basic setup, such as installing Ubuntu Server for ARM, joining the Docker Swarm, and more.

Overall, the build process was not too bad. Given that I replaced the original fans with more silent (but less performant) Noctuas, I wanted to make sure that the cables were all nicely tucked away so that the airflow won't be too obstructed.

Powering on the system was actually quite scary, as I was getting a really loud constant beep. After a bit of digging, I was able to locate the PSU as the origin of the beep, and figured that it's the redundant PSU alarm. I only plugged in one of the redundant PSUs because I assumed it would "work", and sure it did work, it just made a really loud beeping noise. After plugging in the second PSU, I was relieved that the beep disappeared almost immediately.

I was able to boot into the bios with no hiccup; I was actually quite concerned that something may go wrong, such as a bad memory stick, PCIE issues, etc.

Getting into the boot menu took a bit longer than expected (probably around a solid minute after pressing F11), but it did eventually get into my bootable USB.

I went with a HWE kernel install, as I did get some recommendation in case I ran into issues with my obscure hardware.

The IPMI (OpenBMC) that comes with this motherboard also made installation easy, where I would be able to utilize both the web KVM interface and the BMC host console to go through the setup without a physical KVM.

And we're in!

Idle power wasn't the greatest (I was expecting much better), at 120 watts measured from my Unifi PDU Pro (each of the redundant PSU supplies half of the power, thus we add these two values up)

CPU Power, measured by sensors was just showing up as 10W, so I'm wondering if it's the fact that I have 8 RAM sticks (vs 2 in my gaming PC), or all of the fanciness of server hardware, such as redundancy features, just inherently uses more power than consumer grade.

This is compared to my 9900x + RTX5000 ADA gaming PC, which idles at 85 watts, as well the combination of my PoE Unifi switch + my Raspberry Pi fleet, which also idles at 85 watts, both also measured through the PDU. So this machine is officially the noisiest and most power-hungry server in my homelab now.

Joining the Docker Cluster

Firstly, I joined my server as a manager in the docker swarm node. I'm not planning for the Ampere server to run any services as part of the cluster, but for now just Portainer.

Then, I tagged my Ampere server with the portainer label which I use to determine what node portainer container is deployed to.

And after restarting Portainer through compose, it was up and running.

I also had to modify all of my existing stacks to add a deployment rule to not deploy to the Ampere server, as I did not want this server to run any of the web services. And after redeploying those stacks through portainer, my swarm configuration was complete.

Issues, issues, issues

4 main issues to figure out:

  • SATA disks connected to my backplane not showing up
  • I only see 60GB of available space
  • The CPU temp issue
  • I'm only getting 1GbE

SATA disk issue

I have an LSI3008 controller plugged into the SATA backplane. I'm supposed to be using this backplane for a ZFS pool. Although I haven't yet decided on which SSDs to buy for the ZFS pool, I was testing the peripherals using a 2.5 SSD lying around, but I noticed it wasn't showing up:

$ lshw -class disk
  *-namespace:0
       description: NVMe disk
       physical id: 0
       logical name: hwmon0
  *-namespace:1
       description: NVMe disk
       physical id: 2
       logical name: /dev/ng0n1
  *-namespace:2
       description: NVMe disk
       physical id: 1
       bus info: nvme@0:1
       logical name: /dev/nvme0n1
       configuration: wwid=eui.002538521191226d

only my nvme OS drive was showing up

I tried a different slot in the backplane, but nope, same issue.

When building the server, I saw that the backplane had two SATA ports for each drive, and I heard it was supposed to be for redundancy, and I also read that any of the two ports should technically work.

This is literally the only description on the user manual. Very helpful, much wow.

But nope, by switching the SATA connection to the other port that I wasn't using, the drive was now showing up:

$ sudo lshw -class disk
  *-disk
       description: ATA Disk
       product: Samsung SSD 840
       physical id: 0.0.0
       bus info: scsi@0:0.0.0
       logical name: /dev/sda
       version: CB6Q
       serial: S1DHNSAF636463F
       size: 465GiB (500GB)
       capacity: 465GiB (500GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=6 logicalsectorsize=512 sectorsize=512 signature=007bfc0d
  *-namespace:0
       description: NVMe disk
       physical id: 0
       logical name: hwmon0
  *-namespace:1
       description: NVMe disk
       physical id: 2
       logical name: /dev/ng0n1
  *-namespace:2
       description: NVMe disk
       physical id: 1
       bus info: nvme@0:1
       logical name: /dev/nvme0n1
       size: 1863GiB (2TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt

Considering that my Chassis uses 2x2 SATA backplanes, I've tested both backplanes after reseating all SATA cables and verified that both worked. I've also verified that hotswapping the drive to each bay worked as well.

Problem 1 solved.

Temp (and noise) issues

Next was to address the temps. The CPU temps were keep climbing upwards, hitting 94 degrees at idle.

Switched back to the OEM fans.

Instantly, there was an improvement, with idle temps now around a stable 45 degrees. However these fans were loud. I wanted to fine tune it for a bedroom setting. The typical tools in Linux for PWM control, which relies on lm-sensors and pwmcontrol, were not detecting these fans.

But after some digging, I found that the fans are actually controlled by OpenBMC, which is able to configure some kind of fan curve.

As per the Ampere community:

Managing Temperature and Fans | Wiki.js You can configure the fan response curve by editing /usr/share/swampd/config.json on OpenBMC.
Note it won’t survive reboots so you should store a copy in /etc and copy it over somehow into /usr/share each time the BMC boots.

I went ahead and modified the config.json like so:

{
    "sensors" : [
        ... redacted ...
    ],
    "zones" : [
        {
            "id": 0,
            "minThermalOutput": 15.0,
            "failsafePercent": 15.0,
            "pids": [
                ... redacted ...
                {
                    "name": "TEMP_SOC",
                    "type": "stepwise",
                    "inputs": ["TEMP_SOC"],
                    "setpoint": 30.0,
                    "failsafePercent": 75.0,
                    "pid": {
                        "samplePeriod": 1.0,
                        "positiveHysteresis": 1.0,
                        "negativeHysteresis": 1.0,
                        "isCeiling": false,
                        "reading": {
                            "0": 40,
                            "1": 65,
                            "2": 75,
                            "3": 85,
                            "4": 90
                        },
                        "output": {
                            "0": 15,
                            "1": 30,
                            "2": 40,
                            "3": 50,
                            "4": 80
                        }
                    }
                }
            ]
        }
    ]
}

The important pieces are the reading and output values which determine the fan curve, and minThermalOutput and failsafePercent which are the default minimum fan speeds which are initially set to 30. That means that 30% is the lowest fan speed it will allow. I had to reduce this to 15 so that the output values that are lower than 30 would take effect.

Because of the comment saying that any changes to this file doesn't survive reboots, I also made sure to copy the file over to etc/config.json and modify /usr/lib/systemd/system/phosphor-pid-control.service to pass in the new config directory:

[Unit]
Description=OpenBMC Fan Control Daemon

[Service]
Type=simple
ExecStart=/usr/bin/swampd --conf /etc/config.json
Restart=always
RestartSec=5
StartLimitInterval=0

[Install]
WantedBy=basic.target

The CPU (seems to have) stabilized at 52 degrees, and I will have to continue to monitor to see how much temp vs noise I can balance. At 15%, these fans are still audible, but the real villain was not the chassis fans. The jet engine in the home was the redundant PSU and its 2 tiny 40mm fans running at essentially full blast.

There does seem to be mods to swap this out, but there are an equal amount of success stories and horror stories on the internet, especially for PSUs with "smart fan issue detection" features.

I will continue to tackle the noise problem, as I still need to spend some time to research my options.

Disk Space issue

Essentially I'm not seeing the full 2TB of usable storage.

root@ampere:/home/teamcity/.BuildServer/system/artifacts/KtorSample/Build# df -h
Filesystem                               Size  Used Avail Use% Mounted on
tmpfs                                     26G  3.5M   26G   1% /run
efivarfs                                 512K  9.3K  503K   2% /sys/firmware/efi/efivars
/dev/mapper/ubuntu--vg-ubuntu--lv         98G   34G   60G  37% /
tmpfs                                    126G     0  126G   0% /dev/shm
tmpfs                                    5.0M     0  5.0M   0% /run/lock
/dev/nvme0n1p2                           2.0G  101M  1.7G   6% /boot
/dev/nvme0n1p1                           1.1G  6.4M  1.1G   1% /boot/efi

It seems like this is related to some LVM2 behavior, where I'm using LVM2 as the OS partition /dev/mapper/ubuntu--vg-ubuntu--lv 98G 34G 60G 37% /

lsblk -f
NAME                      FSTYPE      FSVER    LABEL       UUID                                   FSAVAIL FSUSE% MOUNTPOINTS
loop0                                                                                                   0   100% /snap/core22/1752
loop1                                                                                                   0   100% /snap/snapd/23772
loop2                                                                                                   0   100% /snap/core22/1804
nvme0n1
├─nvme0n1p1               vfat        FAT32                91CB-6122                                   1G     1% /boot/efi
├─nvme0n1p2               ext4        1.0                  3b749397-8da0-49a1-be61-7d7a360a6376      1.7G     5% /boot
└─nvme0n1p3               LVM2_member LVM2 001             biGHBi-AyfB-wYQ5-lyZ0-9Z4a-GnxN-nIsdNo
  └─ubuntu--vg-ubuntu--lv ext4        1.0                  b128dca1-8bc3-46f4-96a9-145927c54fc3     59.2G    34% /

fdisk does show the full disk size though:

fdisk -l
... redacted ...

Device           Start        End    Sectors  Size Type
/dev/nvme0n1p1    2048    2203647    2201600    1G EFI System
/dev/nvme0n1p2 2203648    6397951    4194304    2G Linux filesystem
/dev/nvme0n1p3 6397952 3907026943 3900628992  1.8T Linux filesystem

And output of pvs:

pvs
  PV             VG        Fmt  Attr PSize  PFree
  /dev/nvme0n1p3 ubuntu-vg lvm2 a--  <1.82t <1.72t

Resizing the VG to use all of the free disk space:

sudo lvresize -l +100%FREE /dev/mapper/ubuntu--vg-ubuntu--lv
sudo resize2fs /dev/mapper/ubuntu--vg-ubuntu--lv

And now I was able to verify that I was no longer stuck with 60GB of total space.

df -h
Filesystem                               Size  Used Avail Use% Mounted on
tmpfs                                     26G  3.5M   26G   1% /run
efivarfs                                 512K  9.3K  503K   2% /sys/firmware/efi/efivars
/dev/mapper/ubuntu--vg-ubuntu--lv        1.8T   34G  1.7T   2% /

Network Speed

  *-network:0
       description: Ethernet interface
       product: Ethernet Controller X550
       vendor: Intel Corporation
       physical id: 0
       bus info: pci@0003:03:00.0
       logical name: enP3p3s0f0
       logical name: /dev/fb0
       version: 01
       serial: 9c:6b:00:4b:11:08
       size: 1Gbit/s
       capacity: 10Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi msix pciexpress bus_master cap_list rom ethernet physical tp 100bt-fd 1000bt-fd 10000bt-fd autonegotiation fb
       configuration: autonegotiation=on broadcast=yes depth=32 driver=ixgbe driverversion=6.11.0-19-generic duplex=full firmware=0x8000172d, 1.3105.0 ip=192.168.1.223 latency=0 link=yes mode=1920x1200 multicast=yes port=twisted pair speed=1Gbit/s visual=truecolor xres=1920 yres=1200
       resources: iomemory:24000-23fff iomemory:24000-23fff irq:104 memory:240000000000-2400003fffff memory:240000800000-240000803fff memory:11800000-1187ffff memory:11900000-119fffff memory:11a00000-11afffff

Size is only showing as 1Gbit/s, whereas I am expecting 2.5Gbit/s.

The ethernet device does support 2.5Gbe connection, but looks like the we are only advertising 100, 1000 (1Gbe), and 10000 (10Gbe) baseT speeds.

$ ethtool -s enP3p3s0f0 speed 2500 duplex full autoneg on

By running this, the advertised link mode was changed to 2500baseT/Full, and now I'm able to see the 2.5 speeds:

Blue means 2.5Gbe, and now I am at peace

However this change will not persist upon reboots as per this thread, so I had to also added a systemd service to enable this on boot.

What's next?

Obviously making the server more quiet. It's so noisy, and my rack is in my office.