moomou

(ノ≧∇≦)ノ ミ ┸┸

Learning to Debug Gibberish

Posted at — Dec 27, 2023

Overview

Recently, I went down a rabbit hole of debugging opensource LLM spewing gibberish on my PC. The investigation was the one of the most difficult (but interesting) debugging experience I have encountered so far.

This blog will chronicle the problem I encountered, debugging steps I took, and finally, the resolution and workaround I adopted.

tl;dr I traced down the source of LLM spewing gibberish to faulty RAM sticks after getting different sha256 checksums on a static file on different runs. I was then able to salvage ~40% of the faulty RAM stick capacity after discovering Linux kernl parameter memtest=.

The Problem

Like everyone else, I have been spending a lot of time studying and playing with opensource LLMs, primarily leveraging my workstation sitting in my garage.

I focused my time originally on deploying and setting up a local instance with llama.cpp and vllm using quantized Mistral models.

However, as I delved deeper and started customizing models (another blog post!) using pytorch and transformers library, I noticed that models I downloaded directly from Huggingface all spew gibberish!

This problem showed up in essentially all the models I tried without any obvious patterns. To make the whole situation extra confusing, some models work.

Debugging

The initial suspicion I had was on some subtle software compatibilities issues in my setup. After all, getting a LLM to run locally requires an insane amount of software.

So as a first step, I started by upgrading and downgradign Nvidia driver and CUDA versions. Specifically, I tried different combinations of CUDA 12.1 and 11.8 with driver versions ranging from 520 to 535.

Concurrently, I also tried different combinations of python, pytorch, and transformers versions.

However, all combinations led to the same results - LLM still spewed gibberish.

Googling (and asking ChatGPT) on various queries ("transformers+all+models+gibberish", "nvidia+3090+pytorch+gibberish", "qwen+gibberish", etc.) led to nowhere.

Verifying in Cloud

After trying just about all possible combinations of software dependencies, I decided I needed a working setup to find out what is wrong with my workstation.

I hopped over to Google Colab and ran Qwen-7B’s starter template

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()

response, history = model.chat(tokenizer, "你好", history=None)
print(response)

and sure enough, it worked on Google cloud!

I then noted down the Nvidia driver version (!nvidia-smi) and python dependencies (!pip3 freeze) and worked toward replicating that environment locally. Surely if it worked in the cloud it must work on my PC too!

Except it did not…

SHA256 in a Loop

Despite meticulously replicating the software environment from Colab, LLM still talked gibberish. Running out places to look, I decided to verify the model weight files (safetensors, bin files, etc.) are correct - ie have the same checksum as on Google Colab.

The possibility of model files being corrupted somehow is quite unlikely since I tested dozens of different models and pytorch modules `load` calls were all succeeding.

However, it is at this point that I am reminded of the following quote

Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth.

So I grabbed the sha256sum of the chatglm-6B from Google Colab and compared them against my local copy and sure enough the checksums were different.

“Ok so I have some corrupted files. I will just download them again”, I thought to myself.

I duly downloaded the model files again and ran sha256sum on the new files, only to discover they were still different from Colab’s checksums AND from previous downloaded files. That is, I have 3 set of checksums at this point!

A little confused, I reran the checkums again on the newly downloaded files and this time yieled completely new sha256 checksums, different from all previous attempts.

Faulty Hardware

At this point, I realized I have a hardware problem.

The prime suspects were hard drive, network, and/or memory.

I eliminated hard drive by downloading files to a different drive and still the checksums were different.

This is replicable irrespective of file’s physical location, whether SATA drives, M.2 sticks, or external USB drive. The faulty hardware is memory.

What turned out to be happening is even when model files were downloaded correctly, when model weights were loaded into memory for LLM inference, they were silently corrupted due to bad RAM so every model spewed gibberish!!

Linux Kernel to the Rescue

After testing my RAM sticks one by one on each of the motherboard slot, I was able to identify the faulty RAM sticks.

While I was glad to have my sanity restored, I was bummed about losing 64GB of RAM capacity.

What was interesting was the RAM mostly worked - the data corruption was rare enough that it only manifested itself on LLM inference!

I researched on the possibility of identifying the bad memory regions and whether Linux can cope with bad RAM somehow.

The answere turned out to be YES and it is simple to enable as well.

All it takes is adding a kernel parameter memtest=N to instruct Linux kernel to perform N memory test before boot to cordon off any identified bad memory regions.

memtest=        [KNL,X86,ARM] Enable memtest
                        Format: <integer>
                        default : 0 <disable>
                        Specifies the number of memtest passes to be
                        performed. Each pass selects another test
                        pattern from a given set of patterns. Memtest
                        fills the memory with this pattern, validates
                        memory contents and reserves bad memory
                        regions that are detected.

This parameter can be added to /etc/default/grub so Linux boots with memory test.

Here is my updated grub file.

# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=menu
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="memtest=4"
GRUB_CMDLINE_LINUX=""

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"

Notice BadRAM is mentioned as well if you happen to have the bad regions identified via another tool such as MemTest86. I did not pursue this option since running memory test on boot is more dynamic and easier.

Leveraging memtest, Linux determined 40% of the faulty RAM sticks is still usable! I also reran my sha256sum test a good number of times to verify stability.

drawing