Hi,
we have an imx6 board similar to the TinyRex with a problem during the boot from a NAND.
We are experiencing some reliability problems on a i.MX6ULL system with a Micron MLC MT29F32G08CBADAWP NAND and we have some questions on the best practices to follow, using NANDs.
We have few systems that sometimes can't boot due to a CRC error on the initramfs partition. We suppose that sometimes the error can also be present on the kernel and dtb MTD partitions. In this case the system have only few months of life, so it's not a very old NAND.
The error we see:
U-Boot 2016.03-nxp/imx_v2016.03_4.1.15_2.0.0_ga+ga57b13b (Aug 04 2020 - 09:37:05 +0000)
CPU: Freescale i.MX6ULL rev1.1 528 MHz (running at 396 MHz)
CPU: Industrial temperature grade (-40C to 105C) at 60C
Reset cause: WDOG
Board: MX6ULL 14x14 EVK
I2C: ready
DRAM: 512 MiB
NAND: 4096 MiB
MMC: FSL_SDHC: 0, FSL_SDHC: 1
*** Warning - bad CRC, using default environment
Display: TFT43AB (480x272)
Video: 480x272x24
In: serial
Out: serial
Err: serial
Net: FEC0
Normal Boot
Autoboot in 3 seconds
NAND read: device 0 offset 0x4000000, size 0x800000
8388608 bytes read: OK
NAND read: device 0 offset 0x5000000, size 0x100000
1048576 bytes read: OK
NAND read: device 0 offset 0x6000000, size 0x1200000
Skipping bad block 0x06600000
18874368 bytes read: OK
Kernel image @ 0x80800000 [ 0x000000 - 0x6f9708 ]
## Loading init Ramdisk from Legacy Image at 83800000 ...
Image Name: Init Ram Disk
Image Type: ARM Linux RAMDisk Image (gzip compressed)
Data Size: 8897996 Bytes = 8.5 MiB
Load Address: 00000000
Entry Point: 00000000
Verifying Checksum ... Bad Data CRC
Ramdisk image is corrupt or invalid
Our NAND has an erase size of 2MB and is split as follow: uboot (64MB), kernel (16MB), dtb (16MB), system (rest of the 4GB NAND) and there is plenty of spare blocks on each partition.
When the system fails, looking at the Bad Blocks Table from u-boot we see:
=> nand bad Device 0 bad blocks: 06600000 0b400000 0b600000
The first one being, in fact, inside the initramfs partition. (we see the others two on every system, so we support they are present by default on the NAND chip itself).
Considering that the u-boot, kernel, dtb and initramfs are only written once in production, and the system may go many months without a reboot, we wonder how those partitions may get corrupted.
We have few hypotheses:
- over time an excessive number of bitflips happens on a block, making it uncorrectable
- a single bitflip happens on the first byte on the OOB part of a block, changing it from 0xff and as such marking it as bad
Does these ideas make sense?
How can it be that, sometimes, the same system is then able to boot again (without any intervention) powering on/off this many times?
Another strange thing we notice is that re-flashing only the u-boot (and nothing else) have made the bad block on the initramfs partition disappear.
To flash partitions we erase them with flash_erase /dev/mtdX and then write them using nandwrite (or kobs-ng for u-boot).
We also have some questions about how we can make the system more reliable: is it a good idea to duplicate the kernel, dtb and initramfs partitions and detect in some way a failure in u-boot, so that a second copy can be loaded if the first fails? or there are better solutions?
Now, a very wild guess: is it possible that reading these partitions periodically from a live system (using nanddump, for example) could improve their health? the reasoning here is as follow: reading it with nanddump would detect (and correct?) bitflips, preventing them to accumulate over time. Does it make sense?
We are asking because we read on Micron documentation that bad blocks are marked only on erase and write operations, but we assume that they are also marked as bad when reading them if there are too many bitflips to correct the errors. Is it right? In our case we never erase or write them again, just read them on a reboot so we do not understand how they can be marked as bad.
Best regards.