Yesterday’s post was about wearing out the EEPROM memory on an ATmega168. It took over 6.7 million writes to address zero to make it fail. As mentioned, this is a limited test, because reading out the value was done right after writing it. But hey, at least I didn’t have to write 100,000 times to each EEPROM address and wait 20 years @ 85°C to find out whether each byte was still correct…
So we got a failure, now what?
Well, re-running the test caused it to fail after a mere 6,200 write cycles, so that EEPROM definitely isn’t up to spec anymore.
Today, I wanted to see what effect this single massive rewrite of address zero did to the rest of the EEPROM. So I wrote another test which would go through the entire EEPROM and see how well writes + read-backs would work on this same ATmega168 with its massively-abused EEPROM:

All failures are counted, per EEPROM address. Whenever the maximum number of failures at any single address increases, a map is printed with the counts for each of the 512 bytes. Here’s the startup situation – no errors yet:

The map counts are encoded as single characters:
0 = "."
1 to 9 = "1" .. "9"
10 to 35 = "a" .. "z"
36 to 61 = "A" .. "Z"
62 and up = "?"
Every 100 cycles, the cycle count is printed, just to let me know that the sketch is still running. Unfortunately, re-writing all 512 bytes in EEPROM is way slower than yesterday’s test, over 1 second per cycle. So I also added an LED to toggle on each cycle:

Now the waiting begins… it’s going to take hours (maybe even days!) to force the next failure again.
And one thing is clear: it’s not easy to get these chips to fail quickly!
Update – 100,000 cycles later, no new failure yet…
Is there any reason to believe that EEPROM locations other than 0 are broken?
Good Q. My hunch is that EEPROM is banked, and that single-byte writes might well do larger erase/writes underneath, so this was an experiment to test just that. If there is banking going on behind the scenes, I would expect a few bytes above address 0 to also fail occasionally.
So the memory is not defect, but just unreliable. If you would just would do a readback, compare and a rewrite in case it was incorrect you are able to use the memory untill all the smoke of that one byte is released ? (as in: “A chip is defect when the magical smoke is released”).
If there was some load-levelling scheme implemented behind the scenes, the entire memory might be worn out, after virtual address 0 got successively remapped to every other address. Not sure how likely that kind of complexity is in this case, though.
Did it fail yet ?
No, I disconnected it after two days. Should probably redo the test with a fresh ATmega to draw conclusions.