Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

atmel_nand.c: Empty page error correction fixed. #19

Closed

Conversation

ksdd
Copy link

@ksdd ksdd commented Mar 17, 2021

Not correcting anything in case of empty ECC data area
is not an appropriate strategy, because an uncorrected bit-flip
in an empty page may cause upper layers (namely UBI) fail to work
properly. Therefore the approach chosen in Linux kernel and other
u-boot mtd drivers has been adopted, where a heuristic implemented
by nand_check_erased_ecc_chunk() is used in order to detect and
correct empty pages.

@ambarus ambarus self-requested a review March 17, 2021 07:37
@ambarus ambarus self-assigned this Mar 17, 2021
Copy link

@ehristev ehristev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add your signed-off-by:
and edit your commit message to match driver: nand: commit shortmessage

Not correcting anything in case of empty ECC data area
is not an appropriate strategy, because an uncorrected bit-flip
in an empty sector may cause upper layers (namely UBI) fail to work
properly. Therefore the approach chosen in Linux kernel and other
u-boot mtd drivers has been adopted, where a heuristic implemented
by nand_check_erased_ecc_chunk() is used in order to detect and
correct empty sectors.

Signed-off-by: Kai Stuhlemmer (ebee Engineering) <[email protected]>
@ksdd ksdd force-pushed the fix/nand-empty-sector-bit-flip-correction branch from ba874cc to ff3c66d Compare March 17, 2021 12:53
@ksdd
Copy link
Author

ksdd commented Mar 17, 2021

Hi @ehristev, I change the commit message and added signed-off-by. I hope it is okay now.

@ehristev
Copy link

Hi @ehristev, I change the commit message and added signed-off-by. I hope it is okay now.

yes, good. I let @ambarus look over the code itself. thanks !

Copy link

@ambarus ambarus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @ksdd,

Sorry for the long delay, it's just now that I found time to look over this.
Your general approach is fine, and as you stated, it follows the linux implementation. A similar patch was done for linux at:
ff6ee101584c ("mtd: nand: atmel: correct bitflips in erased pages for pre-sama5d4 SoCs")
What can be improved is to factor out the following chunk:

  •                                   dev_err(mtd->dev, "PMECC: Too many errors\n");
    
  •                                   mtd->ecc_stats.failed++;
    
  •                                   return -EBADMSG;
    

Don't worry about this, I can do this when applying.

I would like to understand more about your reasoning. I assume that you tested with sam9x60.
sam9x60 is capable of correcting empty pages, even though I see its discovered pmecc version is 0x102, which is smaller than PMECC_VERSION_SAMA5D4 (0x113). Thus, for the sam9x60 case, when you have less than CONFIG_PMECC_CAP bitflips, the code always goes through the else case and calls pmecc_correct_data(). When you have more that CONFIG_PMECC_CAP bitflips the code will superfluously call nand_check_erased_ecc_chunk(), as the ECC chunk was already checked internally in the hw.

9x60 works because you skipped the following code:

  •   eccbytes = nand_chip->ecc.bytes;
    
  •   for (i = 0; i < eccbytes; i++)
    
  •           if (ecc[i] != 0xff)
    
  •                   goto normal_check;
    
  •   /* Erased page, return OK */
    
  •   return 0;
    

But your code should handle well the 5d3 case, which has a 0x112 pmecc version. Unfortunately, I think there's a bug on 5d3. Even when introducing bitflips, err_nbr is always 0, which is strange. I'll have to check that.

On which SoC did you test your code?

Btw, here's what I used for testing on sam9x60:
nand erase 0 1
nand read.raw 0x21000000 0 1
md.b 0x21000000 0x10e0
mm.b 0x21000000
md.b 0x21000000 0x10e0
nand write.raw 0x21000000 0 1
nand read 0x21000000 0 0x1000
md.b 0x21000000 0x1000
nand read.raw 0x21000000 0 1
md.b 0x21000000 0x10e0

Cheers,
ta

@ksdd
Copy link
Author

ksdd commented May 19, 2021

Hi @ambarus,

our design uses a sam9x60d1g. Excerpt from dmesg:

AT91: Detected SoC family: sam9x60
AT91: Detected SoC: sam9x60d1g 128MiB DDR2 SiP, revision 1

Attached is a KOXIA TC58NVG1S3HBAI4 NAND flash. This flash requires 8-bit ECC and practical experience has shown, that it is prone to bit flips pretty much. That's why we discover all these NAND related issues ;-)

But your code should handle well the 5d3 case, which has a 0x112 pmecc version. Unfortunately, I think there's a bug on 5d3. > Even when introducing bitflips, err_nbr is always 0, which is strange. I'll have to check that.

My practical tests had shown, that the pmecc handled empty page bit flips well, although this was not described as a feature for this version of pmecc. To double check and backup my observation I asked our Microchip FAE. The answer was:

We completed the verification with the Design Team
We confirm that the hardware-assisted error correction of empty sectors also works and is reliable with the SAM9X60.

With this feed-back I was happy to apply the patch and we didn't see any problems ever since. Feel free to refactor/improve my patch.

Cheers,
Kai

@ambarus
Copy link

ambarus commented May 20, 2021

Hi, Kai,

I've made it work for 5d3 as well. I'll send your patch on your behalf on u-boot mainline, and after it gets merged in our tree, we will close this PR.

Thanks!
ta

err_nbr = nand_check_erased_ecc_chunk(
buf_pos, host->pmecc_sector_size,
ecc_pos, host->pmecc_bytes_per_sector,
NULL, 0, (host->pmecc_corr_cap * 3) / 4);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the last parameter of nand_check_erased_ecc_chunk() has value (host->pmecc_corr_cap * 3) / 4 ? Shouldn't this be just host->pmecc_corr_cap?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ambarus,

You are right! Just host->pmecc_corr_cap is correct. To be honest I don't remember, where this calculation came from.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ksdd!

No worries, I've already corrected it. Just FYI, I sent the patch on your behalf to u-boot mainline. Once it gets applied by Eugen, and also backported to our u-boot-at91, we'll close this PR.

Cheers,
ta

@ehristev
Copy link

Patch available in u-boot-2021.04-at91.
Please let me know if there is anything left to do, or this is fine for you.
Thanks !

@ehristev ehristev closed this May 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants