Skip to content

Conversation

nleroy917
Copy link
Contributor

Hey @jackh726!

As mentioned in #56, this PR adds a new option to the bedgraphtobigwig binary that will "skip" any chromosomes in the .bedGraph file that are not represented in the given chrom.sizes file. This functionality is inspired by the kent utils -clip option in wigToBigWig.

I appreciate you letting me take a stab at this. Currently... it's unfinished. It is not working properly, but I thought I'd open it and get your immediate feedback. I think I'm getting stuck in an infinite loop and will continue working.

Anyways, after getting my bearings, I was able to quickly add --clip to BBIWriteOptions. I had some troubles deciding how to implement the skip logic, however.

I believe that I was able to find the source of the error... The do_read that gets passed as the start_processing argument in process_to_bbi will return the ProcessDataError::InvalidChromosome error if the chromosome is not found in chrom.sizes. I had three ideas to go forward:

  1. Add a fourth argument called clip to the process_to_bbi function in the BBIDataSource trait. Then, inside match the error and skip if clip is true. I didn't like this since it felt too specific for the trait.
  2. Add a new error called ProcessDataError::ClippedChromosome. If clip is true, return this instead of the current ProcessDataError::ClippedChromosome. Likewise, this can be matched inside process_to_bbi and enable a "skip"-like feature. This felt weird since its not really an error.
  3. Add a boolean field to ProcessDataError::InvalidChromosome. Again, match on this error and that field being true, skip the chromosome.

I went with option three since it seemed the least hacky and most idiomatic/rusty way. I'm curious to your thoughts on this and it seems the real trick is just getting those BBIWriteOptions into process_to_bbi.

Will continue working on the infinite loop, so stay tuned. Thank you!

@jackh726
Copy link
Owner

Hi @nleroy917, nice to see a PR :)

Of your three options, 1 definitely does seem like not the right option. 2 and 3 are similar, in that you're modifying BBIProcessError to capture the "don't process this chromosome" state. I would offer one other option: make StartProcessing be FnMut(String) -> Result<Option<P>, ProcessDataError>, where Ok(None) means "no error, but don't process this chromosome.

One important note: in the BedParserStreamingIterator impl, you will still have to read through the bed file past the skipped chromosome. The BedParserParallelStreamingIterator impl can just be continued since each chrom gets a separate File open+seek.

I have a couple other thoughts:

First, the naive thing to do would be just "pretend" everything is fine to process_to_bbi. This would really only require faking the length by setting to u32::MAX. Then, later when actually writing the data, skip the chromosomes. This has the benefit that proces_to_bbi stays agnostic to clipping. But, has a few major downsides. Particularly that a lot of extra work is done for no reason, and that the chrom will get an id assigned to it. However, it seems appealing because...

Second, it's not a big deal, but so far process_to_bbi hasn't actually cared about the chromosome length at all. With this option, it doesn't need to, since this check can (and should) just be done in process_val (btw, I'm noticing that bigbedwrite::process_val has the wrong check: current_val.start >= chrom_length instead of current_val.end >= chrom_length - can you fix that when you touch that bit of code?). But, it does split the handling of clip into two places. An alternative is to move those checks into process_val, but I don't like that option particularly much, since now implementations of BBIDataProcessor have to duplicate value validation. And, I think just having a way to tell process_to_bbi to skip a chromsome is fine if the actual reason behind that is agnostic.

Thoughts?

@nleroy917
Copy link
Contributor Author

Wow has it been a month? @jackh726 apologies for how long it took to get back to this! I promise I haven't forgotten!

I've read through, and I really like the idea of making the StartProcessing arg be a closure to return Result<Option<P>, ProcessDataError>. I think I've got that implemented. I'm working my way through actual skip logic now, which is less clear to me. I think I need to spend a tad bit more time with the codebase...

Stay tuned I am going to keep going on this.

@jackh726
Copy link
Owner

No wories, let me know if you get stuck and need some thoughts!

@jackh726 jackh726 force-pushed the master branch 4 times, most recently from 20f71b9 to 444daf1 Compare February 11, 2025 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants