Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The LTR_retriever fails to identify LTRs with tandem repeats, such as Dasheng. #165

Open
CSU-KangHu opened this issue Mar 18, 2024 · 3 comments

Comments

@CSU-KangHu
Copy link

Hi @oushujun,

Within the Dasheng LTR, there exists a tandem repeat, but according to the current filtering strategy of LTR_retriever, it seems to be filtered out. Below is a Dasheng LTR scn record in rice:

1524180 1531026 6847 1524180 1524621 442 1530585 1531026 442 98.87 0 Chr1

I conducted debugging based on this record.

You can also view this instance on UCSC: Link

I attempted to disable trf, however, within the LTR.identifier.pl's Identifier function, it gets filtered out again due to blastn alignment records exceeding 8 times. Could you confirm if LTR_retriever is currently able to identify this LTR?

@oushujun
Copy link
Owner

Hello,

Yes, I realized that LTR elements like Dasheng may contain tandem repeats. They are currently filtered. We are still working on a new way to process the tandem repeat information.

LTR_retriever does not count the copy number of flanking sequences, but compares the flanking with internal sequences and discards those with flanking sequences similar to internal sequences. This usually works effectively but may be sabotaged by centromeric repeats where Dasheng is located. It will be helpful to tell more exactly how it is filtered from information in the .defalse file.

Thanks,
Shujun

@CSU-KangHu
Copy link
Author

CSU-KangHu commented Mar 19, 2024

The contents of the .defalse file are presented below:
image

Even when disabling trf to retain Dasheng LTR, it still gets filtered out by the Identifier function in LTR.identifier.pl:
image

During the self-alignment of Dasheng using blastn, the number of alignment records it generates far exceeds 8:
image

@oushujun
Copy link
Owner

Hey sorry for the delay. From the defalse output, the program cannot identify the LTR region, which is because of multiple self-alignment entries reported by blastn as you shown. You can pick out the LTR region by eyes which is the line with 1...442 alignments, but I have not implement a programmatic scheme to find such regions out of multiple self alignments. Can you come up with a way?

Shujun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants