Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update allele encoding #9

Open
bguo068 opened this issue Nov 15, 2023 · 1 comment
Open

update allele encoding #9

bguo068 opened this issue Nov 15, 2023 · 1 comment

Comments

@bguo068
Copy link
Owner

bguo068 commented Nov 15, 2023

The current allele encoding method records changes for each rare allele at every site,
allowing these alleles to be represented as integers from 1 to n in the GenotypeRecords, irrespective of the ALT/REF allele order per site.

Consider a site with REF = "T" and ALT = ["C", "A"]:

  1. When 'C' and 'A' are rare, and 'T' is common, the record "REF T>C T>A" is stored
    in the Sites struct.
  2. If 'T' and 'A' are rare, with 'C' being common, the record "REF T->A" is
    stored in Sites ('C' is not stored).
  3. When 'T' is rare, 'C' is common, and 'A' has an allele count of 0,
    only "REF" is recorded, disregarding the actual allele string/byte values of 'T' and 'C'.
  4. In cases where two alleles are common (e.g., 'T' and 'C'), and 'A' is rare,
    the record "REF T->A" is stored. For a genotype without a rare allele record,
    the genotype (of common alleles) is ambiguous.

This method works well for current functions but presents challenges when converting data back to BCF format (see also #5):

  • Caveat 1: When the REF allele is the sole rare allele, the string/byte values of both the REF and common alleles are lost in tabular encoding. Although retrievable from the original VCF/BCF file, this is not convenient or ideal.
  • Caveat 2: With multiple common alleles, the common allele index cannot be inferred from the tabular encoding of rare genotypes.

To address these issues:

  • Store the REF allele and all ALT alleles with an allele count (AC) > 0. Rare alleles in GenotypeRecords would then correspond to an integer index based on the order of stored REF/ALT alleles. This approach resolves Caveat 1.
  • For sites with multiple common alleles, we wouldn’t explicitly indicate which alleles are common but would infer this by noting that the alleles in GenotypeRecords are rare. This approach doesn’t directly resolve Caveat 2 but signals sites with multi-common-allele issues.
    - One solution is to represent all common alleles at a multi-common-allele site by selecting the first common allele, alerting users to potential issues in the exported BCF.
    - Additionally, we could offer options to filter out these sites for further clarity.
bguo068 added a commit that referenced this issue Nov 17, 2023
@bguo068
Copy link
Owner Author

bguo068 commented Nov 17, 2023

updated allele encoding in 4a6b17f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant