Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simulate_genotype_call_dataset creates duplicate alleles #1221

Open
hyanwong opened this issue Jun 7, 2024 · 0 comments
Open

simulate_genotype_call_dataset creates duplicate alleles #1221

hyanwong opened this issue Jun 7, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@hyanwong
Copy link

hyanwong commented Jun 7, 2024

E.g. we can get 2 "C" values in ds['variant_allele']:

import sgkit as sg
import numpy as np

ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=4, missing_pct=0, phased=True, seed=1)
for i, alleles in enumerate(ds['variant_allele'].values):
    print(f"Site {i}: {alleles}")
    assert len(np.unique(alleles)) == len(alleles)

Fails on site 6:

Site 6: [b'T' b'T']
---------------------------------------------------------------------------
AssertionError

This can cause much confusion in downstream analysis. See tskit-dev/tsinfer#927

@hyanwong hyanwong changed the title simulate_genotype_call_dataset created duplicate alleles Jun 7, 2024
@jeromekelleher jeromekelleher added the bug Something isn't working label Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
2 participants