Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frictionless fails to describe the table with the correct field type when the data file is big #1689

Open
mingjiecn opened this issue Sep 23, 2024 · 0 comments

Comments

@mingjiecn
Copy link

mingjiecn commented Sep 23, 2024

Overview

When the field contain integer and float, frictionless describe this filed as a number type. It works well when the data file is small. But we have some issue with it when our data file is big. For example we have a big data file that is about 2GB, one of the field can have 0 or a float number. For this field, most rows have a value of 0, only a few have a float value. And when frictionless describe the table, it describe this filed as a integer type instead of a number type. It fails to see those float values in this row. Can this bug be fixed? Thanks!

This is the output to describe a small data file (test.tsv) and a big data file (TSTFI46007602.tsv) with the same fields. You can see ref_score identified as a number type in the small size file but a integer type in the big size file:

(checkfiles_venv) (base) mingjie@Mingjies-MacBook-Pro checkfiles % frictionless describe  src/tests/data/test.tsv
# --------
# metadata: src/tests/data/test.tsv
# --------

name: test
type: table
path: src/tests/data/test.tsv
scheme: file
format: tsv
encoding: utf-8
mediatype: text/tsv
dialect:
  csv:
    delimiter: "\t"
schema:
  fields:
    - name: '#chrom'
      type: string
    - name: start
      type: integer
    - name: end
      type: integer
    - name: spdi
      type: string
    - name: ref
      type: string
    - name: alt
      type: string
    - name: kmer_coord
      type: string
    - name: ref_score
      type: number
    - name: alt_score
      type: number
    - name: relative_binding_affinity
      type: number
    - name: effect_on_binding
      type: string

(checkfiles_venv) (base) mingjie@Mingjies-MacBook-Pro checkfiles % frictionless describe  src/tests/data/TSTFI46007602.tsv
# --------
# metadata: src/tests/data/TSTFI46007602.tsv
# --------

name: tstfi46007602
type: table
path: src/tests/data/TSTFI46007602.tsv
scheme: file
format: tsv
encoding: utf-8
mediatype: text/tsv
dialect:
  csv:
    delimiter: "\t"
schema:
  fields:
    - name: '#chrom'
      type: string
    - name: start
      type: integer
    - name: end
      type: integer
    - name: spdi
      type: string
    - name: ref
      type: string
    - name: alt
      type: string
    - name: kmer_coord
      type: string
    - name: ref_score
      type: integer
    - name: alt_score
      type: integer
    - name: relative_binding_affinity
      type: integer
    - name: effect_on_binding
      type: string
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant