Frictionless fails to describe the table with the correct field type when the data file is big #1689

mingjiecn · 2024-09-23T19:31:23Z

Overview

When the field contain integer and float, frictionless describe this filed as a number type. It works well when the data file is small. But we have some issue with it when our data file is big. For example we have a big data file that is about 2GB, one of the field can have 0 or a float number. For this field, most rows have a value of 0, only a few have a float value. And when frictionless describe the table, it describe this filed as a integer type instead of a number type. It fails to see those float values in this row. Can this bug be fixed? Thanks!

This is the output to describe a small data file (test.tsv) and a big data file (TSTFI46007602.tsv) with the same fields. You can see ref_score identified as a number type in the small size file but a integer type in the big size file:

(checkfiles_venv) (base) mingjie@Mingjies-MacBook-Pro checkfiles % frictionless describe  src/tests/data/test.tsv
# --------
# metadata: src/tests/data/test.tsv
# --------

name: test
type: table
path: src/tests/data/test.tsv
scheme: file
format: tsv
encoding: utf-8
mediatype: text/tsv
dialect:
  csv:
    delimiter: "\t"
schema:
  fields:
    - name: '#chrom'
      type: string
    - name: start
      type: integer
    - name: end
      type: integer
    - name: spdi
      type: string
    - name: ref
      type: string
    - name: alt
      type: string
    - name: kmer_coord
      type: string
    - name: ref_score
      type: number
    - name: alt_score
      type: number
    - name: relative_binding_affinity
      type: number
    - name: effect_on_binding
      type: string

(checkfiles_venv) (base) mingjie@Mingjies-MacBook-Pro checkfiles % frictionless describe  src/tests/data/TSTFI46007602.tsv
# --------
# metadata: src/tests/data/TSTFI46007602.tsv
# --------

name: tstfi46007602
type: table
path: src/tests/data/TSTFI46007602.tsv
scheme: file
format: tsv
encoding: utf-8
mediatype: text/tsv
dialect:
  csv:
    delimiter: "\t"
schema:
  fields:
    - name: '#chrom'
      type: string
    - name: start
      type: integer
    - name: end
      type: integer
    - name: spdi
      type: string
    - name: ref
      type: string
    - name: alt
      type: string
    - name: kmer_coord
      type: string
    - name: ref_score
      type: integer
    - name: alt_score
      type: integer
    - name: relative_binding_affinity
      type: integer
    - name: effect_on_binding
      type: string

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frictionless fails to describe the table with the correct field type when the data file is big #1689

Frictionless fails to describe the table with the correct field type when the data file is big #1689

mingjiecn commented Sep 23, 2024 •

edited

Loading

Frictionless fails to describe the table with the correct field type when the data file is big #1689

Frictionless fails to describe the table with the correct field type when the data file is big #1689

Comments

mingjiecn commented Sep 23, 2024 • edited Loading

Overview

mingjiecn commented Sep 23, 2024 •

edited

Loading