-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to read CSV method #1512
base: master
Are you sure you want to change the base?
Conversation
…ary when files argument is supplied with None
@@ -474,7 +475,7 @@ def convert_excel_files_to_csv(folder: Path, files: Optional[list[str]] = None, | |||
Path(folder/excel_file_path).unlink() | |||
|
|||
|
|||
def read_csv_files(folder: Path, files: Optional[list[str]] = None) -> DataFrame | dict[str, DataFrame]: | |||
def read_csv_files(folder: Path, dtype: DtypeArg | None = None, files: Optional[list[str]] | None | int = 0) -> DataFrame | dict[str, DataFrame]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def read_csv_files(folder: Path, dtype: DtypeArg | None = None, files: Optional[list[str]] | None | int = 0) -> DataFrame | dict[str, DataFrame]: | |
def read_csv_files(folder: Path, dtype: DtypeArg | None = None, files: list[str] | None | int = 0) -> DataFrame | dict[str, DataFrame]: |
Optional[T]
is equivalent to T | None
so is redundant here.
@@ -498,15 +500,15 @@ def clean_dataframe(dataframes_dict: dict[str, DataFrame]) -> None: | |||
for _key, dataframe in dataframes_dict.items(): | |||
all_data[_key] = dataframe.drop(dataframe.filter(like='Unnamed'), axis=1) # filter and drop Unnamed columns | |||
|
|||
if files is None: | |||
if files == 0 or files is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reasoning for allowing both files = 0
and files = None
to be used to select all files in directory so that different output behaviour can be selected? At the moment it looks like files = 0
if there is a single file in directory will return just a single dataframe (but a dictionary otherwise), while files = None
will always return a dict. Having this flexibility possibly makes sense, but if this is the intention I think it needs to be made more obvious by explaining in docstring. Alternatively might make more sense to instead have an explicit boolean flag return_single_file_as_dataframe
to control output behaviour which if True
and len(all_data) == 1
will force a single dataframe to be returned, with a dict returned otherwise, as the name would make it explicit to the user what the behaviour is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matt-graham , declaring an explicit flag return_single_file_as_dataframe
to control output is good but I'm afraid it will change the behaviour that modellers are used to in pd.read_excel. If we are to implement this now it will mean introducing this flag in several other files as well. Do we want to this in this PR?
@@ -484,6 +485,7 @@ def read_csv_files(folder: Path, files: Optional[list[str]] = None) -> DataFrame | |||
:py:func:`pandas.drop`. | |||
|
|||
:param folder: Path to folder containing CSV files to read. | |||
:param dtype: preferred datatype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pandas.read_csv
method allows passing in a dictionary of datatypes (mapping from column names to datatypes) in cases where you want not just a single datatype for whole sheet but different datatypes per column. Are there any cases where we might need this functionality? As we passthrough the dtype
argument straight to pandas.read_csv
we could already pass in a dict here, but it might be worth documenting this is an additional option if we do think it will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it seems like you do use explicitly use this ability to pass dict
in test below so this should definitely be documented and typehint for dtype
updated to DtypeArg | dict[str, DtypeArg] | None
.
def test_pass_datatypes_to_read_csv_method(tmpdir): | ||
""" test passing column datatypes to read csv method. Final column datatype should change to what has been passed """ | ||
# copy and get resource files path in the temporal directory | ||
tmpdir_resource_filepath = copy_files_to_temporal_directory_and_return_path(tmpdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we don't seem to use the files copied to the temporary directory in copy_files_to_temporal_directory_and_return_path
in this test could we not just use tmpdir
directly here?
# read from the sample data file | ||
read_sample_data = read_csv_files(tmpdir_resource_filepath, files=['sample_data']) | ||
# confirm column datatype is what was assigned | ||
assert read_sample_data.numbers1.dtype and read_sample_data.numbers2.dtype == 'int' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert read_sample_data.numbers1.dtype and read_sample_data.numbers2.dtype == 'int' | |
assert read_sample_data.numbers1.dtype == 'int' and read_sample_data.numbers2.dtype == 'int' |
The current condition is equivalent to
(read_sample_data.numbers1.dtype) and (read_sample_data.numbers2.dtype == 'int')
that is check if read_sample_data.numbers1.dtype
evaluates to True
(which I think will always be the case) and read_sample_data.numbers2.dtype == 'int'
.
Thanks for the review @matt-graham . Applying changes now |
This PR aims to address issue #1509