Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to read CSV method #1512

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mnjowe
Copy link
Collaborator

@mnjowe mnjowe commented Nov 12, 2024

This PR aims to address issue #1509

…ary when files argument is supplied with None
@mnjowe mnjowe self-assigned this Nov 12, 2024
@mnjowe mnjowe linked an issue Nov 12, 2024 that may be closed by this pull request
3 tasks
@@ -474,7 +475,7 @@ def convert_excel_files_to_csv(folder: Path, files: Optional[list[str]] = None,
Path(folder/excel_file_path).unlink()


def read_csv_files(folder: Path, files: Optional[list[str]] = None) -> DataFrame | dict[str, DataFrame]:
def read_csv_files(folder: Path, dtype: DtypeArg | None = None, files: Optional[list[str]] | None | int = 0) -> DataFrame | dict[str, DataFrame]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def read_csv_files(folder: Path, dtype: DtypeArg | None = None, files: Optional[list[str]] | None | int = 0) -> DataFrame | dict[str, DataFrame]:
def read_csv_files(folder: Path, dtype: DtypeArg | None = None, files: list[str] | None | int = 0) -> DataFrame | dict[str, DataFrame]:

Optional[T] is equivalent to T | None so is redundant here.

@@ -498,15 +500,15 @@ def clean_dataframe(dataframes_dict: dict[str, DataFrame]) -> None:
for _key, dataframe in dataframes_dict.items():
all_data[_key] = dataframe.drop(dataframe.filter(like='Unnamed'), axis=1) # filter and drop Unnamed columns

if files is None:
if files == 0 or files is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reasoning for allowing both files = 0 and files = None to be used to select all files in directory so that different output behaviour can be selected? At the moment it looks like files = 0 if there is a single file in directory will return just a single dataframe (but a dictionary otherwise), while files = None will always return a dict. Having this flexibility possibly makes sense, but if this is the intention I think it needs to be made more obvious by explaining in docstring. Alternatively might make more sense to instead have an explicit boolean flag return_single_file_as_dataframe to control output behaviour which if True and len(all_data) == 1 will force a single dataframe to be returned, with a dict returned otherwise, as the name would make it explicit to the user what the behaviour is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matt-graham , declaring an explicit flag return_single_file_as_dataframe to control output is good but I'm afraid it will change the behaviour that modellers are used to in pd.read_excel. If we are to implement this now it will mean introducing this flag in several other files as well. Do we want to this in this PR?

@@ -484,6 +485,7 @@ def read_csv_files(folder: Path, files: Optional[list[str]] = None) -> DataFrame
:py:func:`pandas.drop`.

:param folder: Path to folder containing CSV files to read.
:param dtype: preferred datatype
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pandas.read_csv method allows passing in a dictionary of datatypes (mapping from column names to datatypes) in cases where you want not just a single datatype for whole sheet but different datatypes per column. Are there any cases where we might need this functionality? As we passthrough the dtype argument straight to pandas.read_csv we could already pass in a dict here, but it might be worth documenting this is an additional option if we do think it will be used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it seems like you do use explicitly use this ability to pass dict in test below so this should definitely be documented and typehint for dtype updated to DtypeArg | dict[str, DtypeArg] | None.

def test_pass_datatypes_to_read_csv_method(tmpdir):
""" test passing column datatypes to read csv method. Final column datatype should change to what has been passed """
# copy and get resource files path in the temporal directory
tmpdir_resource_filepath = copy_files_to_temporal_directory_and_return_path(tmpdir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we don't seem to use the files copied to the temporary directory in copy_files_to_temporal_directory_and_return_path in this test could we not just use tmpdir directly here?

# read from the sample data file
read_sample_data = read_csv_files(tmpdir_resource_filepath, files=['sample_data'])
# confirm column datatype is what was assigned
assert read_sample_data.numbers1.dtype and read_sample_data.numbers2.dtype == 'int'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert read_sample_data.numbers1.dtype and read_sample_data.numbers2.dtype == 'int'
assert read_sample_data.numbers1.dtype == 'int' and read_sample_data.numbers2.dtype == 'int'

The current condition is equivalent to

(read_sample_data.numbers1.dtype) and (read_sample_data.numbers2.dtype == 'int')

that is check if read_sample_data.numbers1.dtype evaluates to True (which I think will always be the case) and read_sample_data.numbers2.dtype == 'int'.

@mnjowe
Copy link
Collaborator Author

mnjowe commented Nov 18, 2024

Thanks for the review @matt-graham . Applying changes now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

Update read cvs files method in util
2 participants