Updates to read CSV method #1512

mnjowe · 2024-11-12T06:45:20Z

This PR aims to address issue #1509

…ary when files argument is supplied with None

…_utils

matt-graham · 2024-11-15T08:11:51Z

src/tlo/util.py

@@ -474,7 +475,7 @@ def convert_excel_files_to_csv(folder: Path, files: Optional[list[str]] = None,
            Path(folder/excel_file_path).unlink()


-def read_csv_files(folder: Path, files: Optional[list[str]] = None) -> DataFrame | dict[str, DataFrame]:
+def read_csv_files(folder: Path, dtype: DtypeArg | None = None, files: Optional[list[str]] | None | int = 0) -> DataFrame | dict[str, DataFrame]:


matt-graham · 2024-11-15T08:48:22Z

src/tlo/util.py

@@ -498,15 +500,15 @@ def clean_dataframe(dataframes_dict: dict[str, DataFrame]) -> None:
        for _key, dataframe in dataframes_dict.items():
            all_data[_key] = dataframe.drop(dataframe.filter(like='Unnamed'), axis=1)  # filter and drop Unnamed columns

-    if files is None:
+    if files == 0 or files is None:


Is the reasoning for allowing both files = 0 and files = None to be used to select all files in directory so that different output behaviour can be selected? At the moment it looks like files = 0 if there is a single file in directory will return just a single dataframe (but a dictionary otherwise), while files = None will always return a dict. Having this flexibility possibly makes sense, but if this is the intention I think it needs to be made more obvious by explaining in docstring. Alternatively might make more sense to instead have an explicit boolean flag return_single_file_as_dataframe to control output behaviour which if True and len(all_data) == 1 will force a single dataframe to be returned, with a dict returned otherwise, as the name would make it explicit to the user what the behaviour is.

@matt-graham , declaring an explicit flag return_single_file_as_dataframe to control output is good but I'm afraid it will change the behaviour that modellers are used to in pd.read_excel. If we are to implement this now it will mean introducing this flag in several other files as well. Do we want to this in this PR?

matt-graham · 2024-11-15T08:50:49Z

src/tlo/util.py

@@ -484,6 +485,7 @@ def read_csv_files(folder: Path, files: Optional[list[str]] = None) -> DataFrame
    :py:func:`pandas.drop`.

    :param folder: Path to folder containing CSV files to read.
+    :param dtype: preferred datatype


The pandas.read_csv method allows passing in a dictionary of datatypes (mapping from column names to datatypes) in cases where you want not just a single datatype for whole sheet but different datatypes per column. Are there any cases where we might need this functionality? As we passthrough the dtype argument straight to pandas.read_csv we could already pass in a dict here, but it might be worth documenting this is an additional option if we do think it will be used.

Actually it seems like you do use explicitly use this ability to pass dict in test below so this should definitely be documented and typehint for dtype updated to DtypeArg | dict[str, DtypeArg] | None.

matt-graham · 2024-11-15T08:53:53Z

tests/test_utils.py

+def test_pass_datatypes_to_read_csv_method(tmpdir):
+    """ test passing column datatypes to read csv method. Final column datatype should change to what has been passed """
+    # copy and get resource files path in the temporal directory
+    tmpdir_resource_filepath = copy_files_to_temporal_directory_and_return_path(tmpdir)


As we don't seem to use the files copied to the temporary directory in copy_files_to_temporal_directory_and_return_path in this test could we not just use tmpdir directly here?

matt-graham · 2024-11-15T08:56:42Z

tests/test_utils.py

+    # read from the sample data file
+    read_sample_data = read_csv_files(tmpdir_resource_filepath, files=['sample_data'])
+    # confirm column datatype is what was assigned
+    assert read_sample_data.numbers1.dtype and read_sample_data.numbers2.dtype == 'int'


Suggested change

assert read_sample_data.numbers1.dtype and read_sample_data.numbers2.dtype == 'int'

assert read_sample_data.numbers1.dtype == 'int' and read_sample_data.numbers2.dtype == 'int'

The current condition is equivalent to

(read_sample_data.numbers1.dtype) and (read_sample_data.numbers2.dtype == 'int')

that is check if read_sample_data.numbers1.dtype evaluates to True (which I think will always be the case) and read_sample_data.numbers2.dtype == 'int'.

mnjowe · 2024-11-18T05:32:18Z

Thanks for the review @matt-graham . Applying changes now

update read csv method to accept datatypes and allow return a diction…

7352c14

…ary when files argument is supplied with None

mnjowe added the framework label Nov 12, 2024

mnjowe requested review from tamuri and matt-graham November 12, 2024 06:45

mnjowe self-assigned this Nov 12, 2024

mnjowe linked an issue Nov 12, 2024 that may be closed by this pull request

Update read cvs files method in util #1509

Open

3 tasks

mnjowe and others added 3 commits November 12, 2024 11:30

update read csv files method return statement

4f96c65

remove unnecessary print statement

8c61da4

Merge branch 'master' into mnjowe/updates_to_read_csv_files_method_in…

375cf75

…_utils

matt-graham requested changes Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to read CSV method #1512

Updates to read CSV method #1512

mnjowe commented Nov 12, 2024

matt-graham Nov 15, 2024

matt-graham Nov 15, 2024

mnjowe Nov 18, 2024

matt-graham Nov 15, 2024

matt-graham Nov 15, 2024

matt-graham Nov 15, 2024

matt-graham Nov 15, 2024

mnjowe commented Nov 18, 2024

	def read_csv_files(folder: Path, dtype: DtypeArg \| None = None, files: Optional[list[str]] \| None \| int = 0) -> DataFrame \| dict[str, DataFrame]:
	def read_csv_files(folder: Path, dtype: DtypeArg \| None = None, files: list[str] \| None \| int = 0) -> DataFrame \| dict[str, DataFrame]:

	assert read_sample_data.numbers1.dtype and read_sample_data.numbers2.dtype == 'int'
	assert read_sample_data.numbers1.dtype == 'int' and read_sample_data.numbers2.dtype == 'int'

Updates to read CSV method #1512

Are you sure you want to change the base?

Updates to read CSV method #1512

Conversation

mnjowe commented Nov 12, 2024

matt-graham Nov 15, 2024

Choose a reason for hiding this comment

matt-graham Nov 15, 2024

Choose a reason for hiding this comment

mnjowe Nov 18, 2024

Choose a reason for hiding this comment

matt-graham Nov 15, 2024

Choose a reason for hiding this comment

matt-graham Nov 15, 2024

Choose a reason for hiding this comment

matt-graham Nov 15, 2024

Choose a reason for hiding this comment

matt-graham Nov 15, 2024

Choose a reason for hiding this comment

mnjowe commented Nov 18, 2024