Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to read CSV method #1512

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions src/tlo/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import numpy as np
import pandas as pd
from pandas import DataFrame, DateOffset
from pandas._typing import DtypeArg

from tlo import Population, Property, Types

Expand Down Expand Up @@ -474,7 +475,7 @@ def convert_excel_files_to_csv(folder: Path, files: Optional[list[str]] = None,
Path(folder/excel_file_path).unlink()


def read_csv_files(folder: Path, files: Optional[list[str]] = None) -> DataFrame | dict[str, DataFrame]:
def read_csv_files(folder: Path, dtype: DtypeArg | None = None, files: Optional[list[str]] | None | int = 0) -> DataFrame | dict[str, DataFrame]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def read_csv_files(folder: Path, dtype: DtypeArg | None = None, files: Optional[list[str]] | None | int = 0) -> DataFrame | dict[str, DataFrame]:
def read_csv_files(folder: Path, dtype: DtypeArg | None = None, files: list[str] | None | int = 0) -> DataFrame | dict[str, DataFrame]:

Optional[T] is equivalent to T | None so is redundant here.

"""
A function to read CSV files in a similar way pandas reads Excel files (:py:func:`pandas.read_excel`).

Expand All @@ -484,6 +485,7 @@ def read_csv_files(folder: Path, files: Optional[list[str]] = None) -> DataFrame
:py:func:`pandas.drop`.

:param folder: Path to folder containing CSV files to read.
:param dtype: preferred datatype
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pandas.read_csv method allows passing in a dictionary of datatypes (mapping from column names to datatypes) in cases where you want not just a single datatype for whole sheet but different datatypes per column. Are there any cases where we might need this functionality? As we passthrough the dtype argument straight to pandas.read_csv we could already pass in a dict here, but it might be worth documenting this is an additional option if we do think it will be used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it seems like you do use explicitly use this ability to pass dict in test below so this should definitely be documented and typehint for dtype updated to DtypeArg | dict[str, DtypeArg] | None.

:param files: preferred csv file name(s). This is the same as sheet names in Excel file. Note that if None(no files
selected) then all files in the containing folder will be loaded

Expand All @@ -498,15 +500,15 @@ def clean_dataframe(dataframes_dict: dict[str, DataFrame]) -> None:
for _key, dataframe in dataframes_dict.items():
all_data[_key] = dataframe.drop(dataframe.filter(like='Unnamed'), axis=1) # filter and drop Unnamed columns

if files is None:
if files == 0 or files is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reasoning for allowing both files = 0 and files = None to be used to select all files in directory so that different output behaviour can be selected? At the moment it looks like files = 0 if there is a single file in directory will return just a single dataframe (but a dictionary otherwise), while files = None will always return a dict. Having this flexibility possibly makes sense, but if this is the intention I think it needs to be made more obvious by explaining in docstring. Alternatively might make more sense to instead have an explicit boolean flag return_single_file_as_dataframe to control output behaviour which if True and len(all_data) == 1 will force a single dataframe to be returned, with a dict returned otherwise, as the name would make it explicit to the user what the behaviour is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matt-graham , declaring an explicit flag return_single_file_as_dataframe to control output is good but I'm afraid it will change the behaviour that modellers are used to in pd.read_excel. If we are to implement this now it will mean introducing this flag in several other files as well. Do we want to this in this PR?

for f_name in folder.rglob("*.csv"):
all_data[f_name.stem] = pd.read_csv(f_name)
all_data[f_name.stem] = pd.read_csv(f_name, dtype=dtype)

else:
for f_name in files:
all_data[f_name] = pd.read_csv((folder / f_name).with_suffix(".csv"))
all_data[f_name] = pd.read_csv((folder / f_name).with_suffix(".csv"), dtype=dtype)
# clean and return the dataframe dictionary
clean_dataframe(all_data)
# If only one file loaded return dataframe directly rather than dict
return next(iter(all_data.values())) if len(all_data) == 1 else all_data
return next(iter(all_data.values())) if len(all_data) == 1 and files is not None else all_data

42 changes: 39 additions & 3 deletions tests/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -332,8 +332,44 @@ def copy_files_to_temporal_directory_and_return_path(tmpdir):
return tmpdir_resource_filepath


def test_read_csv_method_with_no_file(tmpdir):
""" read csv method when no file name is supplied
def test_pass_datatypes_to_read_csv_method(tmpdir):
""" test passing column datatypes to read csv method. Final column datatype should change to what has been passed """
# copy and get resource files path in the temporal directory
tmpdir_resource_filepath = copy_files_to_temporal_directory_and_return_path(tmpdir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we don't seem to use the files copied to the temporary directory in copy_files_to_temporal_directory_and_return_path in this test could we not just use tmpdir directly here?

sample_data = pd.DataFrame(data={'numbers1': [5,6,8,4,9,6], 'numbers2': [19,27,53,49,75,56]}, dtype=int)
sample_data.to_csv(tmpdir_resource_filepath/'sample_data.csv', index=False)
# read from the sample data file
read_sample_data = read_csv_files(tmpdir_resource_filepath, files=['sample_data'])
# confirm column datatype is what was assigned
assert read_sample_data.numbers1.dtype and read_sample_data.numbers2.dtype == 'int'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert read_sample_data.numbers1.dtype and read_sample_data.numbers2.dtype == 'int'
assert read_sample_data.numbers1.dtype == 'int' and read_sample_data.numbers2.dtype == 'int'

The current condition is equivalent to

(read_sample_data.numbers1.dtype) and (read_sample_data.numbers2.dtype == 'int')

that is check if read_sample_data.numbers1.dtype evaluates to True (which I think will always be the case) and read_sample_data.numbers2.dtype == 'int'.

# define new datatypes
datatype = {'numbers1': int, 'numbers2': float}
# pass the new datatypes to read csv method and confirm datatype has changed to what has been declared now
assign_dtype = read_csv_files(tmpdir_resource_filepath, files=['sample_data'], dtype=datatype)
assert assign_dtype.numbers1.dtype == 'int' and assign_dtype.numbers2.dtype == 'float'


def test_read_csv_file_method_passing_none_to_files_argument(tmpdir):
""" test reading csv files with one file in the target resource file and setting to None the files argument

Expectations
1. should return a dictionary
2. the dictionary key name should match file name
"""
# copy and get resource files path in the temporal directory
tmpdir_resource_filepath = copy_files_to_temporal_directory_and_return_path(tmpdir)
# choose an Excel file with one sheet in it and convert it to csv file
convert_excel_files_to_csv(tmpdir_resource_filepath, files=['ResourceFile_load-parameters.xlsx'])
# get the folder containing the newly converted csv file and check the expected behavior
this_csv_resource_folder = tmpdir_resource_filepath/"ResourceFile_load-parameters"
file_names = [csv_file_path.stem for csv_file_path in this_csv_resource_folder.rglob("*.csv")]
one_csv_file_in_folder_dict = read_csv_files(this_csv_resource_folder, files=None)
assert isinstance(one_csv_file_in_folder_dict, dict)
assert set(one_csv_file_in_folder_dict.keys()) == set(file_names)


def test_read_csv_method_with_default_value_for_files_argument(tmpdir):
""" read csv method when no file name(s) is supplied to the files argument
i) should return dictionary.
ii) dictionary keys should match csv file names in resource folder
iii) all dictionary values should be dataframes
Expand All @@ -350,7 +386,7 @@ def test_read_csv_method_with_no_file(tmpdir):


def test_read_csv_method_with_one_file(tmpdir):
""" test read csv method when one file name is supplied. should return a dataframe
""" test read csv method when one file name is supplied to files argument. should return a dataframe
:param tmpdir: path to a temporal directory

"""
Expand Down