Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial schema creation is very slow #1810

Closed
2 of 3 tasks
bluenote10 opened this issue Sep 16, 2024 · 5 comments · Fixed by #1818
Closed
2 of 3 tasks

Initial schema creation is very slow #1810

bluenote10 opened this issue Sep 16, 2024 · 5 comments · Fixed by #1818
Labels
bug Something isn't working

Comments

@bluenote10
Copy link

Describe the bug issue

This is more of a usability issue than a bug: The initial creation of a schema is very slow. I'm measuring it around ~800 ms, which can be a significant slow down e.g. in quick/small CLI tools that otherwise have a sub-second runtime.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample, a copy-pastable example

I'm observing runtime of >800ms even for the most simplest usages like this:

import time

import pandera as pa

t1 = time.monotonic()
schema = pa.DataFrameSchema(
    {
        "column1": pa.Column(int),
        "column2": pa.Column(float),
        "column3": pa.Column(str),
    }
)
t2 = time.monotonic()
print(t2 - t1)

Expected behavior

Faster execution of simple usages.

Desktop (please complete the following information):

  • OS: Ubuntu
  • Browser: none / headless system (what's the significance of this -- I'm not doing anything in the browser)
  • Version: 20.04
@bluenote10 bluenote10 added the bug Something isn't working label Sep 16, 2024
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Sep 21, 2024

@bluenote10 thanks! if you have time, would you mind providing a runtime profile either with cProfile or your profiling library of choice?

This'll provide more actionable data on what parts of the execution path are slowing things down

Expected behavior

Faster execution of simple usages.

How fast are you expecting?

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Sep 22, 2024

Also, can you provide your python environment to repro? I get:

0.3683174999896437

When I run the script above

@bluenote10
Copy link
Author

How fast are you expecting?

From a user perspective the pa.DataFrameSchema(...) expression only constructs a Python class instance, and there is no obvious work to do in the constructor (no data is involved yet), so it would be sensible to expect <1 ms.

A guess: Could it be an effect the lazy import system? I've seen that #1644 mentions these ~800 ms as the import time as well. Unfortunately the Python ecosystem seems to suffer more and more from slow import times. Lazy imports largely "postpone" the issue, i.e., it may just happen now in the first usage of that constructor.

A module initialization time of 800 ms feels a lot. I'm wondering what all these packages/modules are doing at import time to lead to such a slow import. I've attached some information on the Python environment and a cProfile run. Can you spot something obvious why it is taking so much time?

Python environment (pip freeze output)
actionlib==1.14.0
adal==1.2.7
aiofiles==22.1.0
aiohappyeyeballs==2.4.0
aiohttp==3.10.5
aiosignal==1.3.1
aiosqlite==0.20.0
alabaster==0.7.16
altair==5.4.1
angles==1.9.13
annotated-types==0.7.0
ansi2html==1.9.2
ansible==9.10.0
ansible-core==2.16.11
antlr4-python3-runtime==4.13.2
anyio==4.4.0
anys==0.3.0
appdirs==1.4.4
argcomplete==3.5.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
argparse-addons==0.12.0
arrow==1.3.0
asammdf==8.0.0
asttokens==2.4.1
async-timeout==4.0.3
attrs==24.2.0
avro-python3==1.10.2
azure-batch==14.2.0
azure-common==1.1.28
azure-containerregistry==1.2.0
azure-core==1.31.0
azure-cosmos==4.7.0
azure-data-tables==12.5.0
azure-devops==7.1.0b4
azure-eventgrid==4.20.0
azure-eventhub==5.12.1
azure-eventhub-checkpointstoreblob==1.1.4
azure-functions==1.20.0
azure-functions-durable==1.2.9
azure-identity==1.17.1
azure-keyvault==4.2.0
azure-keyvault-certificates==4.8.0
azure-keyvault-keys==4.9.0
azure-keyvault-secrets==4.8.0
azure-kusto-data==4.5.1
azure-kusto-ingest==4.5.1
azure-mgmt-batch==17.3.0
azure-mgmt-compute==33.0.0
azure-mgmt-consumption==10.0.0
azure-mgmt-containerinstance==10.1.0
azure-mgmt-core==1.4.0
azure-mgmt-datafactory==9.0.0
azure-mgmt-keyvault==10.3.1
azure-mgmt-network==26.0.0
azure-mgmt-resource==23.1.1
azure-mgmt-storage==21.2.1
azure-mgmt-web==7.3.1
azure-monitor-ingestion==1.0.4
azure-servicebus==7.12.2
azure-storage-blob==12.23.0
azure-storage-queue==12.12.0
babel==2.16.0
beautifulsoup4==4.12.3
bidict==0.23.1
bitstruct==8.19.0
black==22.12.0
bleach==6.1.0
blinker==1.8.2
blosc2==2.7.1
bokeh==3.5.2
bondpy==1.8.6
boolean.py==3.4
branca==0.7.2
build==1.2.2
cachetools==5.5.0
cachier==3.0.1
camera-calibration-parsers==1.12.0
canmatrix==1.0
cantools==39.4.5
catkin==0.8.10
catkin-pkg==1.0.0
certifi==2024.8.30
cffi==1.17.1
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
click-log==0.4.0
cloudpickle==3.0.0
codeowners==0.7.0
colorama==0.4.6
coloredlogs==15.0.1
colorlog==6.8.2
comm==0.2.2
conan==2.7.1
contourpy==1.3.0
coverage==7.6.1
crccheck==1.3.0
cryptography==43.0.1
cssselect==1.2.0
cv-bridge==1.16.2
cycler==0.12.1
Cython==3.0.11
dacite==1.8.1
dash==2.18.1
dash-core-components==2.0.0
dash-html-components==2.0.0
dash-table==5.0.0
dask==2024.9.0
dask-expr==1.1.14
debugpy==1.8.5
decopatch==1.4.10
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
diagnostic-updater==1.11.0
dill==0.3.8
dirhash==0.5.0
diskcache==5.6.3
distributed==2024.9.0
distro==1.8.0
docker==7.1.0
docopt==0.6.2
docutils==0.20.1
dohq-artifactory==0.10.0
doxysphinx==3.3.10
dynamic-reconfigure==1.7.3
empy==3.3.4
entrypoints==0.4
exceptiongroup==1.2.2
execnet==2.1.1
executing==2.1.0
fastapi==0.115.0
fastapi-azure-auth==5.0.1
fasteners==0.19
fastjsonschema==2.20.0
filelock==3.16.1
flake8==7.1.1
flake8-bugbear==24.8.19
flake8-tidy-imports==4.10.0
Flask==3.0.3
Flask-Cors==5.0.0
Flask-PyMongo==2.3.0
folium==0.17.0
fonttools==4.53.1
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.9.0
furl==2.1.3
future==1.0.0
gcovr==7.2
gencpp==0.7.0
geneus==3.0.0
genlisp==0.4.18
genmsg==0.6.0
gennodejs==2.0.2
genpy==0.6.15
geographiclib==1.52
geojson==3.1.0
geojson-pydantic==1.1.1
geopandas==1.0.1
geopy==2.4.1
gitdb==4.0.11
GitPython==3.1.43
gnupg==2.3.1
google-auth==2.34.0
gpstime==0.8.2
graphviz==0.20.3
gunicorn==23.0.0
h11==0.14.0
h2==4.1.0
h5py==3.11.0
hpack==4.0.0
httpcore==1.0.5
httpx==0.27.2
humanfriendly==10.0
hyperframe==6.0.1
icontract==2.7.0
idna==3.10
ijson==3.3.0
image-geometry==1.16.2
imageio==2.35.1
imagesize==1.4.1
importlib_metadata==8.4.0
importlib_resources==6.4.5
iniconfig==2.0.0
interactive-markers==1.12.0
ipykernel==6.29.5
ipympl==0.9.4
ipython==8.21.0
ipython-genutils==0.2.0
ipywidgets==7.8.4
isal==1.7.0
isodate==0.6.1
isoduration==20.11.0
isort==5.13.2
itsdangerous==2.2.0
jedi==0.19.1
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
joint-state-publisher==1.15.1
jsk-recognition-utils==1.2.15
jsk_data==2.2.12
jsk_network_tools==2.2.12
jsk_rviz_plugins==2.1.8
jsk_tools==2.2.12
jsk_topic_tools==2.2.12
json5==0.9.25
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
jupyter-events==0.10.0
jupyter-ydoc==0.2.5
jupyter_client==7.4.9
jupyter_core==5.7.2
jupyter_server==2.14.2
jupyter_server_fileid==0.9.3
jupyter_server_terminals==0.5.3
jupyter_server_ydoc==0.8.0
jupyterlab==3.6.8
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
jupyterlab_widgets==1.1.10
keplergl==0.3.2
kiwisolver==1.4.7
kubernetes==30.1.0
laser_geometry==1.6.7
lazy_loader==0.4
libsass==0.22.0
llvmlite==0.43.0
locket==1.0.0
lxml==4.9.4
lz4==4.3.3
lzstring==1.0.4
maison==2.0.0
makefun==1.15.4
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.8.4
matplotlib-inline==0.1.7
mccabe==0.7.0
mdit-py-plugins==0.4.2
mdurl==0.1.2
memory-profiler==0.61.0
mercantile==1.2.1
message-filters==1.16.0
microsoft-kiota-abstractions==1.3.3
microsoft-kiota-authentication-azure==1.1.0
microsoft-kiota-http==1.3.3
microsoft-kiota-serialization-form==0.1.1
microsoft-kiota-serialization-json==1.3.2
microsoft-kiota-serialization-multipart==0.1.0
microsoft-kiota-serialization-text==1.0.0
mistune==3.0.2
mock==5.1.0
mongomock==4.2.0.post1
mpire==2.10.2
mpld3==0.5.10
mpmath==1.3.0
msal==1.28.1
msal-extensions==1.1.0
msgpack==1.0.8
msgraph-core==1.1.3
msgraph-sdk==1.7.0
msgspec==0.18.6
msrest==0.7.1
msrestazure==0.6.4.post1
multidict==6.1.0
multimethod==1.10
multiprocess==0.70.16
mypy==1.11.2
mypy-extensions==1.0.0
myst-parser==4.0.0
narwhals==1.8.1
nbclassic==1.1.0
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
ndindex==1.8
nest-asyncio==1.6.0
netifaces==0.11.0
networkx==3.3
nose==1.3.7
notebook==6.5.7
notebook_shim==0.2.4
numba==0.60.0
numexpr==2.10.1
numpy==1.26.4
numpy-quaternion==2023.0.4
oauthlib==3.2.2
opencv-python-headless==4.10.0.84
openni2_launch==1.6.0
openrouteservice==2.3.3
opentelemetry-api==1.27.0
opentelemetry-sdk==1.27.0
opentelemetry-semantic-conventions==0.48b0
orderedmultidict==1.0.1
osm2geojson==0.2.5
osmium==3.7.0
overpass==0.7.2
overrides==7.7.0
packaging==24.1
pandas==2.2.2
pandas-stubs==2.2.2.240909
pandera==0.20.4
pandocfilters==1.5.1
parso==0.8.4
partd==1.4.2
patch-ng==1.18.0
pathos==0.3.2
pathspec==0.12.1
pendulum==3.0.0
pexpect==4.9.0
pick==2.4.0
pillow==10.4.0
pip-tools==7.4.1
pkg_resources==0.0.0
platformdirs==4.3.6
plotly==5.24.1
pluggy==1.5.0
portalocker==2.10.1
pox==0.3.4
ppft==1.7.6.8
progressbar2==4.5.0
prometheus_client==0.20.0
prompt_toolkit==3.0.47
proto-schema-parser==1.3.6
protobuf==4.25.3
psutil==6.0.0
psycopg==3.2.2
psycopg-binary==3.2.2
ptyprocess==0.7.0
pure_eval==0.2.3
py==1.11.0
py-cpuinfo==9.0.0
pyarrow==17.0.0
pyarrow-hotfix==0.6
pyasn1==0.6.1
pyasn1_modules==0.4.1
pycodestyle==2.12.1
pycparser==2.22
pycryptodomex==3.20.0
pydantic==2.9.2
pydantic_core==2.23.4
pydeck==0.9.1
pyflakes==3.2.0
Pygments==2.18.0
pyjson5==1.6.6
PyJWT==2.9.0
pylddwrap==1.2.2
pymap3d==1.6.3
pymongo==3.13.0
Pympler==1.1
pyogrio==0.9.0
pyOpenSSL==24.2.1
pyparsing==3.1.4
pypcd==0.1.1
pyproj==3.6.1
pyproject_hooks==1.1.0
pyros==0.4.3
pyros-common==0.5.4
pyros-config==0.2.1
pyros-setup==0.3.0
pyrosbag==0.1.3
pyserial==3.5
PySocks==1.7.1
pysolr==3.10.0
pytest==8.3.3
pytest-asyncio==0.24.0
pytest-cases==3.8.5
pytest-cov==5.0.0
pytest-mock==3.14.0
pytest-timeout==2.3.1
pytest-watch==4.2.0
pytest-xdist==3.6.1
python-can==4.4.2
python-dateutil==2.9.0.post0
python-debian==0.1.49
python-geohash==0.8.5
python-intervals==1.10.0.post1
python-json-logger==2.0.7
python-lzf==0.2.6
python-qt-binding==0.4.4
python-utils==3.8.2
pytz==2024.2
PyYAML==6.0.2
pyzmp==0.0.17
pyzmq==26.2.0
qt-dotgraph==0.4.2
qt-gui==0.4.2
qt-gui-cpp==0.4.2
qt-gui-py-common==0.4.2
redis==5.0.8
referencing==0.35.1
requests==2.32.3
requests-file==2.1.0
requests-oauthlib==2.0.0
requests-toolbelt==1.0.0
resolvelib==1.0.1
resource_retriever==1.12.7
retry==0.9.2
retrying==1.3.4
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.8.1
rosbag==1.16.0
rosclean==1.15.8
rosdep==0.25.1
rosdistro==0.9.1
rosgraph==1.16.0
roslaunch==1.16.0
roslib==1.15.8
roslint==0.12.0
roslz4==1.16.0
rosmake==1.15.8
rosmaster==1.16.0
rosmsg==1.16.0
rosnode==1.16.0
rosparam==1.16.0
rospkg==1.5.1
rospy==1.16.0
rosservice==1.16.0
rostest==1.16.0
rostopic==1.16.0
rosunit==1.15.8
roswtf==1.16.0
rpds-py==0.20.0
rqt-image-view==0.4.17
rqt-reconfigure==0.5.5
rqt_action==0.4.9
rqt_bag==0.5.1
rqt_bag_plugins==0.5.1
rqt_console==0.4.11
rqt_dep==0.4.12
rqt_graph==0.4.14
rqt_gui==0.5.3
rqt_gui_py==0.5.3
rqt_launch==0.4.9
rqt_logger_level==0.4.11
rqt_msg==0.4.10
rqt_plot==0.4.13
rqt_publisher==0.4.10
rqt_py_common==0.5.3
rqt_py_console==0.4.10
rqt_service_caller==0.4.10
rqt_shell==0.4.11
rqt_srv==0.4.9
rqt_top==0.4.10
rqt_topic==0.4.13
rqt_web==0.4.10
rsa==4.9
Rtree==1.3.0
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
ruyaml==0.91.0
rviz==1.14.20
s2cell==1.8.0
scantree==0.0.4
scikit-image==0.24.0
scikit-learn==1.5.2
scipy==1.14.1
seaborn==0.13.2
Send2Trash==1.8.3
sensor-msgs==1.13.1
sentinels==1.0.0
shapely==2.0.6
shellcheck-py==0.10.0.1
simplejson==3.19.3
six==1.16.0
smclib==1.8.6
smmap==5.0.1
smmap2==2.0.5
sniffio==1.3.1
snowballstemmer==2.2.0
sortedcontainers==2.4.0
sound-play==0.3.17
soupsieve==2.6
Sphinx==7.4.7
sphinx-autodoc-typehints==2.3.0
sphinx-charts==0.2.1
sphinx-click==6.0.0
sphinx-collections==0.0.1
sphinx-copybutton==0.5.2
sphinx-data-viewer==0.1.5
sphinx-needs==3.0.0
sphinx-rtd-theme==2.0.0
sphinx-tags==0.4
sphinx_design==0.6.1
sphinxcontrib-applehelp==2.0.0
sphinxcontrib-devhelp==2.0.0
sphinxcontrib-doxylink==1.12.3
sphinxcontrib-htmlhelp==2.1.0
sphinxcontrib-jquery==4.1
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-plantuml==0.30
sphinxcontrib-qthelp==2.0.0
sphinxcontrib-serializinghtml==2.0.0
sphinxcontrib-svg2pdfconverter==1.2.3
splunk-handler==3.0.0
sqlitedict==2.1.0
stack-data==0.6.3
starlette==0.38.5
std-uritemplate==1.0.6
streamlit==1.38.0
sympy==1.13.3
tables==3.10.1
tabulate==0.9.0
tblib==3.0.0
tenacity==8.5.0
termcolor==2.4.0
terminado==0.18.1
textparser==0.24.0
tf==1.13.2
tf-conversions==1.13.2
tf2-geometry-msgs==0.7.6
tf2-kdl==0.7.6
tf2-py==0.7.6
tf2-ros==0.7.6
threadpoolctl==3.5.0
tifffile==2024.8.30
time-machine==2.15.0
tinycss2==1.3.0
tokenize-rt==6.0.0
toml==0.10.2
tomli==2.0.1
toolz==0.12.1
topic-tools==1.16.0
toposort==1.10
torch @ https://download.pytorch.org/whl/cpu/torch-2.3.1%2Bcpu-cp310-cp310-linux_x86_64.whl#sha256=d679e21d871982b9234444331a26350902cfd2d5ca44ce6f49896af8b3a3087d
torcheval==0.0.7
torchinfo==1.8.0
torchvision @ https://download.pytorch.org/whl/cpu/torchvision-0.18.1%2Bcpu-cp310-cp310-linux_x86_64.whl#sha256=2ae9d4e4e11bc43c7ee6c7c7e87b1e6adf5503ad0710e59cd86bc7b1a342d75a
tornado==6.4.1
tqdm==4.66.5
traitlets==5.9.0
traittypes==0.2.1
typed-argparse==0.3.1
typeguard==4.3.0
types-beautifulsoup4==4.12.0.20240907
types-cffi==1.16.0.20240331
types-click==7.1.8
types-docutils==0.21.0.20240907
types-filelock==3.2.7
types-html5lib==1.1.11.20240806
types-Jinja2==2.11.9
types-jsonschema==4.23.0.20240813
types-lxml==2024.9.16
types-MarkupSafe==1.1.10
types-mock==5.1.0.20240425
types-protobuf==4.25.0.20240417
types-psutil==6.0.0.20240901
types-pyOpenSSL==24.1.0.20240722
types-python-dateutil==2.9.0.20240906
types-pytz==2024.2.0.20240913
types-PyYAML==6.0.12.20240917
types-redis==4.6.0.20240903
types-requests==2.31.0.6
types-retry==0.9.9.4
types-setuptools==75.1.0.20240917
types-simplejson==3.19.0.20240801
types-six==1.16.21.20240513
types-tabulate==0.9.0.20240106
types-termcolor==1.1.6.2
types-urllib3==1.26.25.14
typing-inspect==0.9.0
typing_extensions==4.12.2
tzdata==2024.1
tzlocal==5.2
urdfdom-py==0.4.6
uri-template==1.3.0
urllib3==1.26.20
uvicorn==0.30.6
uwsgidecorators==1.1.0
validators==0.33.0
watchdog==4.0.2
wcwidth==0.2.13
webcolors==24.8.0
webencodings==0.5.1
websocket-client==1.8.0
Werkzeug==3.0.4
widgetsnbextension==3.6.9
wrapt==1.16.0
xacro==1.14.16
xyzservices==2024.9.0
y-py==0.6.2
yachalk==0.1.6
yamlfix==1.17.0
yarl==1.11.1
ypy-websocket==0.8.4
zict==3.0.0
zipp==3.20.2

And here is the output of a cProfile of that snippet: pandera_cprofile.txt

@cosmicBboy
Copy link
Collaborator

Thanks for the details! #1818 should bring schema initialization time close to 0: running the code snippet in the description of this issue yields

0.0005101249553263187

@bluenote10
Copy link
Author

#1818 should bring schema initialization time close to 0

Awesome! I had a quick look into the approach taken there, and the idea looks very sensible to me. Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants