Just going to add a note here for future, currently seeing a small difference in pandas vs spark report sample rows when there are rows only in one dataframe. #288

fdosani · 2024-03-25T18:55:49Z

Just going to add a note here for future, currently seeing a small difference in pandas vs spark report sample rows when there are rows only in one dataframe.

There is an additional _merge_right column which is not in the original dataframes, which could cause a bit of confusion for users.
We're displaying the column names as their aliases, which could also be a bit confusing. It would be best to translate them back to their original names.

Not a blocker for this, but we should open a follow-up issue to keep track of this.

import pandas as pd
import pyspark.pandas as ps
pdf1 = pd.DataFrame.from_dict({"id": [1,2,3,4,5], "a": [2,3,2,3, 2], "b": ["a", "b", "c", "d", ""]})
pdf2 = pd.DataFrame.from_dict({"id": [1,2,3,4,5, 6], "a": [2,3,2,3, 2, np.nan], "b": ["a", "b", "c", "d", "", pd.NA]})
df1 = ps.DataFrame(pdf1)
df2 = ps.DataFrame(pdf2)

Spark

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0       df1        3     5
1       df2        3     6

Column Summary
--------------

Number of columns in common: 3
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0

Row Summary
-----------

Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 1

Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 5

Column Comparison
-----------------

Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 3
Total number of values which compare unequal: 0

Columns with Unequal Values or Types
------------------------------------

  Column df1 dtype df2 dtype  # Unequal  Max Diff  # Null Diff
0      a     int64   float64          0       0.0            0

Sample Rows with Unequal Values
-------------------------------

Sample Rows Only in df2 (First 10 Columns)
------------------------------------------

   id_df2  a_df2 b_df2  _merge_right
5       6    NaN  None          True

Pandas

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0       df1        3     5
1       df2        3     6

Column Summary
--------------

Number of columns in common: 3
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0

Row Summary
-----------

Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 1

Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 5

Column Comparison
-----------------

Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 3
Total number of values which compare unequal: 0

Columns with Unequal Values or Types
------------------------------------

  Column df1 dtype df2 dtype  # Unequal  Max Diff  # Null Diff
0      a     int64   float64          0       0.0            0

Sample Rows with Unequal Values
-------------------------------

Sample Rows Only in df2 (First 10 Columns)
------------------------------------------

   id   a     b
0   6 NaN  <NA>

Originally posted by @jdawang in #275 (review)

The text was updated successfully, but these errors were encountered:

fdosani self-assigned this Mar 25, 2024

fdosani added enhancement New feature or request spark labels Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Just going to add a note here for future, currently seeing a small difference in pandas vs spark report sample rows when there are rows only in one dataframe. #288

Just going to add a note here for future, currently seeing a small difference in pandas vs spark report sample rows when there are rows only in one dataframe. #288

fdosani commented Mar 25, 2024 •

edited

Loading

Just going to add a note here for future, currently seeing a small difference in pandas vs spark report sample rows when there are rows only in one dataframe. #288

Just going to add a note here for future, currently seeing a small difference in pandas vs spark report sample rows when there are rows only in one dataframe. #288

Comments

fdosani commented Mar 25, 2024 • edited Loading

fdosani commented Mar 25, 2024 •

edited

Loading