Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read an empty cell array blob fails #1056

Open
Alvalunasan opened this issue Sep 29, 2022 · 12 comments
Open

Read an empty cell array blob fails #1056

Alvalunasan opened this issue Sep 29, 2022 · 12 comments
Assignees
Labels

Comments

@Alvalunasan
Copy link

Bug Report

Description

Reading an empty cell array inserted with mym MATLAB fails to be read in Datajoint python

Reproducibility

  • OS (WIN (MATLAB) & MACOS (Python)
  • Python Version (3.7) & MATLAB Version (2019b)
  • MySQL Version (10.2.33-MariaDB)
  • DataJoint Version (0.13.7)

I have a corner case for reading some special. blobs in Datajoint Python when these are stored with mym Matlab:
Here is the type of blob stored in the DB and read on Matlab:

la = bdata('select protocol_data from bdata.sessions where sessid=889527');
la{1}.crash_comments
ans =
  3×1 cell array
    {0×0 double}
    {0×0 double}
    {0×0 double} 

As you can see, what is stored in a part of the blob is a 3x1 cell array composed of empty items:


When trying to read this data in Python, I got this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/sg/5bw1t8p11nx09k7kmytnmfb40000gp/T/ipykernel_31219/4072205635.py in <module>
      3 session_key = {'sessid': 889527}
      4 # session_key = {'sessid': 889664}
----> 5 session_data = (bdata.Sessions & session_key).fetch('protocol_data', as_dict=True)
      6 parsed_events = (bdata.ParsedEvents & session_key).fetch(as_dict=True)

~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/fetch.py in __call__(self, offset, limit, order_by, format, as_dict, squeeze, download_path, *attrs)
    234                 squeeze=squeeze,
    235                 download_path=download_path,
--> 236                 format="array",
    237             )
    238             if attrs_as_dict:

~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/fetch.py in __call__(self, offset, limit, order_by, format, as_dict, squeeze, download_path, *attrs)
    287                 for name in heading:
    288                     # unpack blobs and externals
--> 289                     ret[name] = list(map(partial(get, heading[name]), ret[name]))
    290                 if format == "frame":
    291                     ret = pandas.DataFrame(ret).set_index(heading.primary_key)

~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/fetch.py in _get(connection, attr, data, squeeze, download_path)
    112                 squeeze=squeeze,
    113             )
--> 114             if attr.is_blob
    115             else data
    116         )

~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in unpack(blob, squeeze)
    619         return blob
    620     if blob is not None:
--> 621         return Blob(squeeze=squeeze).unpack(blob)

~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in unpack(self, blob)
    127         blob_format = self.read_zero_terminated_string()
    128         if blob_format in ("mYm", "dj0"):
--> 129             return self.read_blob(n_bytes=len(self._blob) - self._pos)
    130 
    131     def read_blob(self, n_bytes=None):

~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in read_blob(self, n_bytes)
    161                 % data_structure_code
    162             )
--> 163         v = call()
    164         if n_bytes is not None and self._pos - start != n_bytes:
    165             raise DataJointError("Blob length check failed! Invalid blob")

~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in read_struct(self)
    463                 self.read_blob(n_bytes=int(self.read_value())) for _ in range(n_fields)
    464             )
--> 465             for __ in range(n_elem)
    466         ]
    467 

~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in <listcomp>(.0)
    463                 self.read_blob(n_bytes=int(self.read_value())) for _ in range(n_fields)
    464             )
--> 465             for __ in range(n_elem)
    466         ]
    467 

~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in <genexpr>(.0)
    461         raw_data = [
    462             tuple(
--> 463                 self.read_blob(n_bytes=int(self.read_value())) for _ in range(n_fields)
    464             )
    465             for __ in range(n_elem)

~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in read_blob(self, n_bytes)
    161                 % data_structure_code
    162             )
--> 163         v = call()
    164         if n_bytes is not None and self._pos - start != n_bytes:
    165             raise DataJointError("Blob length check failed! Invalid blob")

~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in read_cell_array(self)
    508         return (
    509             self.squeeze(
--> 510                 np.array(result).reshape(shape, order="F"), convert_to_scalar=False
    511             )
    512         ).view(MatCell)

ValueError: cannot reshape array of size 0 into shape (3,1)

I have “patched” the blob.py code read_cell_array function with:

        if result.size == 0:
            return (
            self.squeeze(
                np.array(np.empty(shape, dtype=type(result[0]))), convert_to_scalar=False
            )
        ).view(MatCell)
        else:
            return (
            self.squeeze(
                np.array(result).reshape(shape, order="F"), convert_to_scalar=False
            )
        ).view(MatCell)

Just to add the case that the size of the array is zero (numpy array size is 0 if it’s filled with empty arrays)
Probably not the cleanest way to do it.

Expected Behavior

To get something similar to this when reading this kind of blobs:

session_data['crash_comments']
MatCell([[None],
         [None],
         [None]], dtype=object)
@dimitri-yatsenko
Copy link
Member

Thank you for submitting this @Alvalunasan. I think I understand the problem.

@Alvalunasan
Copy link
Author

Hi @dimitri-yatsenko
In brodylab this error has resurfaced with the new datajoint integration.
Is it possible that this gets merged ? (or the bug fixed somehow) ?

Thank you very much for your help

@carlosbrody
Copy link

Hi @dimitri-yatsenko , a note that this is not restricted to old Matlabs-- it was happening with a recent Matlab , and on data from August 2023. Maybe newer data too, I haven't yet checked on newer data.

Would it be appropriate to merge Alvaro's patch?

@carlosbrody
Copy link

@Alvalunasan updates his patch and suggests replacing lines 495-499 with

    sizes_array = [x.size for x in result]
    sum_sizes = sum(sizes_array)

    if n_elem ==0:
        return np.array(np.empty(0)).view(MatCell)
    elif sum_sizes == 0:
        return (self.squeeze(np.array(np.empty(shape, dtype=type(result[0]))), convert_to_scalar=False)).view(MatCell)
    else:
        return (self.squeeze(np.array(result).reshape(shape, order="F"), convert_to_scalar=False)).view(MatCell)

@dimitri-yatsenko
Copy link
Member

ok, will incorporate asap. We are starting to work on a new release. Thanks.

@Alvalunasan
Copy link
Author

Alvalunasan commented Jun 8, 2024

See next coment

@Alvalunasan
Copy link
Author

Alvalunasan commented Jun 10, 2024

@dimitri-yatsenko , @carlosbrody

New corner case for the function (Cell Matrix reading, instead of only nx1 vectors):

  def read_cell_array(self):
       """deserialize MATLAB cell array"""
       load_as_object = False
       n_dims = self.read_value()
       shape = self.read_value(count=n_dims)
       n_elem = int(np.prod(shape))
       result = [self.read_blob(n_bytes=self.read_value()) for _ in range(n_elem)]

       # If it is a matrix (and not a nx1 vector) load as object
       if np.sum(shape > 1) > 1:
           load_as_object = True

       # Check size for each element (could have Empty elements in vector)
       if n_elem > 0:
           # If there are arrays, tuple or list inside elements of result, load as object (except if all emptys)
           if isinstance(result[0], np.ndarray):
               sizes_array = [x.size for x in result]
               sum_sizes = sum(sizes_array)
               load_as_object = True
           elif isinstance(result[0], tuple) or isinstance(result[0], list):
               sizes_array = [len(x) for x in result]
               sum_sizes = sum(sizes_array)
               load_as_object = True
           else:
               sum_sizes = n_elem

       # If no trials in array
       if n_elem ==0:
           return np.array(np.empty(0)).view(MatCell)
       # If all trials contains "empty" data
       elif sum_sizes == 0:
           return (self.squeeze(np.array(np.empty(shape, dtype=type(result[0]))), convert_to_scalar=False)).view(MatCell)
       # If some trials contains data and others contains "empty" data
       elif sum_sizes != n_elem or load_as_object:
           return (self.squeeze(np.array(result, dtype='object').reshape(shape, order="F"), convert_to_scalar=False)).view(MatCell)
       # Regular case, all trials contains data
       else:
           return (self.squeeze(np.array(result).reshape(shape, order="F"), convert_to_scalar=False)).view(MatCell)

@dimitri-yatsenko
Copy link
Member

Excellent. I am traveling until next week and will work on this when I return. Thank you so much for this solution.

@dimitri-yatsenko
Copy link
Member

I understand this was resolved and traced to a 32-bit compilation of mym in Matlab. Can we close this @Alvalunasan ?

@dimitri-yatsenko dimitri-yatsenko self-assigned this Sep 12, 2024
@dimitri-yatsenko
Copy link
Member

Hi @Alvalunasan @carlosbrody

I just reviewed the blob code and found that we have a setting to switch to 32-bit length econding. You can turn it on by doing

dj.blob.use_32bit_dims = True

This should fix your issue. Please try and let me know.
I will close this issue then.

@Alvalunasan
Copy link
Author

Hi Dimitri, sorry for the late response,
This is a different issue than the 32bit vs 64bit length encoding.
Regardless of this, MATLAB has a way to handle empty and nested cells that are not being captured properly by read_cell_array function.

@dimitri-yatsenko
Copy link
Member

dimitri-yatsenko commented Sep 19, 2024

@Alvalunasan We are in a good place to fix it if you can provide a good way to reproduce this. We are working toward releasing DataJoint Python 1.0 by the end of the year and would like to close most issues. We are not focusing on MATLAB compatibility as much, going deeper on Python instead. You are a very important users, so we can fix it if you can help us reproduce it without using MATLAB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants