MapMyCells compressed csv failure

Hello,

I have successfully used the MapMyCells platform multiple times in the past and noticed recently mapping fails when I upload a compressed csv. I have not changed my code so I am wondering why this may be. To test file compatibility, I compressed a small csv file and tested both the csv.gz and the csv file. The csv file worked perfectly, but the compressed version came back as a failure at the “Input” step. Does MapMyCells still accept the csv.gz format?

Thanks!

Gzipped CSV files should still be accepted.

To give us a little more insight into what is happening, do you mind re-running your data and recording the RunID (the long alphanumeric string that appears in the failure message as detailed in this post)? Once we have that, we can look up a more detailed error message associated with your specific data.

Thank you! I re-ran the pipeline with the uncompressed file and it worked, but the compressed file still threw an error. Here is the run ID. Run ID: 1751463056143-3ce43182-f0e5-40c0-9348-60eb58587bf6

That’s odd. The error message for your run is not recorded in our cloud database.

May I download your gzipped data so that I can run it locally and see what happens?

That is odd indeed. I can send the gzipped data. It will not let me attach a compressed csv to this forum does email work ok?

Don’t worry about emailing the data. I can download what you uploaded from the cloud. I just wanted to ask your permission before I did that. Sounds like I have your permission. I’ll let you know what I find.

So: I know what is happening. It is going to require a bugfix.

When you input a CSV file, the code has to do some analysis to figure out if the first column in the CSV file is a list of row names or a list of values (in which case the rows have no names). Both shapes are things we expect users to submit to us.

Part of that process is reading the CSV in with pandas. For better or worse, your CSV is big enough that when pandas reads it in (with a very default configuration because we are assuming we know nothing about the shape of the CSV file), it requires too much memory and the system crashes (probably because pandas doesn’t know be default what is a string and what is an int and reads everything as strings).

I need to put some work in making this process more intelligent/efficient, apparently.

There is one quick way around this: the “read everything into pandas and intuit the meaning of the first column” step would not be necessary if the value of the first column were blank. Unfortunately, the first row in your csv file looks like

"",cell_name_A,cell_name_B,cell_name_C,...

which is not the same as

,cell_name_A,cell_name_B,cell_name_C,...

(the first column in the first example is not blank; it is a string with two characters, both of which are "). This is also a bug. Our code should probably treat "" and '' as blank.

I will work to fix these bugs. In the meantime, should you find yourself blocked, editing your CSV file to look like the second example above (with an actual blank first column entry) ought to get your data through.

Sorry about this, and thanks for the bug catch.

Ok! Thank you so much!