Single session taking ¿too much? RAM

Hi everybody,

I am a new user of the AllenSDK. I would like to start using this dataset with my students in a course. However, I’m facing problems when loading the data.

I followed the basic tutorials and downloaded the data corresponding to a session. However, I’m unable to operate with it. When I try to call any of the functions, my laptop goes out of RAM memory.

#Download and load of the session goes well, no problem. Takes a few seconds.
session_id = 798911424 
oursession = cache.get_session_data(session_id)

#When I call this in a notebook cell, RAM usage starts to gradually increase
oursession.metadata 

Calling get_session_data(id) allocates around 1GB, which I believe is reasonable given that all data went into the variable. Totally fine.
The weird thing is that asking for the metadata just starts to allocate more and more memory gradually, and the function does not seem to finish. After 1-2 minutes approximately, the function has allocated enough memory to kill all my RAM (my laptop has 8GB). It was basically 5GB. Also, the code continues to be executed (i.e., it’s not that code finishes and there’s a memory leak somewhere else).
Is this normal behaviour? To my understanding metadata is just a dictionary with a few fields, there’s no way it weights all that. I find the same problem if I try to call, e.g., oursession.structurewise_unit_counts. When I manually interrupt the kernel, also, memory is not freed again.

I tried to find system requirements for the AllenSDK to know if (maybe) this is the usual way of operating and I just need more RAM or there’s something else going on. But to me, this seems like some kind of problem, which I am not able to diagnose…

My OS is Ubuntu 20.04. I am using AllenSDK in a clean environment with Python 3.11 and AllenSDK 2.16.2.

Thank you all in advance for the support!

Hi @victorb, this problem is likely due to limited memory. Creating the session object does not load the units table, which contains the spike times that make up the bulk of the data. For some sessions, this can use up 5-6 GB of RAM, and takes a few minutes to load.

Some of the metadata depends on the contents of the units table, so this is loaded into memory when you request this info. You can request other types of metadata (e.g. age_in_days) without loading the units table, but things like structure_acronyms (and also structurewise_unit_counts) depend on the info in this table.

So, I think the only solution will require doing your analysis on a machine with more memory available.

Hi @joshs , thank you so much for your answer. I was indeed able to load the session object on a more capable computer without any problem. I confirm allocating the dataset takes around 4-5 GB of RAM, which prevented my 8GB RAM laptop from doing so. Once I moved to a larger computer all worked beautifully.

However, the issue is that students might have this limitation too. So I had the idea that I could try to extract only the data that my students need, save them to CSV files and ignore the rest. While doing so I found some hints that might indicate that there’s a possibility to save up some memory from the session object. Let me carefully explain and then you can decide if it makes sense or if I’m saying stupid stuff (high chances, but just in case…) :slight_smile:

First, I tried to estimate the memory size of the session object (by using code from Stack Overflow) and from what I see the session object takes about 2GB of RAM. In fact, many of the relevant tables (like spike times) are relatively light (~500MB). The largest property I see is the spike_amplitudes, which takes ~1GB for itself. One detail, this table seems to include the amplitude of spikes for all cells, and not only the ones fulfilling quality measurements:

session = cache.get_session_data(798911424)
len(session.units)  #825
len(session.spike_amplitudes) #2525

If that’s by design it’s perfect. Still, the problem is that I estimated the session object to be 2GB but physically I saw my computer allocate 4-5 GB. I wonder if there are intermediate computations at some point that are never freed from memory. I was thinking about this because I see that many objects have an “internal” representation for the class (like session.units and session._units). Most of them are identical and refer to the same location in memory, so they are fine. However, e.g., session.stimulus_presentations and session._stimulus_presentations are different and the latter is a superset of the former, containing 4 more columns (both weight around 50MB, so not a big deal, but makes me suspect if there are no more cases like this eating up memory). I understand that all variables starting with the underscore serve to do intermediate computations and serve a design…

…so what I mean is that there’s still like 2-3 GB of difference between what I see allocated physically in the RAM and the memory estimation of the session object, and looking at these small things makes me suspect that there’s some potentially not-needed memory floating around.

Or (most probably) there’s a well-thought design I did not understand and that memory might be needed indeed. If that’s the case let me know and I’ll solve the thread now. Thanks again for your answer!

Hi Victor – thanks for looking at this in detail. I think the discrepancies you’re seeing are due to the fact that the data in the underlying NWB file is loaded lazily (i.e., only upon request), and that a subset of the data (e.g. units that do not pass QC) is filtered out before exposing it to the user. It’s probably the case that some of this data could be freed from memory, but it wouldn’t get around the fact that it needs to be loaded in order to apply the appropriate filters. So 16 GB is likely the minimum amount of RAM you’d need to work with these files.

For teaching purposes, I would recommend extracting a subset of the data from each session and re-saving it as an NWB file. That way, if they want to dig into other sessions, they’ll be familiar with the methods for accessing those type of files. You can check out this tutorial for instructions on how to do this: Extracellular Electrophysiology Data — PyNWB unknown documentation

I see. I understand that even cleaning up some of the intermediate calculations one would have a RAM spike anyway due to loading the file. However, it seems that some of those calculations are never cleared up from memory, so they might accumulate and worsen the situation.

Probably, as you say, even if the data loading is optimized one needs at minimum 16GB, but it could be interesting to profile a bit and see if something can be done. Unfortunately I lack the capacity to do this now myself.

In any case, I believe it would be nice to add a warning that 16GB of RAM could be required in the AllenSDK installation webpage, since I was not able to find this information myself. That said, I believe we can safely close this thread.