-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make polars frames lazy and stream into csv #294
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #294 +/- ##
==========================================
- Coverage 89.69% 89.65% -0.05%
==========================================
Files 16 16
Lines 4019 4021 +2
Branches 939 941 +2
==========================================
Hits 3605 3605
- Misses 281 284 +3
+ Partials 133 132 -1 ☔ View full report in Codecov by Sentry. |
Hey @coroa, thanks for your PR. According to the profiler the lazy operation is taking very long. Original Pandas BasedPolars Based (Non-lazy)Polars Based (lazy)Code for running the benchmarkimport pypsa
import psutil
import time
import threading
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
# Flag to control the monitoring loop
stop_monitoring = False
# List to store memory usage values
memory_values = []
# Function to monitor memory usage
def monitor_memory_usage(interval=0.1):
global stop_monitoring
global memory_values
process = psutil.Process()
while not stop_monitoring:
mem_info = process.memory_info()
memory_values.append(mem_info.rss / 1024 ** 2) # Store memory in MB
time.sleep(interval)
# Start monitoring memory usage in a separate thread
monitor_thread = threading.Thread(target=monitor_memory_usage)
monitor_thread.daemon = True # Daemonize thread
monitor_thread.start()
# Your original code
n = pypsa.Network(".../pypsa-eur/results/solver-io/prenetworks/elec_s_128_lv1.5__Co2L0-25H-T-H-B-I-A-solar+p3-dist1_2050.nc")
m = n.optimize.create_model()
m.to_file("test.lp", io_api="lp-polars")
# Stop monitoring
stop_monitoring = True
monitor_thread.join()
# Plotting the memory usage
plt.plot(memory_values)
plt.xlabel('Time (in 0.1s intervals)')
plt.ylabel('Memory Usage (MB)')
plt.title('Memory Usage Over Time')
plt.savefig("mem-polars-non-lazy.png")
print(max(memory_values)) |
Interesting that there is no memory savings in either case compared to the other two. |
Thanks for the profiling. Very disappointing. |
It's possible that |
The lazy version has to do everything at least twice, since the |
I'll try to debug a bit around to find out where we are scooping up this memory use. Any particular xarray version to focus on? @FabianHofmann |
Cool, but no rush, seems to be stable for the moment. I think it should be independent of the xarray version. |
@@ -316,7 +318,7 @@ def check_has_nulls_polars(df: pl.DataFrame, name: str = "") -> None: | |||
ValueError: If the DataFrame contains null values, | |||
a ValueError is raised with a message indicating the name of the constraint and the fields containing null values. | |||
""" | |||
has_nulls = df.select(pl.col("*").is_null().any()) | |||
has_nulls = df.select(pl.col("*").is_null().any()).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps, we can also avoid this .collect
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should be able to, but i think we need to change the formulation then a bit further.
Tests run fine. The extra pyarrow dependency should not hurt, since arrow is already a requirement for polars (and soon also pandas), while pandas is only the python frontend in addition.
We should check each of the invocations of
write_lazyframe
thatexplain(streamable=True)
shows it can actually run the streaming pipeline.If you decide to merge, please squash (the history is ugly :))