Dask
Dask is a parallel computing and data analytics library for Python
Pandas
Pandas is a Python library for data manipulation and analysis, e.g
Faster following slower
Example |
---|
"For and on my laptop the dask version is faster than the pandas one" from question Dask groupby date performance |
"This tells us that dask is about 50 slower than pandas for this task this is to be expected because the chunking and recombining of data partitions leads to some extra overhead" from question Read a large csv into a sparse pandas dataframe in a memory efficient way |
"When data fit in memory pandas is faster than dask" from question Dask groupby date performance |
"1 i guess dask will be slower than pandas for smaller datasets" from question Dask in-place replacement of pandas? |
"My question is is there a a way to do this in either pandas or dask that is faster than the following sequence group by index outer join each group to itself to produce pairs dataframe.apply comparison function on each row of pairs for reference assume i have access to a good number of cores hundreds and about 200g of memory" from question Efficient pairwise comparison of rows in pandas DataFrame |
"I am implementing dask but its slower than normal pandas sequential streaming" from question How to have parallelization in Key-Value Databases? |
To_csv method slow
Example |
---|
"But if i invoke the to_csv method then dask is as slow as pandas" from question How should I write multiple CSV files efficiently using dask.dataframe? |
"It will be quicker than pandas method dask write function will break your file into mulitple chuncks and store mulitple chuncks" from question Saving large Pandas df of text data to disk crashes Colab due to using up all RAM. Is there a workaround? |
Others
Example |
---|
Current i have this working piece of code essentially i create as dask dataframe from a pandas dataframe weather then i apply the function dffunc to each row of the dataframe from question Pandas-Dask DataFrame Apply Function with List Return |
The dask documentation states that dask s set_index is much more expensive than pandas with that in mind which of the following should be a best practice the time column is filled with datetime objects from question Is it better to set_index in Pandas then convert to Dask or vice versa? |
Dask arrays are mostly api compatible with pandas and support parallel execution for apply from question Speed up Pandas on Multi-core machine |
Pandas is far more flexible for working with data so i often bring parts of dask dataframes into memory manipulate columns and create new ones from question Add pandas series to dask dataframe |
If you just want performance gains while sticking with pandas check out the docs here and this article i found particularly helpful edit with dask you would do from question Parallel mapping x.f() instead of f(x) for many functions |
Maybe dask is more efficient than pandas however i have never used dask before from question Large SQL query (64. Mio rows) into pandas dataframe takes way to long |
This may help those confused by dask and hdf5 but more familiar with pandas like myself from question "Large data" workflows using pandas |
When reading the performance on dask is significantly poorer than pandas from question * (no title is found for this review) |
Looking through dask documentation it says there that generally speaking dask.dataframe groupby-aggregations are roughly same performance as pandas groupby-aggregations. so unless you re using a dask distributed client to manage workers threads etc. the benefit from using it over vanilla pandas isn t always there from question Why does my code take so long to write CSV file in Dask Python |
When hdf5 storage can be accessed fast than .csv and when dask creates dataframes faster than pandas why is dask from hdf5 slower than dask from csv from question Why do pandas and dask perform better when importing from CSV compared to HDF5? |
In fact this naive dask implementation seems to be slower than plain pandas for larger problem instances from question Dask broadcast not available during compute graph |
It lets you use most of the standard pandas commands in parallel out-of-memory;the only problem is dask doesn t have an excel reader from what i can tell from question Using Python to analyze large set of sensor-data |
Dask dataframe developers recommend using pandas when feasible from question How can I order entries in a Dask dataframe for display in seaborn? |
Here is a simplified dataset i created which creates this data frame input dataframe customerid t vals 0 0 0 0 1 0 1 1 2 0 2 2 3 0 3 3 4 0 4 4 5 0 5 5 6 0 6 6 7 0 7 7 8 0 8 8 9 0 9 9 10 1 0 10 11 1 1 11 12 1 2 12 13 1 3 13 14 1 4 14 15 1 5 15 16 1 6 16 17 1 7 17 18 1 8 18 19 1 9 19 my goal output is the 8 weekly lagged vals columns including vals_0 as the current week s value with nans where data is unavailable goal output dataframe customerid t vals_0 vals_1 vals_2 vals_3 vals_4 vals_5 vals_6 vals_7 0 0 0 0 nan nan nan nan nan nan nan 1 0 1 1 0.0 nan nan nan nan nan nan 2 0 2 2 1.0 0.0 nan nan nan nan nan 3 0 3 3 2.0 1.0 0.0 nan nan nan nan 4 0 4 4 3.0 2.0 1.0 0.0 nan nan nan 5 0 5 5 4.0 3.0 2.0 1.0 0.0 nan nan 6 0 6 6 5.0 4.0 3.0 2.0 1.0 0.0 nan 7 0 7 7 6.0 5.0 4.0 3.0 2.0 1.0 0.0 8 0 8 8 7.0 6.0 5.0 4.0 3.0 2.0 1.0 9 0 9 9 8.0 7.0 6.0 5.0 4.0 3.0 2.0 10 1 0 10 nan nan nan nan nan nan nan 11 1 1 11 10.0 nan nan nan nan nan nan 12 1 2 12 11.0 10.0 nan nan nan nan nan 13 1 3 13 12.0 11.0 10.0 nan nan nan nan 14 1 4 14 13.0 12.0 11.0 10.0 nan nan nan 15 1 5 15 14.0 13.0 12.0 11.0 10.0 nan nan 16 1 6 16 15.0 14.0 13.0 12.0 11.0 10.0 nan 17 1 7 17 16.0 15.0 14.0 13.0 12.0 11.0 10.0 18 1 8 18 17.0 16.0 15.0 14.0 13.0 12.0 11.0 19 1 9 19 18.0 17.0 16.0 15.0 14.0 13.0 12.0 the following pandas function creates the goal output dataframe and runs in roughly 500ms i can also accomplish this in dask using map_partitions and get the same results in 900 ms presumably worse than pandas due to the overhead from spinning up a thread i can also accomplish this in pyspark note for both dask and spark i have only one partition to make a more fair comparison with pandas with the following code i get the correct results back although shuffled +----------+---+------+------+------+------+------+------+------+------+ |customerid| t|vals_0|vals_1|vals_2|vals_3|vals_4|vals_5|vals_6|vals_7| +----------+---+------+------+------+------+------+------+------+------+ | 1| 3| 13| 12| 11| 10| null| null| null| null| | 1| 0| 10| null| null| null| null| null| null| null| | 1| 1| 11| 10| null| null| null| null| null| null| | 0| 9| 9| 8| 7| 6| 5| 4| 3| 2| | 0| 1| 1| 0| null| null| null| null| null| null| | 1| 4| 14| 13| 12| 11| 10| null| null| null| | 0| 4| 4| 3| 2| 1| 0| null| null| null| | 0| 3| 3| 2| 1| 0| null| null| null| null| | 0| 7| 7| 6| 5| 4| 3| 2| 1| 0| | 1| 5| 15| 14| 13| 12| 11| 10| null| null| | 1| 6| 16| 15| 14| 13| 12| 11| 10| null| | 0| 6| 6| 5| 4| 3| 2| 1| 0| null| | 1| 7| 17| 16| 15| 14| 13| 12| 11| 10| | 0| 8| 8| 7| 6| 5| 4| 3| 2| 1| | 0| 0| 0| null| null| null| null| null| null| null| | 0| 2| 2| 1| 0| null| null| null| null| null| | 1| 2| 12| 11| 10| null| null| null| null| null| | 1| 9| 19| 18| 17| 16| 15| 14| 13| 12| | 0| 5| 5| 4| 3| 2| 1| 0| null| null| | 1| 8| 18| 17| 16| 15| 14| 13| 12| 11| +----------+---+------+------+------+------+------+------+------+------+ but the pyspark version takes significantly longer to run 34 seconds i kept this example small and simple only 20 data rows only 1 partition for both dask and spark so i would not expect memory and cpu usage to drive significant performance differences from question Optimizing Pyspark Performance to Match Pandas / Dask? |