Dask

Dask is a parallel computing and data analytics library for Python

Pandas

Pandas is a Python library for data manipulation and analysis, e.g



To_csv method slow

Example

"But if i invoke the to_csv method then dask is as slow as pandas"

from question  

How should I write multiple CSV files efficiently using dask.dataframe?

"It will be quicker than pandas method dask write function will break your file into mulitple chuncks and store mulitple chuncks"

from question  

Saving large Pandas df of text data to disk crashes Colab due to using up all RAM. Is there a workaround?

Faster following slower

Example

"This tells us that dask is about 50 slower than pandas for this task this is to be expected because the chunking and recombining of data partitions leads to some extra overhead"

from question  

Read a large csv into a sparse pandas dataframe in a memory efficient way

"For and on my laptop the dask version is faster than the pandas one"

from question  

Dask groupby date performance

"When data fit in memory pandas is faster than dask"

from question  

Dask groupby date performance

"1 i guess dask will be slower than pandas for smaller datasets"

from question  

Dask in-place replacement of pandas?

"I am implementing dask but its slower than normal pandas sequential streaming"

from question  

How to have parallelization in Key-Value Databases?

"My question is is there a a way to do this in either pandas or dask that is faster than the following sequence group by index outer join each group to itself to produce pairs dataframe.apply comparison function on each row of pairs for reference assume i have access to a good number of cores hundreds and about 200g of memory"

from question  

Efficient pairwise comparison of rows in pandas DataFrame

Others

Example

If you just want performance gains while sticking with pandas check out the docs here and this article i found particularly helpful edit with dask you would do

from question  

Parallel mapping x.f() instead of f(x) for many functions

When reading the performance on dask is significantly poorer than pandas

from question  

Dask to parquet - performance and size issues

Current i have this working piece of code essentially i create as dask dataframe from a pandas dataframe weather then i apply the function dffunc to each row of the dataframe

from question  

Pandas-Dask DataFrame Apply Function with List Return

It lets you use most of the standard pandas commands in parallel out-of-memory;the only problem is dask doesn t have an excel reader from what i can tell

from question  

Using Python to analyze large set of sensor-data

Dask arrays are mostly api compatible with pandas and support parallel execution for apply

from question  

Speed up Pandas on Multi-core machine

Here is a simplified dataset i created which creates this data frame input dataframe customerid t vals 0 0 0 0 1 0 1 1 2 0 2 2 3 0 3 3 4 0 4 4 5 0 5 5 6 0 6 6 7 0 7 7 8 0 8 8 9 0 9 9 10 1 0 10 11 1 1 11 12 1 2 12 13 1 3 13 14 1 4 14 15 1 5 15 16 1 6 16 17 1 7 17 18 1 8 18 19 1 9 19 my goal output is the 8 weekly lagged vals columns including vals_0 as the current week s value with nans where data is unavailable goal output dataframe customerid t vals_0 vals_1 vals_2 vals_3 vals_4 vals_5 vals_6 vals_7 0 0 0 0 nan nan nan nan nan nan nan 1 0 1 1 0.0 nan nan nan nan nan nan 2 0 2 2 1.0 0.0 nan nan nan nan nan 3 0 3 3 2.0 1.0 0.0 nan nan nan nan 4 0 4 4 3.0 2.0 1.0 0.0 nan nan nan 5 0 5 5 4.0 3.0 2.0 1.0 0.0 nan nan 6 0 6 6 5.0 4.0 3.0 2.0 1.0 0.0 nan 7 0 7 7 6.0 5.0 4.0 3.0 2.0 1.0 0.0 8 0 8 8 7.0 6.0 5.0 4.0 3.0 2.0 1.0 9 0 9 9 8.0 7.0 6.0 5.0 4.0 3.0 2.0 10 1 0 10 nan nan nan nan nan nan nan 11 1 1 11 10.0 nan nan nan nan nan nan 12 1 2 12 11.0 10.0 nan nan nan nan nan 13 1 3 13 12.0 11.0 10.0 nan nan nan nan 14 1 4 14 13.0 12.0 11.0 10.0 nan nan nan 15 1 5 15 14.0 13.0 12.0 11.0 10.0 nan nan 16 1 6 16 15.0 14.0 13.0 12.0 11.0 10.0 nan 17 1 7 17 16.0 15.0 14.0 13.0 12.0 11.0 10.0 18 1 8 18 17.0 16.0 15.0 14.0 13.0 12.0 11.0 19 1 9 19 18.0 17.0 16.0 15.0 14.0 13.0 12.0 the following pandas function creates the goal output dataframe and runs in roughly 500ms i can also accomplish this in dask using map_partitions and get the same results in 900 ms presumably worse than pandas due to the overhead from spinning up a thread i can also accomplish this in pyspark note for both dask and spark i have only one partition to make a more fair comparison with pandas with the following code i get the correct results back although shuffled +----------+---+------+------+------+------+------+------+------+------+ |customerid| t|vals_0|vals_1|vals_2|vals_3|vals_4|vals_5|vals_6|vals_7| +----------+---+------+------+------+------+------+------+------+------+ | 1| 3| 13| 12| 11| 10| null| null| null| null| | 1| 0| 10| null| null| null| null| null| null| null| | 1| 1| 11| 10| null| null| null| null| null| null| | 0| 9| 9| 8| 7| 6| 5| 4| 3| 2| | 0| 1| 1| 0| null| null| null| null| null| null| | 1| 4| 14| 13| 12| 11| 10| null| null| null| | 0| 4| 4| 3| 2| 1| 0| null| null| null| | 0| 3| 3| 2| 1| 0| null| null| null| null| | 0| 7| 7| 6| 5| 4| 3| 2| 1| 0| | 1| 5| 15| 14| 13| 12| 11| 10| null| null| | 1| 6| 16| 15| 14| 13| 12| 11| 10| null| | 0| 6| 6| 5| 4| 3| 2| 1| 0| null| | 1| 7| 17| 16| 15| 14| 13| 12| 11| 10| | 0| 8| 8| 7| 6| 5| 4| 3| 2| 1| | 0| 0| 0| null| null| null| null| null| null| null| | 0| 2| 2| 1| 0| null| null| null| null| null| | 1| 2| 12| 11| 10| null| null| null| null| null| | 1| 9| 19| 18| 17| 16| 15| 14| 13| 12| | 0| 5| 5| 4| 3| 2| 1| 0| null| null| | 1| 8| 18| 17| 16| 15| 14| 13| 12| 11| +----------+---+------+------+------+------+------+------+------+------+ but the pyspark version takes significantly longer to run 34 seconds i kept this example small and simple only 20 data rows only 1 partition for both dask and spark so i would not expect memory and cpu usage to drive significant performance differences

from question  

Optimizing Pyspark Performance to Match Pandas / Dask?

Pandas is far more flexible for working with data so i often bring parts of dask dataframes into memory manipulate columns and create new ones

from question  

Add pandas series to dask dataframe

When hdf5 storage can be accessed fast than .csv and when dask creates dataframes faster than pandas why is dask from hdf5 slower than dask from csv

from question  

Why do pandas and dask perform better when importing from CSV compared to HDF5?

Maybe dask is more efficient than pandas however i have never used dask before

from question  

Large SQL query (64. Mio rows) into pandas dataframe takes way to long

This may help those confused by dask and hdf5 but more familiar with pandas like myself

from question  

"Large data" work flows using pandas

Dask dataframe developers recommend using pandas when feasible

from question  

How can I order entries in a Dask dataframe for display in seaborn?

Looking through dask documentation it says there that generally speaking dask.dataframe groupby-aggregations are roughly same performance as pandas groupby-aggregations. so unless you re using a dask distributed client to manage workers threads etc. the benefit from using it over vanilla pandas isn t always there

from question  

Why does my code take so long to write CSV file in Dask Python

In fact this naive dask implementation seems to be slower than plain pandas for larger problem instances

from question  

Dask broadcast not available during compute graph

The dask documentation states that dask s set_index is much more expensive than pandas with that in mind which of the following should be a best practice the time column is filled with datetime objects

from question  

Is it better to set_index in Pandas then convert to Dask or vice versa?

Back to Home
Data comes from Stack Exchange with CC-BY-SA-4.0