Changelog

2.9.1 / 2019-12-27

Array

Core

DataFrame

Documentation

2.9.0 / 2019-12-06

Array

Core

DataFrame

Documentation

2.8.1 / 2019-11-22

Array

Core

DataFrame

Documentation

2.8.0 / 2019-11-14

Array

Bag

Core

DataFrame

Documentation

2.7.0 / 2019-11-08

This release drops support for Python 3.5

Array

Core

DataFrame

Documentation

2.6.0 / 2019-10-15

Core

DataFrame

Documentation

2.5.2 / 2019-10-04

Array

DataFrame

Documentation

2.5.0 / 2019-09-27

Core

DataFrame

Documentation

2.4.0 / 2019-09-13

Array

Core

DataFrame

Documentation

2.3.0 / 2019-08-16

Array

Bag

Core

DataFrame

Documentation

2.2.0 / 2019-08-01

Array

Bag

Core

DataFrame

Documentation

2.1.0 / 2019-07-08

Array

Core

DataFrame

Documentation

2.0.0 / 2019-06-25

Array

Core

DataFrame

Documentation

1.2.2 / 2019-05-08

Array

Bag

Core

DataFrame

Documentation

1.2.1 / 2019-04-29

Array

Core

DataFrame

Documentation

1.2.0 / 2019-04-12

Array

Core

DataFrame

Documentation

1.1.5 / 2019-03-29

Array

Core

DataFrame

Documentation

1.1.4 / 2019-03-08

Array

Core

DataFrame

Documentation

1.1.3 / 2019-03-01

Array

DataFrame

Documentation

1.1.2 / 2019-02-25

Array

Bag

DataFrame

Documentation

Core

1.1.1 / 2019-01-31

Array

DataFrame

Delayed

Documentation

Core

  • Work around psutil 5.5.0 not allowing pickling Process objects Janne Vuorela

1.1.0 / 2019-01-18

Array

DataFrame

Documentation

Core

1.0.0 / 2018-11-28

Array

DataFrame

Documentation

Core

0.20.2 / 2018-11-15

Array

Dataframe

Documentation

0.20.1 / 2018-11-09

Array

Core

Dataframe

Documentation

0.20.0 / 2018-10-26

Array

Bag

Core

Dataframe

Documentation

0.19.4 / 2018-10-09

Array

Bag

Dataframe

Core

Documentation

0.19.3 / 2018-10-05

Array

Bag

Dataframe

Core

Documentation

0.19.2 / 2018-09-17

Array

Core

Documentation

0.19.1 / 2018-09-06

Array

Dataframe

Documentation

0.19.0 / 2018-08-29

Array

DataFrame

Core

Docs

0.18.2 / 2018-07-23

Array

Bag

Dataframe

Delayed

Core

0.18.1 / 2018-06-22

Array

DataFrame

Core

0.18.0 / 2018-06-14

Array

Dataframe

Bag

Core

0.17.5 / 2018-05-16

Array

DataFrame

0.17.4 / 2018-05-03

Dataframe

0.17.3 / 2018-05-02

Array

DataFrame

Core

  • Support traversing collections in persist, visualize, and optimize (GH#3410) Jim Crist
  • Add schedule= keyword to compute and persist. This replaces common use of the get= keyword (GH#3448) Matthew Rocklin

0.17.2 / 2018-03-21

Array

DataFrame

Bag

Core

0.17.1 / 2018-02-22

Array

DataFrame

Core

0.17.0 / 2018-02-09

Array

DataFrame

Bag

  • Document bag.map_paritions function may receive either a list or generator. (GH#3150) Nir

Core

0.16.1 / 2018-01-09

Array

DataFrame

Core

0.16.0 / 2017-11-17

This is a major release. It includes breaking changes, new protocols, and alarge number of bug fixes.

Array

DataFrame

Core

0.15.4 / 2017-10-06

Array

  • da.random.choice now works with array arguments (GH#2781)
  • Support indexing in arrays with np.int (fixes regression) (GH#2719)
  • Handle zero dimension with rechunking (GH#2747)
  • Support -1 as an alias for “size of the dimension” in chunks (GH#2749)
  • Call mkdir in array.to_npy_stack (GH#2709)

DataFrame

  • Added the .str accessor to Categoricals with string categories (GH#2743)
  • Support int96 (spark) datetimes in parquet writer (GH#2711)
  • Pass on file scheme to fastparquet (GH#2714)
  • Support Pandas 0.21 (GH#2737)

Bag

  • Add tree reduction support for foldby (GH#2710)

Core

  • Drop s3fs from pip install dask[complete] (GH#2750)

0.15.3 / 2017-09-24

Array

  • Add masked arrays (GH#2301)
  • Add *_like array creation functions (GH#2640)
  • Indexing with unsigned integer array (GH#2647)
  • Improved slicing with boolean arrays of different dimensions (GH#2658)
  • Support literals in top and atop (GH#2661)
  • Optional axis argument in cumulative functions (GH#2664)
  • Improve tests on scalars with assert_eq (GH#2681)
  • Fix norm keepdims (GH#2683)
  • Add ptp (GH#2691)
  • Add apply_along_axis (GH#2690) and apply_over_axes (GH#2702)

DataFrame

  • Added Series.str[index] (GH#2634)
  • Allow the groupby by param to handle columns and index levels (GH#2636)
    • DataFrame.to_csv and Bag.to_textfiles now return the filenames to
    • which they have written (GH#2655)
  • Fix combination of partition_on and append in to_parquet(GH#2645)
  • Fix for parquet file schemes (GH#2667)
  • Repartition works with mixed categoricals (GH#2676)

Core

  • python setup.py test now runs tests (GH#2641)
  • Added new cheatsheet (GH#2649)
  • Remove resize tool in Bokeh plots (GH#2688)

0.15.2 / 2017-08-25

Array

Bag

  • Remove deprecated Bag behaviors (GH#2525)

DataFrame

Core

  • Remove bare except: blocks everywhere (GH#2590)

0.15.1 / 2017-07-08

  • Add storage_options to to_textfiles and to_csv (GH#2466)
  • Rechunk and simplify rfftfreq (GH#2473), (GH#2475)
  • Better support ndarray subclasses (GH#2486)
  • Import star in dask.distributed (GH#2503)
  • Threadsafe cache handling with tokenization (GH#2511)

0.15.0 / 2017-06-09

Array

Bag

  • Fix bug where reductions on bags with no partitions would fail (GH#2324)
  • Add broadcasting and variadic db.map top-level function. Also removeauto-expansion of tuples as map arguments (GH#2339)
  • Rename Bag.concat to Bag.flatten (GH#2402)

DataFrame

Core

  • Move dask.async module to dask.local (GH#2318)
  • Support callbacks with nested scheduler calls (GH#2397)
  • Support pathlib.Path objects as uris (GH#2310)

0.14.3 / 2017-05-05

DataFrame

  • Pandas 0.20.0 support

0.14.2 / 2017-05-03

Array

  • Add da.indices (GH#2268), da.tile (GH#2153), da.roll (GH#2135)
  • Simultaneously support drop_axis and new_axis in da.map_blocks (GH#2264)
  • Rechunk and concatenate work with unknown chunksizes (GH#2235) and (GH#2251)
  • Support non-numpy container arrays, notably sparse arrays (GH#2234)
  • Tensordot contracts over multiple axes (GH#2186)
  • Allow delayed targets in da.store (GH#2181)
  • Support interactions against lists and tuples (GH#2148)
  • Constructor plugins for debugging (GH#2142)
  • Multi-dimensional FFTs (single chunk) (GH#2116)

Bag

  • to_dataframe enforces consistent types (GH#2199)

DataFrame

  • Set_index always fully sorts the index (GH#2290)
  • Support compatibility with pandas 0.20.0 (GH#2249), (GH#2248), and (GH#2246)
  • Support Arrow Parquet reader (GH#2223)
  • Time-based rolling windows (GH#2198)
  • Repartition can now create more partitions, not just less (GH#2168)

Core

0.14.1 / 2017-03-22

Array

  • Micro-optimize optimizations (GH#2058)
  • Change slicing optimizations to avoid fusing raw numpy arrays (GH#2075) (GH#2080)
  • Dask.array operations now work on numpy arrays (GH#2079)
  • Reshape now works in a much broader set of cases (GH#2089)
  • Support deepcopy python protocol (GH#2090)
  • Allow user-provided FFT implementations in da.fft (GH#2093)

DataFrame

  • Fix to_parquet with empty partitions (GH#2020)
  • Optional npartitions='auto' mode in set_index (GH#2025)
  • Optimize shuffle performance (GH#2032)
  • Support efficient repartitioning along time windows like repartition(freq='12h') (GH#2059)
  • Improve speed of categorize (GH#2010)
  • Support single-row dataframe arithmetic (GH#2085)
  • Automatically avoid shuffle when setting index with a sorted column (GH#2091)
  • Improve handling of integer-na handling in read_csv (GH#2098)

Delayed

  • Repeated attribute access on delayed objects uses the same key (GH#2084)

Core

  • Improve naming of nodes in dot visuals to avoid generic apply(GH#2070)
  • Ensure that worker processes have different random seeds (GH#2094)

0.14.0 / 2017-02-24

Array

Bag

  • Repartition can now increase number of partitions (GH#1934)
  • Fix bugs in some reductions with empty partitions (GH#1939), (GH#1950),(GH#1953)

DataFrame

  • Support non-uniform categoricals (GH#1877), (GH#1930)
  • Groupby cumulative reductions (GH#1909)
  • DataFrame.loc indexing now supports lists (GH#1913)
  • Improve multi-level groupbys (GH#1914)
  • Improved HTML and string repr for DataFrames (GH#1637)
  • Parquet append (GH#1940)
  • Add dd.demo.daily_stock function for teaching (GH#1992)

Delayed

  • Add traverse= keyword to delayed to optionally avoid traversing nesteddata structures (GH#1899)
  • Support Futures in from_delayed functions (GH#1961)
  • Improve serialization of decorated delayed functions (GH#1969)

Core

  • Improve windows path parsing in corner cases (GH#1910)
  • Rename tasks when fusing (GH#1919)
  • Add top level persist function (GH#1927)
  • Propagate errors= keyword in byte handling (GH#1954)
  • Dask.compute traverses Python collections (GH#1975)
  • Structural sharing between graphs in dask.array and dask.delayed (GH#1985)

0.13.0 / 2017-01-02

Array

  • Mandatory dtypes on dask.array. All operations maintain dtype informationand UDF functions like map_blocks now require a dtype= keyword if it can notbe inferred. (GH#1755)
  • Support arrays without known shapes, such as arises when slicing arrays witharrays or converting dataframes to arrays (GH#1838)
  • Support mutation by setting one array with another (GH#1840)
  • Tree reductions for covariance and correlations. (GH#1758)
  • Add SerializableLock for better use with distributed scheduling (GH#1766)
  • Improved atop support (GH#1800)
  • Rechunk optimization (GH#1737), (GH#1827)

Bag

  • Avoid wrong results when recomputing the same groupby twice (GH#1867)

DataFrame

Delayed

  • Changed behaviour for delayed(nout=0) and delayed(nout=1):delayed(nout=1) does not default to out=None anymore, anddelayed(nout=0) is also enabled. I.e. functions with returntuples of length 1 or 0 can be handled correctly. This is especiallyhandy, if functions with a variable amount of outputs are wrapped bydelayed. E.g. a trivial example:delayed(lambda args: args, nout=len(vals))(vals)

Core

0.12.0 / 2016-11-03

DataFrame

  • Return a series when functions given to dataframe.map_partitions returnscalars (GH#1515)
  • Fix type size inference for series (GH#1513)
  • dataframe.DataFrame.categorize no longer includes missing valuesin the categories. This is for compatibility with a pandas change (GH#1565)
  • Fix head parser error in dataframe.read_csv when some lines have quotes(GH#1495)
  • Add dataframe.reduction and series.reduction methods to apply genericrow-wise reduction to dataframes and series (GH#1483)
  • Add dataframe.select_dtypes, which mirrors the pandas method (GH#1556)
  • dataframe.read_hdf now supports reading Series (GH#1564)
  • Support Pandas 0.19.0 (GH#1540)
  • Implement select_dtypes (GH#1556)
  • String accessor works with indexes (GH#1561)
  • Add pipe method to dask.dataframe (GH#1567)
  • Add indicator keyword to merge (GH#1575)
  • Support Series in read_hdf (GH#1575)
  • Support Categories with missing values (GH#1578)
  • Support inplace operators like df.x += 1 (GH#1585)
  • Str accessor passes through args and kwargs (GH#1621)
  • Improved groupby support for single-machine multiprocessing scheduler(GH#1625)
  • Tree reductions (GH#1663)
  • Pivot tables (GH#1665)
  • Add clip (GH#1667), align (GH#1668), combine_first (GH#1725), andany/all (GH#1724)
  • Improved handling of divisions on dask-pandas merges (GH#1666)
  • Add groupby.aggregate method (GH#1678)
  • Add dd.read_table function (GH#1682)
  • Improve support for multi-level columns (GH#1697) (GH#1712)
  • Support 2d indexing in loc (GH#1726)
  • Extend resample to include DataFrames (GH#1741)
  • Support dask.array ufuncs on dask.dataframe objects (GH#1669)

Array

  • Add information about how dask.array chunks argument work (GH#1504)
  • Fix field access with non-scalar fields in dask.array (GH#1484)
  • Add concatenate= keyword to atop to concatenate chunks of contracted dimensions
  • Optimized slicing performance (GH#1539) (GH#1731)
  • Extend atop with a concatenate= (GH#1609) new_axes=(GH#1612) and adjust_chunks= (GH#1716) keywords
  • Add clip (GH#1610) swapaxes (GH#1611) round (GH#1708) repeat
  • Automatically align chunks in atop-backed operations (GH#1644)
  • Cull dask.arrays on slicing (GH#1709)

Bag

  • Fix issue with callables in bag.from_sequence being interpreted astasks (GH#1491)
  • Avoid non-lazy memory use in reductions (GH#1747)

Administration

  • Added changelog (GH#1526)
  • Create new threadpool when operating from thread (GH#1487)
  • Unify example documentation pages into one (GH#1520)
  • Add versioneer for git-commit based versions (GH#1569)
  • Pass through node_attr and edge_attr keywords in dot visualization(GH#1614)
  • Add continuous testing for Windows with Appveyor (GH#1648)
  • Remove use of multiprocessing.Manager (GH#1653)
  • Add global optimizations keyword to compute (GH#1675)
  • Micro-optimize get_dependencies (GH#1722)

0.11.0 / 2016-08-24

Major Points

DataFrames now enforce knowing full metadata (columns, dtypes) everywhere.Previously we would operate in an ambiguous state when functions lost dtypeinformation (such as apply). Now all dataframes always know their dtypesand raise errors asking for information if they are unable to infer (whichthey usually can). Some internal attributes like _pd and_pd_nonempty have been moved.

The internals of the distributed scheduler have been refactored totransition tasks between explicit states. This improves resilience,reasoning about scheduling, plugin operation, and logging. It also makesthe scheduler code easier to understand for newcomers.

Breaking Changes

  • The distributed.s3 and distributed.hdfs namespaces are gone. Useprotocols in normal methods like read_text('s3://…' instead.
  • Dask.array.reshape now errs in some cases where previously it would havecreate a very large number of tasks

0.10.2 / 2016-07-27

  • More Dataframe shuffles now work in distributed settings, ranging fromsetting-index to hash joins, to sorted joins and groupbys.
  • Dask passes the full test suite when run when under in Python’soptimized-OO mode.
  • On-disk shuffles were found to produce wrong results in somehighly-concurrent situations, especially on Windows. This has been resolvedby a fix to the partd library.
  • Fixed a growth of open file descriptors that occurred under large datacommunications
  • Support ports in the —bokeh-whitelist option ot dask-scheduler to betterrouting of web interface messages behind non-trivial network settings
  • Some improvements to resilience to worker failure (though other knownfailures persist)
  • You can now start an IPython kernel on any worker for improved debugging andanalysis
  • Improvements to dask.dataframe.read_hdf, especially when reading frommultiple files and docs

0.10.0 / 2016-06-13

Major Changes

  • This version drops support for Python 2.6
  • Conda packages are built and served from conda-forge
  • The dask.distributed executables have been renamed from dfoo to dask-foo.For example dscheduler is renamed to dask-scheduler
  • Both Bag and DataFrame include a preliminary distributed shuffle.

Bag

  • Add task-based shuffle for distributed groupbys
  • Add accumulate for cumulative reductions

DataFrame

  • Add a task-based shuffle suitable for distributed joins, groupby-applys, andset_index operations. The single-machine shuffle remains untouched (andmuch more efficient.)
  • Add support for new Pandas rolling API with improved communicationperformance on distributed systems.
  • Add groupby.std/var
  • Pass through S3/HDFS storage options in read_csv
  • Improve categorical partitioning
  • Add eval, info, isnull, notnull for dataframes

Distributed

  • Rename executables like dscheduler to dask-scheduler
  • Improve scheduler performance in the many-fast-tasks case (important forshuffling)
  • Improve work stealing to be aware of expected function run-times and datasizes. The drastically increases the breadth of algorithms that can beefficiently run on the distributed scheduler without significant userexpertise.
  • Support maximum buffer sizes in streaming queues
  • Improve Windows support when using the Bokeh diagnostic web interface
  • Support compression of very-large-bytestrings in protocol
  • Support clean cancellation of submitted futures in Joblib interface

Other

  • All dask-related projects (dask, distributed, s3fs, hdfs, partd) are nowbuilding conda packages on conda-forge.
  • Change credential handling in s3fs to only pass around delegated credentialsif explicitly given secret/key. The default now is to rely on managedenvironments. This can be changed back by explicitly providing a keywordargument. Anonymous mode must be explicitly declared if desired.

0.9.0 / 2016-05-11

API Changes

  • dask.do and dask.value have been renamed to dask.delayed
  • dask.bag.from_filenames has been renamed to dask.bag.read_text
  • All S3/HDFS data ingest functions like db.from_s3 ordistributed.s3.read_csv have been moved into the plain read_text,read_csv functions, which now support protocols, likedd.read_csv('s3://bucket/keys*.csv')

Array

  • Add support for scipy.LinearOperator
  • Improve optional locking to on-disk data structures
  • Change rechunk to expose the intermediate chunks

Bag

  • Rename from_filenames to read_text
  • Remove from_s3 in favor of read_text('s3://…')

DataFrame

  • Fixed numerical stability issue for correlation and covariance
  • Allow no-hash from_pandas for speedy round-trips to and from-pandasobjects
  • Generally reengineered read_csv to be more in line with Pandas behavior
  • Support fast set_index operations for sorted columns

Delayed

  • Rename do/value to delayed
  • Rename to/from_imperative to to/from_delayed

Distributed

  • Move s3 and hdfs functionality into the dask repository
  • Adaptively oversubscribe workers for very fast tasks
  • Improve PyPy support
  • Improve work stealing for unbalanced workers
  • Scatter data efficiently with tree-scatters

Other

  • Add lzma/xz compression support
  • Raise a warning when trying to split unsplittable compression types, likegzip or bz2
  • Improve hashing for single-machine shuffle operations
  • Add new callback method for start state
  • General performance tuning

0.8.1 / 2016-03-11

Array

  • Bugfix for range slicing that could periodically lead to incorrect results.
  • Improved support and resiliency of arg reductions (argmin, argmax, etc.)

Bag

  • Add zip function

DataFrame

  • Add corr and cov functions
  • Add melt function
  • Bugfixes for io to bcolz and hdf5

0.8.0 / 2016-02-20

Array

  • Changed default array reduction split from 32 to 4
  • Linear algebra, tril, triu, LU, inv, cholesky,solve, solve_triangular, eye, lstsq, diag, corrcoef.

Bag

  • Add tree reductions
  • Add range function
  • drop from_hdfs function (better functionality now exists in hdfs3 anddistributed projects)

DataFrame

  • Refactor dask.dataframe to include a full empty pandas dataframe asmetadata. Drop the .columns attribute on Series
  • Add Series categorical accessor, series.nunique, drop the .columnsattribute for series.
  • read_csv fixes (multi-column parse_dates, integer column names, etc. )
  • Internal changes to improve graph serialization

Other

  • Documentation updates
  • Add from_imperative and to_imperative functions for all collections
  • Aesthetic changes to profiler plots
  • Moved the dask project to a new dask organization

0.7.6 / 2016-01-05

Array

  • Improve thread safety
  • Tree reductions
  • Add view, compress, hstack, dstack, vstack methods
  • map_blocks can now remove and add dimensions

DataFrame

  • Improve thread safety
  • Extend sampling to include replacement options

Imperative

  • Removed optimization passes that fused results.

Core

  • Removed dask.distributed
  • Improved performance of blocked file reading
  • Serialization improvements
  • Test Python 3.5

0.7.4 / 2015-10-23

This was mostly a bugfix release. Some notable changes:

  • Fix minor bugs associated with the release of numpy 1.10 and pandas 0.17
  • Fixed a bug with random number generation that would cause repeated blocksdue to the birthday paradox
  • Use locks in dask.dataframe.read_hdf by default to avoid concurrencyissues
  • Change dask.get to point to dask.async.get_sync by default
  • Allow visualization functions to accept general graphviz graph options likerankdir=’LR’
  • Add reshape and ravel to dask.array
  • Support the creation of dask.arrays from dask.imperative objects

Deprecation

This release also includes a deprecation warning for dask.distributed, whichwill be removed in the next version.

Future development in distributed computing for dask is happening here:https://distributed.dask.org . General feedback on that project is mostwelcome from this community.

0.7.3 / 2015-09-25

Diagnostics

  • A utility for profiling memory and cpu usage has been added to thedask.diagnostics module.

DataFrame

This release improves coverage of the pandas API. Among other thingsit includes nunique, nlargest, quantile. Fixes encoding issueswith reading non-ascii csv files. Performance improvements and bug fixeswith resample. More flexible read_hdf with globbing. And many more. Variousbug fixes in dask.imperative and dask.bag.

0.7.0 / 2015-08-15

DataFrame

This release includes significant bugfixes and alignment with the Pandas API.This has resulted both from use and from recent involvement by Pandas coredevelopers.

  • New operations: query, rolling operations, drop
  • Improved operations: quantiles, arithmetic on full dataframes, dropna,constructor logic, merge/join, elemwise operations, groupby aggregations

Bag

  • Fixed a bug in fold where with a null default argument

Array

  • New operations: da.fft module, da.image.imread

Infrastructure

  • The array and dataframe collections create graphs with deterministic keys.These tend to be longer (hash strings) but should be consistent betweencomputations. This will be useful for caching in the future.
  • All collections (Array, Bag, DataFrame) inherit from common subclass

0.6.1 / 2015-07-23

Distributed

  • Improved (though not yet sufficient) resiliency for dask.distributedwhen workers die

DataFrame

  • Improved writing to various formats, including to_hdf, to_castra, andto_csv
  • Improved creation of dask DataFrames from dask Arrays and Bags
  • Improved support for categoricals and various other methods

Array

  • Various bug fixes
  • Histogram function

Scheduling

  • Added tie-breaking ordering of tasks within parallel workloads tobetter handle and clear intermediate results

Other

  • Added the dask.do function for explicit construction of graphs withnormal python code
  • Traded pydot for graphviz library for graph printing to support Python3
  • There is also a gitter chat room and a stackoverflow tag