You can run this notebook in a live session or view it on Github.
Lazy execution
Here we discuss some of the concepts behind dask, and lazy execution of code. You do not need to go through this material if you are eager to get on with the tutorial, but it may help understand the concepts underlying dask, how these things fit in with techniques you might already be using, and how to understand things that can go wrong.
Prelude
As Python programmers, you probably already perform certain tricks to enable computation of larger-than-memory datasets, parallel execution or delayed/background execution. Perhaps with this phrasing, it is not clear what we mean, but a few examples should make things clearer. The point of Dask is to make simple things easy and complex things possible!
Aside from the detailed introduction, we can summarize the basics of Dask as follows:
process data that doesn’t fit into memory by breaking it into blocks and specifying task chains
parallelize execution of tasks across cores and even nodes of a cluster
move computation to the data rather than the other way around, to minimize communication overhead
All of this allows you to get the most out of your computation resources, but program in a way that is very familiar: for-loops to build basic tasks, Python iterators, and the NumPy (array) and Pandas (dataframe) functions for multi-dimensional or tabular data, respectively.
The remainder of this notebook will take you through the first of these programming paradigms. This is more detail than some users will want, who can skip ahead to the iterator, array and dataframe sections; but there will be some data processing tasks that don’t easily fit into those abstractions and need to fall back to the methods here.
We include a few examples at the end of the notebooks showing that the ideas behind how Dask is built are not actually that novel, and experienced programmers will have met parts of the design in other situations before. Those examples are left for the interested.
Dask is a graph execution engine
Dask allows you to construct a prescription for the calculation you want to carry out. That may sound strange, but a simple example will demonstrate that you can achieve this while programming with perfectly ordinary Python functions and for-loops. We saw this in Chapter 02.
- [1]:
- from dask import delayed
- @delayed
- def inc(x):
- return x + 1
- @delayed
- def add(x, y):
- return x + y
Here we have used the delayed annotation to show that we want these functions to operate lazily — to save the set of inputs and execute only on demand. dask.delayed
is also a function which can do this, without the annotation, leaving the original function unchanged, e.g.,
- delayed_inc = delayed(inc)
- [2]:
- # this looks like ordinary code
- x = inc(15)
- y = inc(30)
- total = add(x, y)
- # incx, incy and total are all delayed objects.
- # They contain a prescription of how to execute
Calling a delayed function created a delayed object (incx, incy, total
) - examine these interactively. Making these objects is somewhat equivalent to constructs like the lambda
or function wrappers (see below). Each holds a simple dictionary describing the task graph, a full specification of how to carry out the computation.
We can visualize the chain of calculations that the object total
corresponds to as follows; the circles are functions, rectangles are data/results.
- [3]:
- total.visualize()
- [3]:
But so far, no functions have actually been executed. This demonstrated the division between the graph-creation part of Dask (delayed()
, in this example) and the graph execution part of Dask.
To run the “graph” in the visualization, and actually get a result, do:
- [4]:
- # execute all tasks
- total.compute()
- [4]:
- 47
Why should you care about this?
By building a specification of the calculation we want to carry out before executing anything, we can pass the specification to an execution engine for evaluation. In the case of Dask, this execution engine could be running on many nodes of a cluster, so you have access to the full number of CPU cores and memory across all the machines. Dask will intelligently execute your calculation with care for minimizing the amount of data held in memory, while parallelizing over the tasks that make up agraph. Notice that in the animated diagram below, where four workers are processing the (simple) graph, execution progresses vertically up the branches first, so that intermediate results can be expunged before moving onto a new branch.
With delayed
and normal pythonic looped code, very complex graphs can be built up and passed on to Dask for execution. See a nice example of simulated complex ETL work flow.
Exercise
We will apply delayed
to a real data processing task, albeit a simple one.
Consider reading three CSV files with pd.read_csv
and then measuring their total length. We will consider how you would do this with ordinary Python code, then build a graph for this process using delayed, and finally execute this graph using Dask, for a handy speed-up factor of more than two (there are only three inputs to parallelize over).
- [5]:
- %run prep.py -d accounts
- [6]:
- import pandas as pd
- import os
- filenames = [os.path.join('data', 'accounts.%d.csv' % i) for i in [0, 1, 2]]
- filenames
- [6]:
- ['data/accounts.0.csv', 'data/accounts.1.csv', 'data/accounts.2.csv']
- [7]:
- %%time
- # normal, sequential code
- a = pd.read_csv(filenames[0])
- b = pd.read_csv(filenames[1])
- c = pd.read_csv(filenames[2])
- na = len(a)
- nb = len(b)
- nc = len(c)
- total = sum([na, nb, nc])
- print(total)
- 30000
- CPU times: user 9.04 ms, sys: 4.17 ms, total: 13.2 ms
- Wall time: 12.9 ms
Your task is to recreate this graph again using the delayed function on the original Python code. The three functions you want to delay are pd.read_csv
, len
and sum
..
- delayed_read_csv = delayed(pd.read_csv)
- a = delayed_read_csv(filenames[0])
- ...
- total = ...
- # execute
- %time total.compute()
- [8]:
- # your verbose code here
Next, repeat this using loops, rather than writing out all the variables.
- [9]:
- # your concise code here
- [10]:
- ## verbose version
- delayed_read_csv = delayed(pd.read_csv)
- a = delayed_read_csv(filenames[0])
- b = delayed_read_csv(filenames[1])
- c = delayed_read_csv(filenames[2])
- delayed_len = delayed(len)
- na = delayed_len(a)
- nb = delayed_len(b)
- nc = delayed_len(c)
- delayed_sum = delayed(sum)
- total = delayed_sum([na, nb, nc])
- %time print(total.compute())
- ## concise version
- csvs = [delayed(pd.read_csv)(fn) for fn in filenames]
- lens = [delayed(len)(csv) for csv in csvs]
- total = delayed(sum)(lens)
- %time print(total.compute())
- 30000
- CPU times: user 14.6 ms, sys: 4.42 ms, total: 19 ms
- Wall time: 13.6 ms
- 30000
- CPU times: user 11.7 ms, sys: 3.96 ms, total: 15.6 ms
- Wall time: 11.4 ms
Notes
Delayed objects support various operations:
- x2 = x + 1
if x
was a delayed result (like total
, above), then so is x2
. Supported operations include arithmetic operators, item or slice selection, attribute access and method calls - essentially anything that could be phrased as a lambda
expression.
Operations which are not supported include mutation, setter methods, iteration (for) and bool (predicate).
Appendix: Further detail and examples
The following examples show that the kinds of things Dask does are not so far removed from normal Python programming when dealing with big data. These examples are only meant for experts, typical users can continue with the next notebook in the tutorial.
Example 1: simple word count
This directory contains a file called README.md
. How would you count the number of words in that file?
The simplest approach would be to load all the data into memory, split on whitespace and count the number of results. Here we use a regular expression to split words.
- [11]:
- import re
- splitter = re.compile('\w+')
- with open('README.md', 'r') as f:
- data = f.read()
- result = len(splitter.findall(data))
- result
- [11]:
- 675
The trouble with this approach is that it does not scale - if the file is very large, it, and the generated list of words, might fill up memory. We can easily avoid that, because we only need a simple sum, and each line is totally independent of the others. Now we evaluate each piece of data and immediately free up the space again, so we could perform this on arbitrarily-large files. Note that there is often a trade-off between time-efficiency and memory footprint: the following uses very littlememory, but may be slower for files that do not fill a large faction of memory. In general, one would like chunks small enough not to stress memory, but big enough for efficient use of the CPU.
- [12]:
- result = 0
- with open('README.md', 'r') as f:
- for line in f:
- result += len(splitter.findall(line))
- result
- [12]:
- 675
Example 2: background execution
There are many tasks that take a while to complete, but don’t actually require much of the CPU, for example anything that requires communication over a network, or input from a user. In typical sequential programming, execution would need to halt while the process completes, and then continue execution. That would be dreadful for a user experience (imagine the slow progress bar that locks up the application and cannot be canceled), and wasteful of time (the CPU could have been doing useful workin the meantime.
For example, we can launch processes and get their output as follows:
- import subprocess
- p = subprocess.Popen(command, stdout=subprocess.PIPE)
- p.returncode
The task is run in a separate process, and the return-code will remain None
until it completes, when it will change to 0
. To get the result back, we need out = p.communicate()[0]
(which would block if the process was not complete).
Similarly, we can launch Python processes and threads in the background. Some methods allow mapping over multiple inputs and gathering the results, more on that later. The thread starts and the cell completes immediately, but the data associated with the download only appears in the queue object some time later.
- [13]:
- import threading
- import queue
- import urllib
- def get_webdata(url, q):
- u = urllib.request.urlopen(url)
- # raise ValueError
- q.put(u.read())
- q = queue.Queue()
- t = threading.Thread(target=get_webdata, args=('http://www.google.com', q))
- t.start()
- [14]:
- # fetch result back into this thread. If the worker thread is not done, this would wait.
- q.get()
- [14]:
- b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="/jLZNQqHVc4rVNsK9fepeA==">(function(){window.google={kEI:\'1dYTXtehEoKW5wLfg6W4BA\',kEXPI:\'0,1353746,5663,731,223,3657,1069,379,206,2415,540,249,10,713,338,175,364,671,254,51,178,3,209,69,4,60,195,120,375,62,198,10,427,33,135,1128873,143,1197777,372,38,329080,1294,12383,4855,32692,15247,867,19397,9287,369,3314,5505,8384,1700,3158,1362,4323,4968,773,2250,2821,1923,3118,6196,1719,1808,1976,2044,5766,1,3142,5297,2054,920,873,1217,2975,4294,2136,1142,2291,1617,2382,1571,1683,620,2883,21,317,3173,975,1,368,2778,519,400,992,1287,6,85,2711,969,610,14,1279,1974,238,202,328,149,1103,327,513,517,318,821,1,277,49,8,48,157,663,3438,260,52,1135,1,3,2669,1839,184,595,1182,143,377,686,1261,748,218,64,1,145,44,1009,93,328,1285,15,84,417,2426,1639,607,474,1339,29,719,1039,3227,773,953,595,524,7,523,205,592,1574,1879,35,2756,69,3,6510,299,2533,257,215,367,1040,1043,128,246,2084,1226,1456,6,2070,1865,1274,108,1246,26,440,561,654,39,441,908,2,433,508,513,1735,366,753,132,989,523,359,1157,8,276,2,149,507,78,458,302,1062,121,532,53,355,619,473,129,780,48,1,137,704,66,2,1,106,10,2,16,146,138,442,373,701,258,523,601,410,189,6,15,199,71,5,33,242,203,91,1,265,85,364,326,373,751,83,5,24,194,1291,156,228,101,1114,5857694,3198,1802696,4194805,45,2801172,549,333,444,1,2,80,1,900,896,1,8,1,2,2551,1,748,141,59,736,563,1,4265,1,1,1,1,137,1,879,9,309,112,25,3365601,20598661\',authuser:0,kGL:\'US\',kBL:\'jSId\'};google.sn=\'webhp\';google.kHL=\'en\';google.jsfs=\'Ffpdje\';})();(function(){google.lc=[];google.li=0;google.getEI=function(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||google.kEI};google.getLEI=function(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute("leid")));)a=a.parentNode;return b};google.ml=function(){return null};google.time=function(){return(new Date).getTime()};google.log=function(a,b,e,c,g){if(a=google.logUrl(a,b,e,c,g)){b=new Image;var d=google.lc,f=google.li;d[f]=b;b.onerror=b.onload=b.onabort=function(){delete d[f]};google.vel&&google.vel.lu&&google.vel.lu(a);b.src=a;google.li=f+1}};google.logUrl=function(a,b,e,c,g){var d="",f=google.ls||"";e||-1!=b.search("&ei=")||(d="&ei="+google.getEI(c),-1==b.search("&lei=")&&(c=google.getLEI(c))&&(d+="&lei="+c));c="";!e&&google.cshid&&-1==b.search("&cshid=")&&"slh"!=a&&(c="&cshid="+google.cshid);a=e||"/"+(g||"gen_204")+"?atyp=i&ct="+a+"&cad="+b+d+f+"&zx="+google.time()+c;/^http:/i.test(a)&&"https:"==window.location.protocol&&(google.ml(Error("a"),!1,{src:a,glmm:1}),a="");return a};}).call(this);(function(){google.y={};google.x=function(a,b){if(a)var c=a.id;else{do c=Math.random();while(google.y[c])}google.y[c]=[a,b];return!1};google.lm=[];google.plm=function(a){google.lm.push.apply(google.lm,a)};google.lq=[];google.load=function(a,b,c){google.lq.push([[a],b,c])};google.loadAll=function(a,b){google.lq.push([a,b])};}).call(this);google.f={};(function(){\ndocument.documentElement.addEventListener("submit",function(b){var a;if(a=b.target){var c=a.getAttribute("data-submitfalse");a="1"==c||"q"==c&&!a.elements.q.value?!0:!1}else a=!1;a&&(b.preventDefault(),b.stopPropagation())},!0);}).call(this);\nvar a=window.location,b=a.href.indexOf("#");if(0<=b){var c=a.href.substring(b+1);/(^|&)q=/.test(c)&&-1==c.indexOf("#")&&a.replace("/search?"+c.replace(/(^|&)fp=[^&]*/g,"")+"&cad=h")};</script><style>#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}\n</style><style>body,td,a,p,.h{font-family:arial,sans-serif}body{margin:0;overflow-y:scroll}#gog{padding:3px 8px 0}td{line-height:.8em}.gac_m td{line-height:17px}form{margin-bottom:20px}.h{color:#36c}.q{color:#00c}.ts td{padding:0}.ts{border-collapse:collapse}em{font-weight:bold;font-style:normal}.lst{height:25px;width:496px}.gsfi,.lst{font:18px arial,sans-serif}.gsfs{font:17px arial,sans-serif}.ds{display:inline-box;display:inline-block;margin:3px 0 4px;margin-left:4px}input{font-family:inherit}a.gb1,a.gb2,a.gb3,a.gb4{color:#11c !important}body{background:#fff;color:black}a{color:#11c;text-decoration:none}a:hover,a:active{text-decoration:underline}.fl a{color:#36c}a:visited{color:#551a8b}a.gb1,a.gb4{text-decoration:underline}a.gb3:hover{text-decoration:none}#ghead a.gb2:hover{color:#fff !important}.sblc{padding-top:5px}.sblc a{display:block;margin:2px 0;margin-left:13px;font-size:11px}.lsbb{background:#eee;border:solid 1px;border-color:#ccc #999 #999 #ccc;height:30px}.lsbb{display:block}.ftl,#fll a{display:inline-block;margin:0 12px}.lsb{background:url(/images/nav_logo229.png) 0 -261px repeat-x;border:none;color:#000;cursor:pointer;height:30px;margin:0;outline:0;font:15px arial,sans-serif;vertical-align:top}.lsb:active{background:#ccc}.lst:focus{outline:none}</style><script nonce="/jLZNQqHVc4rVNsK9fepeA=="></script></head><body bgcolor="#fff"><script nonce="/jLZNQqHVc4rVNsK9fepeA==">(function(){var src=\'/images/nav_logo229.png\';var iesg=false;document.body.onload = function(){window.n && window.n();if (document.images){new Image().src=src;}\nif (!iesg){document.f&&document.f.q.focus();document.gbqf&&document.gbqf.q.focus();}\n}\n})();</script><div id="mngb"> <div id=gbar><nobr><b class=gb1>Search</b> <a class=gb1 href="http://www.google.com/imghp?hl=en&tab=wi">Images</a> <a class=gb1 href="http://maps.google.com/maps?hl=en&tab=wl">Maps</a> <a class=gb1 href="https://play.google.com/?hl=en&tab=w8">Play</a> <a class=gb1 href="http://www.youtube.com/?gl=US&tab=w1">YouTube</a> <a class=gb1 href="http://news.google.com/nwshp?hl=en&tab=wn">News</a> <a class=gb1 href="https://mail.google.com/mail/?tab=wm">Gmail</a> <a class=gb1 href="https://drive.google.com/?tab=wo">Drive</a> <a class=gb1 style="text-decoration:none" href="https://www.google.com/intl/en/about/products?tab=wh"><u>More</u> »</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe></span><a href="http://www.google.com/history/optout?hl=en" class=gb4>Web History</a> | <a href="/preferences?hl=en" class=gb4>Settings</a> | <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.com/" class=gb4>Sign in</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div> </div><center><br clear="all" id="lgpd"><div id="lga"><img alt="Google" height="92" src="/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png" style="padding:28px 0 14px" width="272" id="hplogo"><br><br></div><form action="/search" name="f"><table cellpadding="0" cellspacing="0"><tr valign="top"><td width="25%"> </td><td align="center" nowrap=""><input name="ie" value="ISO-8859-1" type="hidden"><input value="en" name="hl" type="hidden"><input name="source" type="hidden" value="hp"><input name="biw" type="hidden"><input name="bih" type="hidden"><div class="ds" style="height:32px;margin:4px 0"><input class="lst" style="color:#000;margin:0;padding:5px 8px 0 6px;vertical-align:top" autocomplete="off" value="" title="Google Search" maxlength="2048" name="q" size="57"></div><br style="line-height:0"><span class="ds"><span class="lsbb"><input class="lsb" value="Google Search" name="btnG" type="submit"></span></span><span class="ds"><span class="lsbb"><input class="lsb" id="tsuid1" value="I\'m Feeling Lucky" name="btnI" type="submit"><script nonce="/jLZNQqHVc4rVNsK9fepeA==">(function(){var id=\'tsuid1\';document.getElementById(id).onclick = function(){if (this.form.q.value){this.checked = 1;if (this.form.iflsig)this.form.iflsig.disabled = false;}\nelse top.location=\'/doodles/\';};})();</script><input value="AAP1E1EAAAAAXhPk5UO-hyqnekrfAORJKEt44D0K7bN0" name="iflsig" type="hidden"></span></span></td><td class="fl sblc" align="left" nowrap="" width="25%"><a href="/advanced_search?hl=en&authuser=0">Advanced search</a><a href="/language_tools?hl=en&authuser=0">Language tools</a></td></tr></table><input id="gbv" name="gbv" type="hidden" value="1"><script nonce="/jLZNQqHVc4rVNsK9fepeA==">(function(){var a,b="1";if(document&&document.getElementById)if("undefined"!=typeof XMLHttpRequest)b="2";else if("undefined"!=typeof ActiveXObject){var c,d,e=["MSXML2.XMLHTTP.6.0","MSXML2.XMLHTTP.3.0","MSXML2.XMLHTTP","Microsoft.XMLHTTP"];for(c=0;d=e[c++];)try{new ActiveXObject(d),b="2"}catch(h){}}a=b;if("2"==a&&-1==location.search.indexOf("&gbv=2")){var f=google.gbvu,g=document.getElementById("gbv");g&&(g.value=a);f&&window.setTimeout(function(){location.href=f},0)};}).call(this);</script></form><div id="gac_scont"></div><div style="font-size:83%;min-height:3.5em"><br><div id="prm"><style>.szppmdbYutt__middle-slot-promo{font-size:small;margin-bottom:32px}.szppmdbYutt__middle-slot-promo a.ZIeIlb{display:inline-block;text-decoration:none}.szppmdbYutt__middle-slot-promo img{border:none;margin-right:5px;vertical-align:middle}</style><div class="szppmdbYutt__middle-slot-promo" data-ved="0ahUKEwiX4ODGo_DmAhUCy1kKHd9BCUcQnIcBCAQ"><img height="48" src="https://www.google.com/images/hpp/d4g-4.png" width="48"><span>Calling all young artists: Submit your artwork for </span><a class="NKcBbd" href="https://www.google.com/url?q=https://doodles.google.com/d4g/%3Futm_source%3DlaunchHPP%26utm_campaign%3DJan06&source=hpp&id=19015621&ct=3&usg=AFQjCNGxqBxr3InXGK8Kq8fIu9-HVfgXTA&sa=X&ved=0ahUKEwiX4ODGo_DmAhUCy1kKHd9BCUcQ8IcBCAU" rel="nofollow">Doodle for Google</a></div></div></div><span id="footer"><div style="font-size:10pt"><div style="margin:19px auto;text-align:center" id="fll"><a href="/intl/en/ads/">Advertising\xa0Programs</a><a href="/services/">Business Solutions</a><a href="/intl/en/about.html">About Google</a></div></div><p style="color:#767676;font-size:8pt">© 2020 - <a href="/intl/en/policies/privacy/">Privacy</a> - <a href="/intl/en/policies/terms/">Terms</a></p></span></center><script nonce="/jLZNQqHVc4rVNsK9fepeA==">(function(){window.google.cdo={height:0,width:0};(function(){var a=window.innerWidth,b=window.innerHeight;if(!a||!b){var c=window.document,d="CSS1Compat"==c.compatMode?c.documentElement:c.body;a=d.clientWidth;b=d.clientHeight}a&&b&&(a!=google.cdo.width||b!=google.cdo.height)&&google.log("","","/client_204?&atyp=i&biw="+a+"&bih="+b+"&ei="+google.kEI);}).call(this);})();(function(){var u=\'/xjs/_/js/k\\x3dxjs.hp.en_US.GBWtC82m3hM.O/m\\x3dsb_he,d/am\\x3dAAMCbAQ/d\\x3d1/rs\\x3dACT90oGVQ_TgbLEidTuUiEjMFjASetsFDg\';\nsetTimeout(function(){var b=document;var a="SCRIPT";"application/xhtml+xml"===b.contentType&&(a=a.toLowerCase());a=b.createElement(a);a.src=u;google.timers&&google.timers.load&&google.tick&&google.tick("load","xjsls");document.body.appendChild(a)},0);})();(function(){window.google.xjsu=\'/xjs/_/js/k\\x3dxjs.hp.en_US.GBWtC82m3hM.O/m\\x3dsb_he,d/am\\x3dAAMCbAQ/d\\x3d1/rs\\x3dACT90oGVQ_TgbLEidTuUiEjMFjASetsFDg\';})();function _DumpException(e){throw e;}\nfunction _F_installCss(c){}\n(function(){google.spjs=false;google.snet=true;google.em=[];google.emw=false;})();(function(){var pmc=\'{\\x22d\\x22:{},\\x22sb_he\\x22:{\\x22agen\\x22:true,\\x22cgen\\x22:true,\\x22client\\x22:\\x22heirloom-hp\\x22,\\x22dh\\x22:true,\\x22dhqt\\x22:true,\\x22ds\\x22:\\x22\\x22,\\x22ffql\\x22:\\x22en\\x22,\\x22fl\\x22:true,\\x22host\\x22:\\x22google.com\\x22,\\x22isbh\\x22:28,\\x22jsonp\\x22:true,\\x22msgs\\x22:{\\x22cibl\\x22:\\x22Clear Search\\x22,\\x22dym\\x22:\\x22Did you mean:\\x22,\\x22lcky\\x22:\\x22I\\\\u0026#39;m Feeling Lucky\\x22,\\x22lml\\x22:\\x22Learn more\\x22,\\x22oskt\\x22:\\x22Input tools\\x22,\\x22psrc\\x22:\\x22This search was removed from your \\\\u003Ca href\\x3d\\\\\\x22/history\\\\\\x22\\\\u003EWeb History\\\\u003C/a\\\\u003E\\x22,\\x22psrl\\x22:\\x22Remove\\x22,\\x22sbit\\x22:\\x22Search by image\\x22,\\x22srch\\x22:\\x22Google Search\\x22},\\x22ovr\\x22:{},\\x22pq\\x22:\\x22\\x22,\\x22refpd\\x22:true,\\x22rfs\\x22:[],\\x22sbpl\\x22:24,\\x22sbpr\\x22:24,\\x22scd\\x22:10,\\x22sce\\x22:5,\\x22stok\\x22:\\x22LhCBl4I4AsP8tRAbG3d1Tsq6uDc\\x22,\\x22uhde\\x22:false}}\';google.pmc=JSON.parse(pmc);})();</script> </body></html>'
Consider: what would you see if there had been an exception within the get_webdata
function? You could uncomment the raise
line, above, and re-execute the two cells. What happens? Is there any way to debug the execution to find the lYou may need
Example 3: delayed execution
There are many ways in Python to specify the computation you want to execute, but only run it later.
- [15]:
- def add(x, y):
- return x + y
- # Sometimes we defer computations with strings
- x = 15
- y = 30
- z = "add(x, y)"
- eval(z)
- [15]:
- 45
- [16]:
- # we can use lambda or other "closure"
- x = 15
- y = 30
- z = lambda: add(x, y)
- z()
- [16]:
- 45
- [17]:
- # A very similar thing happens in functools.partial
- import functools
- z = functools.partial(add, x, y)
- z()
- [17]:
- 45
- [18]:
- # Python generators are delayed execution by default
- # Many Python functions expect such iterable objects
- def gen():
- res = x
- yield res
- res += y
- yield y
- g = gen()
- [19]:
- # run once: we get one value and execution halts within the generator
- # run again and the execution completes
- next(g)
- [19]:
- 15
Dask graphs
Any Dask object, such as total
, above, has an attribute which describes the calculations necessary to produce that result. Indeed, this is exactly the graph that we have been talking about, which can be visualized. We see that it is a simple dictionary, the keys are unique task identifiers, and the values are the functions and inputs for calculation.
delayed
is a handy mechanism for creating the Dask graph, but the adventerous may wish to play with the full fexibility afforded by building the graph dictionaries directly. Detailed information can be found here.
- [20]:
- total.dask
- [20]:
- <dask.highlevelgraph.HighLevelGraph at 0x7f51ec40c3d0>
- [21]:
- dict(total.dask)
- [21]:
- {'sum-2cba994e-3d48-4b6c-9e19-ec8736bd2061': (<function sum(iterable, start=0, /)>,
- ['len-fe832380-ef28-4371-a714-38e1be9f9090',
- 'len-4aade36e-f075-48b0-b3c8-46043abd743c',
- 'len-d711ce73-f051-4417-878d-b17c26f7a4c0']),
- 'len-fe832380-ef28-4371-a714-38e1be9f9090': (<function len(obj, /)>,
- 'read_csv-600224a0-6ab0-448e-977c-169453850a60'),
- 'read_csv-600224a0-6ab0-448e-977c-169453850a60': (<function pandas.io.parsers._make_parser_function.<locals>.parser_f(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)>,
- 'data/accounts.0.csv'),
- 'len-4aade36e-f075-48b0-b3c8-46043abd743c': (<function len(obj, /)>,
- 'read_csv-5ea84818-d499-4ae1-adb6-69acf9aec2f8'),
- 'read_csv-5ea84818-d499-4ae1-adb6-69acf9aec2f8': (<function pandas.io.parsers._make_parser_function.<locals>.parser_f(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)>,
- 'data/accounts.1.csv'),
- 'len-d711ce73-f051-4417-878d-b17c26f7a4c0': (<function len(obj, /)>,
- 'read_csv-3003f3a5-05bc-4e4a-9a2b-ccb5ecf8c8cd'),
- 'read_csv-3003f3a5-05bc-4e4a-9a2b-ccb5ecf8c8cd': (<function pandas.io.parsers._make_parser_function.<locals>.parser_f(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)>,
- 'data/accounts.2.csv')}