Chapter 1: Introduction
Data are descriptions of the world around us, collected through observation andstored on computers. Computers enable us to infer properties of the world fromthese descriptions. Data science is the discipline of drawing conclusions fromdata using computation. There are three core aspects of effective dataanalysis: exploration, prediction, and inference. This text develops aconsistent approach to all three, introducing statistical ideas and fundamentalideas in computer science concurrently. We focus on a minimal set of coretechniques that they apply to a vast range of real-worldapplications. A foundation in data science requires not only understandingstatistical and computational techniques, but also recognizing how they applyto real scenarios.
For whatever aspect of the world we wish to study—whether it's the Earth'sweather, the world's markets, political polls, or the human mind—data wecollect typically offer an incomplete description of the subject at hand. Acentral challenge of data science is to make reliable conclusions using thispartial information.
In this endeavor, we will combine two essential tools: computation andrandomization. For example, we may want to understand climate change trendsusing temperature observations. Computers will allow us to use all availableinformation to draw conclusions. Rather than focusing only on the averagetemperature of a region, we will consider the whole range of temperaturestogether to construct a more nuanced analysis. Randomness will allow us toconsider the many different ways in which incomplete information might becompleted. Rather than assuming that temperatures vary in a particular way, wewill learn to use randomness as a way to imagine many possible scenarios thatare all consistent with the data we observe.
Applying this approach requires learning to program a computer, and so thistext interleaves a complete introduction to programming that assumes no priorknowledge. Readers with programming experience will find that we cover severaltopics in computation that do not appear in a typical introductory computerscience curriculum. Data science also requires careful reasoning aboutquantities, but this text does not assume any background in mathematics orstatistics beyond basic algebra. You will find very few equations in this text.Instead, techniques are described to readers in the same language in which theyare described to the computers that execute them—a programming language.
This page was created by The Jupyter Book Community