This book introduces the reader to data science using R and the tidyverse. No prerequisite knowledge is needed in college-level programming or mathematics (e.g., calculus or statistics). The book is self-contained so readers can immediately begin building data science workflows without needing to reference extensive amounts of external resources for onboarding. The contents are targeted for undergraduate students but are equally applicable to students at the graduate level and beyond. The book develops concepts using many real-world examples to motivate the reader.
Upon completion of the text, the reader will be able to:
Gain proficiency in R programming
Load and manipulate data frames, and "tidy" them using tidyverse tools
Conduct statistical analyses and draw meaningful inferences from them
Perform modeling from numerical and textual data
Generate data visualizations (numerical and spatial) using ggplot2 and understand what is being represented
An accompanying R package "edsdata" contains synthetic and real datasets used by the textbook and is meant to be used for further practice. An exercise set is made available and designed for compatibility with automated grading tools for instructor use.
As you develop familiarity with processing data, you learn how to develop intuition from the data at hand by glancing at its values. Unfortunately, there is only so much you can do with glancing at values. There is a substantial limitation to what you can obtain when the data at hand is so large. Visualization is a powerful tool in such cases. In this chapter we introduce another key member of the tidyverse, the ggplot2 package, for visualization. R provides many facilities for creating visualizations. The most sophisticated of them, and perhaps the most elegant, is ggplot2. In this section we introduce generating visualizations using ggplot2.