# Exploratory data analysis

STAT 167 Introduction to Data Science
Data transformation with dplyr
Wenxiu Ma
wenxiu.ma@ucr.edu
4/26/2022
© W Ma 2022 1/ 44
Exploratory data analysis – data transformation
A typical data science project workflow
Visualization is an important tool for insight generation, but it is rare that
you get the data in exactly the right form you need.
Often you’ll need to create some new variables or summaries, or maybe you
just want to rename the variables or reorder the observations in order to
make the data a little easier to work with.
© W Ma 2022 2/ 44
Data transformation with dplyr
Wickham and Grolemund. “R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data.
” 1st Edition (2017)
Chapter 3 “Data Transformation with dplyr” in print.
Chapter 5 “Data transformation” @
CRAN R Vignette: https://cran.rproject.org/web/packages/dplyr/vignettes/dplyr.html
dplyr cheatsheet: https://raw.githubusercontent.com/rstudio/
cheatsheets/main/data-transformation.pdf
© W Ma 2022 3/ 44
What can you do with dplyr?
© W Ma 2022 4/ 44
What can you do with dplyr?
Pick observations by their values – filter()
Reorder the rows – arrange()
Pick variables by their names – select()
Create new variables with functions of existing variables –
mutate()
Collapse many values down to a single summary –
summarise()
The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 5/ 44
All dplyr functions share similar grammar
The first argument is a data frame.
The subsequent arguments describe what to do with the data
frame, using the variable names (without quotes).
The result is a new data frame.
Together these properties make it easy to chain together multiple
simple steps to achieve a complex result –
piping (%>%)!
© W Ma 2022 6/ 44
Example: nycflights13::flights
This data frame contains all 336,776 flights that departed from New York City in
2013. The data comes from the US Bureau of Transportation Statistics.
install.packages(“nycflights13”)
library(nycflights13)
help(
package=“nycflights13”)
?flights
# full documentation of flights
# View(flights) # see the data in RStudio Viewer
flights
## # A tibble: 336,776 x 19

##
##

year month
day dep_time sched_dep_time dep_delay arr_time sched_arr_tim

<int> <int> <int>
<int>
517
533
542
544
554
554
555
557
557
558

<int>
515
529
540
545
600
558
600
600
600
600

<dbl>
2
4
2
-1
-6
-4
-5
-3
-3
-2

<int>
830
850
923
1004
812
740
913
709
838
753

<int
81
83
85
102
83
72
85
72
84
74

## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013

1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1

## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,

## #
## #

carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,

air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm

© W Ma 2022 7/ 44
Type of variables
int stands for integers.
dbl stands for doubles, or real numbers.
chr stands for character vectors, or strings.
dttm stands for date-times (a date + a time).
Some other common types of variables:
lgl stands for logical, vectors that contain only TRUE or
FALSE.
fctr stands for factors, which R uses to represent categorical
variables with fixed possible values.
ord stands for ordered factors
date stands for dates.
© W Ma 2022 8/ 44
What can you do with dplyr?
Pick observations by their values – ‘filter()’
Reorder the rows – arrange()
Pick variables by their names – select()
Create new variables with functions of existing variables –
mutate()
Collapse many values down to a single summary –
summarise()
The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 9/ 44
Pick observations by their values – filter()
filter() allows you to subset observations based on their values.
The first argument is the name of the data frame.
The second and subsequent arguments are the expressions that
filter the data frame.
filter(flights, month == 1, day == 1)
## # A tibble: 842 x 19

##
##

year month
day dep_time sched_dep_time dep_delay arr_time sched_arr_tim

<int> <int> <int>
<int>
517
533
542
544
554
554
555
557
557
558

<int>
515
529
540
545
600
558
600
600
600
600

<dbl>
2
4
2
-1
-6
-4
-5
-3
-3
-2

<int>
830
850
923
1004
812
740
913
709
838
753

<int
81
83
85
102
83
72
85
72
84
74

## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013

1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1

## # … with 832 more rows, and 11 more variables: arr_delay <dbl>,

## #
## #

carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,

air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm

© W Ma 2022 10/ 44
Pick observations by their values – filter()
filter() allows you to subset observations based on their values.
The output of filter() is a new data frame
# print out the new data frame only
filter(flights, month == 1, day == 1)
# save the new data frame to a variable, no print out
jan1 <- filter(flights, month == 1, day == 1)
# save and print out the new data frame
(jan1 <- filter(flights, month == 1, day == 1))
© W Ma 2022 11/ 44
Selection criteria
Select observations using comparisons
>>=<<=!= (not equal), and == (equal).
Multiple arguments to
filter() are combined with logical
operations
& is “and”, | is “or”, and ! is “not”
Complete set of boolean operations
© W Ma 2022 12/ 44
Combination of selection criteria
filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(
1112))
filter(flights, !(arr_delay >
120 | dep_delay > 120))
filter(flights, arr_delay <=
120, dep_delay <= 120)
© W Ma 2022 13/ 44
Peculiarities of floating-point numbers
The R floating-point data type is a double, a.k.a. numeric
The more bits in the fraction part, the more precision
Rounding errors tend to accumulate in long calculations
When results should be ≈ 0, errors can flip signs
0.45 == 3*0.15
## [1] FALSE
0.45 – 3*0.15
## [1] 5.551115e-17
sqrt(2== 2
## [1] FALSE
sqrt(2– 2
## [1] 4.440892e-16
1/49*49 == 1
## [1] FALSE
1/49*49 – 1
## [1] -1.110223e-16
© W Ma 2022 14/ 44
Usually better to use all.equal() or near() instead of exact
comparison
0.45 == 3*0.15
## [1] FALSE
all.equal(0.453*0.15)
## [1] TRUE
near(0.453*0.15)
## [1] TRUE
© W Ma 2022 15/ 44
Missing values – NA (“not available”)
NA represents an unknown value so missing values are “contagious”
Almost any operation involving an unknown value will also be unknown.
NA > 5
## [1] NA
10 == NA
## [1] NA
NA + 10
## [1] NA
NA / 2
## [1] NA
NA == NA
## [1] NA
If you want to determine if a value is missing, use is.na()
<- NA
is.na(x)
## [1] TRUE
© W Ma 2022 16/ 44
filter() with NA values
filter() only includes rows where the condition is TRUE
It excludes both FALSE and NA values.
If you want to preserve missing values, ask for them explicitly:
(df <- tibble(x = c(1, NA, 3)))
## # A tibble: 3 x 1

##
##
## 1
## 2
## 3

x
<dbl>
1
NA
3

filter(df, x > 1)
## # A tibble: 1 x 1

##
##
## 1

x
<dbl>
3

filter(df, is.na(x) | x > 1)
## # A tibble: 2 x 1

##
##
## 1
## 2

x
<dbl>
NA
3

© W Ma 2022 17/ 44
Count the number of NAs
(df <- tibble(x = c(1, NA, 3, NA)))
## # A tibble: 4 x 1

##
##
## 1
## 2
## 3
## 4

x
<dbl>
1
NA
3
NA

filter(df, is.na(x)) %>% count()
## # A tibble: 1 x 1

##
##
## 1

n
<int>
2

© W Ma 2022 18/ 44
Combine multiple operations with the pipe %>%
Piping focuses on the transformations, not what’s being
transformed, which makes the code easier to read.
x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z)
turns into g(f(x, y), z) and so on
df <- tibble(x = c(1, NA, 3, NA))
# step-by-step transformation
df_na <- filter(df, is.na(x))
count(df_na)
## # A tibble: 1 x 1
## n
## <int>
## 1 2
# PIPING
# filter(df, is.na(x)) %>% count()
df %>% filter(is.na(x)) %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 2
© W Ma 2022 19/ 44
What can you do with dplyr?
Pick observations by their values – filter()
Reorder the rows – ‘arrange()’
Pick variables by their names – select()
Create new variables with functions of existing variables –
mutate()
Collapse many values down to a single summary –
summarise()
The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 20/ 44
Reorder the rows – arrange()
arrange() takes a data frame and a set of column names (or
more complicated expressions) to order by.
When more than one column name are given, each additional
column will be used to break ties in the values of preceding
columns.
arrange(flights, year, month, day)
## # A tibble: 336,776 x 19

##
##

year month
day dep_time sched_dep_time dep_delay arr_time sched_arr_tim

<int> <int> <int>
<int>
517
533
542
544
554
554
555
557
557
558

<int>
515
529
540
545
600
558
600
600
600
600

<dbl>
2
4
2
-1
-6
-4
-5
-3
-3
-2

<int>
830
850
923
1004
812
740
913
709
838
753

<int
81
83
85
102
83
72
85
72
84
74

## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013

1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1

## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,

## #
## #

carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,

air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm

© W Ma 2022 21/ 44
Descending order – desc()
By default, arrange() sorts the columns in an ascending order
To re-order by a column in descending order, use desc()
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19

##
##

year month
day dep_time sched_dep_time dep_delay arr_time sched

<int> <int> <int>
<int>
641
1432
1121
1139
845
1100
2321
959
2257
756

<int>
900
1935
1635
1845
1600
1900
810
1900
759
1700

<dbl>
1301
1137
1126
1014
1005
960
911
899
898
896

<int>
1242
1607
1239
1457
1044
1342
135
1236
121
1058

## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013

1
6
1
9
7
4
3
6
7
12

9
15
10
20
22
10
17
27
22
5

## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,

## #
## #

carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <c

air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_ho

© W Ma 2022 22/ 44
Descending order – desc()
desc() transforms one single vector only
To sort multiple columns, use desc() on each column
# arrange(flights, desc(year, month, day)) # doesn’t work!
arrange(flights, desc(year), desc(month), desc(day))
## # A tibble: 336,776 x 19

##
##

year month
day dep_time sched_dep_time dep_delay arr_time sched

<int> <int> <int>
<int>
13
18
26
459
514
549
550
552
553
554

<int>
2359
2359
2245
500
515
551
600
600
600
550

<dbl>
14
19
101
-1
-1
-2
-10
-8
-7
4

<int>
439
449
129
655
814
925
725
811
741
1024

## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013

12
12
12
12
12
12
12
12
12
12

31
31
31
31
31
31
31
31
31
31

## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,

## #
## #

carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <c

air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_ho

© W Ma 2022 23/ 44
Reorder the rows with NAs
Missing values are always sorted at the end
df <- tibble(x = c(52, NA))
arrange(df, x)
## # A tibble: 3 x 1

##
##
## 1
## 2
## 3

x
<dbl>
2
5
NA

arrange(df, desc(x))
## # A tibble: 3 x 1

##
##
## 1
## 2
## 3

x
<dbl>
5
2
NA

© W Ma 2022 24/ 44
What can you do with dplyr?
Pick observations by their values – filter()
Reorder the rows – arrange()
Pick variables by their names – ‘select()’
Create new variables with functions of existing variables –
mutate()
Collapse many values down to a single summary –
summarise()
The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 25/ 44
Pick variables by their names – select()
When you have large dataset (hundreds or more variables), the
first challenge is often narrowing in on the variables you’re
actually interested in.
select() allows you to rapidly zoom in on a useful subset
using operations based on the names of the variables.
© W Ma 2022 26/ 44
Pick variables by their names – select()
# Select columns by name
# select(flights, year, month, day)
# Select all columns between year and day (inclusive)
select(flights, year:day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # … with 336,766 more rows
© W Ma 2022 27/ 44
Pick variables by their names – select()
# Remove columns by names
# select(data, -year, -month, -day)
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
## # A tibble: 336,776 x 16

##
##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10

dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_dela

<int>
517
533
542
544
554
554
555
557
557
558

<int>
515
529
540
545
600
558
600
600
600
600

<dbl>
2
4
2
-1
-6
-4
-5
-3
-3
-2

<int>
830
850
923
1004
812
740
913
709
838
753

<int>
819
830
850
1022
837
728
854
723
846
745

<dbl
1
2
3
-1
-2
1
1
-1

## # … with 336,766 more rows, and 9 more variables: flight <int>,

## #
## #

tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance

hour <dbl>, minute <dbl>, time_hour <dttm>

© W Ma 2022 28/ 44
Pick variables by their names – select()
There are a number of functions you can use within select():
starts_with(“abc”): matches names that begin with “abc”.
ends_with(“xyz”): matches names that end with “xyz”.
contains(“ijk”): matches names that contain “ijk”.
matches(“(.)1”): selects variables that match a regular
expression. This one matches any variables that contain
stringr @
num_range(“x”, 1:3): matches x1, x2 and x3. It is useful
when numbers were included in column names.
one_of(…): selects columns names that are from a group of
names. It is useful when columns are named as a vector or
character string.
everything(): selects all columns.
See ’?select’ for more details.
© W Ma 2022 29/ 44
Rename variables – rename()
rename(), which is a variant of select() that keeps all the
variables that aren’t explicitly mentioned.
# select(flights, tail_num = tailnum) # only one variable left
rename(flights, tail_num = tailnum)
## # A tibble: 336,776 x 19

##
##

year month
day dep_time sched_dep_time dep_delay arr_time sched

<int> <int> <int>
<int>
517
533
542
544
554
554
555
557
557
558

<int>
515
529
540
545
600
558
600
600
600
600

<dbl>
2
4
2
-1
-6
-4
-5
-3
-3
-2

<int>
830
850
923
1004
812
740
913
709
838
753

## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013

1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1

## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,

## #
## #

carrier <chr>, flight <int>, tail_num <chr>, origin <chr>, dest <
air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_ho

© W Ma 2022 30/ 44
What can you do with dplyr?
Pick observations by their values – filter()
Reorder the rows – arrange()
Pick variables by their names – select()
Create new variables with functions of existing variables –
‘mutate()’
Collapse many values down to a single summary –
summarise()
The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 31/ 44
Create new variables – mutate()
mutate() adds new columns with functions of existing columns
flights_sml <- select(flights, month:day, ends_with(“delay”),
distance, air_time)
mutate(flights_sml,
gain = arr_delay – dep_delay,
speed = distance / air_time * 60)
## # A tibble: 336,776 x 8

##
##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10

month
day dep_delay arr_delay distance air_time gain speed

<int> <int>
<dbl>
2
4
2
-1
-6
-4
-5
-3
-3
-2

<dbl>
11
20
33
-18
-25
12
19
-14
-8
8

<dbl>
1400
1416
1089
1576
762
719
1065
229
944
733

<dbl> <dbl> <dbl>

1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1

227
227
160
183
116
150
158
53
140
138

9 370.
16 374.
31 408.
-17 517.
-19 394.
16 288.
24 404.
-11 259.
-5 405.
10 319.

## # … with 336,766 more rows
© W Ma 2022 32/ 44
Create new variables – transmute()
If you only want to keep the new variables, use transmute()
transmute(flights,
gain = arr_delay – dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 3

##
##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10

gain hours gain_per_hour

<dbl> <dbl>
9 3.78
16 3.78
31 2.67
-17 3.05
-19 1.93
16 2.5
24 2.63
-11 0.883
-5 2.33
10 2.3

<dbl>
2.38
4.23
11.6
-5.57
-9.83
6.4
9.11
-12.5
-2.14
4.35

## # … with 336,766 more rows
© W Ma 2022 33/ 44
Useful creation functions
Arithmetic operators: +*/ˆ
Modular arithmetic: %/% (integer division) and %% (remainder),
where
x == y * (x %/% y) + (x %% y).
It is a handy tool to break integers up into pieces.
transmute(flights, dep_time, hour = dep_time %/% 100,
minute = dep_time %% 100)
## # A tibble: 336,776 x 3
## dep_time hour minute
## <int> <dbl> <dbl>
## 1 517 5 17
## 2 533 5 33
## 3 542 5 42
## 4 544 5 44
## 5 554 5 54
## 6 554 5 54
## 7 555 5 55
## 8 557 5 57
## 9 557 5 57
## 10 558 5 58
## # … with 336,766 more rows
© W Ma 2022 34/ 44
Useful creation functions
Logs: log()log2()log10()
lagging values
This allows you to compute running differences (e.g. x –
lag(x)
) or find when values change (x != lag(x)).
They are most useful in conjunction with group_by().
(x <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 2 3 4 5 6 7 8 9 10 NA
## [1] 3 4 5 6 7 8 9 10 NA NA
lag(x, 1)
## [1] NA 1 2 3 4 5 6 7 8 9
lag(x, 2)
## [1] NA NA 1 2 3 4 5 6 7 8
x – lag(x)
## [1] NA 1 1 1 1 1 1 1 1 1
© W Ma 2022 35/ 44
Useful creation functions
Cumulative and rolling aggregates
R provides functions for running sums, products, mins and
maxes:
cumsum()cumprod()cummin()cummax()
dplyr provides cummean() for cumulative means.
x
## [1] 1 2 3 4 5 6 7 8 9 10
cumsum(x)
## [1] 1 3 6 10 15 21 28 36 45 55
cummean(x)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
© W Ma 2022 36/ 44
Useful creation functions
Logical comparisons, <<=>>===!=.
Ranking: there are a number of ranking functions, but you
min_rank().
Look at the variants row_number()dense_rank(),
percent_rank()cume_dist()ntile()
<- c(122, NA, 34)
min_rank(y)
## [1] 1 2 2 NA 4 5
desc(y) # transform a vector into a format that will be sorted in descen
## [1] -1 -2 -2 NA -3 -4
min_rank(desc(y))
## [1] 5 3 3 NA 2 1
© W Ma 2022 37/ 44
What can you do with dplyr?
Pick observations by their values – filter()
Reorder the rows – arrange()
Pick variables by their names – select()
Create new variables with functions of existing variables –
mutate()
Collapse many values down to a single summary – ‘summarise()’
The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.
© W Ma 2022 38/ 44
Grouped summaries with summarise()
summarise() collapses a data frame to a single row:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
## delay
## <dbl>
## 1 12.6
© W Ma 2022 39/ 44
Grouped summaries with summarise()
summarize() is useful when pair it with group_by()
by_day <- group_by(flights, year, month, day)
summarise(by_day,
delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` has grouped output by ‘year’, ‘month’. You can override
## `.groups` argument.
## # A tibble: 365 x 4

## # Groups:
year, month [12]

##
##

year month
day delay

<int> <int> <int> <dbl>

## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013

1
1
1
1
1
1
1
1
1
1

1 11.5
2 13.9
3 11.0
4 8.95
5 5.73
6 7.15
7 5.42
8 2.55
9 2.28
10 2.84

## # … with 355 more rows
© W Ma 2022 40/ 44
Useful summary functions
Measures of location: mean(x)median(x)
sometimes useful to combine aggregation with logical
subsetting.
flights %>% group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay, na.rm = T),
# the average positive delay
avg_delay2 = mean(arr_delay[arr_delay > 0], na.rm = T))
## `summarise()` has grouped output by ‘year’, ‘month’. You can override
## `.groups` argument.
## # A tibble: 365 x 5

## # Groups:
year, month [12]

##
##

year month
day avg_delay1 avg_delay2
41/ 44

<int> <int> <int>
<dbl>
12.7
12.7
5.73
-1.93
-1.53
4.24
-4.95
-3.23
-0.264

<dbl>
32.5
32.0
27.7
28.3
22.6
24.4
27.8
20.8
25.6

## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
1
1
1
1
1
1
1
1
1

1
2
3
4
5
6
7
8
9

Useful summary functions
Measures of spread: sd(x)IQR(x) (interquartile range),
Measures of rank: min(x)quantile(x, 0.25)max(x)
Measures of position: first(x)nth(x, 2)last(x)
These work similarly to x[1]x[2], and x[length(x)] but let
you set a default value if that position does not exist
Counts: n()sum(!is.na(x))n_distinct(x)
Combine summary functions with logical values: sum(x > 10),
mean(y == 0).
© W Ma 2022 42/ 44
Summarise multiple columns
# count # of NAs per column
colSums(is.na(flights))

##
##
##
##
##
##
##
##

year
month
0

day
0

dep_time sched_dep_time

0
8255
arr_delay
9430
dest

0
carrier
0
air_time
9430

dep_delay
8255
flight

arr_time sched_arr_time

8713
tailnum
2512
hour
0

0
origin
0
minute
0

0
0

distance
time_hour

0
0

flights %>% summarise_all(funs(sum(is.na(.)))) %>% print(width=Inf)

## # A tibble: 1 x 19

##
##
## 1
##
##
## 1
##
##
## 1

year month
day dep_time sched_dep_time dep_delay arr_time sched_arr_time

<int> <int> <int>
<int>
8255

<int>
0

<int>
8255

<int>
8713

<int>

0
0
0
0

arr_delay carrier flight tailnum origin dest air_time distance hour minut

<int>
9430
time_hour
<int>
0

<int> <int>
<int> <int> <int>
<int>
9430

<int> <int> <int

0
0
2512
0
0
0
0

© W Ma 2022 43/ 44
Summary – dplyr basic functions
Pick observations by their values – filter()
Reorder the rows – arrange()
Pick variables by their names – ‘select()’
Create new variables with functions of existing variables –
mutate()
Collapse many values down to a single summary –
summarise()
The above five can be used in conjunction with group_by()
which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of
data
manipulation/wrangling
.

GET THE WORK DONE AT ESSAYLINK.NET

Exploratory data analysis
Scroll to top
WhatsApp
Hello! Need help with your assignments? We are here