-
Notifications
You must be signed in to change notification settings - Fork 79
/
14_TimeSeriesAnalysis.Rmd
1873 lines (1350 loc) · 63.4 KB
/
14_TimeSeriesAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Time Series Analysis {#TimeSeriesAnalysis}
Introduction {-}
------------
Time series analysis has become a hot topic with the rise of
quantitative finance and automated trading of securities. Many of the
facilities described in this chapter were invented by practitioners and
researchers in finance, securities trading, and portfolio management.
Before you start any time series analysis in R, a key decision is your
choice of data representation (object class). This is especially
critical in an object-oriented language such as R, because the choice
affects more than how the data is stored; it also dictates which
functions (methods) will be available for loading, processing,
analyzing, printing, and plotting your data. When many people start using
R they simply store time series data in vectors. That
seems natural. However, they quickly discover that none of the coolest
analytics for time series analysis work with simple vectors. We've found when users
switch to using an object class intended for time series data,
the analysis gets easier, opening a gateway to valuable functions and analytics.
This chapter’s first recipe recommends using the `zoo` or `xts` packages
for representing time series data. They are quite general and should
meet the needs of most users. Nearly every subsequent recipe assumes you
are using one of those two representations.
> **Note**
>
> The `xts` implementation is a superset of `zoo`, so `xts` can do
> everything that `zoo` can do. In this chapter, whenever a recipe works
> for a `zoo` object, you can safely assume (unless stated otherwise)
> that it also works for an `xts` object.
### Other Representations {-}
Other representations of time series data are available in the R universe, including:
* `fts` package
* `irts` from the `tseries` package
* `timeSeries` package
* `ts` (base distribution)
* `tsibble` package, a tidyverse style package for time series
In fact, there is a whole toolkit, called `tsbox`, just for converting between representations.
Two representations deserve special mention.
#### ts (base distribution) {-}
The base distribution of R includes a time series class called `ts`. We
don’t recommend this representation for general use because the
implementation itself is too limited and restrictive.
However, the base distribution includes some important time series analytics
that depend upon `ts`, such as the autocorrelation function (`acf`)
and the cross-correlation function (`ccf`).
To use those base functions on `xts` data, use the `to.ts` function
to "downshift" your data into the `ts` representation before calling the function.
For example, if `x` is an `xts` object, you can compute its autocorrelation
like this:
```
acf(as.ts(x))
```
#### tsibble package {-}
The `tsibble` package is a recent extension to the tidyverse,
specifically designed for working with time series data
within the tidyverse.
We find it useful for *cross-sectional* data—that is, data for which the observations are grouped by date,
and you want to perform analytics *within* dates more
than *across* dates.
### Date Versus Datetime {-}
Every observation in a time series has an associated date or time. The
object classes used in this chapter, `zoo` and `xts`, give you the
choice of using either dates or datetimes for representing the data’s
time component. You would use dates to represent daily data, of course,
and also for weekly, monthly, or even annual data; in these cases, the
date gives the day on which the observation occurred. You would use
datetimes for intraday data, where both the date and time of observation
are needed.
In describing this chapter’s recipes, we found it pretty cumbersome to keep
saying “date or datetime.” So we simplified the prose by assuming that
your data are daily and thus use whole dates. Please bear in mind, of
course, that you are free and able to use timestamps below the
resolution of a calendar date.
### See Also {-}
R has many useful functions and packages for time series analysis.
You’ll find pointers to them in the
[task view for Time Series Analysis](http://cran.r-project.org/web/views/TimeSeries.html).
Representing Time Series Data {#recipe-id076}
-----------------------------
### Problem {-#problem-id076}
You want an R data structure that can represent time series data.
### Solution {-#solution-id076}
We recommend the `zoo` and `xts` packages. They define a data structure
for time series, and they contain many useful functions for working with
time series data. Create a `zoo` object this way where `x` is a vector, matrix,
or data frame, and `dt` is a vector of corresponding dates or datetimes:
``` {r, eval=FALSE}
library(zoo)
ts <- zoo(x, dt)
```
Create an `xts` object in this way:
``` {r, eval=FALSE}
library(xts)
ts <- xts(x, dt)
```
Convert between representations of the time series data by using
`as.zoo` and `as.xts`:
`as.zoo(ts)`
: Converts `ts` to a `zoo` object
`as.xts(ts)`
: Converts `ts` to an `xts` object
### Discussion {-#discussion-id076}
R has at least eight different implementations of data structures for
representing time series. We haven’t tried them all, but we can say that `zoo` and `xts` are excellent packages for working with
time series data and better than the others that we have tried.
These representations assume you have two vectors: a vector of
observations (data) and a vector of dates or times of those
observations. The `zoo` function combines them into a `zoo` object:
``` {r}
library(zoo)
x <- c(3, 4, 1, 4, 8)
dt <- seq(as.Date("2018-01-01"), as.Date("2018-01-05"), by = "days")
ts <- zoo(x, dt)
print(ts)
```
The `xts` function is similar, returning an `xts` object:
``` {r}
library(xts)
ts <- xts(x, dt)
print(ts)
```
The data, `x`, should be numeric. The vector of dates or datetimes,
`dt`, is called the *index*. Legal indices vary between the packages:
`zoo`
: The index can be any ordered values, such as `Date` objects,
`POSIXct` objects, integers, or even floating-point values.
`xts`
: The index must be a supported date or time class. This includes
`Date`, `POSIXct`, and `chron` objects. Those should be sufficient
for most applications, but you can also use `yearmon`, `yearqtr`,
and `dateTime` objects. The `xts` package is more restrictive than
`zoo` because it implements powerful operations that require a
time-based index.
The following example creates a `zoo` object that contains the price of
IBM stock for the first five days of 2010; it uses `Date` objects for
the index:
``` {r}
prices <- c(132.45, 130.85, 130.00, 129.55, 130.85)
dates <- as.Date(c(
"2010-01-04", "2010-01-05", "2010-01-06",
"2010-01-07", "2010-01-08"
))
ibm.daily <- zoo(prices, dates)
print(ibm.daily)
```
In contrast, the next example captures the price of IBM stock at
one-second intervals. It represents time by the number of hours past
midnight starting at 9:30 a.m. (1 second = 0.00027778 hours, more or less):
``` {r}
prices <- c(131.18, 131.20, 131.17, 131.15, 131.17)
seconds <- c(9.5, 9.500278, 9.500556, 9.500833, 9.501111)
ibm.sec <- zoo(prices, seconds)
print(ibm.sec)
```
Those two examples used a single time series, where the data came from a
vector. Both `zoo` and `xts` can also handle multiple, parallel time
series. For this, capture the several time series in a matrix or data frame and
then create a multivariate time series by calling the `zoo` (or `xts`)
function:
``` {r, eval=FALSE}
ts <- zoo(df, dt) # OR: ts <- xts(dfrm, dt)
```
The second argument is a vector of dates (or datetimes) for each
observation. There is only one vector of dates for all the time series;
in other words, all observations in each row of the matrix or data frame must have
the same date. See Recipe \@ref(recipe-id079), ["Merging Several Time Series"](#recipe-id079) if your
data has mismatched dates.
Once the data is captured inside a `zoo` or `xts` object, you can
extract the pure data via `coredata`, which returns a simple vector (or
matrix):
``` {r}
coredata(ibm.daily)
```
You can extract the date or time portion via `index`:
``` {r}
index(ibm.daily)
```
The `xts` package is very similar to `zoo`. It is optimized for
speed, so is especially well suited for processing large volumes of
data. It is also clever about converting to and from other time series
representations.
One big advantage of capturing data inside a `zoo` or `xts` object is
that special-purpose functions become available for printing, plotting,
differencing, merging, periodic sampling, applying rolling functions,
and other useful operations. There is even a function, `read.zoo`,
dedicated to reading time series data from ASCII files.
Remember that the `xts` package can do everything that the `zoo` package
can do, so everywhere that this chapter talks about `zoo` objects you
can also use `xts` objects.
If you are a serious user of time series data, we strongly recommend
studying the documentation of these packages in order to learn about the
ways they can improve your life. They are rich packages with many useful
features.
### See Also {-#see_also-id076}
See CRAN for documentation on
[`zoo`](http://cran.r-project.org/web/packages/zoo/) and
[`xts`](http://cran.r-project.org/web/packages/xts/), including reference manuals, vignettes, and quick reference cards. If the
packages are already installed on your computer, view their
documentation using the `vignette` function:
``` {r, eval=FALSE}
vignette("zoo")
vignette("xts")
```
The `timeSeries` package is another good implementation of a time series
object. It is part of the Rmetrics project for quantitative finance.
Plotting Time Series Data {#recipe-id089}
-------------------------
### Problem {-#problem-id089}
You want to plot one or more time series.
### Solution {-#solution-id089}
Use `plot(x)`, which works for `zoo` objects and `xts` objects
containing either single or multiple time series.
For a simple vector `v` of time series observations, you can use either
`plot(v,type = "l")` or `plot.ts(v)`.
### Discussion {-#discussion-id089}
The generic `plot` function has a version for `zoo` objects and `xts`
objects. It can plot objects that contain a single time series or
multiple time series. In the latter case, it can plot each series in a
separate plot or together in one plot.
Suppose that `ibm.infl` is a `zoo` object that contains two time series.
One shows the quoted price of IBM stock from January 2000 through December 2017, and the other is that same price adjusted for inflation. If you plot the object, R will plot the two time series together in one plot, as shown in Figure \@ref(fig:ibmplot):
``` {r ibmplot, fig.cap='Example xts plot'}
load(file = "./data/ibm.rdata")
library(xts)
main <- "IBM: Historical vs. Inflation-Adjusted"
lty <- c("dotted", "solid")
# Plot the xts object
plot(ibm.infl,
lty = lty, main = main,
legend.loc = "left"
)
```
The `plot` function for `xts` provides a default title as simply the name of the `xts` object. As we show here, it's common to set the `main` parameter to a more meaningful title.
The code specifies two line types (`lty`) so that the two lines are
drawn in two different styles, making them easier to distinguish.
### See Also {-#see_also-id089}
For working with financial data, the `quantmod` package contains special
plotting functions that produce beautiful, stylized plots.
Extracting the Oldest or Newest Observations {#recipe-id077}
--------------------------------------------
### Problem {-#problem-id077}
You want to see only the oldest or newest observations of your time
series.
### Solution {-#solution-id077}
Use `head` to view the oldest observations:
``` {r, eval=FALSE}
head(ts)
```
Use `tail` to view the newest observations:
``` {r, eval=FALSE}
tail(ts)
```
### Discussion {-#discussion-id077}
The `head` and `tail` functions are generic, so they will work whether
your data is stored in a simple vector, a `zoo` object, or an `xts`
object.
Suppose you have a `xts` object with a multiyear history of the price of
IBM stock like the one used in the prior recipe. You can’t display the whole dataset because it would scroll
off your screen. But you can view the initial observations:
``` {r}
ibm <- ibm.infl$ibm # grab one column for illustration
head(ibm)
```
And you can view the final observations:
``` {r}
tail(ibm)
```
By default, `head` and `tail` show (respectively) the six oldest and six
newest observations. You can see more observations by providing a second
argument—for example, `tail(ibm, 20)`.
The `xts` package also includes `first` and `last` functions, which use
calendar periods instead of number of observations. We can use `first`
and `last` to select data by number of days, weeks, months, or even
years:
``` {r}
first(ibm, "2 week")
```
At first glance this output might be confusing. We asked for `"2 week"` and `xts` returned six days. That might seem off until we look at a calendar of January 2000.
```{r, echo=FALSE, fig.width=1, fig.cap='January 2000 calendar'}
knitr::include_graphics("images_v2/cal.png")
```
We can see from the calendar that the first week of January 2000 has only one day, Saturday the 1st. Then the second week runs from the 2nd to the 8th. Our data has no value for the 8th, so when we ask `first` for the first `"2 week"` it returns all the values from the first two calendar weeks. In our example dataset the first two calendar weeks contain only six values.
Similarly, we can ask `last` to give us the last month's worth of data:
```{r}
last(ibm, "month")
```
If we had been using `zoo` objects here, we would need to have converted them to `xts` objects before passing the objects to `first` or `last`, as those are `xts` functions.
### See Also {-}
See `help(first.xts)` and `help(last.xts)` for details on the `first`
and `last` functions, respectively.
> **Warning**
>
> The `tidyverse` package `dplyr` also has functions called `first` and `last`. If your workflow involves loading both `xts` and `dplyr` packages, make sure to be explicit about which function you are calling by using the `*package*::*function*` notation (for example, `xts::first`).
Subsetting a Time Series {#recipe-id281}
------------------------
### Problem {-#problem-id281}
You want to select one or more elements from a time series.
### Solution {-#solution-id281}
You can index a `zoo` or `xts` object by position. Use one or two
subscripts, depending upon whether the object contains one time series
or multiple time series:
`ts[i]`
: Selects the *i*th observation from a single time series
`ts[j,i]`
: Selects the *i*th observation of the *j*th time series of multiple
time series
You can index the time series by date. Use the same type of
object as the index of your time series. This example assumes that the
index contains `Date` objects:
``` {r, eval=FALSE}
ts[as.Date("yyyy-mm-dd")]
```
You can index it by a sequence of dates:
``` {r, eval=FALSE}
dates <- seq(startdate, enddate, increment)
ts[dates]
```
The `window` function can select a range by start and end date:
``` {r, eval=FALSE}
window(ts, start = startdate, end = enddate)
```
### Discussion {-#discussion-id281}
Recall our `xts` object that is a sample of inflation-adjusted IBM stock prices from the previous recipe:
``` {r}
head(ibm)
```
We can select an observation by position, just like selecting elements
from a vector (see Recipe \@ref(recipe-id039), ["Selecting Vector Elements"](#recipe-id039)):
``` {r}
ibm[2]
```
We can also select multiple observations by position:
``` {r}
ibm[2:4]
```
Sometimes it’s more useful to select by date.
Simpy use the date itself as the index.
``` {r}
ibm[as.Date("2010-01-05")]
```
Our `ibm` data is an `xts` object, so we can use date-like subsetting, too. (The `zoo` object does not offer this flexibility.)
```
ibm['2010-01-05']
ibm['20100105']
```
We can also select by a vector of `Date` objects:
``` {r}
dates <- seq(as.Date("2010-01-04"), as.Date("2010-01-08"), by = 2)
ibm[dates]
```
The `window` function is easier for selecting a range of consecutive
dates:
``` {r}
window(ibm, start = as.Date("2010-01-05"), end = as.Date("2010-01-07"))
```
We can select a year/month combination using `yyyymm` subsetting:
```{r, eval=FALSE}
ibm['201001'] # Jan 2010
```
Select year ranges using `/` like so:
```{r, eval=FALSE}
ibm['2009/2011'] # all of 2009 - 2011
```
Or use `/` to select ranges including months:
```{r, eval=FALSE}
ibm['2009/201001'] # all of 2009 plus Jan 2010
ibm['200906/201005'] # June 2009 through May 2010
```
### See Also {-#see_also-id281}
The `xts` package provides many other clever ways to index a time
series. See the package documentation.
Merging Several Time Series {#recipe-id079}
---------------------------
### Problem {-#problem-id079}
You have two or more time series. You want to merge them into a single
time series object.
### Solution {-#solution-id079}
Use a `zoo` or `xts` object to represent the time series, then use the `merge`
function to combine them:
``` {r, eval=FALSE}
merge(ts1, ts2)
```
### Discussion {-#discussion-id079}
Merging two time series is an incredible headache when the two series
have differing timestamps. Consider these two time series: the daily
price of IBM stock from 1999 through 2017, and the monthly Consumer
Price Index (CPI) for the same period:
``` {r}
load(file = "./data/ibm.rdata")
head(ibm)
head(cpi)
```
Obviously, the two time series have different timestamps because one is
daily data and the other is monthly data. Even worse, the downloaded CPI
data is timestamped for the first day of every month, even when that day
is a holiday or weekend (e.g., New Year’s Day).
Thank goodness for the `merge` function, which handles the messy details
of reconciling the different dates:
``` {r}
head(merge(ibm, cpi))
```
By default, `merge` finds the *union* of all dates: the output contains
all dates from both inputs, and missing observations are filled with `NA`
values. You can replace those `NA` values with the most recent observation
by using the `na.locf` function from the `zoo` package:
``` {r}
head(na.locf(merge(ibm, cpi)))
```
(Here `locf` stands for “last observation carried forward.”) Observe
that the `NA`s were replaced. However, `na.locf` left an NA in the first
observation (1999-01-01) because there was no IBM stock price on that
day.
You can get the *intersection* of all dates by setting `all = FALSE`:
``` {r}
head(merge(ibm, cpi, all = FALSE))
```
Now the output is limited to observations that are *common* to both
files.
Notice, however, that the intersection begins on February 1, not January 1.
The reason is that January 1 is a holiday, so there is no IBM stock price for that date and hence no
intersection with the CPI data. To fix this, see Recipe \@ref(recipe-id239), ["Filling or Padding a Time Series"](#recipe-id239).
Filling or Padding a Time Series {#recipe-id239}
--------------------------------
### Problem {-#problem-id239}
Your time series data is missing observations. You want to fill or pad
the data with the missing dates/times.
### Solution {-#solution-id239}
Create a zero-width (dataless) `zoo` or `xts` object with the missing
dates/times. Then merge your data with the zero-width object, taking the
union of all dates:
``` {r, eval=FALSE}
empty <- zoo(, dates) # 'dates' is vector of the missing dates
merge(ts, empty, all = TRUE)
```
### Discussion {-#discussion-id239}
The `zoo` package includes a handy feature in the constructor for `zoo`
objects: you can omit the data and build a zero-width object. The object
contains no data, just dates. We can use these “Frankenstein” objects to
perform such operations as filling and padding on other time series
objects.
Suppose you download monthly CPI data used in the last recipe. The data is
timestamped with the first day of each month:
``` {r}
head(cpi)
```
As far as R knows, we have no observations for the other days of the
months. However, we know that each CPI value applies to the subsequent
days through month-end. So first we build a zero-width object with every
day of the decade, but no data:
``` {r}
dates <- seq(from = min(index(cpi)), to = max(index(cpi)), by = 1)
empty <- zoo(, dates)
```
We use the `min(index(cpi))` and `max(index(cpi))` to get the minimum and maximum index value from our `cpi`data. So our resulting `empty` object is just an index of daily dates with the same range as our `cpi` data.
Then we take the union the CPI data and the zero-width object, yielding
a dataset filled with `NA` values:
``` {r}
filled.cpi <- merge(cpi, empty, all = TRUE)
head(filled.cpi)
```
The resulting time series contains every calendar day, with `NA`s where
there was no observation. That might be what you need. However, a more
common requirement is to replace each `NA` with the most recent
observation as of that date. The `na.locf` function from the `zoo`
package does exactly that:
``` {r}
filled.cpi <- na.locf(merge(cpi, empty, all = TRUE))
head(filled.cpi)
```
January’s value of `1` is carried forward until February 1, at which
time it is replaced by the February value. Now every day has the
latest CPI value as of that date. Note that in this dataset, the CPI is based on January 1, 1999 = 100% and all CPI values are relative to the value on that date.
```{r}
tail(filled.cpi)
```
We can use this recipe to fix the problem mentioned in Recipe \@ref(recipe-id079), ["Merging Several Time Series"](#recipe-id079). There, the daily price of
IBM stock and the monthly CPI data had no intersection on certain days.
We can fix that using several different methods. One way is to pad the
IBM data to include the CPI dates and then take the intersection (recall
that `index(cpi)` returns all the dates in the CPI time series):
``` {r}
filled.ibm <- na.locf(merge(ibm, zoo(, index(cpi))))
head(merge(filled.ibm, cpi, all = FALSE))
```
That gives monthly observations. Another way is to fill out the CPI data
(as described previously) and then take the intersection with the IBM
data. That gives daily observations, as follows:
``` {r}
filled_data <- merge(ibm, filled.cpi, all = FALSE)
head(filled_data)
```
Another common method for filling missing values uses the _cubic spline_ technique,
which interpolates smooth intermediate values from the known data.
We can use the `zoo` function `na.spline` to fill our missing values using a cubic spline.
```{r}
combined_data <- merge(ibm, cpi, all = TRUE)
head(combined_data)
combined_spline <- na.spline(combined_data)
head(combined_spline)
```
Notice that both the missing values for `cpi` and `ibm` were filled. However, the value filled in for January 1, 1999 for the `ibm` column seems out of line with the January 4th observation. This illustrates one of the challenges with cubic splines: they can become quite unstable if the value that is being interpolated is at the very beginning or the very end of a series. To get around this instability, we could get some data points from before January 1st, 1999, then interpolate using `na.spline`, or we could simply choose a different interpolation method.
Lagging a Time Series {#recipe-id280}
---------------------
### Problem {-#problem-id280}
You want to shift a time series in time, either forward or backward.
### Solution {-#solution-id280}
Use the `lag` function. The second argument, `k`, is the number of periods
to shift the data:
``` {r, eval=FALSE}
lag(ts, k)
```
Use positive `k` to shift the data forward in time (tomorrow’s data
becomes today’s data). Use a negative `k` to shift the data backward in
time (yesterday’s data becomes today’s data).
### Discussion {-#discussion-id280}
Recall the `zoo` object containing five days of IBM stock prices from Recipe \@ref(recipe-id076), ["Representing Time Series Data"](#recipe-id076):
``` {r}
ibm.daily
```
To shift the data forward one day, we use `k = +1`:
``` {r}
lag(ibm.daily, k = +1, na.pad = TRUE)
```
We also set `na.pad = TRUE` to fill the trailing dates with `NA`. Otherwise,
they would simply be dropped, resulting in a shortened time series.
To shift the data backward one day, we use `k = -1`. Again we use
`na.pad = TRUE` to pad the beginning with `NA`s:
``` {r}
lag(ibm.daily, k = -1, na.pad = TRUE)
```
If the sign convention for `k` seems odd to you, you are not alone.
> **Warning**
>
> The function is called `lag`, but a positive `k` actually generates
> *leading* data, not lagging data. Use a negative `k` to get *lagging*
> data. Yes, this is bizarre. Perhaps the function should have been
> called `lead`.
The other thing to be careful with when using `lag` is that the `dplyr` package contains a function named `lag` as well. The arguments for `dplyr::lag` are not exactly the same as the Base R `lag` function. In particular, `dplyr` uses `n` instead of `k`:
```{r}
dplyr::lag(ibm.daily, n = 1)
```
> **Warning**
>
> If you want to load `dplyr`, you should use the namespace to be explicit about which `lag` function you are using. The Base R function is `stats::lag`, while the dplyr function is, naturally, `dplyr::lag`.
Computing Successive Differences {#recipe-id141}
--------------------------------
### Problem {-#problem-id141}
Given a time series, *x*, you want to compute the difference between
successive observations: (*x*~2~ – *x*~1~), (*x*~3~ – *x*~2~), (*x*~4~ –
*x*~3~), ....
### Solution {-#solution-id141}
Use the `diff` function:
``` {r, eval=FALSE}
diff(x)
```
### Discussion {-#discussion-id141}
The `diff` function is generic, so it works on simple vectors, `xts` objects, and
`zoo` objects. The beauty of differencing a `zoo` or `xts` object is that the
result is the same type object you started with *and* the differences have the correct
dates. Here we compute the differences for successive prices of IBM
stock:
``` {r}
ibm.daily
diff(ibm.daily)
```
The difference labeled 2010-01-05 is the change from the previous day
(2010-01-04), which is usually what you want. The differenced series is
shorter than the original series by one element because R can’t compute
the change as of 2010-01-04, of course.
By default, `diff` computes successive differences. You can compute
differences that are more widely spaced by using its `lag` parameter.
Suppose you have monthly CPI data and want to compute the change from
the previous 12 months, giving the year-over-year change. Specify a `lag`
of `12`:
``` {r}
head(cpi, 24)
head(diff(cpi, lag = 12), 24) # Compute year-over-year change
```
Performing Calculations on Time Series {#recipe-id080}
--------------------------------------
### Problem {-#problem-id080}
You want to use arithmetic and common functions on time series data.
### Solution {-#solution-id080}
No problem. R is pretty clever about operations on `zoo` and `xts` objects. You
can use arithmetic operators (`+`, `-`, `*`, `/`, etc.) as well as
common functions (`sqrt`, `log`, etc.) and usually get what you
expect.
### Discussion {-#discussion-id080}
When you perform arithmetic on `zoo` or `xts` objects, R aligns the objects
according to date so that the results make sense. Suppose we want to
compute the percentage change in IBM stock. We need to divide the daily
change by the price, but those two time series are not naturally
aligned—they have different start times and different lengths. Here's an illustration with a `zoo` object:
``` {r}
ibm.daily
diff(ibm.daily)
```
Fortunately, when we divide one series by the other, R aligns the series
for us and returns a `zoo` object:
``` {r}
diff(ibm.daily) / ibm.daily
```
We can scale the result by 100 to compute the percentage change, and the
result is another `zoo` object:
``` {r}
100 * (diff(ibm.daily) / ibm.daily)
```
Functions work just as well. If we compute the logarithm or square root
of a `zoo` object, the result is a `zoo` object with the timestamps
preserved:
``` {r}
log(ibm.daily)
```
In investment management, computing the difference of logarithms of
prices is quite common. That’s a piece of cake in R:
``` {r}
diff(log(ibm.daily))
```
### See Also {-#see_also-id080}
See Recipe \@ref(recipe-id141), ["Computing Successive Differences"](#recipe-id141), for the special
case of computing the difference between successive values.
Computing a Moving Average {#recipe-id279}
--------------------------
### Problem {-#problem-id279}
You want to compute the moving average of a time series.
### Solution {-#solution-id279}
Use the `rollmean` function of the `zoo` package to calculate the
*k*-period moving average:
``` {r, eval=FALSE}
library(zoo)
ma <- rollmean(ts, k)
```
Here `ts` is the time series data, captured in a `zoo` object, and `k` is
the number of periods.
For most financial applications, you want `rollmean` to calculate the
mean using only historical data; that is, for each day, use only the
data available that day. To do that, specify `align = right`. Otherwise,
`rollmean` will “cheat” and use future data that was actually
unavailable at the time:
``` {r, eval=FALSE}
ma <- rollmean(ts, k, align = "right")
```
### Discussion {-#discussion-id279}
Traders are fond of moving averages for smoothing out fluctuations in
prices. The formal name is the *rolling mean*. You could calculate the
rolling mean as described in Recipe \@ref(recipe-id081), ["Applying a Rolling Function"](#recipe-id081), by combining the
`rollapply` function and the `mean` function, but `rollmean` is much
faster.
Beside speed, the beauty of `rollmean` is that it returns the same type of time series object it's called on (i.e., `xts` or `zoo`). For each element in the object, its date is the “as of” date for
a calculated mean. Because the result is a time series object, you can easily merge the
original data and the moving average and then plot them together as in Figure \@ref(fig:ibmroll):
```{r ibmroll, fig.cap='Rolling average plot'}
ibm_year <- ibm["2016"]
ma_ibm <- rollmean(ibm_year, 7, align = "right")
ma_ibm <- merge(ma_ibm, ibm_year)
plot(ma_ibm)
```
The output is normally missing a few initial data points, since
`rollmean` needs a full `k` observations to compute the mean.
Consequently, the output is shorter than the input. If that’s a problem,
specify `na.pad = TRUE`; then `rollmean` will pad the initial output with
`NA` values.
### See Also {-#see_also-id279}
See Recipe \@ref(recipe-id081), ["Applying a Rolling Function"](#recipe-id081), for more about the
`align` parameter.
The moving average described here is a simple moving average. The
`quantmod`, `TTR`, and `fTrading` packages contain functions for
computing and plotting many kinds of moving averages, including simple
ones.
Applying a Function by Calendar Period {#recipe-id282}
--------------------------------------
### Problem {-#problem-id282}
Given a time series, you want to group the contents by a calendar period
(e.g., week, month, or year) and then apply a function to each group.
### Solution {-#solution-id282}
The `xts` package includes functions for processing a time series by
day, week, month, quarter, or year:
``` {r, eval=FALSE}
apply.daily(ts, f)
apply.weekly(ts, f)
apply.monthly(ts, f)
apply.quarterly(ts, f)
apply.yearly(ts, f)
```
Here `ts` is an `xts` time series, and `f` is the function to apply to each
day, week, month, quarter, or year.
If your time series is a `zoo` object, convert to an `xts` object first
so you can access these functions; for example:
``` {r, eval=FALSE}
apply.monthly(as.xts(ts), f)
```
### Discussion {-#discussion-id282}
It is common to process time series data according to calendar period.
But figuring calendar periods is tedious at best and bizarre at worst.
Let these functions do the heavy lifting.
Suppose we have a five-year history of IBM stock prices stored in an
`xts` object:
``` {r}
ibm_5 <- ibm["2012/2017"]
head(ibm_5)
```
We can calculate the average price by month if we use `apply.monthly`
and `mean` together:
``` {r}
ibm_mm <- apply.monthly(ibm_5, mean)
head(ibm_mm)
```
Notice that the IBM data is in an `xts` object from the start. Had the data been in a `zoo` object, we would have needed to convert it to `xts` using `as.xts`.
A more interesting application is calculating volatility by calendar
month, where volatility is measured as the standard deviation of daily
log-returns. Daily log-returns are calculated this way:
``` {r, eval=FALSE}
diff(log(ibm_5))
```
We calculate their standard deviation, month by month, like this:
``` {r, eval=FALSE}
apply.monthly(as.xts(diff(log(ibm_5))), sd)
```
We can scale the daily number to estimate annualized volatility, as shown in Figure \@ref(fig:volplot):