-
Notifications
You must be signed in to change notification settings - Fork 15
/
README.Rmd
188 lines (126 loc) · 6.13 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
title: 'tidyjson'
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/tidyjson)](https://cran.r-project.org/package=tidyjson)
[![Build Status](https://travis-ci.org/colearendt/tidyjson.svg?branch=master)](https://travis-ci.org/colearendt/tidyjson)
[![Coverage Status](https://codecov.io/github/colearendt/tidyjson/coverage.svg?branch=master)](https://codecov.io/github/colearendt/tidyjson?branch=master)
[![CRAN Activity](http://cranlogs.r-pkg.org/badges/tidyjson)](https://cran.r-project.org/package=tidyjson/index.html)
[![CRAN History](http://cranlogs.r-pkg.org/badges/grand-total/tidyjson)](https://cran.r-project.org/package=tidyjson/index.html)
![tidyjson graphs](https://cloud.githubusercontent.com/assets/2284427/18217882/1b3b2db4-7114-11e6-8ba3-07938f1db9af.png)
tidyjson provides tools for turning complex [json](http://www.json.org/) into [tidy](https://cran.r-project.org/package=tidyr/vignettes/tidy-data.html)
data.
## Installation
Get the released version from CRAN:
```R
install.packages("tidyjson")
```
or the development version from github:
```R
devtools::install_github("colearendt/tidyjson")
```
## Examples
The following example takes a character vector of
`r library(tidyjson);length(worldbank)`
documents in the `worldbank` dataset and spreads out all objects.
Every JSON object key gets its own column with types inferred, so long
as the key does not represent an array. When `recursive=TRUE` (the default behavior),
`spread_all` does this recursively for nested objects and creates column names
using the `sep` parameter (i.e. `{"a":{"b":1}}` with `sep='.'` would
generate a single column: `a.b`).
```{r, message=FALSE}
library(dplyr)
library(tidyjson)
worldbank %>% spread_all
```
Some objects in `worldbank` are arrays, which are not handled by `spread_all`. This example shows how
to quickly summarize the top level structure of a JSON collection
```{r}
worldbank %>% gather_object %>% json_types %>% count(name, type)
```
In order to capture the data in the `majorsector_percent` array, we can use `enter_object`
to enter into that object, `gather_array` to stack the array and `spread_all`
to capture the object items under the array.
```{r}
worldbank %>%
enter_object(majorsector_percent) %>%
gather_array %>%
spread_all %>%
select(-document.id, -array.index)
```
## API
### Spreading objects into columns
* `spread_all()` for spreading all object values into new columns, with nested
objects having concatenated names
* `spread_values()` for specifying a subset of object values to spread into new
columns using the `jstring()`, `jinteger()`, `jdouble()` and `jlogical()`
functions. It is possible to specify multiple parameters to extract data from
nested objects (i.e. `jstring('a','b')`).
### Object navigation
* `enter_object()` for entering into an object by name, discarding all other
JSON (and rows without the corresponding object name) and allowing further
operations on the object value
* `gather_object()` for stacking all object name-value pairs by name, expanding
the rows of the `tbl_json` object accordingly
### Array navigation
* `gather_array()` for stacking all array values by index, expanding the
rows of the `tbl_json` object accordingly
### JSON inspection
* `json_types()` for identifying JSON data types
* `json_length()` for computing the length of JSON data (can be larger than
`1` for objects and arrays)
* `json_complexity()` for computing the length of the unnested JSON, i.e.,
how many terminal leaves there are in a complex JSON structure
* `is_json` family of functions for testing the type of JSON data
### JSON summarization
* `json_structure()` for creating a single fixed column data.frame that
recursively structures arbitrary JSON data
* `json_schema()` for representing the schema of complex JSON, unioned across
disparate JSON documents, and collapsing arrays to their most complex type
representation
### Creating tbl_json objects
* `as.tbl_json()` for converting a string or character vector into a `tbl_json`
object, or for converting a `data.frame` with a JSON column using the
`json.column` argument
* `tbl_json()` for combining a `data.frame` and associated `list` derived
from JSON data into a `tbl_json` object
* `read_json()` for reading JSON data from a file
### Converting tbl_json objects
* `as.character.tbl_json` for converting the JSON attribute of a `tbl_json`
object back into a JSON character string
### Included JSON data
* `commits`: commit data for the dplyr repo from github API
* `issues`: issue data for the dplyr repo from github API
* `worldbank`: world bank funded projects from jsonstudio
* `companies`: startup company data from jsonstudio
## Philosophy
The goal is to turn complex JSON data, which is often represented as nested
lists, into tidy data frames that can be more easily manipulated.
* Work on a single JSON document, or on a collection of related documents
* Create pipelines with `%>%`, producing code that can be read from left to
right
* Guarantee the structure of the data produced, even if the input JSON
structure changes (with the exception of `spread_all`)
* Work with arbitrarily nested arrays or objects
* Handle 'ragged' arrays and / or objects (varying lengths by document)
* Allow for extraction of data in values or object names
* Ensure edge cases are handled correctly (especially empty data)
* Integrate seamlessly with `dplyr`, allowing `tbl_json` objects to pipe in and
out of `dplyr` verbs where reasonable
## Related Work
Tidyjson depends upon
* [magrritr](https://github.com/tidyverse/magrittr) for the `%>%` pipe operator
* [jsonlite](https://github.com/jeroen/jsonlite) for converting JSON strings into nested lists
* [purrr](https://github.com/tidyverse/purrr) for list operators
* [tidyr](https://github.com/tidyverse/tidyr) for unnesting and spreading
Further, there are other R packages that can be used to better understand
JSON data
* [listviewer](https://github.com/timelyportfolio/listviewer) for viewing JSON data interactively