-
Notifications
You must be signed in to change notification settings - Fork 0
RFC: Version 2 of API #1
Comments
Looks great! I’ve had good results with canonicalized date/times using the UTC ISO conventions. |
@dakworks - Agreed, we have so many different users across timezones that agreeing to a format is necessary. |
Brilliant like a diamond, juicy like a peach. Yes, please! |
Many of the concerns addressed by the proposal mirror the benefits of a GraphQL API The GraphQL data query language is:
Source: GitHub It also allows for (see GraphQL Best Practices):
It also allows for global node ID which makes it easier for improved lookup and updating. |
One persistent issue is that the 7 day averages are backward looking, which makes them 3.5 day lagged - this shows up in the plots made, where steep rises are de-emphasised by the moving average being shifted forwards in time. |
@Nosferican all good points, and we have had internal discussions in the past about GraphQL and versionless APIs. The challenge is in a company or organization that owns their data, they can create and maintain a consistent schema because it's data they were able to plan and model out. However, we are trying to reflect a shifting baseline of 56 different states and territories. I would love to add a GraphQL endpoint to our data, but to do so would add an additional layer of technical and human resources that we do not have. |
Good stuff here, although re csv's, "no effort will be made to keep column names or the layout consistent from one day to the next," seems a shame. I would suggest keeping them the same when reasonably possible, because that is the level at which some people are sure to pull data feeds if they aren't comfortable with JSON:API. |
@cfhp1 - I'll change that wording a bit - we want to definitely give consistent CSV files, but we also don't want folks to be counting on Column 11 being positive tests, which is what is happening now. |
I haven't used BigQuery before, but I see it supports SQL. Will that be an alternative to JSON:API for pulling your data? That would be lovely. |
Yes you would have to setup a BigQuery project, but that is pretty straightforward. |
I might be wrong here, but if not mistaken, by implementing the Open API 3 spec, one gets GraphQL for free (https://loopback.io/openapi-to-graphql.html). |
It may not be apparent to API users that some counts are incomplete/estimated or even wrong when states first publish them at CTP captured, and that CTP then may backfill/update the counts, and the calculations too, at almost any time. I suggest warning that cache should be for no more than 1 day. Then, provide insight into how often different fields have seen backfills/updates, and why, in I would also expect HTTP implementation of this API to participate in correct caching by clients, i.e. respecting request cache headers and returning correct responses. Also: the "most recent record" endpoints may be too much temptation to build something that doesn't take backfill/update into account. The "historic" vs "current" vs "daily" terminology in V1 confused me, and I see you did correct this in V2, but wouldn't it be clearer still to not slice the API by this at all, offering only full history where there's history? (A queryable API of course might allow for this slicing, but I'm not going to advocate for that here.) |
@Nosferican definitely we can get the GraphQL spec, but we would still need to build and maintain an endpoint that implements that spec. |
@waded totally agree, we could either say follow our cache-control headers (which are now 10 minutes) or put something more explicit in there. I'll edit the document to reflect that. I do think we should just support slicing history by either ALL state data, or state data for an explicit date. |
Also our API does and will continue to do appropriate ETag updates whenever any content changes. |
I love that you guys did this whole project. Thank you so much! As for the update spec: me likey! One thing that would be very useful, would be to include some summary info for a "column" per state. E.g. the max: AL deathIncrease max. I like to normalize this stuff, but that means running through the data multiple times. If you went this route, you could include some other statistical bits about the data, e.g. min, average, standard deviation. These things can be helpful in filtering noise from the data. Sometimes a state metric will have 40, 100, 1800, -1750, 95, etc. These anomalous spikes are very hard to process sensibly, but if you have some sense of what the data should look like, it makes it easier to detect and substitute for these anomalies in a way that still represents the overall picture correctly. But, I'd be happy with a max. |
For the record, I dropped the portion of the proposal about dropping CORS support. New configuration options with our API endpoint will allow us to support this without much effort, and it is a nice use case for folks to build interactive UIs without a heavy lift. |
@phreditor - The nice thing is we can keep adding these computed values without just adding tons of new fields to a single object! |
Thank you! We were worried that we'd have to rearchitect the whole application. |
Background
The COVID Tracking Project was founded in the early days of the COVID pandemic arriving in the US, and provided an API from day one. This API receives millions of requests per day, and is used by large and small organizations to inform their users. Our API expands our reach and mission by providing consistently high-quality data to others.
Since March, the data we collect has undergone several big changes. We have twice as many data fields. Definitions of data that seemed solid in March have changed considerably. Some data that was one field now needs more context, or is different from state-to-state.
Unfortunately, so many apps use our API that changing field names breaks things for our clients. We also support two formats of data: CSV and JSON, which means we can’t have nested or structured data if we want to keep the two formats in parity. We serve data endpoints like
states/daily.json
that are over 6MB in size, but cannot add pagination because CSV users would miss out on that data.We get many feature requests like providing data as a percentage of population, or adding calculations like 7-day-rolling-average. While we have built internal tools to do this within our website, we are afraid of adding just more fields that may or may not need to be changed as means of analyzing the pandemic change, or our understanding of our own data improves.
API Proposal
Our proposal is to create a new, versioned API for our COVID data that improves time-to-release of data, prevents changes from breaking well-built applications, and gives space for things like computed fields.
The new API will be served from api.covidtracking.com/, while the old API at covidtracking.com/api/v1 will still be maintained and updated daily. On October 1, the V1 of the API will no longer receive updates, and will remain online until January 1, 2021.
Delivery
Remove CSV files from the API, provide CSV downloads
CSV files are necessary tools for researchers and the public, but they are the biggest source of issues filed about formatting problems. No modern API service delivers data in CSV format because it is a format for bulk migration of data, not real-time application messaging.
Instead, the covidtracking.com website will build CSV files for users to download from the various sections of our site. Researchers and other users will be able to use these generated CSV files to download the latest data, but these files will not have fields like computed values. We will make a best effort attempt to keep these files in line with the latest changes in the API.
BigQuery
We have been using BigQuery as a generalized datastore for non-core data, and have a public datastore of our own COVID data. Let’s add all our API data into a public BigQuery dataset that anyone can query against.
Schema
Our JSON data is currently a long JSON array of data with no structure or context. We propose standardizing all API responses based on JSONAPI:
We would follow the following standards for naming and formats:
Every endpoint would provide the last time the API data was updated, a link to license and data definitions, and the API version.
Field definitions
All endpoints will include field definitions in the
meta
object. This will allow us to rename and flag fields for deprecation. Fields will include aformerly
array that indicates what the field used to be named, and can be used as a fallback for applications in case a field changes its name.Fields have an optional “unit” designation that indicates whether the field represents people or samples.
Row metadata
Each data element will have its own
meta
object that defines things like edit notes and last-update times:Data fields
All endpoints will have a
data
array of objects. Each object can be nested to group like data elements together. Each data element will have acomputed
object that includes 7-day averages and computed values as a percentage of the population.Data elements will be nested as [category].[field].values
Field Disambiguation
Some fields, such as a simple total test results, are impossible to treat globally across all states. In this case, we will not provide a value for these fields, and instead give an object representing the most complete time series (since March 2020), and the most accurate time series (where we have data over 120 days):
Option to disable computed values
Users who just want raw values can request endpoints that return simpler values instead by appending
/simple
to the URL:Add state metadata to all endpoints
Users are making multiple API calls for state metadata and daily or current information. Instead, we can provide a single state endpoint that includes all state information, and then append the state metadata for each state to the beginning of all state API calls.
In addition, we will add unique slug metadata fields to all states and state endpoints.
Data cleanup
Fields currently marked as Deprecated in the V1 API will not be brought over to V2.
Endpoints
The new API will have the following endpoints (all prefixed by
/v2/
):/changes
- A running changelog of additions and changes to the API/status
- Information about the last build time and API health/fields
- A list of all fields, their definitions, and long-names/states
- A list of all states and their state metadata, same as our current State Metadata./states/history
- A list of all historic records for all states/state/[state-code]
- All the state’s metadata, and their most recent data record/state/[state-code]/history
- All the state’s metadata, and a list of all historic records for that state/us
- The most recent record for the US/us/history
- All the US historyWe will no longer use
.json
at the end of endpoint URLs.Change control & community outreach
Changes to endpoints and API will be communicated through a dedicated Headway page and Twitter account. We will handle changes in fields or field definitions in a consistent manner:
The text was updated successfully, but these errors were encountered: