Skip to content
This repository has been archived by the owner on Dec 10, 2020. It is now read-only.

RFC: Version 2 of API #1

Open
kevee opened this issue Jul 17, 2020 · 19 comments
Open

RFC: Version 2 of API #1

kevee opened this issue Jul 17, 2020 · 19 comments

Comments

@kevee
Copy link
Member

kevee commented Jul 17, 2020

Background

The COVID Tracking Project was founded in the early days of the COVID pandemic arriving in the US, and provided an API from day one. This API receives millions of requests per day, and is used by large and small organizations to inform their users. Our API expands our reach and mission by providing consistently high-quality data to others.

Since March, the data we collect has undergone several big changes. We have twice as many data fields. Definitions of data that seemed solid in March have changed considerably. Some data that was one field now needs more context, or is different from state-to-state.

Unfortunately, so many apps use our API that changing field names breaks things for our clients. We also support two formats of data: CSV and JSON, which means we can’t have nested or structured data if we want to keep the two formats in parity. We serve data endpoints like states/daily.json that are over 6MB in size, but cannot add pagination because CSV users would miss out on that data.

We get many feature requests like providing data as a percentage of population, or adding calculations like 7-day-rolling-average. While we have built internal tools to do this within our website, we are afraid of adding just more fields that may or may not need to be changed as means of analyzing the pandemic change, or our understanding of our own data improves.

API Proposal

Our proposal is to create a new, versioned API for our COVID data that improves time-to-release of data, prevents changes from breaking well-built applications, and gives space for things like computed fields.

The new API will be served from api.covidtracking.com/, while the old API at covidtracking.com/api/v1 will still be maintained and updated daily. On October 1, the V1 of the API will no longer receive updates, and will remain online until January 1, 2021.

Delivery

Remove CSV files from the API, provide CSV downloads

CSV files are necessary tools for researchers and the public, but they are the biggest source of issues filed about formatting problems. No modern API service delivers data in CSV format because it is a format for bulk migration of data, not real-time application messaging.

Instead, the covidtracking.com website will build CSV files for users to download from the various sections of our site. Researchers and other users will be able to use these generated CSV files to download the latest data, but these files will not have fields like computed values. We will make a best effort attempt to keep these files in line with the latest changes in the API.

BigQuery

We have been using BigQuery as a generalized datastore for non-core data, and have a public datastore of our own COVID data. Let’s add all our API data into a public BigQuery dataset that anyone can query against.

Schema

Our JSON data is currently a long JSON array of data with no structure or context. We propose standardizing all API responses based on JSONAPI:

{
   "links":{
      "self":"https://api.covidtracking.com/state/ca"
   },
   "meta":{
      "build_time":"2020-07-05T14:00:00Z",
      "data_definitions":"https://covidtracking.com/definitions/state",
      "license":"https://covidtracking.com/license",
      "version":2.1
   },
   "data":[

   ]
}

We would follow the following standards for naming and formats:

  • All names are in snake_case
  • All fields with dates or times are in full ISO format in UTC time zone

Every endpoint would provide the last time the API data was updated, a link to license and data definitions, and the API version.

Field definitions

All endpoints will include field definitions in the meta object. This will allow us to rename and flag fields for deprecation. Fields will include a formerly array that indicates what the field used to be named, and can be used as a fallback for applications in case a field changes its name.

Fields have an optional “unit” designation that indicates whether the field represents people or samples.

{
   "meta":{
      "field_definitions":[
         {
            "field":"cases.cases.current",
            "deprecated":false,
            "unit":"people",
            "formerly":[
               "positive",
               "positiveCurrent"
            ]
         }
      ]
   },
   "data":[

   ]
}

Row metadata

Each data element will have its own meta object that defines things like edit notes and last-update times:

{
   "data":[
      {
         "state":"CA",
         "date":"2020-04-05T00:00:00Z",
         "meta":{
            "last_update":"2020-04-06T05:00:00Z"
         }
      }
   ]
}

Data fields

All endpoints will have a data array of objects. Each object can be nested to group like data elements together. Each data element will have a computed object that includes 7-day averages and computed values as a percentage of the population.

Data elements will be nested as [category].[field].values

{
   "data":[
      {
         "state":"CA",
         "date":"2020-04-05T00:00:00Z",
         "cases":{
            "cases":{
               "current":{
                  "value":400,
                  "computed":{
                     "average_7_day":380,
                     "population_percent":0.06
                  }
               },
               "cumulative":{
                  "value":5000,
                  "computed":{
                     "population_percent":0.1
                  }
               }
            }
         },
         "tests":{
            "negative":{
               "current":{
                  "value":4500,
                  "computed":{
                     "average_7_day":4000,
                     "population_percent":0.06
                  },
                  "cumulative":{
                     "value":50000,
                     "computed":{
                        "population_percent":2.4
                     }
                  }
               },
               "pending":{
                  "current":{
                     "value":4500,
                     "computed":{
                        "average_7_day":4000,
                        "population_percent":0.06
                     },
                     "cumulative":{
                        "value":50000,
                        "computed":{
                           "population_percent":2.4
                        }
                     }
                  }
               }
            }
         }
      }
   ]
}

Field Disambiguation

Some fields, such as a simple total test results, are impossible to treat globally across all states. In this case, we will not provide a value for these fields, and instead give an object representing the most complete time series (since March 2020), and the most accurate time series (where we have data over 120 days):

[
  {
    "state": "CA",
    "date": "2020-09-01",
    "tests": {
      ...
      "total_test_results": {
        "complete_field": "tests.positive_negative",
        "preferred_field": "tests.viral.total"
      }
      ...
    }
  }
]

Option to disable computed values

Users who just want raw values can request endpoints that return simpler values instead by appending /simple to the URL:

[
   {
      "state":"CA",
      "date":"2020-04-05T00:00:00Z",
      "cases":{
         "cases": {"current":400,
         "cumulative":5000
}
      },
      "tests":{
         "negative":{
            "current":4500,
            "cumulative":50000
         },
         "pending":{
            "current":4500,
            "cumulative":50000
         }
      }
   }
]

Add state metadata to all endpoints

Users are making multiple API calls for state metadata and daily or current information. Instead, we can provide a single state endpoint that includes all state information, and then append the state metadata for each state to the beginning of all state API calls.

In addition, we will add unique slug metadata fields to all states and state endpoints.

Data cleanup

Fields currently marked as Deprecated in the V1 API will not be brought over to V2.

Endpoints

The new API will have the following endpoints (all prefixed by /v2/):

  • /changes - A running changelog of additions and changes to the API
  • /status - Information about the last build time and API health
  • /fields - A list of all fields, their definitions, and long-names
  • /states - A list of all states and their state metadata, same as our current State Metadata.
  • /states/history - A list of all historic records for all states
  • /state/[state-code] - All the state’s metadata, and their most recent data record
  • /state/[state-code]/history - All the state’s metadata, and a list of all historic records for that state
  • /us - The most recent record for the US
  • /us/history - All the US history

We will no longer use .json at the end of endpoint URLs.

Change control & community outreach

Changes to endpoints and API will be communicated through a dedicated Headway page and Twitter account. We will handle changes in fields or field definitions in a consistent manner:

  • New fields - Released and announced as soon as possible
  • Changes to field names - The field definitions will be updated immediately, and a new name of the field will be added. The old name of the field will remain and both will exist in parallel. Two weeks after launching the new name, the old name will no longer get updates, and three weeks after launching, the old field will be removed.
  • Removal of fields - If a field is no longer needed, it will be announced and not receive any further updates, zeroed out after two weeks, and removed after three weeks.
@dakworks
Copy link

Looks great! I’ve had good results with canonicalized date/times using the UTC ISO conventions.

@kevee
Copy link
Member Author

kevee commented Jul 17, 2020

@dakworks - Agreed, we have so many different users across timezones that agreeing to a format is necessary.

@amandafrench
Copy link

Brilliant like a diamond, juicy like a peach. Yes, please!

@Nosferican
Copy link

Nosferican commented Jul 18, 2020

Many of the concerns addressed by the proposal mirror the benefits of a GraphQL API

The GraphQL data query language is:

  • A specification. The spec determines the validity of the schema on the API server. The schema determines the validity of client calls.
  • Strongly typed. The schema defines an API's type system and all object relationships. (fields can be typed as timestamp)
  • Introspective. A client can query the schema for details about the schema.
  • Hierarchical. The shape of a GraphQL call mirrors the shape of the JSON data it returns. Nested fields let you query for and receive only the data you specify in a single round trip.
  • An application layer. GraphQL is not a storage model or a database query language. The graph refers to graph structures defined in the schema, where nodes define objects and edges define relationships between objects. The API traverses and returns application data based on the schema definitions, independent of how the data is stored.

Source: GitHub

It also allows for (see GraphQL Best Practices):

  • versionless (easy to support changes without making them breaking)
  • pagination
  • allows gzip for compression for improved network performance

It also allows for global node ID which makes it easier for improved lookup and updating.

@kevinmarks
Copy link

One persistent issue is that the 7 day averages are backward looking, which makes them 3.5 day lagged - this shows up in the plots made, where steep rises are de-emphasised by the moving average being shifted forwards in time.
I know that you can't issue forward looking averages, but can you add guidance (or a time offset field?) to encourage these to be plotted in the past.

@kevee
Copy link
Member Author

kevee commented Jul 20, 2020

@Nosferican all good points, and we have had internal discussions in the past about GraphQL and versionless APIs. The challenge is in a company or organization that owns their data, they can create and maintain a consistent schema because it's data they were able to plan and model out. However, we are trying to reflect a shifting baseline of 56 different states and territories.

I would love to add a GraphQL endpoint to our data, but to do so would add an additional layer of technical and human resources that we do not have.

@cfhp1
Copy link

cfhp1 commented Jul 21, 2020

Good stuff here, although re csv's, "no effort will be made to keep column names or the layout consistent from one day to the next," seems a shame. I would suggest keeping them the same when reasonably possible, because that is the level at which some people are sure to pull data feeds if they aren't comfortable with JSON:API.

@kevee
Copy link
Member Author

kevee commented Jul 21, 2020

@cfhp1 - I'll change that wording a bit - we want to definitely give consistent CSV files, but we also don't want folks to be counting on Column 11 being positive tests, which is what is happening now.

@cfhp1
Copy link

cfhp1 commented Jul 21, 2020

I haven't used BigQuery before, but I see it supports SQL. Will that be an alternative to JSON:API for pulling your data? That would be lovely.

@kevee
Copy link
Member Author

kevee commented Jul 21, 2020

Yes you would have to setup a BigQuery project, but that is pretty straightforward.

@Nosferican
Copy link

I would love to add a GraphQL endpoint to our data, but to do so would add an additional layer of technical and human resources that we do not have.

I might be wrong here, but if not mistaken, by implementing the Open API 3 spec, one gets GraphQL for free (https://loopback.io/openapi-to-graphql.html).

@waded
Copy link

waded commented Jul 26, 2020

Developers are advised to download and cache API data within their application.

It may not be apparent to API users that some counts are incomplete/estimated or even wrong when states first publish them at CTP captured, and that CTP then may backfill/update the counts, and the calculations too, at almost any time.

I suggest warning that cache should be for no more than 1 day. Then, provide insight into how often different fields have seen backfills/updates, and why, in /fields. I understand every state is different, but I'm guessing there some patterns, (e.g. daily hospitalizations can be estimates because not all hospitals report regularly, and a new report can change an old estimate.) The documentation's an opportunity to help consumers of the API understand why historical data cached yesterday might change today.

I would also expect HTTP implementation of this API to participate in correct caching by clients, i.e. respecting request cache headers and returning correct responses.

Also: the "most recent record" endpoints may be too much temptation to build something that doesn't take backfill/update into account. The "historic" vs "current" vs "daily" terminology in V1 confused me, and I see you did correct this in V2, but wouldn't it be clearer still to not slice the API by this at all, offering only full history where there's history? (A queryable API of course might allow for this slicing, but I'm not going to advocate for that here.)

@kevee
Copy link
Member Author

kevee commented Jul 27, 2020

@Nosferican definitely we can get the GraphQL spec, but we would still need to build and maintain an endpoint that implements that spec.

@kevee
Copy link
Member Author

kevee commented Jul 27, 2020

@waded totally agree, we could either say follow our cache-control headers (which are now 10 minutes) or put something more explicit in there. I'll edit the document to reflect that.

I do think we should just support slicing history by either ALL state data, or state data for an explicit date.

@kevee
Copy link
Member Author

kevee commented Jul 27, 2020

Also our API does and will continue to do appropriate ETag updates whenever any content changes.

@phreditor
Copy link

I love that you guys did this whole project. Thank you so much!

As for the update spec: me likey!

One thing that would be very useful, would be to include some summary info for a "column" per state. E.g. the max: AL deathIncrease max. I like to normalize this stuff, but that means running through the data multiple times.

If you went this route, you could include some other statistical bits about the data, e.g. min, average, standard deviation. These things can be helpful in filtering noise from the data. Sometimes a state metric will have 40, 100, 1800, -1750, 95, etc. These anomalous spikes are very hard to process sensibly, but if you have some sense of what the data should look like, it makes it easier to detect and substitute for these anomalies in a way that still represents the overall picture correctly.

But, I'd be happy with a max.

@kevee
Copy link
Member Author

kevee commented Jul 30, 2020

For the record, I dropped the portion of the proposal about dropping CORS support. New configuration options with our API endpoint will allow us to support this without much effort, and it is a nice use case for folks to build interactive UIs without a heavy lift.

@kevee
Copy link
Member Author

kevee commented Jul 30, 2020

@phreditor - The nice thing is we can keep adding these computed values without just adding tons of new fields to a single object!

@niravabhavsar
Copy link

For the record, I dropped the portion of the proposal about dropping CORS support. New configuration options with our API endpoint will allow us to support this without much effort, and it is a nice use case for folks to build interactive UIs without a heavy lift.

Thank you! We were worried that we'd have to rearchitect the whole application.

@kevee kevee transferred this issue from COVID19Tracking/website Oct 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants