About

Contributing

Under the Hood

Download Data

Working with the Birdwatch data

Data snapshots

The Birdwatch data is released as three separate files: one containing a table representing all Birdwatch notes, one containing a table representing all Birdwatch note ratings, and one containing a table with metadata about notes including what statuses they received and when. These tables can be joined together on the noteId field to create a combined dataset with information about notes and their ratings. The data is released in three separate tables/files to reduce the dataset size by avoiding data duplication (this is known as a normalized data model).

Currently, we release one cumulative file each for notes, notes status history, and note ratings. However, in the future, if the data ever grows too large, we will split the data into multiple files as needed.

A new snapshot of the Birdwatch public data is released daily, on a best-effort basis, and technical difficulties may occur and delay the data release until the next day. We are not able to provide guarantees about when this may happen. The snapshots are cumulative files, but only contain notes and ratings that were created as of 48 hours before the dataset release time. When notes and ratings are deleted, they will no longer be released in any future versions of the data downloads, although the note status history dataset will continue to contain metadata about all scored notes even after they’ve been deleted, which includes noteId, creation time, the hashed participant ID of the note’s author, and a history of which statuses each notes received and when; however, all of the content of the note itself e.g. the note’s text will no longer be available.

The data download page in Birdwatch displays a date stamp indicating the most recent date of data included in the downloadable files.

File structure

Each data snapshot table is stored in tsv (tab-separated values) file format with a header row. This means that each row is separated by a newline, each column is separated by a tab, and the first row contains the column names instead of data. The note and note rating data is directly taken from the user-submitted note creation and note rating forms, with only minimal added metadata (like ids and timestamp). The note status history file contains metadata derived from the raw notes and ratings, and contains the outputs of the note scoring algorithm. Below, we will describe each column’s data, including the question or source that generated the data, data type, and other relevant information.


Updates to the Data

As we iterate and improve Birdwatch, we will occasionally make changes to the questions we ask contributors in the note writing and note rating forms, or additional metadata shared about notes and rating. When we do this, some question fields / columns in our public data will be deprecated (no longer populated), and others will be added. Below we will keep a change log of changes we have made to the contribution form questions and other updates we have made to the data, as well as when those changes were made.

Updated Columns

  • notHelpfulArgumentativeOrInflammatory - Changed name to notHelpfulArgumentativeOrBiased

Added Columns

  • helpfulUnbiasedLanguage

  • notHelpfulOpinionSpeculation

  • notHelpfulNoteNotNeeded

  • Note Helpfulness question now has 3 response categories (Yes, Somewhat, No), rather than 2 (originally: Yes, No)

  • We have removed the ‘Agree’ note rating question

  • We have updated the set of categories contributors can use to describe why a note is helpful or unhelpful. (Note: both helpful and unhelpful descriptors can be selected for notes that are rated as ‘Somewhat’ Helpful)


Deprecated Columns

  • helpful - Replaced with helpfulnessLevel
  • notHelpful - Replaced with helpfulnessLevel
  • helpfulInformative
  • helpfulEmpathetic
  • helpfulUniqueContext
  • notHelpfulOpinionSpeculationOrBias
  • notHelpfulOutdated
  • notHelpfulOffTopic

Added Columns

  • helpfulnessLevel
  • helpfulAddressesClaim
  • helpfulImportantContext
  • notHelpfulIrrelevantSources