JDI: Documentation

Documentation & Methodology

The sections below contain detailed information about our data collection (also known as web data scraping), cleaning, aggregation, and other processes and architecture. If you are aware of any complete, publicly available, web-based jail rosters that are potential data sources from which we are not already collecting data, or if you observe any inconsistencies in our data that you suspect may originate in faulty logic or incorrect classification, please reach out to us at questions@jaildatainitiative.org.

We believe in transparency and humanizing language. We are currently scrubbing our GitHub code repository of sensitive information, with the intention of making it publicly available. If you need access to our code in the interim, please reach out. We try to avoid the use of words like "inmate". However, our code was a collaborative effort, and in some cases, variables in our code were labeled using this term, which may appear in the documentation below. We apologize, and we will attempt to remove these as part of our code cleaning process.

1. Jail Roster Sample Identification

The JDI team used the Bureau of Justice Statistics (BJS) 2013 Census of Jails to manually search for the presence of daily jail rosters posted online, on websites for county and municipal governments, local sheriffs, and detention facilities. Of these, we further identified locales where rosters could feasibly be scraped (i.e., by excluding those with unsolvable CAPTCHA, full name search requirements, etc.). This process has resulted in the drafting of scrapers for approximately one-third of the 3,163 jails identified by the Census of Jails in 2013.

Representativeness

There may be concern about the representativeness of the JDI sample along geographic and demographic dimensions. In a published report for the Council on Criminal Justice (see page 23), using population data from the 2019 American Community Survey, we explored the representativeness of the JDI sample, finding that it was reasonably representative of the national population.

2. Roster Scraping: Script Architecture

JDI uses an object-oriented scraping architecture in Python. This includes three levels and scraping methods:

SuperScraper is a top-level framework for including metadata, formatting data in certain fields, and expanding nested data for consumption as multiple CSVs. This class includes specialized methods for interacting with sites using raw HTTP requests, using the Python Requests library.
SeleniumScraper and PDFScraper are intermediate classes with specialized methods for interacting with sites requiring automated browser navigation and PDF rosters, respectively. For the former, we use the Selenium library. For the latter, we use a variety of packages, given the particularities of PDF parsing. These include: PDFminer, AWS Textract, Tabula, and others.
Generic platform or site-specific scrapers are the lowest granularity of our class framework, either for scraping Jail Management Systems used across multiple locales (e.g., the BluHorse JMS platform) or for site-specific scraping.

SuperScraper conducts some basic extractions and standardizations on the raw data. These include:

Converting names to a uniform First Middle Last Suffix style, if presented differently on the roster.
Extracting a float-type amount field corresponding to any numerical field. For example, if a roster reports Total Bail: "$1,000 unsecured bond” the scraper will capture both Total_Bail_Str: “$1,000 unsecured bond” and Total_Bail_Amount: 1000.00.
Converting date and time fields to standards as YYYY-MM-DD and HH:MM, respectively.

Structurally, JDI captures relational nested data into separate CSVs that can be merged back into individual-level data. Scrapers output a list of objects that can be mapped one-to-one to individuals on the jail roster. If an individual subsequently contains a nested (one-to-many) data field, e.g., multiple charges, bails, holds, etc., these are captured as a nested list of objects. During formatting, the field is broken out into a CSV with one row per field per individual. CSVs can be merged on an Inmates_Row_ID field.

Data provider error

Data presented on rosters may be inaccurate (e.g., an individual reported as 120 years old), and fields may be mislabeled (e.g., a roster may incorrectly report education level as Ethnicity). For the former, we include some exclusion criteria in field standardizations. For the latter, we try to correct as appropriate. Additionally, data may exhibit temporal incongruity (e.g., an individual reported as male one day and female the next). We cannot eliminate these sources of error.

Formatting error

Certain fields may be presented in unique formats that are not well-handled by the formatting methods we have specified, resulting in errors such as the misplacement of a suffix between an individual’s middle name and surname.

Schematic error

Drafting of scrapes was conducted beginning in 2019. First scrape dates thus vary across the JDI sample, with a majority beginning in early-to-mid 2020. In some cases, code review has resulted in script revisions, requiring removal of older data to an archive bucket (our storage structure is elaborated below). Additionally, there are many hundreds of fields with different field name variations across jail rosters (e.g., booking date might be expressed as “Booking Date”, “Admission Date”, “Intake Date”, etc.). Schematic reconciliation of our data is an ongoing process, useful in standardization of certain fields, but not required in our NoSQL database architecture (also discussed further below).

Incomplete scraping

In general, we have tried to ensure that we only scrape rosters when we are certain we can successfully document all jailed individuals. In some search-based cases, there may be sources of error (e.g., in a letter-based search, if one letter returns an errant response page). Additionally, certain individual details may be continued on a separate page, requiring a separate request and a separate scrape. We have created a thresholding process for successful collection of additional details on individuals, but earlier data may contain fields that are inconsistently captured (e.g., an individual successfully documented as male on one day, unsuccessfully scraped and therefore documented with null gender the next day, and scraped again as male the third day).

Note: PDF parsing tends to be the least robust process for data collection. As such, data scraped from PDF jail rosters are likely the largest source of formatting and schematic error across our data sample. PDFs account for approximately 120 of the rosters in our sample, and tend to be used by smaller jurisdictions.

3. Roster Scraping: Execution Architecture

JDI collects information from a given jail once per day if successful. Each scrape job is triggered by a Bash script early in the day (Eastern Standard Time), and runs on a dedicated virtual private server hosted by Ionos. At the completion of a successful scraping process, CSVs are sent to permanent storage in an access-restricted AWS S3 bucket. Directory structure is as follows:

s3://{bucket-name}/AL/Autauga/2020-01-01/Bonds.csv
s3://{bucket-name}/AL/Autauga/2020-01-01/Charges.csv
s3://{bucket-name}/AL/Autauga/2020-01-01/Inmates.csv
s3://{bucket-name}/AL/Autauga/2020-01-02/Bonds.csv
...

The statuses of individual scrapes and associated metadata are tracked using an Airtable base. Ahead of the scraping process, each scrape’s STATUS field on Airtable is reset. If the scraping script runs without error, its STATUS is updated to Good, and no further scraping is attempted that day. If it fails, its STATUS is updated to Bad, and a failure log is sent to an S3 bucket. A retry script runs hourly for eight hours of the day, rerunning any scraping script that previously failed that day (reasons might be exogenous, like downed servers, or endogenous, like validation exceptions during the consumption of data from the page by the scraping script).

Two scrape status safeguards are implemented automatically:

At the end of each day, any script that has failed for 14 consecutive days will be set to STATUS Rewrite and removed from the daily job list for manual review.
If a scrape succeeds in perpetuity but we observe a static population with no new admissions or releases, after 21 days its STATUS is automatically set to Rewrite, redundant CSVs are scrubbed from the active S3 bucket into an archive bucket, and we may manually set its STATUS to Hanging Roster, to indicate that the roster is no longer being updated regularly.

Additional scrape status adjustments occur manually. For example, if we observe a scrape failing repeatedly, we may see that the roster page is no longer responding at all, and set that scraper’s STATUS to Website Down.

Data provider error

Although we attempt to correct for outdated information as above, we cannot perfectly ensure that data collected from rosters represent the current date.

Data continuity

The stopping and starting or hanging of rosters may result in bookings with inaccurate start and end dates in our database. We address these using a variety of methods. We create flags for incomplete bookings during a weekly Python continuity check process to account for prolonged scrape failures. If a booking intersects a data gap of 7 or more days in either direction, it is flagged as such. Bookings that span such gaps are also flagged. Additionally, any bookings that intersect the first scrape date for a jail are flagged. If we were reporting, e.g., a mean or median length of stay, we would typically exclude any of the above bookings (except those that fully span gaps).

Intra-day bookings

For uniformity, and because some scraping scripts require considerable runtime, we only require one successful scrape per day. As such, we necessarily omit any reported intra-day bookings that do not intersect our scraping windows. For example, if a scrape runs and succeeds at 3:00AM EST, and someone is booked at 5:00PM EST and released the same evening at 7:30PM EST, that person will not be recorded in our database.

4. Database Architecture: Data Preparation

Our MongoDB booking-level database collection stores documents corresponding to multi-day bookings (more information below), and as such, as part of the data migration process from CSVs to the database, we algorithmically augment records to compare “bookings” and “people in jail”.

First, a jdi_booking_id field is created with the following hierarchy:

If available, booking number;
Else if available, name and booking date;
Else, jdi_inmate_id (see below).

A jdi_inmate_id field is created with the following hierarchy:

If available, inmate number;
Else, if available, name and date of birth;
Else, just name (with an identifier quality flag; more information below).

Next, demographic standardizations are conducted. In a separate process, any new values for a specified set of fields (Race, Ethnicity, Sex, Gender, Classification, Charges.Classification) are manually encoded on Airtable. Other fields such as those related to age are standardized according to specified algorithms. Indicator meta-fields are also created here (for example, a text search across all fields looks for indicators that a person in jail is being held on behalf of Immigration & Customs Enforcement, and if located, an ICE_Standardized field is created). Finally, charges are categorized with levels L1 through L3 of granularity according to a mapping produced periodically from the Criminal Justice Administrative Records System classification model. This model is maintained by the Institute for Social Research at the University of Michigan, and was created in partnership with Measures for Justice. Any new charge strings that have not yet been mapped by CJARS are classified as “TBD” at all levels.

Next, a data collection containing the change histories of bookings as patches is updated. After records are matched to existing bookings in the database, if the values of any of their fields are different, new or missing, timestamped patch records are created indicating change type and old/new values. Patches can be used to easily revert bookings to their states on previous dates. However, not all fields are included in the patching process, due to their frequency of change (e.g., days in custody, which increments). An example patch document, indicating an 18-year-old person turning 19 while in custody, might look like the following:

`_id:ObjectId("12345")`

`primary_doc_id:ObjectId("98765")`

`change_date:2021-06-01T00:00:00.000+00:00`

`state:"NY"`

`county:"New_York_City"`

`patch:Array`
`0:Object`
`op:"replace"`

`path:"/Age"`

`value:"19"`

`patch_str:"[{"op": "replace", "path": "/Age", "value": "19"}]"`

Finally, people in jail are de-duplicated on jdi_booking_id and minor field cleanup is conducted.

Notes: (i) booking match is given a 10-day leeway "look-back" period. This accounts for the possibility of “weekenders,” individuals who are rebooked on weekends and released on weekdays to permit continued employment; (ii) using a threshold of 75% of the previous day’s fields, we check for data completeness as a way of reinforcing data quality. If fields fall below this threshold for fewer than three days we ignore changes and keep the more complete data, otherwise we let the patches through for these changes.

Data continuity

The fields involved in the definition of bookings and individuals may not always be available or correct, or they may be scraped inconsistently if they are reported via detail pages. Further, the last-resort name as jdi_inmate_id is susceptible to duplication, particularly for common names. We account for this in part by creating a non_distinct_jdi_inmate_id flag for any bookings that have common jdi_inmate_id but overlapping bookings. Additionally, the look-back period described above when applied to rosters that only report jdi_inmate_id as name may contribute slightly to skew towards inflated booking duration.

Versioning error

As demonstrated in other sections, reporting and scraping may be inconsistent, which will subsequently produce faulty patch documents. There may also be errant patches resulting from retroactive data cleaning processes, which we may need to review or rectify.

Misclassification error

The CJARS supervised classification model may misclassify charges (e.g., charges that merely indicate holds for other counties may be misclassified as violent crime). Manual classification of demographic fields may result in misclassification of demographic characteristics.

5. Database Architecture: Data Migration

The processes described in the previous section are wrapped in a Python script that runs every few hours on a second dedicated Ionos virtual private server. This script scans our main S3 bucket and creates a job queue for any new CSV files. Bookings and patches are bulk written to the database. The bookings data collection always represents the most recent versions of jail bookings. Patches must be used to recreate older versions.

All of our databases are hosted on MongoDB clusters managed by MongoDB Atlas. MongoDB enables NoSQL database construction, which permits flexible schematic specification. MongoDB automatically creates unique ObjectId keys for all new documents, which can be used for querying in addition to configurable indices on other fields. An example synthetic booking document might look like this (fields chosen randomly):

_id:ObjectId("12345")

Age:"18"

Booking_Date:"2020-01-01"

Booking_Time:"23:59"

Charges:Array
0:Object

Arresting_Agency:"NYPD"

Bond_Amount:100000.0

Bond_Str:"$100,000.00 cash bond"

Docket_Number:"F-13579"

Charge:"DIST- CANNABIS (ATT)"

Charge_Standardized:Object

l1:"Drug"

l2:"Distribution of cannabis"

l2:"Attempted distribution of cannabis"

1:Object

Inmate_ID:"98765"

Name:"JANE DOE"

Race:"White"

Gender:"Unknown"

jdi_booking_id:"JANE DOE_2020-01-01"

meta:Object
first_seen:2020-01-01T00:00:00.000+00:00

last_seen:2020-12-31T00:00:00.000+00:00

Scrape_Date:2020-12-31T00:00:00.000+00:00

State:"NY"

County:"New_York_City"

Facility_Name:"Rikers Correctional Complex"

State_Code:"1"

Jail_ID:"123"

flags:Array
0:"intersects_first"

1:"spans_gap"

jdi_inmate_id:"98765"

Race_Ethnicity_Standardized:"wu__"

Sex_Gender_Standardized:"_f"

Age_Standardized:18

Bond_Standardized:Object
field:"Charges_Bond_Amount_Standardized"

value:250000

flags:Array

Although these data are not relational, they can be queried using a variety of mechanisms, and normal SQL-type queries can be used as necessary.

6. Data Access & Privacy

We offer three levels of data access:

Publicly available data aggregations are displayed on this website, with download options for each type of aggregation. For example, daily populations, admissions, and releases, granular to the facility and day, are downloadable as CSVs.
Restricted individual-level data are available on our website for people who have been granted authentication credentials. Individuals can complete our Data Use Agreement, upon which an email will automatically be sent to our team for review. If we approve, we will set up a profile for the individual (we manage authentication with Auth0), and reply with credentials. When users sign in, they are able to search for historical jail rosters (downloadable as zipped CSV files), and search bookings by name and charge, among other data views.
Restricted API data access is available following the completion of the DUA process described above. Individuals who may need access to higher-volume or -frequency data will be given requisite API keys and information to read data from MongoDB directly. For instance, local organizations that want to review charges for the prior month over an entire state can process data this way.

At the aggregate level, we try to be cognizant of identifying information accidentally surfacing. This might occur, e.g., if a charge string includes a warrant or docket number. As we identify such instances we scrub them.

7. Data Aggregation

Data can easily be aggregated across the set of standardized fields and metadata. We pre-aggregate some data to facilitate fast data transfer and surfacing, e.g., in a script that aggregates and interpolates daily jail traffic by demographic group from one database collection to another. Otherwise, we tend to aggregate data within APIs written in JavaScript.

As an example of a data quality consideration at this level: in order to smooth population time trends, we interpolate population, admissions and releases by demographic group, roster and date over days on which a scraping script failed (within the confines of the gap-related criteria specified above). For population, we simply interpolate linearly from the last scrape date prior to the gap to the first scrape date post-gap. For admissions and releases, we redistribute the first/last date uniformly over the gap. By providing the original data in these cases though, data consumers are welcome to apply their own interpolation methods.

_id:ObjectId("12345") primary_doc_id:ObjectId("98765") change_date:2021-06-01T00:00:00.000+00:00 state:"NY" county:"New_York_City" patch:Array 0:Object op:"replace" path:"/Age" value:"19" patch_str:"[{"op": "replace", "path": "/Age", "value": "19"}]"

`_id:ObjectId("12345")`

`primary_doc_id:ObjectId("98765")`

`change_date:2021-06-01T00:00:00.000+00:00`

`state:"NY"`

`county:"New_York_City"`

`patch:Array`
`0:Object`
`op:"replace"`

`path:"/Age"`

`value:"19"`

`patch_str:"[{"op": "replace", "path": "/Age", "value": "19"}]"`