OCR-Derived PURSUE Release 1 UAP Corpus from FBI and USAF Records (≈3,000 Pages, May 2026 Release, DOI 10.5281/zenodo.20108542)

To the point

Miguel Pavón released a machine-readable OCR corpus of PURSUE Release 1 declassified UAP records (about 3,000 pages from FBI and USAF files) intended for computational analysis and data integration, with public-domain sources, OCR output under CC0, the work under CC BY 4.0, and a DOI published in May 2026.

zenodo.org

OCR Full-Text Corpus of PURSUE Release 1 Declassified UAP Records (FBI Case File 62-HQ-83894 and USAF Project Blue Book Box 7)

Machine-readable plain-text corpus extracted via Optical Character Recognition (Mistral AI mistral-ocr-latest) from the 18 PDF source documents released under PURSUE Tranche 1 (Presidential Unsealing and Reporting System for UAP Encounters) by the U.S. Department of War on May 8, 2026 (war.gov/UFO). The corpus covers two major archival collections: FBI Case File 62-HQ-83894 (Flying Saucers) — 10 sections, 7 individual serials (130, 153, 164, 220, 403, 438, 449), and Sub-file A. Approximately 2,300 pages spanning July 1947 to 1967. USAF Project Blue Book Box 7 — Incident Summaries 1–233 — Three files covering the first 233 USAF-documented UAP cases (June 1947 – January 1949), approximately 531 pages. Each source document is provided as a single plain-text file with pages delimited by ---PAGE--- separators. Total corpus: approximately 3,000 pages of machine-readable text. Intended for computational analysis, database ingestion, and cross-referencing with other UAP research corpora. Source documents are in the public domain (U.S. government works, 17 U.S.C. § 105). OCR output released under CC0.