Skip to content

By April Simpson, Pratheek Rebala and Alexia Fernández Campbell


Every investigative journalist has been there. 

It’s early in an investigation, and the problem is the size of 27 football fields. That’s how much space the documents could cover if we laid them out. Where do we begin? 

That’s how we felt at the beginning of what would become “40 Acres and a Lie,” a historical investigation by the Center for Public Integrity in collaboration with Reveal and Mother Jones. We wanted to know how many formerly enslaved people received land through Sherman’s Special Field Orders 15, commonly known as “40 Acres and a Mule.” No one had nailed down exactly how many newly freed men and women received land titles for 4 to 40-acre plots through the program, a tangible opportunity to restart their lives after slavery. But about a year and a half after they got the land, the federal government took nearly all of it back. 

We had found dozens of names and land titles among these records. But the collection of documents we wanted to analyze was daunting — about 1.8 million images — and old, dating back to the mid to late 19th century, or during and after the Civil War. Back then, the language was different. The keywords were different. Take this line: “he has permission to hold and occupy the said Tract, subject to such regulations as may be established by proper authority.” All business was done on paper, mostly written by hand. 

The first thing we needed to do was make these century-old papers searchable and friendly to users in today’s digital world. 

Not long ago, that process would have required hundreds of hours of scanning documents and making PDFs. But we turned to artificial intelligence. The team, led by Pratheek Rebala, found a way to use machine learning, a form of artificial intelligence, to analyze the documents at scale. 

The documents were from the Freedmen’s Bureau, a federal agency that helped former slaves transition to freedom during early Reconstruction. Its archives included documents like marriage certificates and bank records. But we were interested in finding more land titles related to Special Field Orders 15. The titles would tell us who was given property through the 40-acre program, only to have then-President Andrew Johnson strip the land from them and return it to their former enslavers.

But of the nearly 2 million documents that were digitized by FamilySearch (which is operated by the Mormon Church), only about 500,000 had been transcribed by Smithsonian volunteers. Our initial focus of keyword searching the transcribed documents would exclude three-quarters of the documents. Plus, the poor quality images, chicken scratch handwriting and the unusual vocabulary rendered traditional Optical Character Recognition and data extraction tools ineffective. 

Thankfully, there have been several recent developments in machine-learning technology that allowed us to search both the transcribed and non-transcribed documents. 

We developed a process to enhance machine-generated transcriptions by aligning them with those created by Smithsonian transcribers. This allowed us to represent both transcribed and non-transcribed documents in the same format, capturing their visual appearance and content as data points. This approach made it possible to search and analyze all documents, even those that had not been manually transcribed.

Leveraging this standardized representation of the documents, we could search the entire collection in four key ways. First, readers could use keywords, such as “work contracts” or “school records” to find those types of documents. But they could also search based on visual similarity since our tool captured the appearance and layout of the document.

Fortunately, the land titles had some distinct features. The size of a 3-by-5 index card, each had bold text at the top and a signature at the bottom. So we developed an image recognition system to identify them. It’s like facial recognition but for brittle, handwritten, 19th-century documents. 

Third, we used layout classification. This allowed us to find, for example, all documents that contained a table or a given copy of a form.

Fourth, we clustered the documents according to abstract topics to improve the quality of search results.

In other words, prior to this approach, we were reviewing records the old-fashioned way, click by click. But with the tool, we could quickly analyze a large trove of records, and identify the names of hundreds more people who received land titles. 

We ultimately retrieved 1,250 names, making it the largest collection of 40-acre land title holders to date. 

Public Integrity journalists created at least 100 family trees and identified 41 living descendants, several of whom we interviewed for this project. Some had no idea their ancestors had received land as part of the program. 

Finally, we used georeferencing, the process of fitting a picture of a map to actual geography, to verify locations. We acquired maps from the 1860s and 1870s for coastal Georgia and South Carolina. We spent several months using GIS tools, such as ArcMap and QGIS, to georeference the maps against current maps to approximate the location of plantations that were divided up and given to freedmen and women. This was challenging because 1800s surveying was not as accurate as it is today. And the coastal areas have changed shape with rising water levels due to climate change.

Finally, and most importantly, we made all this information accessible to the public through a searchable database. The tool allows people to search nearly 2 million Freedmen’s Bureau documents, including records about education, labor contracts, marriages and more. This way, genealogists and others can more quickly do research, and descendants of formerly enslaved people can check whether their ancestors received government-issued land titles. 

We are hoping that by making this tool available, more people can learn about their ancestors, and also understand what life was like in the immediate aftermath of the Civil War. At a time when the U.S. government is erasing history by removing and altering key information related to healthcare, crime data, LGBTQ history, environmental justice and more, preserving that information has never been more important.

Hollis Gentry, who oversees the Smithsonian Institution’s Freedmen’s Bureau Digital Records Project, said “40 Acres and a Lie” accomplished what she long dreamed was possible with these records. “This is a Godsend for genealogists, historians, and other researchers,” she wrote in an email, adding that the series left her speechless and in tears. “Your project has laid a foundation which I hope will expand immeasurably in other fields of research, to mine the data of the Reconstruction Era, and begin to tell new stories that have been buried in the archives for more than 150 years.”

There are so many other powerful stories hidden in historical archives. Now they can be mined using machine learning and other new technologies. There are endless ways to apply the tools and methodologies developed in our reporting. And we hope others will take it even further.

To learn more about the reporting that went into the project, listen to the February installment of the IRE Radio Podcast, and read about their work with Reveal in that month’s I-Team Toolkit newsletter. More on the 2024 Philip Meyer Award winners is available here.


Scroll To Top