INSIDE DEALSTAT'S PDF EXTRACTION TECHNOLOGY

img
Analyze Unstructured Deal Sources in Minutes

DealStat designed its proprietary Source Document Reader to remove tedious work involved in the commercial real estate acquisition process. Our machine learning and natural language pipelines transform your unstructured deal marketing materials (i.e. offering memos and online property listings) into fully dynamic models, presentations and reports, created in your language.

In this post, we want to address the common question, “how does it all work?”

Challenges of PDFs

Getting a computer to extract useful information from PDF documents is no easy task. Here are just a few of the hurdles that DealStat faces with offering memos:

  • No two documents are the same. Some brokers will include the acquisition price in a table with other property information, while others will splash it front and center on the document’s first page with no real "key words" to prompt a computer what it refers to.
  • PDFs are designed for humans, not computers. The challenge is compounded by the non-linearity of PDF documents. Offering memos are visually intensive. They rarely read left-to-right and top-to-bottom. They include a range of text, tables, charts and photos. Spatial relationships, color patterns, headings, and relative text size can often change the meaning of a key data point. Traditional PDF parsing tools are rendered largely ineffective.
  • Documents are full of red herrings. Tackling such an expansive problem uncovers a wide array of unexpected challenges that DealStat’s data scientist has had to overcome. How do you teach a computer to differentiate between the target property’s address and the broker’s office location? How do you cordon off information related to comparable sales, which may, at first glance, look similar to the asset being marketed?
  • Language is inconsistent. Even the most seasoned real estate professionals often use inconsistent terminology. For example, knowing whether a revenue figure accounts for vacancy and if it includes other income are important to understanding the potential opportunity. This type of analysis requires a more sophisticated process than simply matching a number to a label.
How it Works

DealStat’s proprietary Source Document Reader leverages a combination of machine learning technology, rule-based pattern-detection, and commercial real estate domain-knowledge to overcome these challenges and many more. Here is a sneak peek into some of the process and technologies that DealStat uses to effectively extract key deal information from offering memos and online property listings:

  • Property meta-analysis. DealStat’s machine learning algorithms quickly scan the entire document to predict the probability of dozens of characteristics. This meta-analysis plays a large role in determining the next steps. It helps answer questions like, “Should we be looking for a unit mix summary or NNN lease terms here?” These algorithms also provide a page-by-page breakdown of where the relevant information may be located within the document.
  • Extraction and classification. Much of the important information contained within offering memos is hidden in tables. Without extensive pre-processing, tables locked inside a PDF document are incomprehensible to a machine. A central pillar of DealStat’s Source File Reader is its object detection model. This model utilizes a computer vision deep-learning algorithm to detect, locate and extract individual tables. Combined with the raw text of the PDF document, these tables form a powerful data source. Each piece of text and table cell can then be classified to draw out values for attributes such as acquisition price, address and rent-roll line items.
  • Intelligent real estate validation. Each data point is analyzed to assess its validity. For example, DealStat ensures that operating summary items tie to rent rolls and expense tables, often utilizing square footage and unit counts to triangulate calcuations. This element provides a consistent structure for presenting financial data, and offers a layer of intelligent protection against capturing erroneous values commonly found within deal documents. Some data points (e.g. acquisition price and a cap rate), may allow for the calculation of others (i.e. net operating income).
  • Image extraction. Images are a powerful way to enhance output files, and to quickly assess prospective opportunities. In addition to textual data, DealStat is able to extract and return images from source documents. We are actively developing enhancements that include a classification algorithm to automatically tag photos with labels such as, “exterior”, “interior”, “floor plan”, and “aerial map”. The algorithm is also learning to set aside images that may add style to an offering document, but are less relevant to your acquisition (such as logos and broker headshots).
  • Output structuring. Finally, DealStat’s Source Document Reader delivers the extracted data in a standardized JSON (JavaScript Object Notation) structure. This feeds directly into your deal portal and acquisition pipeline interface, so that you can easily access and utilize information for analysis, outputs, and reports. It also provides a structure that can be utilized by DealStat’s integration partners via a third-party API. Beyond the raw data points, the output helps maintain the integrity of information. Each item is mapped to a specific source (i.e. the offering memorandum and its page number) and flagged if it was calculated by DealStat using other data points, rather than extracted directly.
What Does This All Mean?

Most importantly, it means that you spend less time tediously extracting and managing countless data points for each deal that comes across your desk. But if you’re interested in understanding more about neural networks and machine learning, this video provides a great conceptual introduction. It's a bit long, but why not learn something with all that new time you have from using DealStat?