The OCR Service to extract the Text Data

Optical character recognition, or OCR, is a key tool for people who want to build or collect text data. OCR uses machine learning to extract words and lines of text from scans and images, which can then be used to perform quantitative text analysis or natural language processing. Here at the Urban Institute, we’ve used OCR for tasks such as automated text extraction of hundreds of state zoning codes and maps and the collection of text from nonprofit Form 990 annual reports.

A plethora of OCR services exist, but we didn’t know which were the most accurate for Urban projects. OCR services can vary by cost, ease of use, confidentiality, and ability to handle other types of data, such as text appearing in tables or forms, so accuracy is just one dimension to consider. Although we haven’t tested every OCR service, we chose four representative examples that vary across these dimensions. Below, we provide a thorough comparison, as well as the code to replicate our accuracy competition yourself:

1. Amazon Web Services (AWS) Textract, which is fully integrated with other AWS cloud-computing offerings

2. ExtractTable, a cloud-based option that specializes in tabular data

3. Tesseract, a long-standing, open-source option sponsored by Google

4. Adobe Acrobat DC, a popular desktop app for viewing, managing, and editing PDFs

Accuracy

The best way to improve OCR accuracy is through data preprocessing. Enhancing scan resolution, rotating pages and images, and properly cropping scans are all methods to create high-quality document scans that most OCR offerings can handle. But practically speaking, many scans and images are askew, rotated, blurry, handwritten, or obscurely formatted, and data cleaning can be too time-consuming to be feasible. We wanted to test the four OCR candidates against the messiness of real-world OCR tasks, so we compared how each tool handled three poor-quality documents.

We converted all 12 pieces of output (4 OCR offerings x 3 documents) into text files for nontabular text and CSV files for tabular text, and we compared them against “ground truth” text, which was typed by a human.

For each document and OCR service, we computed a text similarity score using the Levenshtein distance, which calculates how many edits are necessary to change one sequence of text into another. Because common errors made by OCR software occur at the character level (such as mistaking an “r” for an “n”), this framework made sense for evaluating accuracy.

Extracted text is not always outputted in the same order across OCR offerings (especially in cases of multicolumn formatting, where some services may first read down one column and others may start by looking across the columns). This variability motivated us to use the token sort ratio developed by SeatGeek, which is agnostic of text order. Token sort splits a sequence of text into individual tokens, sorts them alphabetically, and then rejoins them together and calculates the Levenshtein distance as described above, meaning “cat in the hat” and “hat in the cat” would be considered a perfect match.

From our comparison, we found that Textract and ExtractTable lead the way, with Tesseract close behind and Adobe performing poorly. All four struggled with scan 3, which contained handwritten text, but the high performers handled askew and blurry documents without major issue.

Cloud-Based OCR Offerings Outperformed Competitors across All Three Document Types

Cloud-Based OCR Offerings Outperformed Competitors across All Three Document Types

The scores from this “fuzzy matching” procedure generally indicate which OCR offering processed the most text correctly, but a single number can’t reliably tell the whole story. First, the scores are rounded to the nearest whole number, so there is some granularity lost in the comparison. Second, not all errors are created equally. If OCR software interprets the word “neighbor” as “nejghbor,” then token sort scoring will count one incorrect character, but the lexical understanding of that word is not greatly affected. But if the software mistakes “quality” for “duality,” that would totally change the meaning of the word yet yield a similar score.

These scores can serve as useful rules of thumb for OCR accuracy, but they are no substitute for a deeper dive into the output text itself. To allow for this deeper comparison, we published these results, including the original scans, all code, and outputs documents to this public GitHub repository.

We also include an Excel file with the tabular output from Textract and ExtractTable alongside benchmark tables for comparison. The table extraction performance looks comparable between the two services, except for a pair of rows that ExtractTable mistakenly merges. (ExtractTable’s Python library does include a function for making corrections to improperly merged cells to remedy this issue.)

Cost

Open-source options like Tesseract are the most cost-effective choice, but by how much depends on the size of the input and desired output (information from text, tables, and/or forms). AWS Textract charges $1.50 for every 1,000 pages, although it costs more to additionally extract text from tables ($15 per 1,000 pages), forms ($50 per 1,000 pages), or both ($65 per 1,000 pages). The user specifies up front which kinds of text to extract. ExtractTable users purchase credits up front (1 credit = 1 page) and pay on a sliding scale. The price per 1,000 pages to extract tabular data ranges from $26 to $40, and it costs slightly more to extract nontabular text (ranging from about $30 to $45 per 1,000 pages). For jobs that don’t require pulling key-value pairs from forms, Textract is the cheaper of the two cloud-based options, though ExtractTable uniquely offers refunds on bad and failed extractions. Finally, Adobe Acrobat DC requires an annual subscription that charges $14.99 per month (or a month-by-month plan costing $24.99 per month), which includes unlimited use of OCR and other PDF services.

Confidentiality

Although the documents in this competition all consist of nonsensitive text, natural language processing and quantitative text analysis can involve confidential data with personal identifiable information or trade secrets. ExtractTable explicitly guarantees that none of the data generated through purchased credits are saved on their servers, the gold standard here. AWS stores no personal information generated from Textract, though it does store the input and log files. Users can also opt out of having AWS use data stored on its servers to improve its AI services. Tesseract has no built-in confidentiality mechanism and depends entirely on the systems you use to integrate the open-source software.

Ease of use and output

Each OCR user will have a different use case in terms of the output required and a different level of comfort with code-based implementation, so ease of use is an important dimension for each of these offerings. For users looking for a no-code option, Adobe can perform OCR by simply right-clicking on the document in the desktop app. Although Adobe can process documents in batches, the output will be a searchable PDF, which is great for finding text within scanned documents but not for collecting data for text analysis. Converting the searchable PDF to text files is possible, but we find that some of the resulting text can be unintelligible.

We used the tesseract package in R, which provides R bindings for Tesseract. (A Python library is also available here.) Using tesseract is quite simple, and the output can be either a string of text (easily exported to a .txt file) or an R dataframe with one word per row. The creators of the tesseract package also recommend using the magick package in R first to preprocess images and enhance their quality. To keep the playing field level, we did not do that above, but it could lead to improved results for Tesseract users.

ExtractTable’s API and Python library similarly make it possible to process image files and PDFs in just a few lines of code, outputting tabular text in CSV files and nontabular text in text files. ExtractTable also has a Google Sheets plug-in.

The Textract API is less user-friendly, as it entails uploading documents to an Amazon S3 bucket before running a document analysis to extract text in nested JSON format. We use the boto3 package in Python to run the analysis and various pandas functions to wrangle the data into a workable format. Outputting tabular data in CSV format also requires a separate Python script.

All offerings support PDF, JPEG, and PNG input, and Tesseract and Textract can handle TIFF files as well. Adobe will convert other image files to PDF before parsing text, but Tesseract will do the opposite, creating image files whenever the input document is a PDF before running OCR on the new file.

Lastly, if the use case involves extracting text from tables, both Textract and ExtractTable can parse the text and preserve the layout of tabular data. And Textract is the only one of the four options that supports extracting key-value pairs from documents such as forms or invoices.

Conclusions

Ultimately, the right OCR offering will depend on the use case. Adobe is an excellent tool for converting scans to easily searchable PDFs, but it probably doesn’t fit very well into a pipeline for batch text analysis. Tesseract is free and easy to use, and if high accuracy isn’t as important or your documents are high quality, then the open-source, low-hassle model may suit some users perfectly well.

Perhaps unsurprisingly, the paid, cloud-based offerings win the competition, and each offers certain advantages at the margins. Many downstream natural language processing tasks require cloud-computing infrastructure, so if your organization already uses a cloud service provider, offerings such as Textract can plug into existing pipelines and be quite cost-effective, especially at scale. On the other hand, ExtractTable may appeal to individual researchers for its impressive performance, low barrier to entry, and other unique benefits, such as confidentiality guarantees and refunds for bad output.

In part because Urban already uses AWS for our cloud computing, we found Textract best suited large batches of text extraction because of its low cost and integration with other AWS services. But for smaller operations, we found ExtractTable to be a sleeker, more user-friendly alternative that we also recommend to our researchers.



Compaire Cost optimization ECS and EKS

 When focusing on cost optimization for running containerized applications on AWS, choosing between ECS and EKS requires a detailed comparison based on pricing and usage. Below is a comprehensive breakdown of the cost considerations for each service:

Amazon ECS (Elastic Container Service) Costs

ECS on EC2:

  • EC2 Instance Costs: You pay for the EC2 instances you run. This includes the cost of the instance type, storage, and data transfer.
  • Load Balancers: If you use Elastic Load Balancing (ELB), you incur additional costs.
  • Networking: Data transfer between instances and out of AWS will have associated costs.

ECS on Fargate:

  • Fargate Pricing: You pay for vCPU and memory resources consumed by your containerized applications.
    • vCPU: $0.04048 per vCPU per hour.
    • Memory: $0.004445 per GB per hour.
  • Per-Second Billing: Charges are based on the resources your task uses per second, with a 1-minute minimum.

Amazon EKS (Elastic Kubernetes Service) Costs

EKS Control Plane:

  • Control Plane: $0.10 per hour per cluster. This adds up to about $72 per month per cluster, regardless of the number of nodes.

EKS on EC2:

  • EC2 Instance Costs: Similar to ECS, you pay for the EC2 instances you use.
  • Load Balancers: Additional costs for ELB usage.
  • Networking: Data transfer costs apply.

EKS on Fargate:

  • Fargate Pricing: Same as ECS on Fargate, you pay for vCPU and memory resources consumed by your containerized applications.
    • vCPU: $0.04048 per vCPU per hour.
    • Memory: $0.004445 per GB per hour.

Cost Comparison and Recommendations

ECS Cost Example

Suppose you have an application requiring 4 vCPUs and 8 GB of memory, running continuously.

ECS on Fargate:

  • vCPU: 4 vCPUs×24 hours/day×30 days×$0.04048/vCPU hour=$116.594 \text{ vCPUs} \times 24 \text{ hours/day} \times 30 \text{ days} \times \$0.04048/\text{vCPU hour} = \$116.59
  • Memory: 8 GB×24 hours/day×30 days×$0.004445/GB hour=$25.638 \text{ GB} \times 24 \text{ hours/day} \times 30 \text{ days} \times \$0.004445/\text{GB hour} = \$25.63
  • Total Monthly Cost: $116.59 + $25.63 = $142.22

ECS on EC2:

  • Assume an m5.large instance (2 vCPUs, 8 GB RAM) costs approximately $0.096 per hour.
  • You would need 2 m5.large instances to match the requirement (4 vCPUs, 16 GB RAM).
  • Instance Cost: 2 instances×24 hours/day×30 days×$0.096/hour=$138.242 \text{ instances} \times 24 \text{ hours/day} \times 30 \text{ days} \times \$0.096/\text{hour} = \$138.24
  • Total Monthly Cost: $138.24 (plus any additional costs for storage, load balancing, and data transfer).

EKS Cost Example

Using the same resource requirements:

EKS on Fargate:

  • vCPU: 4 vCPUs×24 hours/day×30 days×$0.04048/vCPU hour=$116.594 \text{ vCPUs} \times 24 \text{ hours/day} \times 30 \text{ days} \times \$0.04048/\text{vCPU hour} = \$116.59
  • Memory: 8 GB×24 hours/day×30 days×$0.004445/GB hour=$25.638 \text{ GB} \times 24 \text{ hours/day} \times 30 \text{ days} \times \$0.004445/\text{GB hour} = \$25.63
  • Control Plane Cost: 24 hours/day×30 days×$0.10/hour=$7224 \text{ hours/day} \times 30 \text{ days} \times \$0.10/\text{hour} = \$72
  • Total Monthly Cost: $116.59 + $25.63 + $72 = $214.22

EKS on EC2:

  • Instance Cost: 2 m5.large instances×24 hours/day×30 days×$0.096/hour=$138.242 \text{ m5.large instances} \times 24 \text{ hours/day} \times 30 \text{ days} \times \$0.096/\text{hour} = \$138.24
  • Control Plane Cost: 24 hours/day×30 days×$0.10/hour=$7224 \text{ hours/day} \times 30 \text{ days} \times \$0.10/\text{hour} = \$72
  • Total Monthly Cost: $138.24 + $72 = $210.24 (plus additional costs for storage, load balancing, and data transfer).

Summary

  • ECS: Tends to be more cost-effective and simpler to manage, especially with smaller, less complex workloads or if you prefer AWS-native solutions.

    • ECS on Fargate: Simplifies management by eliminating the need for instance management but can be more expensive for continuous high-load applications.
    • ECS on EC2: Offers flexibility and potential cost savings if you can manage the instances effectively.
  • EKS: Offers more features and flexibility, better suited for complex, multi-cloud, or hybrid cloud environments but comes with additional control plane costs.

    • EKS on Fargate: Convenient for running Kubernetes workloads without managing instances, but adds control plane costs.
    • EKS on EC2: Provides full Kubernetes functionality with potentially lower costs if instance management is optimized.

For purely AWS-focused environments where cost optimization is the primary concern, ECS on EC2 is likely the most cost-effective option, followed by ECS on Fargate for ease of use without managing instances. If you require Kubernetes features or anticipate needing multi-cloud flexibility, EKS is the better choice, with a careful balance between Fargate and EC2 based on your workload requirements and management capabilities.

Deploy a SPA use AWS S3 and Cloudfont

Deploying a React App using AWS S3 and Cloud Front

Amazon Web Services (AWS) offers a set of powerful tools that enable the seamless deployment of applications. In this article, we will walk through the entire process of deploying your React app on AWS. From setting up your development environment to hosting the application using an AWS S3 bucket and completely setting up HTTPS and a custom domain.

Introduction to AWS S3

Amazon Simple Storage Service (S3) is a scalable object storage service that allows you to store and retrieve data. However, S3 is not just for data; it’s also an excellent choice for hosting static websites and web applications. S3 makes web hosting and content distribution simple with its powerful features and seamless connectivity with other AWS services.

What is AWS CloudFront?

Before we dive into deploying our React app, let’s introduce another essential AWS service: Amazon CloudFront. AWS CloudFront is a content delivery network (CDN) that accelerates the delivery of your web content to users worldwide. It distributes your content over a network of international data centers to offer low-latency access and quick data transfer rates.

Domain Name System

A Domain Name System (DNS) is like the internet’s phonebook. It translates human-friendly domain names (like www.example.com) into IP addresses that computers understand. You must correctly establish DNS settings in order to connect your custom domain to your React app running on S3. This is where AWS Route 53 comes into play.

AWS Route 53

Amazon Route 53 is AWS’s scalable and highly available DNS web service. You can use it to manage DNS routing and domain registration for your applications. You can effortlessly set up and manage DNS records with Route 53, ensuring a seamless connection between your custom domain and your S3-hosted React app.

Let’s Get Started
Now that we have a basic understanding of the key AWS services involved, let’s get started with deploying your React app

Prerequisites:

Before we continue with the deployment process, it’s important to ensure that you have the following prerequisites in place. These foundational elements will set the stage for a smooth and successful deployment:

  • Install VSCode?

  • Set up Git & GitHub

  • Install Node.js and npm

  • Create an AWS account

  • A React App

For the purpose of this article, I am using an already existing React project for the tutorial; however, you can follow along by cloning the project.

Step 1: Create a React App

Since I already have an existing React project, I will go ahead and clone my project locally. You can, however, create a React app using the following steps:

npx create-react-app my-app

cd my-app

npm start



// The next step will be to build the application.



npm run build

Once you have confirmed that the application runs as it should in our browser, you can run the build command. This will bundle our React app into production mode in the build folder, which we can use for the next steps.

Step 2: Deploy React App to AWS S3 Bucket

Open your AWS console and type “Bucket” in the search bar.

  • Select the S3 service.

  • Once opened, you should see a page similar to the image below.

  • Proceed to click the “Create a bucket” button.

  • Give your bucket a name. I will be calling this bucket “mypetstore.cloud” but you can use any name of your choice.

  • Specify your preferred region

  • Enable public access to your S3 bucket and then save. I will leave all other options in default.

  • Your bucket should have been created and should appear in S3 like the image below.

  • Navigate into the bucket we just created, where we will import the content of the build folder in the React App into our S3 bucket.

  • Click on the “Add files” button, highlight all the files in our project build and add them

  • Next, click on the “Add folder” button and import the “static” folder to S3 bucket.

  • The uploaded data should look like the image below on our S3 bucket.

  • Next, we are going to update the properties of our bucket

  • Scroll down to the bottom to the “static website hosting” option and edit it.

  • Enable static website hosting for our S3 bucket and specify “index.html” as the default page of your project.

  • Next, navigate to the the S3 bucket permissions lets update the policies.

  • Scroll down to the bucket policy and edit it. Input the following policy configurations.

{

   "Version": "2012-10-17",

   "Statement": [

       {

           "Sid": "Statement1",

           "Effect": "Allow",

           "Principal": "*",

           "Action": "s3:GetObject",

           "Resource": "arn:aws:s3:::www.mypetstore.cloud/*"

       }

   ]

}

  • Ensure you replace “resource” with your bucket name.

  • Once that is completed, navigate to the bucket properties and scroll to the bottom, You will see a URL available for our bucket.

  • Copy the link and paste it into a browser.

  • Your React app should be able to load in the browser like this.

  • Cool, now we have successfully deployed our React App to an AWS S3 Bucket.

Setup a Domain for S3 Static Website

For the next step, we are going to set up a custom domain name for our React App using any domain name register. For the purpose of this tutorial, I will be using Godaddy.

  • You will need to take a moment to purchase a domain name for this part of the demo.

  • I purchased a domain name called “mypetstore.cloud” which matches the S3 bucket we created.

  • Next, navigate to the “DNS” configuration setting.

  • Select the CNAME record available and edit it.

  • Replace the “value” with the url link of our S3 bucket and the name with the name of our S3 bucket.

  • Save the updates.

  • Great! Now let's head back to AWS.

Deploy React App to CloudFront

We need to create a CDN that will speed up the distribution of our website content by routing user requests through the AWS network.

  • Search for Cloudfront in the AWS console.

  • Select “Create distribution” button

  • Input the URL link to our S3 bucket under the origin domain

  • You can leave the other options as the default.

  • Scroll to the bottom of the page under setting and input the CNAME

  • I will be using the root domain name and www subdomain name: “www.mypetstore.cloud” and “mypetstore.cloud”

  • Since we want to use HTTPS for this project, we will need to add a certificate

  • So, proceed to the “Request certificate” link just below that.

  • On the certificate page, you will need to fill in the root domain name and www subdomain names we just specified above.

  • Proceed to request the certificate. That should specify that it has been successfully created but the status will indicate “ Pending validation”

  • Now, let's go back to our domain register (Godaddy) and create two new DNS records( for the root domain name and www subdomain name) and specify the types as CNAME.

  • Now back to Certificate Manager and into the certificate we just created: Copy the “CNAME names” for the individual domains, just like I did below.

  • Navigate to our domain register and replace the name of the new CNAME DNS records we just created with the CNAME we just copied.

  • Copy the “CNAME value” as well and replace it with the value of the same CNAME DNS record created in the domain register.

  • Repeat this for both the root domain name and www subdomain name

  • Now refresh the certificate manager page. Our certificate should indicate “Issued” as the status.

  • Navigate back to CloudFront, and now you should have the option to choose the certificate we just created.

  • Once that is in place, you can go ahead and create our distribution.

  • Next, navigate to our just-created CloudFront distribution and copy the distribution domain name to your clipboard.

  • Go back to our domain name register and replace that initial CNAME value we set up with the CloudFront domain name we just copied.

  • Next, let's confirm that our domain name has been connected to our AWS CloudFront distribution.

  • You can also run the domain name in your browser to verify that it still works.

  • You can also try using HTTPS in the browser instead of http, and it should still load.

Awesome! As seen in the above image, we can now load the website with our specified domain name.

Transfer Domain to Route 53

Next, we need to transfer our domain to Route 53 in order to simplify the management of your domain and DNS settings and enhance the reliability and security of your application.

  • Back in your AWS console, search for Route 53.

  • Select on the hosted zones and then select the “create hosted zone”

  • Input the domain name that we want to route traffic to with a short description.

  • Once you have completed the steps, create the zone, and you should get an image similar to this one below.

  • Navigate to our domain register and create a custom nameserver, by selecting “Change nameservers.”

  • Back in Route 53, we will need to copy these nameservers for routing traffic and replace them with the domain registers nameservers.

  • Input the nameservers from Route 53 into the custom nameservers, add all four nameservers, and save.

  • Now let's validate the nameservers.

  • Next, navigate back to Route 53, into our hosted zone, and create a new record.

  • One record for “www.mypetstore.cloud” and another for “mypetstore.cloud”

  • Under each record, allow the “Alias” to Cloudfront distribution, specify the CloudFront link we created early, and create the records.

  • We should have a total of four records now within our application hosted zone, with two indicating “Alias” as the type.

  • Now, let's verify that the domain names are working just as well as the HTTPS for the React app.

  • Verify with just the root domain name : “mypetstore.cloud”

  • And verify with the www subdomain name as well: “www.mypetstore.cloud

Awesome! Our site now fully functions and can be accessed anywhere with the custom domain created for it. Great job!!

What Next?

Congratulations! 🎉 You have successfully hosted a React application using AWS S3, CloudFront, and Route 53. You can show off your efforts to your peers.

Please note that this tutorial is designed to provide introductory exposure to AWS and is targeted at beginners. Follow for more updates on similar topics as I embark on my journey to AWS.


The OCR Service to extract the Text Data

Optical character recognition, or OCR, is a key tool for people who want to build or collect text data. OCR uses machine learning to extract...