Sensitive Data Discovery with Amazon Textract
Sensitive data discovery is one of the core steps in data protection. With the growth of the amount of data, businesses use cloud storage like Amazon S3. To protect data, you need to know where it resides in your buckets. After that, you need to understand what piece of information you need to protect and how. DataSunrise already has a Data Discovery solution for AWS S3 storage with the OCR functionality.
Here we will introduce you to Sensitive Data Discovery for Amazon S3 with the support of Amazon Textract to expand the possibilities of sensitive data recognition in images and documents.
Possibilities of DataSunrise Sensitive Data Discovery
DataSunrise already can discover sensitive data in S3. The huge amount of supported file formats increases the possible volume of discovered information. Here are some of the formats we are working with:
- Apache Parquet file format
- Semi-structured files like XML, JSON, CSV
- Unstructured text formats like Microsoft Word documents
- Images (PNG, JPEG, TIFF, JPEG 2000, GIF, WebP, BMP, PNM)
One of the most important features for discovering sensitive data in S3 is data discovery in images. To discover sensitive information in images we are using the Tesseract engine based on neural network technology for character recognition. Our OCR Sensitive Data Discovery enables you to detect sensitive information even if it is mentioned in diagrams and tables. DataSunrise extracts sensitive information even from documents with text and numbers mixed.
To enhance our possibilities in sensitive data discovery, we implemented the support of Amazon Textract for S3 in version 8.4.
What Is Amazon Textract?
Amazon has a machine learning service that detects and extracts printed text, handwritten text, and tables from images and scanned documents. Amazon Textract supports the following file formats: PNG, JPEG, and PDF. Otherwise, you need to convert your file in the following formats to be able to use Amazon Textract.
The main benefit for businesses in Textract service is the possibility of handwritten text detection and extraction from such documents as invoices, medical reports, financial records, and others. With the help of Amazon Textract, you can extract data without human resources. This possibility reduces the risk of mistakes that can cause harm to your business during data usage, audit, or in case of data leakage.
DataSunrise and Amazon Textract
OCR data discovery has a lot of benefits for sensitive data protection and data storage. We aim to find a convenient and efficient solution for our customers who are working with printed and handwritten documents and store these documents in S3. That is why now we can employ Amazon Textract. This functionality enhances the possibilities of OCR Data Discovery in S3.
To start using Data Discovery for S3 you just need to do the following steps:
- Navigate to Data Discovery → Periodic Data Discovery.
- Create a Data Discovery task for your S3 bucket.
- Choose two dedicated parameters “DataDiscoveryUseAmazonTextractOCR” and “DataDiscoveryUseAmazonTextractS3Integranion”.
- Run the task and DataSunrise will perform OCR discovery automatically.
We have implemented these two dedicated parameters for configuring Textract-based Data Discovery. Please notice that for the proper work, Textract OCR must be located at the database instance you are going to do Data Discovery across.
Please note that using Amazon Textract charges a fee for using the Detect Document Text API.
Thanks to this the process of data discovery in S3 becomes easier and less time-consuming. Try out our new possibility for Sensitive Data Discovery in S3. Make sure to know where all your sensitive data resides and protect it with the help of DataSunrise.