Data Classification

Introduction

In today’s digital landscape, data is the lifeblood of organizations. From customer records to financial transactions, businesses rely on vast amounts of information to make informed decisions and drive growth. However, not all data is equal. Some data is more sensitive than others and requires special handling and protection. This is where data classification comes into play.

Data classification is the process of categorizing data based on its sensitivity, criticality, and value to the organization. By classifying data, businesses can ensure that appropriate security measures are in place to safeguard sensitive information from unauthorized access, misuse, or breaches. In this article, we will explore the fundamentals of data classification and delve into examples of how it can be implemented using Python and regular expressions.

Understanding Data Classification

Data classification involves organizing data into predefined categories or classes based on its characteristics and sensitivity level. The primary goal of data classification is to identify and prioritize data that requires enhanced security controls and protection.

There are two main approaches to data classification:

Classification by Scheme

This approach involves analyzing database metadata for the names of columns, tables, views, and functions. For instance, if a column is named ‘last_name’, it is classified as sensitive data.

Classification by Data

In this approach, the actual content of the data is analyzed to determine its sensitivity and classification. This method requires a more granular examination of the data itself, often using techniques like pattern matching or regular expressions to identify sensitive information.

These two approaches can be combined as desired. Additionally, DataSunrise combines them when the user creates attributes for the Information Type used in the Sensitive Data Discovery feature. Later, we’ll explore how using regular expressions results in a significant number of checks for each expression. Therefore, centralized control of all data classification mechanisms is extremely important. This functionality is available out of the box in DataSunrise as well as the other powerful features like OCR-based data discovery.

Classifying Data with Python and Regular Expressions

One powerful tool for classifying data is regular expressions. Regular expressions, or regexes, are a sequence of characters that define a search pattern. They allow you to match and extract specific patterns within text data.

Let’s consider an example where we have a virtual database table containing various types of information, including emails, credit card numbers, and social security numbers (SSNs). Our goal is to classify this data and identify the sensitive information.

import re

# Sample data
  data = [
    ['John Doe', 'john@example.com', '5555-5555-5555-4444', '123-45-6789'],
    ['Jane Smith', 'jane.smith@example.com', '4111-1111-1111-1111', '987-65-4321'],
    ['Bob Johnson', 'bob.johnson@example.com', '1234-5678-9012-3456', '456-78-9012']
  ]

# Regular expressions for sensitive data
email_regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
mastercard_regex = r'\b(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}\b'
ssn_regex = r'\b(?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4}\b'

# Classify the data
for row in data:
  for cell in row:
    if re.match(email_regex, cell):
    print(f"Email found: {cell}")
    elif re.match(mastercard_regex, cell):
    print(f"Mastercard number found: {cell}")
    elif re.match(ssn_regex, cell):
    print(f"SSN found: {cell}")

In this example, we have a list of lists representing a database table. Each inner list represents a row, and each element within the row represents a column value.

We define regular expressions for identifying emails, Mastercard numbers, and SSNs. These regular expressions capture the specific patterns associated with each type of sensitive data.

A raw string literal r’…’ in Python treats backslashes (\) as literal characters. This is particularly useful in regular expressions because backslashes are commonly used as escape characters. Using raw string literals, you don’t need to escape backslashes twice (once for Python and once for the regular expression engine).

Using a nested loop, we iterate over each row and cell in the data. For each cell, we use the re.match() function to check if the cell value matches any of the defined regular expressions. If a match is found, we print the corresponding sensitive data type and the matched value.

Running this code will output:

Email found: john@example.com
Mastercard number found: 5555-5555-5555-4444
SSN found: 123-45-6789
Email found: jane.smith@example.com
Mastercard number found: 4111-1111-1111-1111
SSN found: 987-65-4321
Email found: bob.johnson@example.com
SSN found: 456-78-9012

It’s important to note that creating comprehensive regular expressions for all possible variations of sensitive data can be challenging. Different data formats, edge cases, and evolving patterns can make it difficult to capture every instance accurately. That’s why it’s good idea to use simple regular expressions as a starting point and continuously refine them based on the specific requirements and data in real-world scenarios.

Additional Sensitive Data Patterns

Here are a few more regular expressions to classify sensitive data:

Phone Number (US format, with +1 and without):

^\\(?([0-9]{3})\\)?[-.\\s]?([0-9]{3})[-.\\s]?([0-9]{4})$

^(\([0-9]{3}\) |[0-9]{3}-)[0-9]{3}-[0-9]{4}$

These regular expressions match phone numbers in various formats. As you can see from the link above, there are a vast number of regular expressions which help classify phone numbers in different formats and across different countries. This complexity complicates the classification process, as you need to include all these regular expressions to accurately classify all the data.

IP Address (IPv4):

^((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)(\.(?!$)|$)){4}$

This regular expression matches IPv4 addresses, ensuring that each octet is within the valid range (0-255).

Passport Number (US format):

^(?!^0+$)[a-zA-Z0-9]{3,20}$

This regular expression matches US passport numbers.

Bank Account Number (IBAN format):

^[A-Z]{2}[0-9]{2}[A-Z0-9]{1,30}$

This regular expression matches International Bank Account Numbers (IBAN) in the standard format. You can find a list of different formats (regular expressions) in Apache Validator.

Credit Card Number (American Express):

^3[47][0-9]{13}$

This regular expression matches American Express credit card numbers, which start with either 34 or 37 and have a total of 15 digits.

Social Security Number (SSN) with dashes:

^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$

This regular expression matches SSNs in the format XXX-XX-XXXX, excluding certain invalid patterns like 000 in the area number or 0000 in the serial number.

Email Address:

^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}$

This regular expression matches email addresses, allowing for a combination of alphanumeric characters, dots, underscores, and hyphens in the local part and domain name. This is a short variant. You can easily find the discussions on stackoverflow suggesting more advanced variants.

Remember, these regular expressions are examples and may need to be adapted based on your specific requirements and the data formats you encounter. Additionally, regular expressions alone are not sufficient for comprehensive data protection. You should use them in conjunction with other security measures, such as data encryption, access controls, and secure storage practices.

When working with sensitive data, it’s crucial to consider the specific requirements and regulations applicable to your domain. Always consult the relevant security and compliance frameworks and guidelines to ensure the appropriate handling and protection of sensitive information.

Conclusion

Data classification is a crucial aspect of data security and compliance. By categorizing data based on its sensitivity and applying appropriate security controls, organizations can protect sensitive information from unauthorized access and breaches.

Python and regular expressions provide powerful tools for classifying data based on its content. By defining regular expressions that match specific patterns, we can identify and flag sensitive data within structured or unstructured datasets.

However, it’s important to recognize the challenges associated with creating comprehensive regular expressions for all possible data variations. Regular expressions should be used in conjunction with other security measures, such as encryption, access controls, and monitoring, to ensure robust data protection.

At DataSunrise, we offer exceptional and flexible tools for sensitive data discovery, security, audit rules, masking, and compliance. Our solutions empower organizations to safeguard their sensitive data and meet regulatory requirements effectively. We encourage you to schedule an online demo to explore how DataSunrise can help you classify and protect your critical data assets.