DataSunrise Achieves AWS DevOps Competency Status in AWS DevSecOps and Monitoring, Logging, Performance

Data Classification

Data Classification

Data Classification Example

Introduction

In today’s digital landscape, data is the lifeblood of organizations. From customer records to financial transactions, businesses rely on vast amounts of information to make informed decisions and drive growth. However, not all data is equal. Some data is more sensitive than others and requires special handling and protection. This is where data classification comes into play.

Data classification is the process of categorizing data based on its sensitivity, criticality, and value to the organization. By classifying data, businesses can ensure that appropriate security measures are in place to safeguard sensitive information from unauthorized access, misuse, or breaches. In this article, we will explore the fundamentals of data classification and delve into examples of how it can be implemented using Python and regular expressions.

Understanding Data Classification

Data classification involves organizing data into predefined categories or classes based on its characteristics and sensitivity level. The primary goal of data classification is to identify and prioritize data that requires enhanced security controls and protection.

There are two main approaches to data classification:

Classification by Scheme

 This approach involves analyzing database metadata for the names of columns, tables, views, and functions. For instance, if a column is named ‘last_name’, it is classified as sensitive data.

Classification by Data

 In this approach, the actual content of the data is analyzed to determine its sensitivity and classification. This method requires a more granular examination of the data itself, often using techniques like pattern matching or regular expressions to identify sensitive information.

These two approaches can be combined as desired. Additionally, DataSunrise combines them when the user creates attributes for the Information Type used in the Sensitive Data Discovery feature. Later, we’ll explore how using regular expressions results in a significant number of checks for each expression. Therefore, centralized control of all data classification mechanisms is extremely important. This functionality is available out of the box in DataSunrise as well as the other powerful features like OCR-based data discovery.

Classifying Data with Python and Regular Expressions

One powerful tool for classifying data is regular expressions. Regular expressions, or regexes, are a sequence of characters that define a search pattern. They allow you to match and extract specific patterns within text data.

Let’s consider an example where we have a virtual database table containing various types of information, including emails, credit card numbers, and social security numbers (SSNs). Our goal is to classify this data and identify the sensitive information.

import re

# Sample data
  data = [
    ['John Doe', 'john@example.com', '5555-5555-5555-4444', '123-45-6789'],
    ['Jane Smith', 'jane.smith@example.com', '4111-1111-1111-1111', '987-65-4321'],
    ['Bob Johnson', 'bob.johnson@example.com', '1234-5678-9012-3456', '456-78-9012']
  ]

# Regular expressions for sensitive data
email_regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
mastercard_regex = r'\b(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}\b'
ssn_regex = r'\b(?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4}\b'

# Classify the data
for row in data:
  for cell in row:
    if re.match(email_regex, cell):
    print(f"Email found: {cell}")
    elif re.match(mastercard_regex, cell):
    print(f"Mastercard number found: {cell}")
    elif re.match(ssn_regex, cell):
    print(f"SSN found: {cell}")

In this example, we have a list of lists representing a database table. Each inner list represents a row, and each element within the row represents a column value.

We define regular expressions for identifying emails, Mastercard numbers, and SSNs. These regular expressions capture the specific patterns associated with each type of sensitive data.

A raw string literal r’…’ in Python treats backslashes (\) as literal characters. This is particularly useful in regular expressions because backslashes are commonly used as escape characters. Using raw string literals, you don’t need to escape backslashes twice (once for Python and once for the regular expression engine).

Using a nested loop, we iterate over each row and cell in the data. For each cell, we use the re.match() function to check if the cell value matches any of the defined regular expressions. If a match is found, we print the corresponding sensitive data type and the matched value.

Running this code will output:

Email found: john@example.com
Mastercard number found: 5555-5555-5555-4444
SSN found: 123-45-6789
Email found: jane.smith@example.com
Mastercard number found: 4111-1111-1111-1111
SSN found: 987-65-4321
Email found: bob.johnson@example.com
SSN found: 456-78-9012

It’s important to note that creating comprehensive regular expressions for all possible variations of sensitive data can be challenging. Different data formats, edge cases, and evolving patterns can make it difficult to capture every instance accurately. That’s why it’s good idea to use simple regular expressions as a starting point and continuously refine them based on the specific requirements and data in real-world scenarios.

Additional Sensitive Data Patterns

Here are a few more regular expressions to classify sensitive data:

Phone Number (US format, with +1 and without):

^\\(?([0-9]{3})\\)?[-.\\s]?([0-9]{3})[-.\\s]?([0-9]{4})$

or

^(\([0-9]{3}\) |[0-9]{3}-)[0-9]{3}-[0-9]{4}$

These regular expressions match phone numbers in various formats. As you can see from the link above, there are a vast number of regular expressions which help classify phone numbers in different formats and across different countries. This complexity complicates the classification process, as you need to include all these regular expressions to accurately classify all the data.

IP Address (IPv4):

^((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)(\.(?!$)|$)){4}$

This regular expression matches IPv4 addresses, ensuring that each octet is within the valid range (0-255).

Passport Number (US format):

^(?!^0+$)[a-zA-Z0-9]{3,20}$

This regular expression matches US passport numbers.

Bank Account Number (IBAN format):

^[A-Z]{2}[0-9]{2}[A-Z0-9]{1,30}$

This regular expression matches International Bank Account Numbers (IBAN) in the standard format. You can find a list of different formats (regular expressions) in Apache Validator.

Credit Card Number (American Express):

^3[47][0-9]{13}$

This regular expression matches American Express credit card numbers, which start with either 34 or 37 and have a total of 15 digits.

Social Security Number (SSN) with dashes:

^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$

This regular expression matches SSNs in the format XXX-XX-XXXX, excluding certain invalid patterns like 000 in the area number or 0000 in the serial number.

Email Address:

^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}$

This regular expression matches email addresses, allowing for a combination of alphanumeric characters, dots, underscores, and hyphens in the local part and domain name. This is a short variant. You can easily find the discussions on stackoverflow suggesting more advanced variants.

Remember, these regular expressions are examples and may need to be adapted based on your specific requirements and the data formats you encounter. Additionally, regular expressions alone are not sufficient for comprehensive data protection. You should use them in conjunction with other security measures, such as data encryption, access controls, and secure storage practices.

When working with sensitive data, it’s crucial to consider the specific requirements and regulations applicable to your domain. Always consult the relevant security and compliance frameworks and guidelines to ensure the appropriate handling and protection of sensitive information.

Conclusion

Data classification is a crucial aspect of data security and compliance. By categorizing data based on its sensitivity and applying appropriate security controls, organizations can protect sensitive information from unauthorized access and breaches.

Python and regular expressions provide powerful tools for classifying data based on its content. By defining regular expressions that match specific patterns, we can identify and flag sensitive data within structured or unstructured datasets.

However, it’s important to recognize the challenges associated with creating comprehensive regular expressions for all possible data variations. Regular expressions should be used in conjunction with other security measures, such as encryption, access controls, and monitoring, to ensure robust data protection.

At DataSunrise, we offer exceptional and flexible tools for sensitive data discovery, security, audit rules, masking, and compliance. Our solutions empower organizations to safeguard their sensitive data and meet regulatory requirements effectively. We encourage you to schedule an online demo to explore how DataSunrise can help you classify and protect your critical data assets.

Next

Data Localization

Data Localization

Learn More

Need Our Support Team Help?

Our experts will be glad to answer your questions.

Countryx
United States
United Kingdom
France
Germany
Australia
Afghanistan
Islands
Albania
Algeria
American Samoa
Andorra
Angola
Anguilla
Antarctica
Antigua and Barbuda
Argentina
Armenia
Aruba
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bermuda
Bhutan
Bolivia
Bosnia and Herzegovina
Botswana
Bouvet
Brazil
British Indian Ocean Territory
Brunei Darussalam
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Canada
Cape Verde
Cayman Islands
Central African Republic
Chad
Chile
China
Christmas Island
Cocos (Keeling) Islands
Colombia
Comoros
Congo, Republic of the
Congo, The Democratic Republic of the
Cook Islands
Costa Rica
Cote D'Ivoire
Croatia
Cuba
Cyprus
Czech Republic
Denmark
Djibouti
Dominica
Dominican Republic
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Ethiopia
Falkland Islands (Malvinas)
Faroe Islands
Fiji
Finland
French Guiana
French Polynesia
French Southern Territories
Gabon
Gambia
Georgia
Ghana
Gibraltar
Greece
Greenland
Grenada
Guadeloupe
Guam
Guatemala
Guernsey
Guinea
Guinea-Bissau
Guyana
Haiti
Heard Island and Mcdonald Islands
Holy See (Vatican City State)
Honduras
Hong Kong
Hungary
Iceland
India
Indonesia
Iran, Islamic Republic Of
Iraq
Ireland
Isle of Man
Israel
Italy
Jamaica
Japan
Jersey
Jordan
Kazakhstan
Kenya
Kiribati
Korea, Democratic People's Republic of
Korea, Republic of
Kuwait
Kyrgyzstan
Lao People's Democratic Republic
Latvia
Lebanon
Lesotho
Liberia
Libyan Arab Jamahiriya
Liechtenstein
Lithuania
Luxembourg
Macao
Madagascar
Malawi
Malaysia
Maldives
Mali
Malta
Marshall Islands
Martinique
Mauritania
Mauritius
Mayotte
Mexico
Micronesia, Federated States of
Moldova, Republic of
Monaco
Mongolia
Montserrat
Morocco
Mozambique
Myanmar
Namibia
Nauru
Nepal
Netherlands
Netherlands Antilles
New Caledonia
New Zealand
Nicaragua
Niger
Nigeria
Niue
Norfolk Island
North Macedonia, Republic of
Northern Mariana Islands
Norway
Oman
Pakistan
Palau
Palestinian Territory, Occupied
Panama
Papua New Guinea
Paraguay
Peru
Philippines
Pitcairn
Poland
Portugal
Puerto Rico
Qatar
Reunion
Romania
Russian Federation
Rwanda
Saint Helena
Saint Kitts and Nevis
Saint Lucia
Saint Pierre and Miquelon
Saint Vincent and the Grenadines
Samoa
San Marino
Sao Tome and Principe
Saudi Arabia
Senegal
Serbia and Montenegro
Seychelles
Sierra Leone
Singapore
Slovakia
Slovenia
Solomon Islands
Somalia
South Africa
South Georgia and the South Sandwich Islands
Spain
Sri Lanka
Sudan
Suriname
Svalbard and Jan Mayen
Swaziland
Sweden
Switzerland
Syrian Arab Republic
Taiwan, Province of China
Tajikistan
Tanzania, United Republic of
Thailand
Timor-Leste
Togo
Tokelau
Tonga
Trinidad and Tobago
Tunisia
Turkey
Turkmenistan
Turks and Caicos Islands
Tuvalu
Uganda
Ukraine
United Arab Emirates
United States Minor Outlying Islands
Uruguay
Uzbekistan
Vanuatu
Venezuela
Viet Nam
Virgin Islands, British
Virgin Islands, U.S.
Wallis and Futuna
Western Sahara
Yemen
Zambia
Zimbabwe
Choose a topicx
General Information
Sales
Customer Service and Technical Support
Partnership and Alliance Inquiries
General information:
info@datasunrise.com
Customer Service and Technical Support:
support.datasunrise.com
Partnership and Alliance Inquiries:
partner@datasunrise.com