AWS Glue Crawler Schema Inference

https://repost.aws/knowledge-center/glue-crawler-detect-schema

important:

https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-grouping-policy

(copied from the transient link above)

When you run your AWS Glue crawler, the crawler does the following:

  1. Classifies the data
  2. Groups the data into tables or partitions
  3. Writes metadata to the AWS Glue Data Catalog

Review the following to learn what happens when you run the crawler and how the crawler detects the schema.

Defining a crawler

When you define an AWS Glue crawler, you can choose one or more custom classifiers that evaluate the format of your data to infer a schema. When the crawler runs, the first classifier in your list to successfully recognize your data store is used to create a schema for your table. You define the custom classifiers before you define the crawler. When the crawler runs, the crawler uses the custom classifier that you defined to find a match in the data store. The match with each classifier generates a certainty. If the classifier returns certainty=1.0 during processing, then the crawler is 100 percent certain that the classifier can create the correct schema. In this case, the crawler stops invoking other classifiers, and then creates a table with the classifier that matches the custom classifier.

If AWS Glue doesn’t find a custom classifier that fits the input data format with 100 percent certainty, then AWS Glue invokes the built-in classifiers. The built-in classifier returns either certainty=1.0 if the format matches, or certainty=0.0 if the format doesn’t match. If no classifier returns certainty=1.0, then AWS Glue uses the output of the classifier that has the highest certainty. If no classifier returns a certainty of higher than 0.0, then AWS Glue returns the default classification string of UNKNOWN. For the current list of built-in classifiers in AWS Glue and the order that they are invoked in, see Built-in classifiers in AWS Glue.

Schema detection in crawler

During the first crawler run, the crawler reads either the first 1,000 records or the first megabyte of each file to infer the schema. The amount of data read depends on the file format and availability of a valid record. For example, if the input file is a JSON file, then the crawler reads the first 1 MB of the file to infer the schema. If a valid record is read within the first 1 MB of the file, then the crawler infers the schema. If the crawler can’t infer the schema after reading the first 1 MB, then the crawler reads up to a maximum of 10 MB of the file, incrementing 1 MB at a time. For CSV files, the crawler reads either the first 1000 records or the first 1 MB of data, whatever comes first. For Parquet files, the crawler infers the schema directly from the file. The crawler compares the schemas inferred from all the subfolders and files, and then creates one or more tables. When a crawler creates a table, it considers the following factors:

  • Data compatibility to check if the data is of the same format, compression type, and include path
  • Schema similarity to check how closely similar the schemas are in terms partition threshold and the number of different schemas

For schemas to be considered similar, the following conditions must be true:

  • The partition threshold is higher than 0.7 (70%).
  • The maximum number of different schemas (also referred to as “clusters” in this context) doesn’t exceed 5.

The crawler infers the schema at folder level and compares the schemas across all folders. If the schemas that are compared match, that is, if the partition threshold is higher than 70%, then the schemas are denoted as partitions of a table. If they don’t match, then the crawler creates a table for each folder, resulting in a higher number of tables.

Example scenarios

Example 1: Suppose that the folder DOC-EXAMPLE-FOLDER1 has 10 files, 8 files with schema SCH_A and 2 files with SCH_B.

Suppose that the files with the schema SHC_A are similar to the following:

{ "id": 1, "first_name": "John", "last_name": "Doe"}
{ "id": 2, "first_name": "Li", "last_name": "Juan"}

Suppose that the files with the schema SCH_B are similar to the following:

{"city":"Dublin","country":"Ireland"}
{"city":"Paris","country":"France"}

When the crawler crawls the Amazon Simple Storage Service (Amazon S3) path s3://DOC-EXAMPLE-FOLDER1, the crawler creates one table. The table comprises columns of both schema SCH_A and SCH_B. This is because 80% of the files in the path belong to the SCH_A schema, and 20% of the files belong to the SCH_B schema. Therefore, the partition threshold value is met. Also, the number of different schemas hasn’t exceeded the number of clusters, and the cluster size limit isn’t exceeded.

Example 2: Suppose that the folder DOC-EXAMPLE-FOLDER2 has 10 files, 7 files with the schema SCH_A and 3 files with the schema SCH_B.

When the crawler crawls the Amazon S3 path s3://DOC-EXAMPLE-FOLDER2, the crawler creates one table for each file. This is because 70% of the files belong to the schema SCH_A and 30% of the files belong to the schema SCH_B. This means that the partition threshold isn’t met. You can check the crawler logs in Amazon CloudWatch to get information on the created tables.

Crawler options

  • Create a single schema: To configure the crawler to ignore the schema similarity and create only one schema, use the option Create a single schema for each S3 path. For more information, see How to create a single schema for each Amazon S3 include path. However, if the crawler detects data incompatibility, then the crawler still creates multiple tables.
  • Specify table location: The table level crawler option lets you tell the crawler where the tables are located and how the partitions are to be created. When you specify a Table level value, the table is created at that absolute level from the Amazon S3 bucket. When configuring the crawler on the console, you can specify a value for the Table level crawler option. The value must be a positive integer that indicates the table location (the absolute level in the dataset). The level for the top-level folder is 1. For example, for the path mydataset/a/b, if the level is set to 3, then the table is created at the location mydataset/a/b. For more information, see How to specify the table location.

Related information

How crawlers work

Setting crawler configuration options

Leave a Comment

Scroll to Top