Databricks Certified Data Engineer Professional Sample Questions:
1. The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.
The following logic is used to process these records.
Which statement describes this implementation?
A) The customers table is implemented as a Type 3 table; old values are maintained as a new column alongside the current value.
B) The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.
C) The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.
D) The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.
E) The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.
2. Why are Pandas UDFs often preferred over traditional PySpark UDFs in performance-critical applications involving large datasets?
A) They leverage Apache Arrow to enable vectorized operations between the JVM and Python runtimes, reducing serialization costs and improving computational efficiency.
B) They eliminate the JVM-Python boundary by bypassing serialization entirely, thereby avoiding data conversion overhead.
C) They allow row-level execution of functions in Python with native Spark optimization, removing the need for columnar execution.
D) They minimize memory usage by streaming each row individually through a lightweight Python wrapper, avoiding batch processing overhead.
3. A data engineer is using Structured Streaming to read in transaction data from a bronze Delta table. It was discovered that the data has quality issues where sometimes the transaction value is negative, and when that occurs, the rows need to be routed to a separate quarantine table. They have low latency requirements for the good data since it is used by downstream systems, but the bad data will only be analyzed periodically and has no production dependencies. The quarantine job needs to be implemented so that it cannot affect the production processes that depend on the good data, and the cost of the job needs to be minimized. How should the quarantine process be implemented in order to satisfy these requirements?
A) The streaming job for the good data needs to be modified to filter out records with a transaction value less than 0 before writing, and should not share compute with other processes. The streaming job for the quarantine data needs to filter out records with a transaction value greater than or equal to 0 before writing, and should be implemented on a separate small cluster and only run once a day to minimize cost.
B) The streaming job for the good data needs to be modified to filter out records with a transaction value less than 0 before writing. The streaming job for the quarantine data needs to filter out records with a transaction value greater than or equal to 0 before writing. Both should run as separate streams on the same cluster to minimize cost.
C) The existing streaming job for the good data should be updated to incorporate the quarantining of the bad data. A new boolean column called "quarantine" should be added to the dataframe, and its value should be set to true if the transaction value is less than 0 and false if the transaction value is greater than or equal to 0. Processing and storing all the data together will save costs.
D) The existing streaming job for the good data should be updated to incorporate the quarantining of the bad data. Inside a foreachBatch function, the dataframe should be filtered so that records with a transaction value greater than or equal to 0 are written to the good data table and records with a transaction value less than 0 are written to a quarantine table. Try/Catch can be added around the writes in the foreachBatch function so that the stream can't fail.
4. A data governance team at a large enterprise is improving data discoverability across its organization. The team has hundreds of tables in their Databricks Lakehouse with thousands of columns that lack proper documentation. Many of these tables were created by different teams over several years, with missing context about column meanings and business logic. The data governance team needs to quickly generate comprehensive column descriptions for all existing tables to meet compliance requirements and improve data literacy across the organization. They want to leverage modern capabilities to automatically generate meaningful descriptions rather than manually documenting each column, which would take months to complete. Which approach should the team use in Databricks to automatically generate column comments and descriptions for existing tables?
A) Write custom PySpark code using df.describe() and df.schema to programmatically generate basic statistical descriptions for each column.
B) Navigate to the table in Databricks Catalog Explorer, select the table schema view, and use the AI Generate option which leverages artificial intelligence to automatically create meaningful column descriptions based on column names, data types, sample values, and data patterns.
C) Use the DESCRIBE TABLE command to extract existing schema information and manually write descriptions based on column names and data types.
D) Use Delta Lake's DESCRIBE HISTORY command to analyze table evolution and infer column purposes from historical changes.
5. A data engineer is designing a system to process batch patient encounter data stored in an S3 bucket, creating a Delta table (patient_encounters) with columns encounter_id, patient_id, encounter_date, diagnosis_code, and treatment_cost. The table is queried frequently by patient_id and encounter_date, requiring fast performance. Fine-grained access controls must be enforced. The engineer wants to minimize maintenance and boost performance. How should the data engineer create the patient_encounters table?
A) Create an external table in Unity Catalog, specifying an S3 location for the data files. Enable predictive optimization through table properties, and configure Unity Catalog permissions for access controls.
B) Create a managed table in Unity Catalog. Configure Unity Catalog permissions for access controls, schedule jobs to run OPTIMIZE and VACUUM commands daily to achieve best performance.
C) Create a managed table in Hive Metastore. Configure Hive Metastore permissions for access controls, and rely on predictive optimization to enhance query performance and simplify maintenance.
D) Create a managed table in Unity Catalog. Configure Unity Catalog permissions for access controls, and rely on predictive optimization to enhance query performance and simplify maintenance.
Solutions:
| Question # 1 Answer: C | Question # 2 Answer: A | Question # 3 Answer: A | Question # 4 Answer: B | Question # 5 Answer: D |

We're so confident of our products that we provide no hassle product exchange.


By Edmund


