Get Prepared for Your Professional-Data-Engineer Exam With Actual Google Study Guide!
Pass Your Next Professional-Data-Engineer Certification Exam Easily & Hassle Free
Build & Operationalize Data Processing Systems
- Build & Operationalize Pipeline: This module requires that the learners demonstrate competence in data cleansing, transformation, batch & streaming, data import & acquisition, as well as integration with the new data sources;
- Build & Operationalize Processing Infrastructure: The considerations for this subject area include provisioning resources, adjusting pipeline, monitoring pipeline, and testing & quality control.
- Build & Operationalize Storage Systems: This part will require the students’ skills and competence in the effective usage of managed services, including Cloud Spanner, CLoug Bigtable, BigQuery, Cloud SQL, Cloud Memorystore, Cloud Datastore, and Cloud Storage. It also covers their skills in managing the data lifecycle and storage performance and costs;
The Professional Data Engineer exam is the industry-standard exam that proves the candidate’s ability to do data-driven decision-making by assembling, transforming, and publishing data. If you are rooting for a career in data engineering, you should take this test. It will lead you to attain the Professional Data Engineer certification issued by Google.
NEW QUESTION 148
You want to archive data in Cloud Storage. Because some data is very sensitive, you want to use the "Trust No One" (TNO) approach to encrypt your data to prevent the cloud provider staff from decrypting your data. What should you do?
- A. Specify customer-supplied encryption key (CSEK) in the .botoconfiguration file. Use gsutil cpto upload each archival file to the Cloud Storage bucket. Save the CSEK in Cloud Memorystore as permanent storage of the secret.
- B. Specify customer-supplied encryption key (CSEK) in the .botoconfiguration file. Use gsutil cpto upload each archival file to the Cloud Storage bucket. Save the CSEK in a different project that only the security team can access.
- C. Use gcloud kms keys createto create a symmetric key. Then use gcloud kms encryptto encrypt each archival file with the key and unique additional authenticated data (AAD). Use gsutil cp to upload each encrypted file to the Cloud Storage bucket, and keep the AAD outside of Google Cloud.
- D. Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encryptto encrypt each archival file with the key. Use gsutil cpto upload each encrypted file to the Cloud Storage bucket.
Manually destroy the key previously used for encryption, and rotate the key once.
Answer: D
Explanation:
Explanation/Reference:
NEW QUESTION 149
MJTelco's Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations.
You want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline configuration setting should you update?
- A. The disk size per worker
- B. The zone
- C. The maximum number of workers
- D. The number of workers
Answer: B
NEW QUESTION 150
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world.
The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
* Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
* Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production - to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
* Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
* Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
* Provide reliable and timely access to data for analysis from distributed research workers
* Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m records/day Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high- value problems instead of problems with our data pipelines.
Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-grained analysis of each day's events. They also want to use streaming ingestion. What should you do?
- A. Create a partitioned table called tracking_table and include a TIMESTAMP column.
- B. Create a table called tracking_table and include a DATE column.
- C. Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.
- D. Create a table called tracking_table with a TIMESTAMP column to represent the day.
Answer: A
NEW QUESTION 151
Case Study 2 - MJTelco
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world.
The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
* Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
* Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production - to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
* Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
* Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
* Provide reliable and timely access to data for analysis from distributed research workers
* Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
* Ensure secure and efficient transport and storage of telemetry data
* Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
* Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately
100m records/day
* Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis.
Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day. Which schema should you use?
- A. Rowkey: date#device_id
Column data: data_point - B. Rowkey: data_point
Column data: device_id,date - C. Rowkey: date#data_point
Column data: device_id - D. Rowkey: date
Column data: device_id,data_point - E. Rowkey: device_id
Column data: date, data_point
Answer: A
Explanation:
The most common query is for all the data for a given device for a given day", rowkey should have info for both device and date.
NEW QUESTION 152
You work for a manufacturing plant that batches application log files together into a single log file once a day at
2:00 AM. You have written a Google Cloud Dataflow job to process that log file. You need to make sure the log file in processed once per day as inexpensively as possible. What should you do?
- A. Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.
- B. Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.
- C. Manually start the Cloud Dataflow job each morning when you get into the office.
- D. Change the processing job to use Google Cloud Dataproc instead.
Answer: A
NEW QUESTION 153
Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams have freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse. You need to discover what everyone is doing. What should you do first?
- A. Get the identity and access management IIAM) policy of each table
- B. Use Stackdriver Monitoring to see the usage of BigQuery query slots.
- C. Use Google Stackdriver Audit Logs to review data access.
- D. Use the Google Cloud Billing API to see what account the warehouse is being billed to.
Answer: B
NEW QUESTION 154
An organization maintains a Google BigQuery dataset that contains tables with user-level dat A.
They want to expose aggregates of this data to other Google Cloud projects, while still controlling access to the user-level data. Additionally, they need to minimize their overall storage cost and ensure the analysis cost for other projects is assigned to those projects. What should they do?
- A. Create and share a new dataset and table that contains the aggregate results.
- B. Create dataViewer Identity and Access Management (IAM) roles on the dataset to enable sharing.
- C. Create and share a new dataset and view that provides the aggregate results.
- D. Create and share an authorized view that provides the aggregate results.
Answer: B
NEW QUESTION 155
Which SQL keyword can be used to reduce the number of columns processed by BigQuery?
- A. LIMIT
- B. SELECT
- C. WHERE
- D. BETWEEN
Answer: B
Explanation:
SELECT allows you to query specific columns rather than the whole table. LIMIT, BETWEEN, and WHERE clauses will not reduce the number of columns processed by BigQuery.
Reference: https://cloud.google.com/bigquery/launch-
checklist#architecture_design_and_development_checklist
NEW QUESTION 156
To run a TensorFlow training job on your own computer using Cloud Machine Learning Engine, what would your command start with?
- A. gcloud ml-engine local train
- B. You can't run a TensorFlow program on your own computer using Cloud ML Engine .
- C. gcloud ml-engine jobs submit training local
- D. gcloud ml-engine jobs submit training
Answer: A
Explanation:
gcloud ml-engine local train - run a Cloud ML Engine training job locally
This command runs the specified module in an environment similar to that of a live Cloud ML Engine Training Job.
This is especially useful in the case of testing distributed models, as it allows you to validate that you are properly interacting with the Cloud ML Engine cluster configuration.
NEW QUESTION 157
You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt data. What should you do?
- A. Add a SideInput that returns a Boolean if the element is corrupt.
- B. Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.
- C. Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
- D. Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
Answer: C
NEW QUESTION 158
These primary tool in use, and the data format is Optimized Row Columnar (ORC). All ORC files have been successfully copied to a Cloud Storage bucket. You need to replicate some data to the cluster's local Hadoop Distributed File System (HDFS) to maximize performance. What are two ways to start using Hive in Cloud Dataproc? (Choose two.)
- A. Load the ORC files into BigQuery. Leverage BigQuery connector for Hadoop to mount the BigQuery tables as external Hive tables. Replicate external Hive tables to the native ones.
- B. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to any node of the Dataproc cluster. Mount the Hive tables locally.
- C. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to HDFS. Mount the Hive tables locally.
- D. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to the master node of the Dataproc cluster. Then run the Hadoop utility to copy them do HDFS. Mount the Hive tables from HDFS.
- E. Leverage Cloud Storage connector for Hadoop to mount the ORC files as external Hive tables. Replicate external Hive tables to the native ones.
Answer: B,D
NEW QUESTION 159
What are the minimum permissions needed for a service account used with Google Dataproc?
- A. Read and write to Google Cloud Storage; write to Google Cloud Logging
- B. Execute to Google Cloud Storage; write to Google Cloud Logging
- C. Write to Google Cloud Storage; read to Google Cloud Logging
- D. Execute to Google Cloud Storage; execute to Google Cloud Logging
Answer: A
Explanation:
Service accounts authenticate applications running on your virtual machine instances to other Google Cloud Platform services. For example, if you write an application that reads and writes files on Google Cloud Storage, it must first authenticate to the Google Cloud Storage API. At a minimum, service accounts used with Cloud Dataproc need permissions to read and write to Google Cloud Storage, and to write to Google Cloud Logging.
Reference: https://cloud.google.com/dataproc/docs/concepts/service-
accounts#important_notes
NEW QUESTION 160
You are building a data pipeline on Google Cloud. You need to prepare data using a casual method for a machine-learning process. You want to support a logistic regression model. You also need to monitor and adjust for null values, which must remain real-valued and cannot be removed. What should you do?
- A. Use Cloud Dataprep to find null values in sample source data. Convert all nulls to 0 using a Cloud Dataprep job.
- B. Use Cloud Dataprep to find null values in sample source data. Convert all nulls to 'none' using a Cloud Dataproc job.
- C. Use Cloud Dataflow to find null values in sample source data. Convert all nulls to 'none' using a Cloud Dataprep job.
- D. Use Cloud Dataflow to find null values in sample source data. Convert all nulls to using a custom script.
Answer: C
NEW QUESTION 161
Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers. Which cloud-native service should you use to orchestrate the entire pipeline?
- A. Cloud Composer
- B. Cloud Dataprep
- C. Cloud Dataflow
- D. Cloud Dataproc
Answer: A
Explanation:
Cloud Composer uses airflow which is open source and can help to orchestrate jobs.
NEW QUESTION 162
You're using Bigtable for a real-time application, and you have a heavy load that is a mix of read and writes.
You've recently identified an additional use case and need to perform hourly an analytical job to calculate certain statistics across the whole database. You need to ensure both the reliability of your production application as well as the analytical workload.
What should you do?
- A. Add a second cluster to an existing instance with a multi-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
- B. Export Bigtable dump to GCS and run your analytical job on top of the exported files.
- C. Add a second cluster to an existing instance with a single-cluster routing, use live-traffic app profile for your regular workload and profile for the analytics workload.
- D. Increase the size of your existing cluster twice and execute your analytics workload on your new resized cluster.
Answer: A
NEW QUESTION 163
You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat. Here is some of the information you need to store:
* The user profile: What the user likes and doesn't like to eat
* The user account information: Name, address, preferred meal times
* The order information: When orders are made, from where, to whom
The database will be used to store all the transactional data of the product. You want to optimize the data schema. Which Google Cloud Platform product should you use?
- A. Cloud SQL
- B. Cloud Datastore
- C. BigQuery
- D. Cloud Bigtable
Answer: C
NEW QUESTION 164
Which of the following is NOT one of the three main types of triggers that Dataflow supports?
- A. Trigger that is a combination of other triggers
- B. Trigger based on element size in bytes
- C. Trigger based on element count
- D. Trigger based on time
Answer: B
Explanation:
There are three major kinds of triggers that Dataflow supports: 1. Time-based triggers 2. Data-driven triggers. You can set a trigger to emit results from a window when that window has received a certain number of data elements. 3. Composite triggers. These triggers combine multiple time-based or data- driven triggers in some logical way Reference: https://cloud.google.com/dataflow/model/triggers
NEW QUESTION 165
You are deploying a new storage system for your mobile application, which is a media streaming service. You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity 'Movie' the property 'actors' and the property 'tags' have multiple values but the property 'date released' does not. A typical query would ask for all movies with actor=<actorname> ordered by date_released or all movies with tag=Comedy ordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

- A. Option B.
- B. Option A
- C. Option C
- D. Option D
Answer: B
NEW QUESTION 166
......
Ace Professional-Data-Engineer Certification with 270 Actual Questions: https://www.verifieddumps.com/Professional-Data-Engineer-valid-exam-braindumps.html
Free Google Professional-Data-Engineer Exam Question Practice Exams: https://drive.google.com/open?id=1c53b4OM7ODC2XxPj-dhNOMuvWGgvu9ci
