Big Data Certification

Image of AWS Big Data Speciality Certification

I passed the AWS Certified Big Data Speciality Exam on Saturday. That makes my 9th AWS certification in the last 10 months. For a moment I’ll have 9/9 certifications. Machine Learning opens this month, so come tomorrow I’ll have 9/10 Certifications. Machine learning recommended training is Big Data on AWS and Deep Learning on AWS. Given I just completed Big Data, probably schedule this exam for sometime in May.

Big Data Certification Exam is similar to the other specialty exams. While not necessarily as hard as the Professional level exams it does require a detailed level of knowledge. Also unlike the other specialty exams, Big Data requires a breadth and depth of knowledge consistent with the Professional Level exams. I prepared using acloud.guru’s AWS Certified Big Data - Speciality which provides somewhere between 50% - 60% of the required topics around Kinesis, IoT, S3, DynamoDB, EMR, Redshift, and Quicksight. I did review some topics in Linux Academy to reinforce the concepts. The rest of the experience is hands-on or lab learnings. AWS doesn’t offer a practice exam, so I tried the Whizlab practice exams. Whizlab’s typically have issues and provide a false level of confidence as the practice exams are always easier than the actual certification exam.

Acloud.guru covers much information, and it also provides a set of links to critical whitepapers and blog articles. As always without, violating the NDA, they do an excellent job in pointing you to the topics to study. Aside from that material, I read a whole bunch of AWS links, which will be posted at the end of this blog article. Also, there was a great youtube playlist John Creecy put together at https://www.youtube.com/playlist?list=PLlp-qT09uTBcoMpiQkpO-G8GsHOVWyfV0.

I am relatively little experience with Kinesis, EMR, Redshift, and Quicksight, before studying for the exam. I found Kinesis, Redshift, and Elasticsearch fascinating, and will be looking for projects in this space to continue my learning.

Kinesis
https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html https://docs.aws.amazon.com/streams/latest/dev/introduction-to-enhanced-consumers.html https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-ddb.html https://docs.aws.amazon.com/streams/latest/dev/kinesis-using-sdk-java-resharding-split.html https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-kpl.html https://docs.aws.amazon.com/streams/latest/dev/building-consumers.html https://docs.aws.amazon.com/streams/latest/dev/creating-using-sse-master-keys.html https://docs.aws.amazon.com/streams/latest/dev/kinesis-kpl-concepts.html https://docs.aws.amazon.com/streams/latest/dev/kinesis-producer-adv-retries-rate-limiting.html https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-kcl.html https://docs.aws.amazon.com/streams/latest/dev/agent-health.html https://docs.aws.amazon.com/streams/latest/dev/kinesis-using-sdk-java-resharding-merge.html

Kinesis Firehose
https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html#data-flow-diagrams https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html https://docs.aws.amazon.com/firehose/latest/dev/create-configure.html https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html#lambda-blueprints https://docs.aws.amazon.com/firehose/latest/dev/encryption.html

Kinesis Data Analytics
https://docs.aws.amazon.com/kinesisanalytics/latest/dev/what-is.html https://docs.aws.amazon.com/kinesisanalytics/latest/dev/streams-pumps.html https://docs.aws.amazon.com/kinesisanalytics/latest/dev/authentication-and-access-control.html https://docs.aws.amazon.com/kinesisanalytics/latest/dev/stagger-window-concepts.html https://docs.aws.amazon.com/kinesisanalytics/latest/dev/tumbling-window-concepts.html https://docs.aws.amazon.com/kinesisanalytics/latest/dev/sliding-window-concepts.html https://docs.aws.amazon.com/kinesisanalytics/latest/dev/continuous-queries-concepts.html

IoT
https://docs.aws.amazon.com/iot/latest/developerguide/what-is-aws-iot.html https://docs.aws.amazon.com/iot/latest/developerguide/policy-actions.html https://docs.aws.amazon.com/iot/latest/developerguide/iam-policies.html https://docs.aws.amazon.com/iot/latest/developerguide/iot-provision.html https://docs.aws.amazon.com/iot/latest/developerguide/iot-device-shadows.html https://docs.aws.amazon.com/iot/latest/developerguide/iot-rule-actions.html

ElasticSearch
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/what-is-amazon-elasticsearch-service.html https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-bp.html https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-aws-integrations.html

CloudSearch
https://docs.aws.amazon.com/cloudsearch/latest/developerguide/what-is-cloudsearch.html

EMR
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html#emr-overview-clusters https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-encryption-enable.html#emr-awskms-keys https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html https://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-configure-sqs-cw.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-tez.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hbase.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hcatalog.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-zookeeper.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-phoenix.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-sqoop.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyter-emr-managed-notebooks.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub.html

QuickSight
https://docs.aws.amazon.com/quicksight/latest/user/welcome.html https://docs.aws.amazon.com/quicksight/latest/user/refreshing-imported-data.html https://docs.aws.amazon.com/quicksight/latest/user/joining-tables.html https://docs.aws.amazon.com/quicksight/latest/user/bar-charts.html https://docs.aws.amazon.com/quicksight/latest/user/combo-charts.html https://docs.aws.amazon.com/quicksight/latest/user/heat-map.html https://docs.aws.amazon.com/quicksight/latest/user/line-charts.html https://docs.aws.amazon.com/quicksight/latest/user/kpi.html https://docs.aws.amazon.com/quicksight/latest/user/restrict-access-to-a-data-set-using-row-level-security.html#create-row-level-security https://docs.aws.amazon.com/quicksight/latest/user/tabular.html https://docs.aws.amazon.com/quicksight/latest/user/supported-data-sources.html https://docs.aws.amazon.com/quicksight/latest/user/scatter-plot.html https://docs.aws.amazon.com/quicksight/latest/user/geospatial-data-prep.html

Redshift
https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#rs-about-clusters-and-nodes https://docs.aws.amazon.com/redshift/latest/mgmt/enhanced-vpc-working-with-endpoints.html https://docs.aws.amazon.com/redshift/latest/dg/c_designing-queries-best-practices.html https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-copy.html https://docs.aws.amazon.com/redshift/latest/dg/c_intro_STL_tables.html https://docs.aws.amazon.com/redshift/latest/dg/c_intro_STV_tables.html https://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html https://docs.aws.amazon.com/redshift/latest/dg/wlm-short-query-acceleration.html

DynamoDB
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html#bp-partition-key-partitions-adaptive https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/globaltables_monitoring.html https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-data-upload.html https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/globaltables_reqs_bestpractices.html https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-aggregation.html https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-gsi-sharding.html

Machine Learning
https://docs.aws.amazon.com/machine-learning/latest/dg/types-of-ml-models.html https://docs.aws.amazon.com/machine-learning/latest/dg/binary-model-insights.html https://docs.aws.amazon.com/machine-learning/latest/dg/regression-model-insights.html https://docs.aws.amazon.com/machine-learning/latest/dg/multiclass-model-insights.html https://docs.aws.amazon.com/machine-learning/latest/dg/ml-model-insights.html https://docs.aws.amazon.com/machine-learning/latest/dg/cross-validation.html https://docs.aws.amazon.com/machine-learning/latest/dg/creating-and-using-datasources.html https://docs.aws.amazon.com/machine-learning/latest/dg/creating-a-data-schema-for-amazon-ml.html https://docs.aws.amazon.com/machine-learning/latest/dg/amazon-machine-learning-key-concepts.html

Pipeline
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-tasks-scheduled.html https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-datanodes.html https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-databases.html https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-part1.html https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/datapipeline-related-services.html

Data Movement
https://docs.aws.amazon.com/SchemaConversionTool/latest/userguide/CHAP_Welcome.html

Athena
https://docs.aws.amazon.com/athena/latest/ug/access.html https://docs.aws.amazon.com/athena/latest/ug/encryption.html#encryption-options-S3-and-Athena https://docs.aws.amazon.com/athena/latest/ug/athena-aws-service-integrations.html

Glue
https://docs.aws.amazon.com/glue/latest/dg/components-overview.html

Image of AWS Big Data Speciality Certification

I passed the AWS Certified Big Data Speciality Exam on Saturday. That makes my 9th AWS certification in the last 10 months. For a moment I’ll have 9/9 certifications. Machine Learning opens this month, so come tomorrow I’ll have 9/10...

Advanced Architecting on AWS

I took Advanced Architecting on AWS for the last three days. The course is part of the learning process for the AWS Certified Solutions Architect – Professional. I already have the certification based on the older version of the exam. The new version of the certification exam went live on February 4th. The course seems to follow the newer certification guide. Overall the course is good as it covers all the services required, the labs were a little disappointing as they lacked complexity. To become proficient and attempt the certification, one would need to a lot more learning and deep diving on the topics covered in this course. It reviews probably 35% of the material required to sit the exam.

Here is my summary by day of the course.

Day One

The morning was spent covering Account Management and multiple accounts, leading to AWS Organizations with service control policies. It finished on billing. The next two discussions where around Advanced Networking Architectures, then VPN and DirectConnect. The afternoon finished with a discussion on Deployments on AWS which was an abbreviation of material covered in the DevOps Course.

Day Two

The morning started with data specifically discussing S3 and Elasticache. Next, it was all about data import into AWS with Snowball, Snowmobile, S3 Transfer Acceleration, Storage Gateways(Tape Gateway, Volume Gateway, and File Gateway), and fished with Data Sync, and Database Migration,

The afternoon was spent on Big Data Architecture and Designing Large Scale Applications and finished with a lab on Blue-Green Deployments on Elastic BeanStalk.

Day Three

The last day was spent on Building Resilient Architectures, and encryption and Data Security. The day ended early with a Lab on KMS. The lab provided some basic KMS and OpenSSL encryption steps.

I thought the course, missed an opportunity to talk about DR architectures.

It’s an interesting course and worth taking if you’re interested in learning more or planning to take the certifications.

I took Advanced Architecting on AWS for the last three days. The course is part of the learning process for the AWS Certified Solutions Architect – Professional. I already have the certification based on the older version of the exam. The new version of the certification exam went...

Violating Security Policies

Dark Reading wrote a Blog Architect entitled 6 Reasons Why Employees Violate Security Policies The 6 reasons according to the article are:

  1. Ignorance
  2. Convenience
  3. Frustration
  4. Ambition
  5. Curiosity
  6. Helpfulness

I think they’re neglecting to get to the root of the issue which is draconian security policies which don’t make things more secure. Over the years, I’ve seen similar policies coming from InfoSec groups. It’s common for developers to want to use the tools they’re comfortable with, in an extreme case I’ve seen developers wanting to use Eclipse to do development and Eclipse is forbidden because the only safe editor according to some InfoSec policy is VI (probably slightly exaggerated). Other extreme cases include banning of Evernote or OneNote because it uses cloud storage. I’m assuming in this that someone is not putting all there confidential customer data in a OneNote book.

Given what I’ve seen, employee violates security policies to get work done, the way they want to do it. Maybe that ignorance, convenience, frustration, ambition, or any other topic, or maybe if you’ve used something for 10 years, you don’t want to have to learn something new for development or keeping notes, given there are many other things to learn and do which add value to their job and employer.

Maybe to keep employees from violating InfoSec policies, InfoSec groups instead of writing draconian security policies could focus on identifying security vulnerabilities which are more likely targets of hackers, putting policies, procedures and operational security around them. Lastly, InfoSec could spend time educating what confidential data is and where it is allowed to stored.

Disclaimer: This blog article is not meant to condone, encourage, or motivate people to violate security policies.

Dark Reading wrote a Blog Architect entitled 6 Reasons Why Employees Violate Security Policies The 6 reasons according to the article are:

  1. Ignorance
  2. Convenience
  3. Frustration
  4. Ambition
  5. Curiosity
  6. Helpfulness

I think they’re neglecting to get to the root of the issue which is draconian security policies which don’t...

What is the difference between a CDO, CTO and a CIO?

I got into an interesting discussion on what is the difference between a CDO, CTO, and CIO. The initial discussion started with are all those positions required in an organization. The group eventually agreed the answer was yes. The logic was given everything we do is digital, digital needs multiple seats at the executive table. The reason for this blog article is where do these roles fit within an organization. Let’s take a step back and explained how we defined the roles.

CDO should own e-commerce, mobile environments, and technology customer outreach. In a digital product company, they own the product roadmap. The CDO is responsible for all digital customer touch points. The technology partner for the CDO is the CMO or SVP of Sales. This role should be driving the business, and be a business enabler.

CIO should own the back office technology like email, ERP, messaging, desktops, laptops, printers, networking, service desks, and traditional data centers. Typically technology organization which is the cost centers.

CTO should own the architecture and technology of the platforms. CTO is the technology partner for both the CDO and CIO. Their job should be to have uniformity, coalesce ideas across technology and work with the various stakeholders to ensure proper architecture governance (think TOGAF architecture review boards).

The group when discussing it was pretty emphatic, the CDO should report to the CEO. Now, this is where the issue with the outstanding roles breaks down. The role defined for the CIO is an operational role, making sure essential infrastructure services and users can function. The group was split 50/50, and half the group thought the CIO should report to the CDO, the other half said some other C-level executive, like the CFO or COO.

The more complicated issue is where does the CTO report. The CTO is responsible for the architecture and technology of the platform which makes them a partner of the CDO, but also owns architecture review which makes them a partner of the CIO. So where does the CTO report?

The CDO has an entirely different objective than the CIO. If the CIO reports to the CDO, it would make sense to have the CTO report there. However, what happens when the CIO doesn’t report to the CDO. What happens if the CIO reports to the COO?

After several rounds of mental gymnastics, the group agreed to coalesce around two outcomes. First, the CIO either reports to the CDO, and the CTO reports to the CDO. Basically, CTO and CIO become peers in the same organization. The other was the CIO reports to the CTO and both the CTO and CDO report to the CEO.

I got into an interesting discussion on what is the difference between a CDO, CTO, and CIO. The initial discussion started with are all those positions required in an organization. The group eventually agreed the answer was yes. The logic was given everything we do is digital, digital needs multiple...

Using Athena to Query ALB Logs

One of the more interesting AWS Big Data Services is Amazon Athena. Athena can process S3 data in a few seconds. One of the ways I like using it is to look for patterns in ALB access logs.

AWS provides a detailed instruction on how to setup Athena on how to setup ALB access logs. I’m not going to recap the configuration in this blog article, but share 3 of my favorite queries.

What is the most visited page by the client and total traffic on my website:

SELECT sum(received_bytes) as total_received, sum(sent_bytes) as total_sent, client_ip, 
count(client_ip) as client_requests, request_url  
FROM alb_logs 
GROUP BY client_ip, request_url  
ORDER BY total_sent  desc;

How long does it take to process requests on average?

SELECT sum(request_processing_time) as request_pt, sum(target_processing_time) as target_pt,
sum (response_processing_time) respone_pt, 
sum(request_processing_time + target_processing_time + response_processing_time) as total_pt, 
count(request_processing_time) as total_requests,
sum(request_processing_time + target_processing_time + response_processing_time) / count(request_processing_time) as avg_pt,
request_url, target_ip
FROM alb_logs WHERE target_ip <> ''
GROUP BY request_url, target_ip 
HAVING COUNT (request_processing_time) > 4 
ORDER BY avg_pt desc;

This last one is looking for requests the site doesn’t process. It’s usually some person trying to find some vulnerable PHP code.

SELECT count(client_ip) as client_requests, client_ip, target_ip, request_url, 
target_status_code 
FROM alb_logs 
WHERE target_status_code not in ('200','301','302','304') 
GROUP BY client_ip, target_ip, request_url, target_status_code
ORDER BY client_requests desc; 

Athena is a serverless tool, and it sets up in seconds and the charges based on TB scanned with a 10MB minimum for the query.

One of the more interesting AWS Big Data Services is Amazon Athena. Athena can process S3 data in a few seconds. One of the ways I like using it is to look for patterns in ALB access logs.

AWS provides a detailed instruction on how to setup Athena on...

DevOps Engineering on AWS

I took DevOps Engineering on AWS for the last three days. The course is part of the learning process for the AWS Certified DevOps Engineer – Professional Overall the course is excellent it covers substantial material, and the labs are ok. To become proficient, one should do the labs from scratch and build the CloudFormation templates. It reviews 45-50% of the material for the on the DevOps Exam, so each topic requires a deeper dive before sitting the exam.

Here is my summary by day of the course.

Day One

The class started with an introduction to DevOps and the AWS tools which support Devops:

It’s interesting as CodeBuild, CodeDeploy, and CodePipeline are required to replace Jenkins. Their advantage is that it directly integrate with AWS. One question I have is why isn’t there a service like Jfrog Artifactory

One of my favorite topics was DevSecOps which talks about adding security into the DevOps process. There should be a separate certification and course for DevSecOps or SecDevOps.

There was a minimum discussion on Elastic Beanstalk, which was a big part of the old acloud.guru course and had several questions on the old exam.

Lastly, the day focused on various methods for updating applications. In-place updates Rolling updates Blue/Green Deployments Red/Black Deployments

Day Two

The class started with a lab on CloudFormation. The lab was flawed as it had a code deployment via the cfn-init and cfn-hup. The rest of the morning was a deeper dive on the tools discussed throughout Day 1.

Afternoon lab focused on a pipeline, CodeBuild, and CodeDeploy. After the lab, we spent time discussing various testing, CloudWatch Logs, and Opsworks. Most of the discussion was theoretical.

Day Three

The first part of the morning was a 2-hour lab on AWS Opsworks setting up a Chef recipe and scaling out the environment. The rest of the class was devoted to containers, primary ECS, with a lab that deployed an application on containers.

It’s an interesting course and worth taking if you’re doing AWS DevOps or planning to take the certifications.

I took DevOps Engineering on AWS for the last three days. The course is part of the learning process for the AWS Certified DevOps Engineer – Professional Overall the course is excellent it covers substantial material, and the labs are ok. To become proficient, one should do the...