After going through multiple Job Descriptions and based on my experience in the field , I have come up with the detailed skill sets to become a competent Data Engineer.
* Data Wrangling operations - Pandas,numpy,re
* Data Scraping - requests/BeautifulSoup/lxml/Scrapy
* Interacting with External APIs and other Data Sources, Logging
* Parallel processing Libraries - Dask, Multiprocessing
*Serverless Computing, Virtual Instances, Managed Docker and Kubernetes Services
* Security Standards, User Authentication and Authorization, Virtual Private Cloud, Subnet
"90% of the public cloud workloads are running on Linux based OS"
* Bash Scripting concepts in Linux like control flow, looping, passing input parameters
* File System Commands
* Running daemon processes
* Creating tables, Read,Write,Update and Delete operations, joins, procedures, materialized views, aggrgated views, window functions
* Database vs Data warehouse. Star and snowflake schemas, facts and dimension tables.
* Knowledge of how distributed data store works.
* Understanding the Concepts like partitioned data storage, sorting key, SerDes, data replication, caching and persistence.
* Common techniques and patterns for data processing such as partitioning, predicate pushdown, sort by partition, maintaining size of shuffle blocks, window function
* Leveraging all cores and memory available in the cluster to improve concurrency.
Different companies will have different ways to pick ETL frameworks,
One with an In-house data engineering team would prefer to have ETL jobs set up with properly managed workflow management tools for Batch Processing.
* ETL - ETL vs ELT, Data connectivity, Mapping, Metadata, Types of Data Loading
* When to use a Workflow Management System - Directed Acyclic graphs, CRON scheduling, Operators
Knowledge of a JVM based language such as Java or Scala will be extremely useful
- Understand both functional and object oriented programming concepts
- Many of the high performance data science frameworks that are built on top of Hadoop usually are written using Scala or Java.
* Understanding how data injestion happens in Message Queues
* What are Producer and Consumers and how are they implemented
* Sharding, Data Retention, Replay, de-duplication
*Differentiating between Real-time, Stream and Batch Processing.
*Sharding, Repartioning, Poll Wait time, topics/groups,brokers
Also published on: https://dev.to/srinidhi/data-engineering-series-1-10-key-tech-skills-you-need-to-become-a-competent-data-engineer-2n46