What Does a Data Engineer Do?

Data engineers utilize a broad range of skills every day, but technical skills are the most critical—indeed, the more technical skills a data engineer has, the better. Image from Unsplash

Christa Terry April 20, 2020

Data scientists make headlines, but data engineers make data science possible. All the information that data scientists analyze passes through the hands of oft-overlooked data engineers first.

Article continues here

Data engineers are the frontline warriors of the analytics world. They go by many titles, including technical architects, BI developers, data science software engineers, and ETL developers. No matter what they’re called, these professionals take the first pass at the massive amounts of structured data, unstructured data, and semi-structured data that businesses, researchers, and government agencies access in our increasingly connected world.

Data engineers develop the data infrastructure and interfaces necessary to collect data from different sources. They also design the systems used to transform that information into clean data sets that data analysts and data scientists can sort in ways that lead to useful conclusions. That data transformation makes everything that happens in the data science world possible.

Given that, you may be wondering why data engineers don’t make headlines the way data scientists do. The reason may be that while even laypeople can understand what data scientists do, it’s not always clear what role data engineers play in data analysis. Data engineering isn’t any more difficult to understand than data science, however, as you’ll discover below.

In this article, we discuss what a data engineer does and answering the following questions:

What are a data engineer’s key responsibilities?
What does a data engineer’s typical day look like?
Do data engineers and data scientists do the same things?
What skills does a data engineer use?
Which degrees do data engineers usually have?
What tools do data engineers use?
How much do data engineers get paid?
What kinds of people are most likely to succeed in this career?

What are a data engineer’s key responsibilities?

Simply put, data engineers manage data. They construct and oversee database architecture. They determine how data is collected and stored. They prepare that data for analysis by creating the pipelines that transform raw data into useful formats. And sometimes, they even create the systems through which people who aren’t part of the data science or data engineering team can access essential data (via, for instance, a custom real-time analytics dashboard).

According to Dataquest, there are three main types of data engineers, and their responsibilities differ somewhat. First, there is the generalist. These data engineers tend to work at small companies where there aren’t many data-focused employees, so they have to manage data and analyze it, too. Then there are the pipeline-centric data engineers, who work closely with data scientists, building whatever custom tools they might need to accomplish certain big data analytics goals. Finally, there are database-centric data engineers. They may do some pipeline building, but they spend the majority of their time creating large-scale data warehouses that make it easier for analysts and data scientists to do their jobs.

“I’M READY FOR A DEGREE!”

University and Program Name	Learn More
Boston College: Master of Science in Applied Economics
Boston College: Master of Science in Applied Analytics
Merrimack College: Master of Science in Data Science

What does a data engineer’s typical day look like?

Data engineers spend their days:

Building custom data pipelines based on business logic
Collaborating with a database administrator to create data stores
Collecting data from various sources
Creating new data validation methods
Developing frameworks to serve data
Evaluating, parsing, and cleaning data sets
Gathering requirements for data models
Identifying and analyzing new data sources
Maintaining computer clusters
Making sure data is secure
Preparing data as part of ETL (extract, transform, and load) processes
Managing real-time data processing
Sorting through raw data
Stitching data from various sources together
Writing ETL logic
Writing queries that deliver accurate results

Most data engineers won’t do all of these things every single day, however. A data engineer’s day might begin with maintenance tasks like checking logs and looking into pipeline failures. Fixing issues and adding in any missing features required to keep a data warehouse well-stocked might keep our hypothetical data engineer busy until lunchtime. Next, they’ll turn their attention to whatever big project (e.g., creating a new ETL pipeline) is currently in their inbox. If they’ve already mapped out an implementation plan, they’ll start coding.

During all this, however, data scientists and other data users may be submitting tickets related to missing data, duplicate data, and errors. Sometimes the data engineer will prioritize the big project. At other times, tickets will take priority. Every day is different when you become a data engineer.

Do data engineers and data scientists do the same things?

Data science is a team sport. Data engineers work with data architects, data warehouse engineers, data platform engineers, analytics engineers, DevOps engineers, and yes, data scientists. In the smallest organizations, a single data-focused professional may be responsible for everything related to data acquisition, data pipeline creation, and data analysis. At most companies, however, all of these professionals do different things.

Data scientists use math and statistics to interpret data and deliver insights related to it. A data scientist might be given a business problem to solve, and they will need to decide which data can help solve it. In other cases, they’ll be given data and then asked to extract as much meaning from it as possible. Data engineers, on the other hand, use programming to build the infrastructure that serves up data in ready-to-interpret formats. Both have important roles to play in the data science world.

What skills does a data engineer use?

Data engineers utilize a broad range of skills every day. It’s not unreasonable to say that technical skills are the most critical. Indeed, the more technical skills a data engineer has, the better. They need solid coding chops, and they need to understand how to use ETL tools. Automation is an increasingly important part of data engineering, and data engineers must have both the knowledge and the problem-solving skills necessary to automate tedious data collection and sorting processes.

However, data engineers also need soft skills—especially communication skills and people skills. “A data engineer serves internal teams, so he or she has to understand the business goal that the data analyst wants to achieve to best support them,” Paul Lappas, co-founder and CEO of Intermix, told the Stitch blog. “If a data scientist has a specific tool they want to use, the data engineer has to set up the environment in a way that lets them use it. So you have to be really good at interacting with the rest of the data team.”

Data engineering is a collaborative enterprise. Almost everything data engineers do involves communicating and interacting with others.

Which degrees do data engineers usually have?

Some data engineers launch their careers without degrees, but most have at least a bachelor’s degree. That’s because even though it’s all but impossible to learn everything you’ll need to know to become a data engineer in a degree program, most employers prefer to hire candidates with degrees.

Deciding which degrees can be a challenge because there are very few dedicated data engineering degrees at the undergraduate or graduate levels in the US. Most data engineers study computer science, data analytics, information systems, or data science, and if a program offers a data engineering specialization, all the better. Kent State University at Kent, for instance, offers a Bachelor of Science in Computer Science program with a data engineering concentration, and Wellesley College offers a Bachelor of Data Science with a dual concentration in computer science and data engineering.

Meanwhile, Northeastern University, Stevens Institute of Technology, and the University of Wisconsin – Madison offer some of the only master’s programs focused specifically on data engineering.

The best degree programs for data engineers cover:

Architectural Projections: Study of designing and planning data architecture to support current and future data needs, ensuring scalability, efficiency, and integration.
Data Mining Techniques: Methods and algorithms used to discover patterns, correlations, and insights from large datasets, often used for predictive analytics and decision-making.
Data Modeling Techniques: Techniques for creating data models that represent the structure, relationships, and constraints of data within a database or information system.
Database Clustering Techniques: Strategies for grouping multiple database servers to work together as a single system, enhancing performance, availability, and scalability.
Data Platforms: Examination of the software and hardware infrastructure that supports the storage, management, and processing of large-scale data, including cloud-based and on-premises solutions.
ETL Design: Study of Extract, Transform, Load (ETL) processes that integrate data from multiple sources into a single, unified data repository, such as a data warehouse.
Programming Languages: Instruction in programming languages commonly used in data engineering, such as Python, SQL, Java, and Scala, focusing on their application in data manipulation and analysis.
Relational and Non-Relational Database Theory: Exploration of the principles and concepts behind relational databases (structured with tables and schemas) and non-relational databases (such as NoSQL databases), including their design, implementation, and use cases.

It’s also possible to learn much of what you’ll need to know to succeed in this role outside of school. There are many online bootcamp-style courses designed for aspiring data engineers and courses that will teach you to use the most common data engineering tools on sites like Udemy. It’s not unusual for data engineers to pick up the technical skills they need to succeed in this role on their own.

What tools do data engineers use?

Data engineers use a lot of different tools, and most employers expect data engineers to have experience with nearly all the following:

Algorithms and Data Structures: Fundamental concepts in computer science used to efficiently store, retrieve, and manipulate data, essential for optimizing performance and solving complex problems.
Apache Spark: An open-source unified analytics engine for large-scale data processing, known for its speed, ease of use, and support for various programming languages.
Data Warehousing Tools: Software and technologies designed for building and managing data warehouses, which centralize and store large volumes of data for querying and analysis.
Distributed Systems: Systems that distribute data and processing across multiple machines, enabling scalability, fault tolerance, and high availability for large-scale data applications.
ElasticSearch: A distributed, RESTful search and analytics engine capable of storing, searching, and analyzing large volumes of data in near real-time.
ETL Tools: Software that facilitates the Extract, Transform, Load process, which consolidates data from different sources into a single, consistent data store.
Google Cloud Platform (GCP), Amazon Web Services (AWS), and/or Microsoft Azure: Leading cloud service providers offering a wide range of data engineering tools and services for storage, computation, and machine learning.
Hadoop/Hive: Hadoop is an open-source framework for distributed storage and processing of large datasets, while Hive is a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage.
Java/Scala: Programming languages commonly used in data engineering for building scalable and high-performance data processing applications, with strong support for concurrency and data handling.
Kafka: An open-source distributed event streaming platform used for building real-time data pipelines and streaming applications, capable of handling high-throughput data feeds.
Machine Learning: A field of artificial intelligence focused on developing algorithms that enable computers to learn from and make predictions based on data.
Python: A versatile, high-level programming language widely used in data engineering for scripting, data analysis, and machine learning, known for its readability and extensive libraries.
Spark: Refers to Apache Spark, a powerful data processing engine that supports in-memory computation for big data processing, stream processing, and machine learning tasks.
SQL and NoSQL: SQL refers to Structured Query Language used for managing and querying relational databases, while NoSQL encompasses a variety of database technologies designed to handle unstructured or semi-structured data.
Tableau: A powerful data visualization tool that helps in converting raw data into easily understandable visual insights, facilitating data analysis and decision-making.
UNIX, Linux, and Solaris: Operating systems known for their stability, security, and efficiency, commonly used in data engineering for managing servers, running applications, and processing data.

Continuing education for data engineers often involves learning to use the tech tools and programming languages that weren’t covered in their degree program (whether because they didn’t enroll in a data engineering program or because the tools weren’t standard when they were in school). Many certifications for data engineers are tool-specific, like:

Cloudera’s Certified Professional Data Engineer credential
Google’s Cloud Data Engineer Certification
IBM’s Certified Data Engineer credential
Microsoft’s Certified Solutions Associate in Data Engineering with Azure credential
Microsoft’s MCSE: Data Management and Analytics credential

How much do data engineers get paid?

Data engineers are paid quite well. The average entry-level data engineer salary is about $77,000, and the highest-paid data engineers earn close to $160,000. Average salaries for mid-career data engineers can be anywhere from $89,000 to $124,000, which is a big range. It may be that higher earners have certifications or have completed a higher level of education, or that data engineers working in major metro areas are earning a lot more than their peers in suburban locales.

The good news is that $89,000 can go pretty far in many parts of the US, and there are more than 100,000 job openings for data engineers on Indeed.com—many of which were posted by employers who are looking to fill those positions quickly. That means this is a role that comes with a reassuring level of job security. The Dice 2020 Tech Job Report called data engineer the fastest-growing job in technology.

What kinds of people are most likely to succeed in this career?

To do what they do, data engineers need a firm understanding of programming, database design and configuration, and systems architecture, but that’s not all. They also need patience, determination, and a certain amount of grit. As Paul Lappas put it in the Stitch interview linked above, data engineering “is very difficult. It’s an unsexy job, but it’s super-critical. Data engineers are kind of like the unsung heroes of the data world. Their job is incredibly complex, involving new skills and new tech. It’s really hard to build new ETL pipelines.”

It’s not a job for wannabe rock stars. Data engineers do work that’s every bit as challenging and technically rigorous as that of data scientists, but they’re seldom acknowledged for their work in the same way. That means data engineers have to love cleaning up raw data just because it’s there. They have to get a kick out of developing beautiful data architectures. And they have to be cool with the fact that if they want a pat on the back, they’re probably going to have to do the patting themselves. Successful data engineers don’t seem to mind, however. Possibly because they’re too busy munging data and bringing home the big bucks.

(Updated on June 27, 2024)

Questions or feedback? Email editor@noodle.com

About the Editor

Tom Meltzer spent over 20 years writing and teaching for The Princeton Review, where he was lead author of the company's popular guide to colleges, before joining Noodle.

To learn more about our editorial standards, you can click here.