Building Self-Service Data Platforms: Insights from ZipRecruiter and Yotpo
Table of Contents:
- Introduction
- About the Guests
- The Challenges in Data Engineering
- Data Infrastructure and Stack
- Providing Core Data Infrastructure
- Interface with Engineering Teams
- Goals for the Next Year
- Creating a Structured Data Process
- Rebuilding Data Layers
- The Importance of Data Design Patterns
- Tips and Tricks in Data Engineering
- Conclusion
Introduction
Welcome to the Data Engineering Show brought to you by Fireball, the cloud data warehouse that provides insanely fast analytics over terabytes of data with fewer resources. In this episode, we have two experienced data engineers - Duran, the Director of Infrastructure at Yatpo, and Iran, the Director of Engineering at ZipRecruiter - joining us to discuss various aspects of data engineering. They share insights about their roles, challenges, goals, and the importance of data infrastructure in their respective organizations.
About the Guests
Duran, aged 47, has been working at Yatpo for 25 years. He started as a team leader for the data engineering team and later became the group leader for the data infrastructure. He played a significant role in building the data platform at Yatpo. Iran, on the other hand, has experience working at multiple companies, including Yatpo and ZipRecruiter. At ZipRecruiter, she focuses on building data infrastructure and enabling data-driven decision-making.
The Challenges in Data Engineering
Both guests discuss the challenges they face in their day-to-day work as data engineers. They highlight that working with data is complex and constantly evolving. They emphasize the need for data manageability and how important it is to optimize the data platform to ensure scalability, coherence, and resilience. They also mention the challenges of democratizing data sets and tools, as well as the importance of data quality and observability.
Data Infrastructure and Stack
The guests provide insights into the data infrastructure and stack they use in their organizations. Duran explains that at Yatpo, data is ingested from various sources into the data lake, and the platform is primarily built around it. They use AWS for their workloads, store data in various formats, and leverage Spark for data transformation. Workflow orchestration is achieved through Airflow. On the other hand, Iran talks about how ZipRecruiter is building their data layer from scratch, migrating away from previous architectures and technologies. They are focused on ensuring data quality, creating automation tools, and building a semantic layer for data aggregation and joining.
Providing Core Data Infrastructure
Both guests highlight the importance of providing core data infrastructure for their respective organizations. They discuss the need for self-service data experiences and enabling developers to be more self-sufficient. They emphasize the importance of creating tools, interfaces, and best practices to make it easier for developers to produce and consume high-quality data. They also mention the need for clear data contracts, data owners, and data catalogs to facilitate collaboration and enable data-driven decision-making.
Interface with Engineering Teams
The guests discuss how their data teams interface with engineering teams within their organizations. They emphasize the importance of understanding the culture and needs of the organization and working towards aligning data infrastructure with these requirements. They talk about the challenge of bridging the gap between different personas, such as developers, analysts, and data scientists, and providing them with a unified platform to work with. They also highlight the need for continuous education, documentation, and communication to ensure a smooth collaboration between teams.
Goals for the Next Year
The guests share their goals for the next year. Duran mentions the focus on analytics, data manageability, and optimizing the data platform's cost and performance. Iran discusses the challenge of rebuilding their data layer from scratch at ZipRecruiter and the emphasis on data quality, automation, and creating a new semantic layer. Both guests emphasize the importance of re-architecting their platforms to ensure scalability, efficiency, and better user experiences.
Creating a Structured Data Process
Duran and Iran discuss the importance of creating a structured data process within their organizations. They talk about building tools and frameworks to enforce best practices in data production and consumption. They highlight the significance of schema documentation, data pipelines, and data contracts in ensuring high-quality data. They emphasize the need for data owners and data catalogs to facilitate data management and decision-making processes.
Rebuilding Data Layers
The guests discuss the ongoing process of rebuilding data layers in their organizations. They highlight the challenges of migrating from legacy architectures and technologies to more modern and efficient systems. They emphasize the need for data transformation, data consolidation, and the decoupling of production and analytics systems. They talk about the importance of understanding the boundaries between various data layers and creating a more efficient and robust data architecture.
The Importance of Data Design Patterns
The guests emphasize the importance of data design patterns in data engineering. They discuss the lack of standardized patterns in the industry and the need for more documentation and education in this area. They talk about the challenges of working with different personas, technologies, and use cases and how design patterns can help streamline data workflows and ensure best practices. They highlight the importance of continuous learning and adaptation in the evolving field of data engineering.
Tips and Tricks in Data Engineering
In conclusion, the guests share some tips and tricks for data engineering. They emphasize the importance of understanding the culture and needs of the organization, as well as continuous communication and collaboration between different teams. They highlight the significance of analytics, self-measurement, and product measurement in driving data-driven decision-making. They also stress the need for observability, automation, and the adoption of data design patterns to improve efficiency and productivity in data engineering.
Conclusion
Data engineering is a complex and evolving field that plays a crucial role in enabling data-driven decision-making and building robust data infrastructure. The guests provided valuable insights into their roles, challenges, and goals in data engineering. They highlighted the importance of data quality, manageability, and self-service experiences. They emphasized the need for collaboration between different teams and the importance of data design patterns in streamlining data workflows. Overall, data engineering requires continuous learning, adaptation, and a deep understanding of organizational needs to drive success and innovation.
Highlights:
- Data engineers face challenges related to data manageability, infrastructure optimization, and data quality.
- Providing core data infrastructure is crucial for enabling self-service data experiences.
- Collaboration between engineering teams and data teams is essential for successful data engineering.
- Goals for the next year include improving data quality, automating processes, and building new semantic layers.
- Rebuilding data layers involves migrating to modern architectures and decoupling production and analytics systems.
- Data design patterns help streamline data workflows and ensure best practices.
- Tips for data engineering include understanding organizational culture, emphasizing analytics, and adopting automation and observability.
FAQ
Q: What are the main challenges faced by data engineers?
A: Data engineers face challenges related to data manageability, infrastructure optimization, and data quality. They need to ensure efficient data processing, provide self-service data experiences, and meet the requirements of diverse use cases and personas within the organization.
Q: How do data engineers provide core data infrastructure?
A: Data engineers provide core data infrastructure by building tools, interfaces, and best practices to enable developers to produce and consume high-quality data. They aim to create self-service data experiences and ensure that data infrastructure aligns with the needs of the organization.
Q: How do data teams interface with engineering teams?
A: Data teams interface with engineering teams by understanding the culture and needs of the organization. They work towards alignment and collaboration, providing unified platforms for different personas, such as developers, analysts, and data scientists.
Q: What are the goals for data engineering in the next year?
A: The goals for data engineering in the next year include improving data quality, optimizing data infrastructure, automating processes, and creating new semantic layers. The focus is on scalability, efficiency, and better user experiences.
Q: Why are data design patterns important in data engineering?
A: Data design patterns help streamline data workflows, ensure best practices, and enable efficient data processing. They provide guidance and standardization for data engineering tasks, allowing for more scalable and robust data solutions.
Q: What are some tips for data engineering success?
A: Some tips for data engineering success include understanding the organizational culture, emphasizing analytics and measurement, adopting automation and observability practices, and continuously learning and adapting to the evolving field of data engineering.