Practical Enterprise Data Lake Insights - Venkata Giri, Saurabh Gupta

Practical Enterprise Data Lake Insights (eBook)

Handle Data-Driven Challenges in an Enterprise Big Data Lake

Venkata Giri, Saurabh Gupta (Autoren)

eBook Download: PDF

2018 | 1st ed.
XVIII, 327 Seiten
Apress (Verlag)
978-1-4842-3522-5 (ISBN)

Use this practical guide to successfully handle the challenges encountered when designing an enterprise data lake and learn industry best practices to resolve issues.

When designing an enterprise data lake you often hit a roadblock when you must leave the comfort of the relational world and learn the nuances of handling non-relational data. Starting from sourcing data into the Hadoop ecosystem, you will go through stages that can bring up tough questions such as data processing, data querying, and security. Concepts such as change data capture and data streaming are covered. The book takes an end-to-end solution approach in a data lake environment that includes data security, high availability, data processing, data streaming, and more.

Each chapter includes application of a concept, code snippets, and use case demonstrations to provide you with a practical approach. You will learn the concept, scope, application, and starting point.

What You'll Learn

Get to know data lake architecture and design principles
Implement data capture and streaming strategies
Implement data processing strategies in Hadoop
Understand the data lake security framework and availability model

Who This Book Is For

Big data architects and solution architects

Saurabh K. Gupta is a technology leader, published author, and database enthusiast with more than 11 years of industry experience in data architecture, engineering, development, and administration. Working as a Manager, Data & Analytics at GE Transportation, his focus lies with data lake analytics programs that build a digital solution for business stakeholders. In the past, he has worked extensively with Oracle database design and development, PaaS and IaaS cloud service models, consolidation, and in-memory technologies. He has authored two books on advanced PL/SQL for Oracle versions 11g and 12c. He is a frequent speaker at numerous conferences organized by the user community and technical institutions. He tweets at @saurabhkg and blogs at sbhoracle.wordpress.com.

Venkata Giri currently works with GE Digital and has been involved with building resilient distributed services at a massive scale. He has worked on big data tech stack, relational databases, high availability, and performance tuning. With over 20 years of experience in data technologies, he has in-depth knowledge of big data ecosystems, complex data ingestion pipelines, data engineering, data processing, and operations. Prior to working at GE, he worked with the data teams at Linkedin and Yahoo.

Use this practical guide to successfully handle the challenges encountered when designing an enterprise data lake and learn industry best practices to resolve issues.When designing an enterprise data lake you often hit a roadblock when you must leave the comfort of the relational world and learn the nuances of handling non-relational data. Starting from sourcing data into the Hadoop ecosystem, you will go through stages that can bring up tough questions such as data processing, data querying, and security. Concepts such as change data capture and data streaming are covered. The book takes an end-to-end solution approach in a data lake environment that includes data security, high availability, data processing, data streaming, and more.Each chapter includes application of a concept, code snippets, and use case demonstrations to provide you with a practical approach. You will learn the concept, scope, application, and starting point.What You'll LearnGet to know data lake architecture and design principlesImplement data capture and streaming strategiesImplement data processing strategies in HadoopUnderstand the data lake security framework and availability modelWho This Book Is ForBig data architects and solution architects

Saurabh K. Gupta is a technology leader, published author, and database enthusiast with more than 11 years of industry experience in data architecture, engineering, development, and administration. Working as a Manager, Data & Analytics at GE Transportation, his focus lies with data lake analytics programs that build a digital solution for business stakeholders. In the past, he has worked extensively with Oracle database design and development, PaaS and IaaS cloud service models, consolidation, and in-memory technologies. He has authored two books on advanced PL/SQL for Oracle versions 11g and 12c. He is a frequent speaker at numerous conferences organized by the user community and technical institutions. He tweets at @saurabhkg and blogs at sbhoracle.wordpress.com. Venkata Giri currently works with GE Digital and has been involved with building resilient distributed services at a massive scale. He has worked on big data tech stack, relational databases, high availability, and performance tuning. With over 20 years of experience in data technologies, he has in-depth knowledge of big data ecosystems, complex data ingestion pipelines, data engineering, data processing, and operations. Prior to working at GE, he worked with the data teams at Linkedin and Yahoo.

Chapter 1: Data Lake Concepts OverviewChapter Goal: This chapter highlights key concepts of Data Lake and Tech Stack. It briefs the readers on the background of Data Management, the need to have a Data Lake, and focus on latest running trends.No of pages: 20Sub -Topics:1. Familiarization with Enterprise Data Lake ecosystem2. Understand key components of Data Lake3. Data understanding – Structured vs UnstructuredChapter 2: Data Replication Strategies Chapter Goal: The chapter will focus on how to replicate data into Hadoop from source systems. Depending on the nature of source systems, strategies may change. The chapter will start with a talk trivial approaches to ETL data into Hadoop and then dive into the latest trends of change data capture.No of pages: 25Sub – Topics: 1. Conventional ETL strategies2. Change data capture for relational data3. Change data capture for time-series dataChapter – 3: Bring Data into HadoopChapter Goal: The chapter will focus on how to get data into a Hadoop cluster. It will talk on several approaches and utilities that can be used to bring data into Hadoop for processing.Page count: 30Sub -Topics:1. RDBMS to Hadoop2. MPP database systems to Hadoop3. Unstructured data into HadoopChapter 4: Data Streaming StrategiesChapter Goal: The chapter will deep dive into data streaming principles of Kafka. It will talk on how Kafka works and understand how it resolves the challenge of getting data into Data Lake.No of pages: 50Sub - Topics: 1. How to stream the data? Kafka2. How to persist the changes3. How to batch the data4. How to massage the data5. Tools and technologies – HVR, Oracle golden gate for big dataChapter 5: Data Processing in HadoopChapter Goal: This chapter will provide an insight into various data querying platforms. It all started with Map Reduce but Hive is quickly acquiring de facto status in the industry. Chapter will deep dive into Hive, its SQL like semantics and show case its most recent capabilities. A dedicated section on Spark will give a detailed walk-through on Spark approach to process data in Hadoop.No of pages: 30Sub - Topics: 1. Map reduce2. Query engines – intro/bigdata sql/bigSQL3. Hive - focus4. Spark – focus5. PrestoChapter 6: Data Security and ComplianceChapter Goal: This chapter will talk on security aspects of a data lake in Hadoop. The fact that security had been deliberately compromised in the past by organizations, does has a weight. The chapter talks about how to build a safety net around data lake and mitigate the risks of unauthorized access or injection attacks on a Data Lake. Page count: 20Sub - Topics:1. Encryption in-transit and at rest2. Data masking3. Kerberos security and LDAP authentication4. Ranger Chapter 7: Ensure Availability of a Data LakeChapter Goal: This chapter throws light on yet another key aspect of data landscape i.e. availability. It will discuss topics like disaster recovery strategies, how to setup replication between two data centers, and how to tackle consistency and integrity of data.Page count: 20Sub - Topics:1. Disaster Recovery Strategies2. Setup Data center replication3. Active-passive mode4. Active-active mode

Erscheint lt. Verlag	27.6.2018
Zusatzinfo	XVIII, 327 p. 90 illus.
Verlagsort	Berkeley
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Datenbanken
	Mathematik / Informatik ► Informatik ► Netzwerke
	Mathematik / Informatik ► Mathematik ► Finanz- / Wirtschaftsmathematik
	Wirtschaft
Schlagworte	BigData • datalake • Data Lake • Data Management • Enterprise • Replication • Streaming
ISBN-10	1-4842-3522-3 / 1484235223
ISBN-13	978-1-4842-3522-5 / 9781484235225

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 5,3 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Softcover

48,14 €