This site uses cookies to deliver our services, improve performance, for analytics, and (if not signed in) for advertising. By using LibraryThing you acknowledge that you have read and understand our Terms of Service and Privacy Policy. Your use of the site and services is subject to these policies and terms.
Hide this

Results from Google Books

Click on a thumbnail to go to Google Books.

Designing Data-Intensive Applications: The…

Designing Data-Intensive Applications: The Big Ideas Behind Reliable,…

by Martin Kleppmann

MembersReviewsPopularityAverage ratingConversations
622277,543 (4.55)None



Sign up for LibraryThing to find out whether you'll like this book.

No current Talk conversations about this book.

Showing 2 of 2
Great non-technical review of the fundamentals. This book effectively highlights some of the major design challenges in modern distributed systems along with a catalog of modern solutions to these challenges. ( )
  albertgoldfain | Oct 19, 2017 |
I consider this book a mini-encyclopedia of modern data engineering. Like a specialized encyclopedia, it covers a broad field and in considerable detail. But it is not a practice or a cookbook for a particular Big Data, NoSQL or newSQL product. What the author does is to lay down the principles of current distributed big data systems, and he does a very fine job of it.

If you are after the obscure details of a particular product, or some tutorials and "how-to"s, go elsewhere. But if you want to understand the main principles, issues, as well as the challenges of data intensive and distributed system, you've come to the right place.

Martin Klepman starts out by solidly giving the reader the conceptual framework in the first chapter: what does reliability mean? How is it defined? What is the difference between a "fault" and a "failure"? How do you describe load on a data intensive system? How do you talk about performance and scalability in a meaningful way? What does it mean to have a "maintainable" system?

Second chapter gives a brief overview of different data models and shows the suitability of of them to different use cases, using modern challenges that companies such as Twitter faced. This chapter is a solid foundation for understanding the difference between the relational data model, document data model, graph data model, as well as the languages used for processing data stored using these methods.

The third chapter goes into a lot of detail regarding the building blocks of different types of database systems: the data structures and algorithms used for the different systems shown in the previous chapter are described; you get to know what hash indexes, SSTables (Sorted String Tables), Log-Structured Merge trees (LSM-trees), B-trees and other data structures. Following this chapter, you are introduced to Column Databases and the underlying principles and structures behind them.

Following these, the books describes the methods of data encoding, starting from the venerable XML and JSON, and going into the details of formats such as Avro, Thrift and Protocol Buffers, showing the trade-offs between these choices.

Following the building blocks and foundations above is the Part II of the book and this is where things start to get really interesting because now the reader starts to learn about challenging topic of distributed systems: how to use the basic building blocks in a setting where anything can go wrong in the most unexpected ways. This Part II is the most complex of part the book: you learn about how to replicate your data, what happens when replication lags behind, how you provide a consistent picture to the end-user or the end-programmer, what algorithms are used for leader election in consensus systems, and how leaderless replication works.

One of the primary purpose of using a distributed system is to have an advantage over a single, central system, and that advantage is to provide better service, meaning a more resilient service with an acceptable level of responsiveness. This means you need to distribute the load and your data, and there a lot of schemes for partitioning your data. Chapter 6 of Part II provides a lot of details on partitioning, keys, indexes, secondary indexes and how to handle data queries when your data is partitioned using various methods.

No data systems book can be complete without touching the topic of transactions, and this book is not an exception to the rule. You learn about the fuzziness surrounding the definition of ACID, isolation levels, and serializability.

The remaining two chapters of Part II, Chapter 8 and 9 is probably the most interesting part of the book. You are now ready to learn the gory details of how to deal with all kinds of network and other types of faults to keep your data system in usable and consistent state, the problems with the CAP theorem, version vectors and that they are not vector clocks, Byzantine faults, how to have a sense of causality and ordering in a distributed system, why algorithms such as Paxos, Raft, and ZAB (used in ZooKeeper) exist, distributed transactions, and many more topics.

The rest of the book, that is Part III, is dedicated to batch and stream processing. The author describes the famous Map Reduce batch processing model in detail, and briefly touches upon the modern frameworks for processing distributed data processing such as Apache Spark. The final chapter discusses event streams and messaging systems and challenges that arise when trying to process this "data in motion". You might not be in the business of building the next generation streaming system, but you'll definitely need to have a handle on these topics because you'll encounter the described issues in the practical stream processing systems that you deal with daily as a data engineer.

As I said in the opening of this review, consider this a mini-encyclopedia for the modern data engineer, and also don't be surprised if you see more than 100 references at the end of some chapters; if the author tried to include most of them in the text itself, the book would well go beyond 2000 pages!

At the time of my writing, the book is 90% complete, according to its official site there's only 1 more chapter to be added (Chapter 12: Materialized Views and Caching), so it is safe to say that I recommend this book to anyone working with distributed big data systems, dealing with NoSQL and newSQL databases, document stores, column oriented data stores, streaming and messaging systems. As for me, it'll definitely be my go-to reference for the upcoming years for these topics. ( )
  EmreSevinc | Nov 10, 2016 |
Showing 2 of 2
no reviews | add a review
You must log in to edit Common Knowledge data.
For more help see the Common Knowledge help page.
Series (with order)
Canonical title
Original title
Alternative titles
Original publication date
Important places
Important events
Related movies
Awards and honors
First words
Last words
Disambiguation notice
Publisher's editors
Publisher series
Original language
Canonical DDC/MDS

References to this work on external resources.

Wikipedia in English


Book description
Haiku summary

No descriptions found.

Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and mainteinability. In addition, we have an overwhelming variet of tools, including relational databases, NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords? In this practical and comprehensive gjuide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. Software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications.… (more)

Quick Links

Popular covers


Average: (4.55)
4 3
4.5 3
5 4

Is this you?

Become a LibraryThing Author.


About | Contact | Privacy/Terms | Help/FAQs | Blog | Store | APIs | TinyCat | Legacy Libraries | Early Reviewers | Common Knowledge | 131,817,452 books! | Top bar: Always visible