Book Notes: Designing Data-Intensive Applications (Part 1)
Book Notes on Part 1 of Designing Data-Intensive Applications - the topic is Foundations of Data Systems
Part 1
Chapter 1. Reliable, Maintainable and Scalable Applications
- Applications are data-intensive if their limiting factor is not compute resources but rather the amount of data, complexity of data or the speed at which it is changing
- Data systems can take the form of databases, caches, queues, search indexes, etc.
Reliability
- System should continue to work correctly even in the face of adversity
- Can tolerate user mistakes
- Performance is good enough under expected load and data volume
- System prevents unauthorized use and abuse
- Fault is defined as one component deviating from its spec, failure is when the system as a whole stops working
- Many critical bugs are due to poor error handling
- Faults include
- hardware (disk failure, power outage)
- software (bugs, incorrect assumptions)
- human (configuration error, incorrect commands)
- Aim to minimize ability to make errors
- elegant API
- smart defaults
- rolling deploys
- multiple hardware instances
Scalability
- Systems ability to cope with increased load
- Load is described depending on the system type - requests per minute, cache hits / minute, simultaneous users
- Performance also depends on the system type - examples are throughput, response time
- Increase a load parameter and system resources remain unchanged - how is your performance affected?
- Increase a load parameter how much do you need to increase the resources if you want to keep performance unchanged?
- Often performance is not a single number, but rather a distribution. For example, response times should be calculated by percentiles, i.e. x% of users experience y ms response time
- Coping with load
- Scaling up - moving to a more powerful machine
- Scaling out - adding instances and distributing load
- Distributed systems can add more complexity but may be more flexible
Maintainability
- Operability, Simplicity, Evolability
- The ability for a system to continue to be operational
Operability
- Operation teams are vital to keep a system running smoothly
- Monitoring health of system
- Keep software up to databases
- Tracking cause of issues such as degraded performance or system failures`
- Deployment and configuration management
- Preserving organizational knowledge
- Important to make these tasks easy
- Avoid dependency on individual machines
- Provide good documentation
- Visibility into runtime behavior
- Good default behavior
Simplicity
- Simple systems are easy to reason about
- Projects mired in complxity (big balls of mud) are hard to improve upon or fix
- Good abstractions allow for increasing complexity
- Key goal for systems we build
Evolvability
- System must be extensible for future improvements
- Good testing
- Simple systems are easier to iterate on
Chapter 2. Data Models and Query Languages
- Data models not only affect how the software is written, but how we think about the problems we are solving
- Each layer of data provides abstractions for interactions with layers above it
Relational Model vs Document Model
- Relational model (SQL in modern days) proposed by Edgar Codd in 1970
- Prior databases at the time forced application developers to think about the internal representation of the data in the database
- NoSQL (No Only SQL) spawned from a need for
- Greater scalability than relational databases, i.e. large datasets or high write throughput
- FOSS preference
- Frustration with schema restrictiveness and desire for more dynamic and expressive model
- Object-Relational mismatch often require ORMs to translate relational model to OO style design
- Network model (lost to relational model in 1970s)
- CODASYL (Conference on Data System Languages) is an example
- Generalized form of hierarchical model (each document had one parent)
- Must know access path to document
- Need to keep path visit, since cycles possible
- Like traversal of linked list, but connections made at write time
- Document model
- Data locality (everything is stored on the object)
- Denormalization (duplication of data rather than reference)
- Good for needing all information in document, since entire document is loaded
- Hard to model many-to-one or many-to-many relationships
- Most don’t support native joins
- Updating document may not be performant due to flexibility of document size, and needing to allocate variable amounts of space
- Schema on read
- Allows schema to be flexible, don’t need to update all documents to add field
- Relational model
- Sets of relations (tables) with collection of tuples (rows)
- Query optimizer is abstraction that doesn’t require application developer to think about data representation
- Schema on write
- “Statically typed”
Query Languages for Data
- Imperative vs. Declarative
- Imperative requires executions to be applied in a certain order - think of a for loop with variables and mutable data structures
- Declarative defines the pattern of the result, but doesn’t specify the implementation, allowing optimizations to be performed behind the scenes
- Examples of declarative languages: CSS, SQL, MapReduce, MongoDB’s Aggregation Pipeline
- Graph-like Data Models
- Vertices and edges
- Property Graphs
- Vertices & edges have unique ids and properties
- Data schema / types flexible
- Cypher is declarative query language for Neo4j graph database
- Triple-store
- Equivalent to property graph model using different terminology
- All information stored in
(subject, predicate, object)
such as(Jim, likes, bananas)
- Examples of implementation: Semantic web/RDF (Resource Description Framework)
- SPARQL (SPARQL Protocol and RDF Query Language) is a RDF format triple-store declarative query language
- Datalog
- Similar to triple-store model, much older than Cypher or SPARQL
- Define small rules to build up a query to build virtual predicates
- Different than Network model (comparing to CODASYL)
- Data types are flexible
- Queries can be declarative, while CODASYL is always imperative
- Vertices and edges are not ordered
- Vertices and edges can be referred to be their unique id
- Alternatives to the relational model
- Graph-like for targeting use cases where everything is potentially related to everything
- Document model for targeting use cases where references are self-contained, and cross-document relationships are rare
Chapter 3. Storage and Retrieval
Data Structures that power Databases
- A log is an append-only data file used by many databases in various forms
- An index is a data structure that can take various forms to improve lookup performance derived from primary data, i.e. it does not affect the data content of a database
- Well-chosen indexes speed up read queries, but every index slows down writes
- Compaction is the general concept of removing old, duplicated values within segments of log data files
Indexes
Hash Indexes
- Examples include keeping an in-memory hash map where each key is mapped to a location on disk
- Well-suited for situations where the value for each key is updated frequently
- Well-suited for situations where the number of keys is low
- Bitcask, the default storage engine in Riak, uses this model as long as all the keys fit in the available RAM
- Not good for range queries since keys are unsorted
- Must fit into memory, or else key / value lookup will incur disk reads and performance costs
Advantages of logs
- Crash recovery - since all old data is kept until it’s no longer needed, no need to worry about crashing in the middle of a value being overwritten
- Concurrency - since values are not updated in place, concurrency operations can occur
- Compacting old data avoids the problem of data files becoming fragmented over time
- Appending and segment merging are sequential write operations which tend to be much faster than random writes
LSM-Trees & SSTables
- Rather than keeping an unsorted log of values, keep sequences of key-values sorted by key
- Format is called Sorted String Table, or SSTable. Each key should only appear once per merged segment file
- Merging segments becomes simpler and more effecient, and can use a merge-sort like algorithm
- Can keep a spare index rather than all keys in-memory, since all keys are sequential and can be used to roughly determine a keys on-disk location
- Easy to compress segments since they need to be scanned and fetched anyway, saving disk space and I/O time
- LSM-trees are Log-Structured Merge-Trees
- Writes are added to in-memory balanced tree called a memtable
- When memtable gets too big, write to an SSTable file, which can be done efficiently because the keys are already sorted. New writes occur to a new memtable instance
- Fetching a key requires looking at the memtable first, and then iterating through all past SSTable files
- Compaction merges and resorts SSTable data files
- Bloom filters can approximate whether or not a key exists in a set, so that not all SSTables need to be iterated through until a key can finally be determined to not exist
B-Trees
- Ubiquitous in databases, mostly in relational databases
- A tree of pages of a specific size, that reference other pages until a leaf page with values is reached
- Each layer of pages has a more specific range of references
- Inserting and deleting occur just like a tree - if page becomes overloaded with values, page is broken into 2 separate pages and the parent page is updated
- Updating a single value rewrites the entire page in more storage engines
- A B-tree with
n
keys always has a depth ofO(log n)
- The number of references to child pages in one page is called the branching factor, which depends on the space required to store a page and range boundaries - often it is several hundred
- A four-level tree of 4KB pages with a branching factor of 500 can store up to 256TB
- Often uses a write-ahead log or (WAL) to store each change before it is applied to the pages itself - this ensures that in the event of a crash the B-tree can be restored back to a consistent state
- Reads may see pages in an inconsistent state based on other writes to the same page - latches are used to prevent concurrency issues
B-Trees vs LSM-Trees
- B-Trees can be fragmented over time, as data is not written sequentially and updating pages may leave available memory in large, inconsistent chunks
- B-trees can have higher write amplification since data is always written at least twice, both to the WAL and the page
- LSM-Trees can be compressed better
- LSM-Trees need to deal with the large performance drains of compaction, especially if the rate of incoming writes is faster than the rate of compaction
- B-Trees offer strong transaction semantics due to transaction isolation and non-duplicate keys
Other Indexing Structures
- Clustered indexes store rows within the index itself, but must pay a penalty for data duplication
- Multi-column indexes
- Concatenated index concatenates the keys based on sort order, which means that the reverse sort order is not indexed
- Multi-dimensional indexes are not able to be indexed efficiently with B-trees or LSM-trees, and can use other data structures such as R-trees
- Full-text search / fuzzy index
- Allows searching for similar keys like mispelled words
- Lucene uses trie-like data structures and Leinshtein automation
- In-memory databases are growing
- Cost / GB is going down for RAM storage, allowing cheaper in-memory databases
- Solutions to durability such as periodically writing to disk, special hardware (battery-powered RAM) or replicating state in other machines
- In-memory performance comes from not just not having to write to disk, but not needing to encode in-memory data structures to something that can be written and fetch efficiently on disk
Data Warehouses
- OLTP - online transaction processing
- Small number of records / query, end-user, latest state of data
- OLAP - online analytics processing
- Aggregate over large number of records, history of events, used internally for business intelligence
- Column-oriented storage
- More efficient when queries only need a few columns - rather than loading entire wide rows, just load the columns needed
- Each column is own “file” that is n long, with n as the number of rows
- Can be compressed very efficiently using bitmap encoding
- Writing is more complicated, and uses LSM-trees often
- Materialized Views
- Written data tables that are denormalized - quicker for queries since values are precomputed
Chapter 4. Encoding and Evolution
- Evolvability is a major focus when designing systems - changes are consistent
- We often cannot make major changes in an atomic transaction, so we need to be backward / forward compatible
- Backward compatibility: Newer code can read data written by older code
- Forward compatibility: Older code can read data written by newer code
Formats for Encoding Data
- Programs usually work with data in two ways
- In memory, optimized for efficient access using pointers as objects, structs, lists, arrays, trees, etc.
- Self-contained sequence of bytes to be transmitted over a network or written to a file
- Encoding / Marshalling / Serializing and Decoding / Unmarshalling / Deserialization translate between the two forms
- Many languages have language-specific encoding / decoding, but usually is only interoperable within the language and also provides some security concerns due to the need to instantiate arbitrary classes and execute arbitrary code
JSON, XML, Binary
- XML - criticized as being too verbose & complicated
- CSV - no schema, quite vague and escaping is hard (what to do with commas, newlines, etc?)
- JSON - simple in comparison with XML, built-in browser support due to being a subset of JavaScript
- Issues with JSON / XML / CSV
- Ambiguity with number encoding - XML / CSV can’t distinguish between digit strings & numbers, JSON does not differentiate between floats and integers so inaccuracies occur with large numbers or large precisions
- Do no support binary strings, which often need to be encoded as Base64
- Optional schema support, which can cause subtle problems
- However, XML & JSON are good for many purposes as they are quick to set-up, flexible, and common
Binary Encoding
- Binary encodings for JSON & XML - MessagePack, BSON, BISON, etc., WBXML, etc.
- Usually not efficient as designed binary encodings, as they often need to include field names
Thrift and Protocol Buffers
- Binary encoding libraries made open-source in 2007-8
- Require a schema for any encoded data
- Include code generation tool to produce classes that conform to schema
- Smaller data size since field names are not needed (defined in schema)
- Schema evolution
- Allows for adding fields, which will be ignored by older readers (new fields cannot be required however, as new readers will throw errors from old writers)
- Allows for changing field names
- Removing fields requires the field to be optional and the field tag to be reserved (never used again)
- In some cases can change data type to be bigger, or to be a list rather than single element
Avro
- Started in 2009 to specifically fit with Hadoop
- Can be more compact than Thrift / Protobufs
- No type annotations in encoded format
- Separate schema for writer and reader, but only need to be compatible
- Field order is important in schema definition
- Requires schema acknowledgement between reader and writer
- Writer schema defined in Avro object container files (large file with lots of record)
- Each record has a specific version number (records in a DB)
- Negotiate schema versions on connection (Sending records over a network connection)
- Schema evolution
- Only add or remove fields that have a default value
- Allows aliases, but changing names are a little trickier - backwards compatible but not forward compatible
- Advantage is dynamically generated schemas
- Fields are not tagged by numbers, but by names
- Can generate Avro records and object container file when reading a CSV or Relational DB
- Thrift or Protobufs would require the schema to be assigned by hand and manually set the mapping from columns to field tags
The Merits of Schemas
- Schemas support detailed validation
- More compact than JSON / XML / more flexible encoding formats
- Schema is a valuable form of documentation
- Keeping a database of schemas allows forward and backward compatibility to be checked
- Code generation is useful and provides compile-time type checking and run-time schema enforcement
Modes of Dataflow
- Whenever you send data to another process with which you don’t share memory with, data needs to be encoded
- Via databases, service calls, and asynchronous message passing
Databases
- Storing data for a process in the future to access
- Data lasts longer than code
- Schema evolution allows the database to appear as it was encoded with the single schema, even though the underlying storage may contain records encoded with historical versions
REST and RPC
- To communicate over the network, you often has clients and servers
- HTTP is often the transportation protocol for information over the public internet
- If a service is communicated with by HTTP, it is referred to as a web service
- Many APIs are built upon HTTP transport
- Services are APIs exposed by a server that allows for query of data or actions to be performed
- Organization-internal services may communicate through Remote Procedure Calls, or RPC
- Many services may exist that serve small-scoped purposes - this is microservice architecture or service oriented architecture
- REST and SOAP are two approaches to web services
- REST - Representational State Transfer
- Use urls to identify resources and use HTTP headers to dictate authentication, cache-control, content-type, etc.
- Simpler approaches than SOAP
- SOAP - Simple Object Access Protocol
- Avoids most HTTP features
- XML-based
- API is described using WSDL (Web Services Description Language) that enables code generation
- Interoperability between different vendors often causes problems
- REST - Representational State Transfer
- RPCs aim to mock local function calls, but are inherently more complex
- Unpredictable, because networks are unpredictable
- Need to deal with timeouts if function call takes a long time. This may result in duplication
- Network responses can get lost, and include much more latency than local function calls
- Local function calls can efficiently pass references to local memory around - RPC calls must have easily-serializable data
- New generation of RPC frameworks are explicit that they occur over a network, and aim to simplify service discovery and abstract over network issues
- Can achieve better performance but are more suited for internal organization message passing
Message-Passing
- Queue-like, which don’t often wait for a response like RPC calls
- Advantages over RPC
- Act as a buffer if recipient is unavailable or overloaded
- Automatically redeliver messages to crashed processes
- Avoids the client from needing to know specific location of recipient (useful in virtual deployments)
- One Message -> Many Receivers
- Decouples sender from receiver
- Message brokers include RabbitMQ, ActiveMQ, Kafka
- Consumers and publishers to a “topic” or “channel”
- Actor model is a programming model for concurrency in a single process
- Rather than deal with threads directly, each actor represents one client or entity and communicates with other actors by sending and receiving asynchronous messages
- Message delivery is not guaranteed
- Can abstract over whether the receiver is on the same node or another node