Morten Siebuhr · NoSQL

Open Source Days 2012

NoSQL

  • Computer Scientist
  • Webdeveloper
  • Interest in distributed systems

!

?

The Plan

  • Databases
  • ACID vs. BASE
  • NoSQL anatomy
  • Wet Toes
  • Fin!

Database-related image here

/ˈdatəˌbās/

A structured set of data held in a computer, esp. one that is accessible in various ways.

/ˈɛs kjuː ˈɛl/ · /ˌsiːkwəl/

  • »Industry Standard«
  • ACID (Atomicity, Consistency, Isolation and Durability)
  • Been around since ≈1970
Why Not SQL?
  • Language within a language
  • Constantly on ACID
  • Difficult to model some data-structures
  • Scaling

/dɪˈstrɪbyʊtɪd ˈsɪstəms/

  • Lower per-CPU cost
  • Higher total bandwidth
  • Cheaper to achieve redundancy in software

/noʊ ˈɛs kjuː ˈɛl/ · /noʊ ˌsiːkwəl/

  • Forfeit Consistency and Isolation…
  • …for availability,
  • grateful degradation
  • and performance

BASE

  • Basically Available,
  • Soft-state
  • and Eventual consistency

Eric A. Brewer:
The CAP theorem

Consistency,
Availability and
Partition Tolerance

Consistency

All clients see the same data at the same time

Availability

Every request recieves a response about whether the operation was successfull or failed

Partition Tolerance

Works despite network partitions
·
Continues to operate despite arbitratry message loss

Pick Any Two

(Then work very very hard to fake the last!)

Consistency + Availability

Consistency + Availability

  • All clients see the same data and all clients recieve responses.
  • Traits: 2-phase commit, cache validation protocols
  • But what to do with network partitions?
  • Typical fix: add hot/cold-failover, replication
  • (Also: screwing the System Administrators)

Examples

  • Single-site/Cluster databases,
  • LDAP
  • RDBMS's

Consistency + Partition Tolerance

Consistency + Partition Tolerance

  • All clients see the same data, despite partitioned network.
  • Traits: Drop minority partitions when in trouble
  • What to do about »unknown« data?
  • Deals with it by dropping availability.
  • Fix it by replicating data agressively
  • (Downtime ≈ screwing users)

Examples

  • Distributed databases
  • Locking and majority protocols
  • Google BigTable / HBase
  • Redis

Partition Tolerance + Availability

Partition Tolerance + Availability

  • All clients and read and write despite network partitions.
  • Optimistic, Conflict resolution, and leases/expiration
  • Fix it by replication and Verification
  • »Eventual Consistency«
  • Screw the Developers

Examples

  • Caches
  • DNS
  • Riak
  • Cassandra
  • CouchDB

NoSQL anatomy

NoSQL categories

  • Key/Value
  • Document
  • Column / Tabular
  • (Graph)

SQL … again

SELECT
[ALL | DISTINCT | DISTINCTROW ]
  [HIGH_PRIORITY] [STRAIGHT_JOIN] [SQL_SMALL_RESULT] [SQL_BIG_RESULT]
  [SQL_BUFFER_RESULT] [SQL_CACHE | SQL_NO_CACHE] [SQL_CALC_FOUND_ROWS]
select_expr [, select_expr …]
[FROM table_references
[WHERE where_condition]
[GROUP BY {col_name | expr | position}
  [ASC | DESC], … [WITH ROLLUP]]
[HAVING where_condition]
[ORDER BY {col_name | expr | position}
  [ASC | DESC], …]
[LIMIT {[offset,] row_count | row_count OFFSET offset}]
[PROCEDURE procedure_name(argument_list)]
[INTO OUTFILE 'file_name' export_options
  | INTO DUMPFILE 'file_name'
  | INTO var_name [, var_name]]
    [FOR UPDATE | LOCK IN SHARE MODE]]

Document databases

Document

  • put(document, [key]) → key
  • get(key) → document
  • delete(key, …)
  • Support variations of:
    create_index(document_part) · query_index(…)
  • Sometimes:
    query(some_query_language)

CouchDB

  • Available + Partition-tolerant → »Eventual Consistency«
  • Replicates entire databases … everywhere
  • Map/Reduce for indices
  • Vector clocks

WTF:
Replicates entire DBs everywhere‽

  • Network partitioning as a feature
  • (Think mobile clients…)

Map/Reduce

  • map(key1, value1)[(key2, value2), …]
  • … Shuffle output …
  • reduce(key2, [value2])[value3, …]

Map/Reduce wordcount

def map(document_name, text):
	for word in text.split():
		yield (word, 1)

… Shuffle output …

def reduce(word, count):
	yield (word, sum(count))

»Veiw« with Map/Reduce

PUT /dbname/_design/viewname/
Then query with:
GET /dbname/_design/viewname/_view/all \
	?and_some_modifiers

Vector Clocks

Vector Clocks

Vector Clocks

Vector Clocks

Vector Clocks

Vector Clocks

Key-Value databases

API

  • Key → Value
  • put(key, value)
  • get(key)
  • delete(key)
  • Sometimes:
  • get_all()

Riak

  • Eventual consistency
  • Links
  • Secondary Indices
  • Map/Reduce for bulk operations

One giant hash-table

  • Split it, typically ~64 parts
  • Then save each part on multiple machines

Lazily propagate global state

Then take care of data

Three servers

Then four

Then five

Crash!

How many copies?

  • Three parameters: w/dw and r.
  • One given: servers.
  • w + r > servers

How many copies?

  • Given 5 servers…
  • w=3 and r=3 → ~ consistent system
  • w=1 and r=5 → fast write, slow reads
  • w=5 and r=1 → slow write, fast reads
  • w=1 and r=1 → cache

Column/Tabular databases

API

  • (Table, Row, Prefix:Column) → Value
  • create_table(table, column_prefixes)
  • put(table, row, prefix:column, value)
  • get(table, [row], [prefix:[column]])
  • scan(table, row_start, row_end, [prefix:[column]])
  • delete(table, …)

Big, sparse matrices

row cf1 cf2
cf1:col2 cf1:col2 cf2:col3 cf2:col1 cf2:col2 cf2:col3
a value1 value2 value3 value4
b value5 value6 value7
c value8 value9 value10 value11
d value12 value13 value14 value15

The Canonical Example

URL data meta
data:content meta:headers meta:c-t meta:speed
com.web.service {"foo": "bar", …} Accept: … application/json 0.1s
org.o-s-d.www <html xmlns="http://… Charset: utf-8… text/html char… 1.18s
dk.sbhr <doctype html>\n<html… Age: 0… 0.44s
dk.sbhr/about <doctype html>\n<html… Host: sbhr.dk… 0.71s

HBase

  • Column-oriented
  • Built on Hadoop
  • Copy of Google BigTable
  • Key-scans
  • Map/Reduce

Organization & keyscans

  • Column-family → distict set of files
  • Keys held in-memory, if possible or explicitly configured
  • Many machines holds part of the index → scan in parallel
  • Optional bloom filters

Heads up!

  • Only get »raw« data in and out.
  • Have to massage data in the application.
  • Inflexible query-system.
  • Fix by inserting same data under multiple keys/documents and pretend those are indices.

Dimension Reductino / Geo

  • Latitude - longitude
  • 55.681722, 12.530602
  • 5152..6583…

Free writes!

  • Problem:
    Multiple front-end machines will write to the same key
  • lock → read → update → write → unlock + read
  • Exploit that read < write < lock
  • Each machine write separate keys → merge after reading
  • read → update → write + multiple read
  • Then reader can write back one key, if needed

How To Choose?

  • Look at your data
  • Look at your use-patterns
  • What programming language are you using?
  • Examine their »DNA«
  • They're evolving
  • Try them out! … with real data!
  • »The Cloud«

Thanks!

Morten Siebuhr

Extras

ZooKeeper

  • Specialized distributed configuration & coordination system
  • Filesystem-like interface
  • Persistent client connections
  • Read-oriented

ZooKeeper Features

  • Get/Put
  • ACLs
  • Ephemeral files
  • Watches