Morten Siebuhr · NoSQL

NoSQL

Computer Scientist
Webdeveloper
Interest in distributed systems

!

?

The Plan

Databases
ACID vs. BASE
NoSQL anatomy
Wet Toes
Fin!

Database-related image here

/ˈdatəˌbās/

A structured set of data held in a computer, esp. one that is accessible in various ways.

/ˈɛs kjuː ˈɛl/ · /ˌsiːkwəl/

»Industry Standard«
ACID (Atomicity, Consistency, Isolation and Durability)
Been around since ≈1970
…

Why Not SQL?

Language within a language
Constantly on ACID
Difficult to model some data-structures
Scaling
…

/dɪˈstrɪbyʊtɪd ˈsɪstəms/

Lower per-CPU cost
Higher total bandwidth
Cheaper to achieve redundancy in software

/noʊ ˈɛs kjuː ˈɛl/ · /noʊ ˌsiːkwəl/

Forfeit Consistency and Isolation…
…for availability,
grateful degradation
and performance

BASE

Basically Available,
Soft-state
and Eventual consistency

Eric A. Brewer:
The CAP theorem

Consistency,
Availability and
Partition Tolerance

Consistency

All clients see the same data at the same time

Availability

Every request recieves a response about whether the operation was successfull or failed

Partition Tolerance

Works despite network partitions
·
Continues to operate despite arbitratry message loss

Pick Any Two

(Then work very very hard to fake the last!)

Consistency + Availability

All clients see the same data and all clients recieve responses.
Traits: 2-phase commit, cache validation protocols
But what to do with network partitions?
Typical fix: add hot/cold-failover, replication
(Also: screwing the System Administrators)

Examples

Single-site/Cluster databases,
LDAP
RDBMS's

Consistency + Partition Tolerance

All clients see the same data, despite partitioned network.
Traits: Drop minority partitions when in trouble
What to do about »unknown« data?
Deals with it by dropping availability.
Fix it by replicating data agressively
(Downtime ≈ screwing users)

Examples

Distributed databases
Locking and majority protocols
Google BigTable / HBase
Redis

Partition Tolerance + Availability

All clients and read and write despite network partitions.
Optimistic, Conflict resolution, and leases/expiration
Fix it by replication and Verification
»Eventual Consistency«
Screw the Developers

Examples

Caches
DNS
Riak
Cassandra
CouchDB

NoSQL anatomy

NoSQL categories

Key/Value
Document
Column / Tabular
(Graph)

SQL … again

SELECT
[ALL | DISTINCT | DISTINCTROW ]
  [HIGH_PRIORITY] [STRAIGHT_JOIN] [SQL_SMALL_RESULT] [SQL_BIG_RESULT]
  [SQL_BUFFER_RESULT] [SQL_CACHE | SQL_NO_CACHE] [SQL_CALC_FOUND_ROWS]
select_expr [, select_expr …]
[FROM table_references
[WHERE where_condition]
[GROUP BY {col_name | expr | position}
  [ASC | DESC], … [WITH ROLLUP]]
[HAVING where_condition]
[ORDER BY {col_name | expr | position}
  [ASC | DESC], …]
[LIMIT {[offset,] row_count | row_count OFFSET offset}]
[PROCEDURE procedure_name(argument_list)]
[INTO OUTFILE 'file_name' export_options
  | INTO DUMPFILE 'file_name'
  | INTO var_name [, var_name]]
    [FOR UPDATE | LOCK IN SHARE MODE]]

Document databases

Document

put(document, [key]) → key
get(key) → document
delete(key, …)
Support variations of:
create_index(document_part) · query_index(…)
Sometimes:
query(some_query_language)

CouchDB

Available + Partition-tolerant → »Eventual Consistency«
Replicates entire databases … everywhere
Map/Reduce for indices
Vector clocks

WTF:
Replicates entire DBs everywhere‽

Network partitioning as a feature
(Think mobile clients…)

Map/Reduce

map(key1, value1) → [(key2, value2), …]
… Shuffle output …
reduce(key2, [value2]) → [value3, …]

Map/Reduce wordcount

def map(document_name, text):
	for word in text.split():
		yield (word, 1)

… Shuffle output …

def reduce(word, count):
	yield (word, sum(count))

»Veiw« with Map/Reduce

PUT /dbname/_design/viewname/
Then query with:
GET /dbname/_design/viewname/_view/all \
	?and_some_modifiers

Vector Clocks

Key-Value databases

API

Key → Value
put(key, value)
get(key)
delete(key)
Sometimes:
get_all()

Riak

Eventual consistency
Links
Secondary Indices
Map/Reduce for bulk operations

One giant hash-table

Split it, typically ~64 parts
Then save each part on multiple machines

Lazily propagate global state

Then take care of data

Three servers

Then four

Then five

Crash!

How many copies?

Three parameters: w/dw and r.
One given: servers.
w + r > servers

How many copies?

Given 5 servers…
w=3 and r=3 → ~ consistent system
w=1 and r=5 → fast write, slow reads
w=5 and r=1 → slow write, fast reads
w=1 and r=1 → cache

Column/Tabular databases

API

(Table, Row, Prefix:Column) → Value
create_table(table, column_prefixes)
put(table, row, prefix:column, value)
get(table, [row], [prefix:[column]])
scan(table, row_start, row_end, [prefix:[column]])
delete(table, …)

Big, sparse matrices

row	cf1			cf2
	cf1:col2	cf1:col2	cf2:col3	cf2:col1	cf2:col2	cf2:col3
a	value1			value2	value3	value4
b			value5	value6	value7
c		value8	value9	value10		value11
d	value12			value13	value14	value15
…		…			…

The Canonical Example

URL	data	meta
	data:content	meta:headers	meta:c-t	meta:speed
com.web.service	{"foo": "bar", …}	Accept: …	application/json	0.1s
org.o-s-d.www	<html xmlns="http://…	Charset: utf-8…	text/html char…	1.18s
dk.sbhr	<doctype html>\n<html…	Age: 0…		0.44s
dk.sbhr/about	<doctype html>\n<html…	Host: sbhr.dk…		0.71s
…	…	…

HBase

Column-oriented
Built on Hadoop
Copy of Google BigTable
Key-scans
Map/Reduce

Organization & keyscans

Column-family → distict set of files
Keys held in-memory, if possible or explicitly configured
Many machines holds part of the index → scan in parallel
Optional bloom filters

Heads up!

Only get »raw« data in and out.
Have to massage data in the application.
Inflexible query-system.
Fix by inserting same data under multiple keys/documents and pretend those are indices.

Dimension Reductino / Geo

Latitude - longitude
55.681722, 12.530602
5152..6583…

Free writes!

Problem:
Multiple front-end machines will write to the same key
lock → read → update → write → unlock + read
Exploit that read < write < lock
Each machine write separate keys → merge after reading
read → update → write + multiple read
Then reader can write back one key, if needed

How To Choose?

Look at your data
Look at your use-patterns
What programming language are you using?
Examine their »DNA«
They're evolving
Try them out! … with real data!
»The Cloud«

Thanks!

Morten Siebuhr

Extras

ZooKeeper

Specialized distributed configuration & coordination system
Filesystem-like interface
Persistent client connections
Read-oriented

ZooKeeper Features

Get/Put
ACLs
Ephemeral files
Watches