An upward trend ID generation service

07/04/2020 posted in  System Design

1. First question - why do we need a sequential ID instead of a random UUID?

  • UUID takes space! We have to store it as HEX string, with 32 chars & 4 '-'.

    • UUID has 128bit in binary - will be 39 chars as Decimal string or 32 chars as HEX string.
    • Consider the data type to store it:
      • MySQL BIGINT: 64bit, not enough for 128bit.
      • MySQL 'VARCHAR(36)`: 36 chars to store the HEX string with 4 '-'s.
      • DynamoDB Number: Support Java BigInteger, can be any size. However, it is actually a Decimal string in storage, which takes 39 chars.
      • DynamoDB String: 36 chars to store the HEX string with 4 '-'s.
  • Some business requirement may require the ID to be in an upward trend.

    • StatementId, OrderId, TweetId.
      • Twitter Example
        • Although there may be a creationTime column, twitter said they prefer to sort tweets by ID.
        • The timeline of a user is paginated. If the tweetId has order and is stored as sort key (e.g. MySQL indexed column, DDB Sort Key, HBase Column Key, etc.), the query to get next page will be very efficient.
  • UUID might have duplicates when there are already a lot of existing IDs. But this is not a practical concern because the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion

1.1 2nd Question is why do I write this article.

In my work projects, we are lazy and just used random UUID for the most of the time. But I also saw many systems like e-commerce and social media systems use a number as ID.
I'm trying to figure out the best way to generate universal unique Ids used in distributed systems.

When I almost completed my doc, I came across the original blog by twitter on why they developed SnowFlake.
I feel that the reasons they listed are very similar to my thoughts. And then I realized I might not need to write this article...

2. Requirement:

  • ID generation service
  • ID is in an upward trend
  • ID is globally unique
  • TPS is large enough for application requirement
    • Normal system write TPS is small
      • FB/Twitter should be < 100TPS
    • But in some events write TPS can be large
      • Double11 peak - 100kTPS order placement.
        • Assumption: 1billion RMB / 21s, 500RMB/order
        • order might be written to order table asynchronously
        • I believe they prepared orderId in advance.

3. Solution 1: DB atomic update

3.1 IDGenerator service logic

  • Service is a cluster.
  • Each host uses one background thread to fetch an id range periodically.
  • May prefetch many ranges and store them in a queue.
  • Clients request one ID at a time.
  • Consume the head of the queue until it's empty.

3.2 DB schema

id_name :the name of this ID 
max_id :current max used id in this sequence
step :the number of ids to allocate in one fetch 

3.3 How to query (MySQL example)

3.3.1 Atomic Read and Update / Transaction

  1. We query current max_id, and we got an ID range [max_id+1, max_id+step].
  2. To avoid other concurrent MySQL threads also getting the same range, we need to add a Write lock (FOR UPDATE, Pessimistic lock). Only after COMMIT can another thread (also with FOR UPDATE) read max_id again.
SET SESSION TRANSACTION ISOLATION LEVEL read committed;
SET autocommit = 0;
BEGIN 
    SELECT max_id 
    FROM id_generator 
    WHERE id_name='OrderId' FOR UPDATE;
    
    UPDATE id_generator 
    SET max_id = max_id+step
    WHERE id_name ='OrderId';
COMMIT;

3.3.2 Optimistic Lock / Lockless

Adding a version column is the common way of applying optimistic lock. Optimistic lock only has better performance than actual lock in cases that concurrency is low.

In this case, max_id is definitely something updated concurrently by all hosts in the service cluster. So I don't think it's a good idea to use optimistic lock.

But for reference purpose, let me still put the logic here.

  1. Read the max_id and version. Use [max_id + 1, max_id + step] as the fetched range.
  2. Update max_id only when version is not changed.
  3. If version changed, need to retry step 1 and 2.
SELECT max_id, version, step
FROM id_generator 
WHERE id_name='OrderId';

UPDATE id_generator
SET max_id = max_id+step, version = version + 1
WHERE version = {version} and id_name ='OrderId'

3.4 Assessment

Because of Write lock in the transaction, the transactions are sequentially executed.
The through put will be limited and latency P99 will be high and even timeout.

  • Scalability:

    • Service cluster is scalable.
    • DB write is not scalable because one ID name uses only one row. (Usually this causes HotKey problem.)
    • However we may not need to scale because the background thread can fetch a large range each time, at the cost of sequence being not exactly increasing.

    • Example:

      • Clients request 100k ID per second.
      • Cluster has 200 hosts, with 500TPS/host.
      • We want the IDs are k-sorted and k~1.
      • Every host fetches a range from DB every 1s
        • The range step will be 500 (100k / 200 * 1)
        • DB write is 200/1 = 200 TPS.
  • Availability:

    • Must have synchronous replication of the master node.
      • MySQL master-slave replication
      • DDB partition has three nodes, one is leader, the other two are followers. We need strong consistency for write.
      • Leaderless architecture, such as Cassandra.
  • General Drawback:
    The disadvantage of this strategy is obvious compared with SnowFlake algorithm.
    It depends on an extra system.
    It seems high cost / limited scalability / low availability.

4. Solution2: Redis INCRBY

4.1 The logic is the same.

This is the same idea as above.
Service fetches a range from Redis by increasing the max_id.

INCRY is atomic and it returns the value after increase.

redis> SET OrderId "1000000"
"OK"
redis> INCRBY OrderId 500
(integer) 1000500

4.2 Assessment

We can still say it has the same problem as above
- extra dependency
- limited scalability
- low availability
Redis does offer higher throughput than MySQL.
But Redis replication seems only support asynchronous way.
(MySQL supports full sync, semisync and asycn replication, during which we need full sync to make replications consistent and slave can be switch to master without losing data.)

5. Solution3: Time-based on-host generation.

I think SnowFlake and TimeBasedUUID are the same idea.
SnowFlake wins because it has 64bits while UUID has 128bits.

TimeBasedUUID
Java UUID

5.1 128 bit UUID

| Time (100ns, i.e. 1e-7s precision) | 60 bit |
| Version (Timebased, DCE Security, NameBased, Random) | 4 bit |
| Mac | 48 bit |
| Sequence Number 14 bit | 14 bit |
| Variant (2 (Leach-Salz)) | 2 bit |

The generated UUID will be unique for approximately 8,925 years so long as less than 10,000 IDs are generated per millisecond on the same device (as identified by its MAC address).

5.2 64 bit SnowFlake Id

| 0 | 1 bit |
| timestamp in milliseconds | 41 bit |
| Machine id | 10 bit |
| sequence no | 12 bit |

Machine id can be based on private ip - always unique in a LAN.
(In contrast, TimeBasedUUID is using Mac as machine ID, which has some privacy leakage risk.)

The bits can be redefined. Under the default config, there can be at most 210=1024 machines. There can be at most 212=4096 IDs generated for 1 millisecond.

I saw Baidu has a variation that changed timestamp to seconds, so there can be more machine ids.

5.3 Problem

We can get epoch timestamp using System.currentTimeMillis() in java. But the absolute millisecond value is not very reliable due to delays in NTP sync.
After an NTP sync, the clock may go back to a prior second due to clock drift
Found a good article on it - How System Clocks Can Cause Mysterious Faults?.

6. Conclusion

  1. I still think in most cases, whey data size is not big (data generation is slow or retention time is limited), using RandomUUID is the best way, because of simplicity.

  2. Otherwise use a customized SnowFlake library seems much better than managing a database of Redis cluster.

Reference:

  1. Nine ways to generate unique ID
  2. Distributed universal ID generation
  3. Announcing Snowflake