Stopping Time In PostgreSQL

Recently, I’ve been working with a database that makes liberal uses of triggers to audit database changes. Each table has a trigger that records the action taken (INSERT, UPDATE, DELETE), the change in value, and the time at which the action occurred. The history recorded effectively paints a picture of database history, thereby requiring much less effort on the part of the application designer to persist data in any cohesive way.

Scanning through some of this historical data with a coworker, we discovered to our surprise that the timestamps were not monotonically increasing. In fact, updates to the same records looked to be confusingly out of order. Abstracting away the actual data, we were looking at a situation like so:

testdb=# SELECT id, old_val, new_val, created_at FROM history;
id | old_val | new_val | created_at
---+---------+---------+----------------------------
1  | a       | b       | 2015-05-11 12:00:00.000000
2  | b       | c       | 2015-05-11 12:00:30.000000
3  | c       | d       | 2015-05-11 12:00:15.000000

Logically, the progression of values still makes sense: a -> b -> c -> d. Verifying the original table, the value is persisted as d, suggesting the most recent update (by id value) is in fact the most accurate representation of the current state of the database. How then, do we generate out of order timestamps?

The Investigation

Transactions in Postgres assure data integrity by locking modified rows until that transaction has been completed. Let’s say I have a users table, and a user named ‘a’. I also have a trigger, which populates my history table, which in turn has a default created_at value of NOW():

testdb=# SELECT * FROM users;
id | name 
---+------
1  | a
(1 row)
testdb=# \d history
             Column             |            Type             |                      Modifiers
--------------------------------+-----------------------------+------------------------------------------------------
 id                             | integer                     | not null default nextval('history_id_seq'::regclass)
 old_val                        | character varying(255)      |
 new_val                        | character varying(255)      |
 created_at                     | timestamp with time zone    | not null default now()

Now, in two different windows, I can open simultaneous transactions, a few seconds apart:

testdb=# # SESSION 1                           | testdb=# # SESSION 2
testdb=# START TRANSACTION;                    | testdb=#
testdb=# SELECT now();                         | testdb=#
              now                              | testdb=#
-------------------------------                | testdb=#
 2015-05-11 20:57:50.255651-04                 | testdb=#
(1 row)                                        | testdb=#
testdb=#                                       | testdb=# START TRANSACTION;
testdb=#                                       | testdb=# SELECT now();
testdb=#                                       |               now  
testdb=#                                       | -------------------------------
testdb=#                                       |  2015-05-11 20:58:00.466234-04
testdb=#                                       |  (1 row) 
testdb=#                                       | testdb=# START TRANSACTION;

Importantly, my 2nd session was started roughly 10 seconds after my 1st. If I wait a few seconds, and check now() again, I see the following:

testdb=# # SESSION 1                           | testdb=# # SESSION 2
testdb=# SELECT now();                         | testdb=# SELECT now();
              now                              |               now  
-------------------------------                | -------------------------------
 2015-05-11 20:57:50.255651-04                 |  2015-05-11 20:58:00.466234-04
(1 row)                                        |  (1 row) 

Once a transaction is open, Postgres fully stops time. Now, in my 2nd session, I’ll update the name of my user ‘a’. Since I’m still in a separate transactions, the name of my user will still appear as ‘a’ in my 1st session.

testdb=# # SESSION 1                           | testdb=# # SESSION 2
testdb=#                                       | testdb=# UPDATE users SET name='b' WHERE id=1;
testdb=#                                       | UPDATE 1
testdb=#                                       | testdb=# SELECT * FROM users;
testdb=#                                       | id | name 
testdb=#                                       | ---+------
testdb=#                                       | 1  | b
testdb=#                                       | (1 row)
testdb=# SELECT * FROM users;                  | testdb=#
id | name                                      | testdb=#
---+------                                     | testdb=#
1  | a                                         | testdb=#
(1 row)                                        | testdb=#

If I then attempt to update my user’s name in my 1st session, the update will hang, as the transaction will not be able to secure a lock on the row in question. While the 2nd transaction remains open, the 2nd session will continue to hold a lock on the updated row to prevent any loss or overwrite or changed yet uncommitted data.

testdb=# # SESSION 1                           | testdb=# # SESSION 2
testdb=# UPDATE users SET name='c' WHERE id=1; | testdb=# 
                                               | testdb=# SELECT * FROM users;
                                               | id | name 
                                               | ---+------
                                               | 1  | b
                                               | (1 row)

Committing my 2nd session, will release the lock to my 1st, allowing the update to complete. However, until my 1st session is committed, reads outside of the transaction will not show the update from b -> c.

testdb=# # SESSION 1                           | testdb=# # SESSION 2
testdb=# UPDATE users SET name='c' WHERE id=1; | testdb=# 
                                               | testdb=# COMMIT TRANSACTION;
UPDATE 1                                       | testdb=#
testdb=#                                       | testdb=# SELECT * FROM users;
testdb=#                                       | id | name 
testdb=#                                       | ---+------
testdb=#                                       | 1  | b
testdb=#                                       | (1 row)
testdb=# SELECT * FROM users;                  | testdb=#
id | name                                      | testdb=#
---+------                                     | testdb=#
1  | c                                         | testdb=#
(1 row)                                        | testdb=#
testdb=# COMMIT TRANSACTION;                   | testdb=# 
testdb=#                                       | testdb=# SELECT * FROM users;
testdb=#                                       | id | name 
testdb=#                                       | ---+------
testdb=#                                       | 1  | c
testdb=#                                       | (1 row)

As soon as each update is performed, the write to the history table is also prepped by the trigger. However, this write also isn’t executed until the transaction closes. Since the 2nd transaction closes first, it posts first, and carries with it the timestamp from the start of the transaction.

The Conclusion

The timeline of our investigation is as follows:

SESSION 1
start: 2015-05-11 20:57:50.255651-04                      close
<--------------------------------------------------------------->

SESSION 2
        start: 2015-05-11 20:58:00.466234-04       close
        <------------------------------------------------>

The end result is a history table in which the created_at timestamps are ‘out of order’:

testdb=# SELECT id, old_val, new_val, created_at FROM history;
id | old_val | new_val | created_at
---+---------+---------+----------------------------
1  | a       | b       | 2015-05-11 20:58:00.466234-04
2  | b       | c       | 2015-05-11 20:57:50.255651-04

Despite the default on the history table’s created_at column, the value of now() at the time the row is prepped is coming from the timestamp frozen at the start of the transaction. The timestamp() is not a reflection of time at the write time, but at the row prep time. The id, in contrast, is acquired at the time the row posts to the history table, and is taken from the next value available in the id sequence.1

The ordering of the table by id, rather than the created_at value, therefore represents the most accurate history of how the data has changed.

* All above code examples are in psql syntax.
1 Because sequence values must be immediately available to all transactions, incrementing a sequence cannot be undone. This prevents locking of a sequence value itself, as well as preventing a collision between a value updated within an open transaction and the state of the sequence after a closed transaction. See sequence documentation and transaction isolation documentation for further information.

13 Comments

  1. Very good article! We are linking to this particularly great article on our website.
    Keep up the good writing.

  2. Michael Glaesemann June 2, 2015 at 4:18 pm

    You should likely look at clock_timestamp() to determine if that suits your needs better than CURRENT_TIMESTAMP.

  3. This behavior of now is well known and documented. If you want real life clock, then use clock_timestamp() function.

  4. Very good article! I’m making fun of the stupid above comment that needs to be removed about this particularly great article. Keep up the good writing!

    But really, great post.

  5. The use of now() is not safe in a transaction as it grabs the timestamp of when the transaction started. In these cases, you might be better off using the postgres clock_timestamp() function to capture the date/time. Of course, this really depends on what you’re trying to do (if it’s more relevant to capture the time “now” or the time when the transaction started).

    Ref: http://www.postgresql.org/docs/9.4/static/functions-datetime.html

  6. I should have added, this is a well written article and clearly describes some issues with capturing time / timestamps in transactions.

  7. This isn’t really surprising behavior, it’s well documented: http://www.postgresql.org/docs/devel/static/functions-datetime.html. Try using the default created_at value to be clock_timestamp() instead of now().

  8. Well, this is the way it is supposed to work.
    From the documentation, section 9.9.4: “now() is a traditional PostgreSQL equivalent to transaction_timestamp().”

  9. Or more succinctly, from the Postgres docs:

    It is important to know that CURRENT_TIMESTAMP and related functions return the start time of the current transaction; their values do not change during the transaction. This is considered a feature: the intent is to allow a single transaction to have a consistent notion of the “current” time, so that multiple modifications within the same transaction bear the same time stamp. timeofday() returns the wall-clock time and does advance during transactions.

    Postgres has great docs.

  10. If accurate time matters use clock_timestamp().

    If you don’t want to roll your own audit in postgres you can use postgresql.org’s script (url below). This is for version 9.1 and above. You may need to change to use clock_timestamp().

    Tracks changed fields and original values among a bunch of other cool features. Uses HStore data type.

    PGSql 9.1+
    http://wiki.postgresql.org/wiki/Audit_trigger_91plus

    Older than 9.1
    http://wiki.postgresql.org/wiki/Audit_trigger

  11. It’s worth mentioning that for situations like you described you may consider using clock_timestamp() which returns actual date time value and therefore is changing during transactions instead of now(). See http://www.postgresql.org/docs/current/interactive/functions-datetime.html

  12. Update: Just read the postgresql.org script I mentioned above. Looks like it audits all 3 timestamps.
    current_timestamp, — action_tstamp_tx
    statement_timestamp(), — action_tstamp_stm
    clock_timestamp(), — action_tstamp_clk

    http://wiki.postgresql.org/wiki/Audit_trigger_91plus

    Setting up a trigger on a table is as simple as:
    SELECT audit.audit_table(‘target_table_name’);
    More advanced params also available.

  13. Everyone is recommending using clock_timestamp instead of now() or CURRENT_TIMESTAMP. This will get things *closer* but it still won’t always make the rows *visible* in timestamp order. (This can also be a problem for sequences)

    If you have some reader selecting rows in batches using a “checkpoint” time, where each batch gets the rows after the last timestamp in the previous batch, you can miss rows. The solution is that every transaction has to use the *same* lock *before* getting the timestamp.

Leave a Reply

Your email address will not be published. Required fields are marked *