What is the most suitable data type for a column in a table?

Question

Currently I'm trying to develop a PostgreSQL database schema, it has only 2 tables, one of which contains usernames. The difficulty is that for certain reasons I cannot store the username directly, so I have to store the hash (SHA256).

Postgres has a bytea data type, which is an array of bytes, which is technically SHA256.

My question is whether there is a better type to store the hash in terms of speed of searching whether the username exists in the database or not.

Perhaps I should look towards NoSQL solutions, where such a search is faster?

-----Add-----

The answer and comments suggest that it is optimal to use bytea.

CREATE TABLE users (
    username bytea PRIMARY KEY,
    somedata text NOT NULL
);

If you're looking for exact matches, then would you just be hashing the input username to search and then comparing against the column? Or, would there be other search scenarios here? — Tim Biegeleisen, Commented Jul 4 at 9:48
Yes I'm looking for exact matches and yes username will be hashed on the client side. But what if I choose CHAR(64) instead of bytea? Or maybe something else? — bylazy, Commented Jul 4 at 10:04
bytea is probably the most compact way of storing the hash. — Tim Biegeleisen, Commented Jul 4 at 10:07

Zegarek · Accepted Answer · 2024-07-04 12:38:17Z

Try it. As you suspected and as already confirmed in comments and the answer from @Laurenz Albe, bytea wins by being the most compact and fastest to look up.

Here's how storing 400k hashes compares between these:

variant	pg_total_relation_size	pg_size_pretty
table1_bytea	48709632	46 MB
table2_text	75530240	72 MB
table3_text_collate_c	75522048	72 MB

And here's how much time it takes to look one up:

variant	avg
hashes_in_bytea	00:00:00.00003
hashes_in_text_collate_c	00:00:00.000035
hashes_in_text	00:00:00.000042
hashes_in_text_otf	00:00:00.00011

hash index because I'm only interested in equality checks

You can't be using hash for unique indexes, so you'd need to maintain both that and the separate unique index that gets created to handle the UNIQUE constraint on your username column. I'd just stick with the default unique you already have.

Also, hash indexes are not faster.
– Laurenz Albe
Commented Jul 4 at 13:15 — Laurenz Albe, Commented Jul 4 at 13:15

Laurenz Albe · Accepted Answer · 2024-07-04 12:14:47Z

2

It is probably a micro-optimization, but bytea would use significantly less space and compare faster.

If you end up using text (avoid character), make sure that you are using the C collation:

hash_value text COLLATE "C"

Other collations will result in much more expensive comparisons.

edited Jul 4 at 12:14

answered Jul 4 at 10:49

Laurenz Albe

234k19 gold badges257 silver badges327 bronze badges

Why is bytea a micro-optimization? 32 bytes vs 64 bytes seems a significant difference if used as an index key.
– Charlieface
Commented Jul 4 at 12:00
@Charlieface Perhaps my wording was not great. I don't think that the performance will be much different, (using the C collation). The difference in storage can be significant.
– Laurenz Albe
Commented Jul 4 at 12:16
Twice is long = twice as much to check for equality in the btree, and load and store will also be longer. I don't see how performance won't be a good bit slower.
– Charlieface
Commented Jul 4 at 12:30

Add a comment |

Collectives™ on Stack Overflow

What is the most suitable data type for a column in a table?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
sql
database
postgresql
nosql
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged sqldatabasepostgresqlnosql or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
sql
database
postgresql
nosql
or ask your own question.