03 September 2024

Quick Benchmark: ANALYZE vs. maintenance_io_concurrency

Introduction

Running ANALYZE (either explicitly or via auto-analyze) is very important in order to have uptodate data statistics for the Postgres query planner. In particular after in-place upgrades via pg_upgrade, ANALYZE needs to be run in order to have any query statistics at all. As ANALYZE samples only parts of the blocks in a table its I/O pattern looks more like random access than sequential read. Version 14 of Postgres has gained the possibility to use prefetching (if available, but this is the case on Linux) to tell the operating system kernel which blocks it will look at next. This is controlled via the maintenenance_io_concurrency configuration parameter, which is set to 10 by default (contrary to effective_io_concurrency, which is set to 1 by default).

Benchmark

In order to test and demonstrate the changes between version 13 and 14, we have done some quick benchmarks using the current maintenance releases (13.16 and 14.13) on Debian 12 with packages from https://apt.postgresql.org. Hardware-wise, a ThinkPad T14s Gen 3 with a Intel i7-1280P CPU with 20 cores and 32 GB of RAM was used. The basis is a pgbench database, initialized with scale factor of 1000:

    $ pgbench -i -I dtg -s 1000 -d pgbench

This creates 100 million rows and leads to a database size of around 15 GB. In order to have ANALYZE do a bit more work, we increase default_statistics_target from the default of 100 to the same value as the pgbench scale factor (i.e., 1000). This results in ANALYZE scanning around 20% of all blocks. We then analyze the main pgbench table, pgbench_accounts:

    $ vacuumdb -Z -v -d pgbench -t pgbench_accounts
    INFO:  analyzing "public.pgbench_accounts"
    INFO:  "pgbench_accounts": scanned 300000 of 1639345 pages,
           containing 18300000 live rows and 0 dead rows;
           300000 rows in sample, 100000045 estimated total rows

Between runs, the file system page cache is dropped via echo 3 | sudo tee /proc/sys/vm/drop_caches and all runs are repeated three times. The following table lists the run-times (in seconds) of the above vacuumdb command for various settings of maintenance_io_concurrency:

 

Version 0 1 5 10 20 50 100 500
13 19.557 21.610 19.623 21.060 21.463 20.533 20.230 20.537
14 24.707 29.840 8.740 5.777 4.067 3.353 3.007 2.763

 

Analysis

Two things are very clear from those numbers: First, the run-times do not change for version 13, the value of maintenance_io_concurrency has no effect for this version. Second, once prefetching kicks in for version 14 (maintenance_io_concurrency is 5 or more), ANALYZE gets several times faster, up to a factor of 6-7x. The default value of maintenance_io_concurrency of 10 is already 3-4x faster and values larger than 50 show only minor further improvements, at least for this benchmark on this hardware. Also notable is that the run-times when prefetching is turned off (maintenance_io_concurrency=0) or only set to 1 are worse than for version 13, but as the default for maintenance_io_concurrency is 10, this should not affect anybody in practice.

Conclusion

Enabling prefetching for ANALYZE in version 14 of PostgreSQL has made statistics sampling much faster. The default value of 10 for maintenance_io_concurrency is already quite good, but we advise to increase it to 20-50 (or higher) in case high-performing local NVME storage is used. In a future quick benchmark, we plan to compare the ANALYZE performance for the major versions since 14. In particular, the upcoming 17 release promises some further improvements to ANALYZE due to the new streaming I/O interface.

Categories: PostgreSQL®
Tags: Benchmarks planetpostgresql PostgreSQL®

About the author

Michael Banck

zur Person

Michael Banck ist seit 2009 Mitarbeiter der credativ GmbH, sowie seit 2001 Mitglied des Debian Projekts und auch in weiteren Open Source Projekten aktiv. Als Mitglied des Datenbank-Teams von credativ hat er in den letzten Jahren verschiedene Kunden bei der Lösung von Problemen mit und dem täglichen Betrieb von PostgreSQL®, sowie bei der Einführung von Hochverfügbarkeits-Lösungen im Bereich Datenbanken unterstützt und beraten.

View posts


Beitrag teilen: