Out of Memory errors are frustrating as heck!

Next Topic
 
classic Classic list List threaded Threaded
69 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Out of Memory errors are frustrating as heck!

Gunter

For weeks now, I am banging my head at an "out of memory" situation. There is only one query I am running on an 8 GB system, whatever I try, I get knocked out on this out of memory. It is extremely impenetrable to understand and fix this error. I guess I could add a swap file, and then I would have to take the penalty of swapping. But how can I actually address an out of memory condition if the system doesn't tell me where it is happening?

You might want to see the query, but it is a huge plan, and I can't really break this down. It shouldn't matter though. But just so you can get a glimpse here is the plan:

Insert on businessoperation  (cost=5358849.28..5361878.44 rows=34619 width=1197)
  ->  Unique  (cost=5358849.28..5361532.25 rows=34619 width=1197)
        ->  Sort  (cost=5358849.28..5358935.83 rows=34619 width=1197)
              Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformationsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectentityemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_displayname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubject_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid
              ->  Nested Loop Left Join  (cost=2998335.54..5338133.63 rows=34619 width=1197)
                    Join Filter: (((documentinformationsubject.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject.actinternalid)::text = (r.targetinternalid)::text))
                    ->  Merge Left Join  (cost=2998334.98..3011313.54 rows=34619 width=930)
                          Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))
                          ->  Sort  (cost=1408783.87..1408870.41 rows=34619 width=882)
                                Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid
                                ->  Seq Scan on documentinformationsubject  (cost=0.00..1392681.22 rows=34619 width=882)
                                      Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))
                          ->  Materialize  (cost=1589551.12..1594604.04 rows=1010585 width=159)
                                ->  Sort  (cost=1589551.12..1592077.58 rows=1010585 width=159)
                                      Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid
                                      ->  Seq Scan on documentinformationsubject documentinformationsubject_1  (cost=0.00..1329868.64 rows=1010585 width=159)
                                            Filter: ((participationtypecode)::text = 'PRD'::text)
                    ->  Materialize  (cost=0.56..2318944.31 rows=13 width=341)
                          ->  Nested Loop Left Join  (cost=0.56..2318944.24 rows=13 width=341)
                                ->  Nested Loop Left Join  (cost=0.00..2318893.27 rows=1 width=281)
                                      Join Filter: ((agencyname.entityinternalid)::text = (documentinformationsubject_2.otherentityinternalid)::text)
                                      ->  Nested Loop Left Join  (cost=0.00..2286828.33 rows=1 width=291)
                                            Join Filter: ((agencyid.entityinternalid)::text = (documentinformationsubject_2.otherentityinternalid)::text)
                                            ->  Nested Loop Left Join  (cost=0.00..2284826.24 rows=1 width=239)
                                                  Join Filter: (((q.documentinternalid)::text = (documentinformationsubject_2.documentinternalid)::text) AND ((q.actinternalid)::text = (documentinformationsubject_2.actinternalid)::text))
                                                  ->  Nested Loop  (cost=0.00..954957.59 rows=1 width=136)
                                                        Join Filter: ((q.actinternalid)::text = (r.sourceinternalid)::text)
                                                        ->  Seq Scan on actrelationship r  (cost=0.00..456015.26 rows=1 width=74)
                                                              Filter: ((typecode)::text = 'SUBJ'::text)
                                                        ->  Seq Scan on documentinformation q  (cost=0.00..497440.84 rows=120119 width=99)
                                                              Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))
                                                  ->  Seq Scan on documentinformationsubject documentinformationsubject_2  (cost=0.00..1329868.64 rows=1 width=177)
                                                        Filter: ((participationtypecode)::text = 'AUT'::text)
                                            ->  Seq Scan on entity_id agencyid  (cost=0.00..1574.82 rows=34182 width=89)
                                      ->  Seq Scan on bestname agencyname  (cost=0.00..27066.08 rows=399908 width=64)
                                ->  Index Scan using act_id_fkidx on act_id an  (cost=0.56..50.85 rows=13 width=134)
                                      Index Cond: ((q.actinternalid)::text = (actinternalid)::text)

I have monitored the activity with vmstat and iostat, and it looks like the memory grabbing happens rapidly after a Sort Merge step. I see in the iostat a heavy read and write activity, which I attribute to a sort-merge step, then that is followed by a sudden spike in write activity, and then the out of memory crash.

procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- -----timestamp-----
 r  b         swpd         free         buff        cache   si   so    bi    bo   in   cs  us  sy  id  wa  st                 UTC
 0  2            0       119344            0      7616672    0    0 11681  3107    9    0   6   1  72  21   0 2019-04-14 16:19:52
 0  2            0       128884            0      7607288    0    0  2712 55386  500  509   3   2  15  80   0 2019-04-14 16:19:54
 0  2            0       116984            0      7619916    0    0   880 59241  548  525   2   2   9  87   0 2019-04-14 16:19:56
 0  2            0       131492            0      7604816    0    0   128 56512  518  401   1   1  12  86   0 2019-04-14 16:19:58
 ...
 0  2            0       134508            0      7601480    0    0     0 58562  428  353   0   1   4  95   0 2019-04-14 16:21:46
 0  2            0       125360            0      7611320    0    0     0 59392  481  369   0   1  11  89   0 2019-04-14 16:21:48
 0  2            0       122896            0      7612872    0    0     0 58564  431  342   0   1  17  82   0 2019-04-14 16:21:50
 1  1            0       121456            0      7614248    0    0    54 57347  487  399   0   1  13  85   0 2019-04-14 16:21:52
 0  2            0       122820            0      7613324    0    0    12 59964  460  346   0   1  20  79   0 2019-04-14 16:21:54
 0  2            0       120344            0      7616528    0    0  1844 55691  645  676   5   3   6  85   0 2019-04-14 16:21:56
 0  2            0       124900            0      7611404    0    0   936 58261  795 1215   2   3  13  83   0 2019-04-14 16:21:58
 0  2            0       124572            0      7612192    0    0  1096 55340  518  487   1   2   0  97   0 2019-04-14 16:22:00
 0  2            0       123040            0      7612740    0    0   888 57574  573  620   1   2   5  92   0 2019-04-14 16:22:02
 0  2            0       125112            0      7610592    0    0   124 59164  498  480   1   1  13  85   0 2019-04-14 16:22:04
 1  1            0       129440            0      7607592    0    0   568 60196  563  612   2   2   8  88   0 2019-04-14 16:22:06
 0  2            0       124020            0      7612364    0    0     0 58260  629  725   0   1   8  91   0 2019-04-14 16:22:08
 2  1            0       124480            0      7611848    0    0     0 58852  447  331   0   1   1  98   0 2019-04-14 16:22:10
 0  3            0       137636            0      7598484    0    0 11908 44995  619  714   1   1  11  87   0 2019-04-14 16:22:12
 0  2            0       123128            0      7613392    0    0 29888 28901  532  972   1   1  29  68   0 2019-04-14 16:22:14
 0  2            0       126260            0      7609984    0    0 39872 18836  706 1435   1   2  28  70   0 2019-04-14 16:22:16
 0  2            0       130748            0      7605536    0    0 36096 22488  658 1272   2   1   8  89   0 2019-04-14 16:22:18
...
 0  2            0       127216            0      7609192    0    0 29192 29696  472  949   1   1  23  75   0 2019-04-14 16:22:40
 0  2            0       147428            0      7588556    0    0 29120 29696  523  974   1   1  19  79   0 2019-04-14 16:22:42
 0  1            0       120644            0      7615388    0    0 32320 25276  566  998   1   2  49  47   0 2019-04-14 16:22:44
 0  1            0       128456            0      7607904    0    0 58624     0  621 1103   3   2  49  46   0 2019-04-14 16:22:46
 0  1            0       127836            0      7608260    0    0 58624     0  631 1119   3   2  50  46   0 2019-04-14 16:22:48
 0  1            0       126712            0      7609616    0    0 58624     0  616 1110   2   2  50  47   0 2019-04-14 16:22:50
...
 0  1            0       157408            0      7578060    0    0 58628     0  736 1206   3   3  50  44   0 2019-04-14 16:27:22
 0  1            0       142420            0      7593400    0    0 58688     0  623 1099   1   4  50  45   0 2019-04-14 16:27:24
 0  1            0       247016            0      7488184    0    0 58568     0  649 1113   1   4  50  45   0 2019-04-14 16:27:26
 0  1            0       123232            0      7612088    0    0 58412   215  675 1141   2   3  50  46   0 2019-04-14 16:27:28
 0  2            0       144920            0      7586576    0    0 48376 11046  788 1455   1   5  34  60   0 2019-04-14 16:27:30
 1  1            0       125636            0      7595704    0    0 36736 21381  702 1386   1   4  21  74   0 2019-04-14 16:27:32
 0  3            0       156700            0      7559328    0    0 35556 23364  709 1367   1   3  22  74   0 2019-04-14 16:27:34
 0  2            0       315580            0      7382748    0    0 33608 24731  787 1407   1   5  18  76   0 2019-04-14 16:27:36
...
 0  2            0       684412            0      6152040    0    0 29832 28356  528  994   1   2  32  66   0 2019-04-14 16:38:04
 0  2            0       563512            0      6272264    0    0 29696 29506  546  987   1   2  32  65   0 2019-04-14 16:38:06
 0  2            0       595488            0      6241068    0    0 27292 30858  549  971   1   2  26  71   0 2019-04-14 16:38:08
 0  2            0       550120            0      6285352    0    0 28844 29696  567  995   1   2  29  68   0 2019-04-14 16:38:10
 1  1            0       432380            0      6402964    0    0 28992 29696  557  979   1   2  37  61   0 2019-04-14 16:38:12
 0  2            0       445796            0      6384412    0    0 26768 32134  628 1029   1   4  27  69   0 2019-04-14 16:38:14
 0  2            0       374972            0      6453592    0    0 28172 30839  529  962   1   2  43  54   0 2019-04-14 16:38:16
 0  2            0       317824            0      6507992    0    0 29172 29386  560 1001   1   3  27  68   0 2019-04-14 16:38:18
 0  3            0       215092            0      6609132    0    0 33116 25210  621 1148   1   3  19  77   0 2019-04-14 16:38:20
 0  2            0       194836            0      6621524    0    0 27786 30959  704 1152   0   5  18  77   0 2019-04-14 16:38:22
 0  3            0       315648            0      6500196    0    0 31434 27226  581 1073   0   3  31  65   0 2019-04-14 16:38:24
 0  2            0       256180            0      6554676    0    0 29828 29017  668 1174   0   4  20  76   0 2019-04-14 16:38:26 <<< CRASH
 0  1            0       378220            0      6552496    0    0  4348 53686 2210 3816   1   5  46  49   0 2019-04-14 16:38:28
 0  1            0       389888            0      6536296    0    0  2704 56529 2454 4178   0   5  42  52   0 2019-04-14 16:38:30
 0  2            0       923572            0      5998992    0    0  1612 56863 2384 3928   0   6  16  78   0 2019-04-14 16:38:32
 0  0            0       908336            0      6006696    0    0  3584 49280 8961 17334   0  19  39  42   0 2019-04-14 16:38:34
 0  1            0      1306480            0      5607088    0    0   264 63632 18605 37933   3  58  35   4   0 2019-04-14 16:38:36
 2  1            0      1355448            0      5558576    0    0     8 59222 14817 30296   2  46  24  27   0 2019-04-14 16:38:38
 2  2            0      1358224            0      5555884    0    0     0 58544 14226 28331   2  44   3  50   0 2019-04-14 16:38:40
 2  1            0      1446348            0      5468748    0    0     0 58846 14376 29185   2  44  11  42   0 2019-04-14 16:38:42
 0  0            0      2639648            0      4357608    0    0     0 28486 12909 26770   2  44  49   5   0 2019-04-14 16:38:44
 0  0            0      2639524            0      4357800    0    0     0     0  158  154   0   0 100   0   0 2019-04-14 16:38:46
 0  0            0      2687316            0      4309976    0    0     0     0  181  188   0   2  98   0   0 2019-04-14 16:38:48
 0  0            0      2706920            0      4300116    0    0     0   105  137  263   0   0 100   0   0 2019-04-14 16:38:50
 0  0            0      2706672            0      4300232    0    0     0     0  142  204   0   0 100   0   0 2019-04-14 16:38:52
 0  0            0      2815116            0      4191928    0    0     0     0  116  242   0   0 100   0   0 2019-04-14 16:38:54
 0  0            0      2815364            0      4192008    0    0     0     0  116  239   0   0 100   0   0 2019-04-14 16:38:56
 0  0            0      2815116            0      4192164    0    0     0     0  159  236   0   0 100   0   0 2019-04-14 16:38:58

ending after the out of memory crash, that occurred exactly at the marked point 16:38:26.355 UTC.

We can't really see anything too worrisome. There is always lots of memory used by cache, which could have been mobilized. The only possible explanation I can think of is that in that moment of the crash the memory utilization suddenly skyrocketed in less than a second, so that the 2 second vmstat interval wouldn't show it??? Nah.

I have already much reduced work_mem, which has helped in some other cases before. Now I am going to reduce the shared_buffers now, but that seems counter-intuitive because we are sitting on all that cache memory unused!

Might this be a bug? It feels like a bug. It feels like those out of memory issues should be handled more gracefully (garbage collection attempt?) and that somehow there should be more information so the person can do anything about it.

Any ideas?

-Gunther

Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Tom Lane-2
Gunther <[hidden email]> writes:
> For weeks now, I am banging my head at an "out of memory" situation.
> There is only one query I am running on an 8 GB system, whatever I try,
> I get knocked out on this out of memory. It is extremely impenetrable to
> understand and fix this error. I guess I could add a swap file, and then
> I would have to take the penalty of swapping. But how can I actually
> address an out of memory condition if the system doesn't tell me where
> it is happening?

> You might want to see the query, but it is a huge plan, and I can't
> really break this down. It shouldn't matter though. But just so you can
> get a glimpse here is the plan:

Is that the whole plan?  With just three sorts and two materializes,
it really shouldn't use more than more-or-less 5X work_mem.  What do
you have work_mem set to, anyway?  Is this a 64-bit build of PG?

Also, are the estimated rowcounts shown here anywhere close to what
you expect in reality?  If there are any AFTER INSERT triggers on the
insertion target table, you'd also be accumulating per-row trigger
queue entries ... but if there's only circa 35K rows to be inserted,
it's hard to credit that eating more than a couple hundred KB, either.

> Might this be a bug?

It's conceivable that you've hit some memory-leakage bug, but if so you
haven't provided nearly enough info for anyone else to reproduce it.
You haven't even shown us the actual error message :-(

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Jeff Janes
In reply to this post by Gunter
On Sun, Apr 14, 2019 at 4:51 PM Gunther <[hidden email]> wrote:

For weeks now, I am banging my head at an "out of memory" situation. There is only one query I am running on an 8 GB system, whatever I try, I get knocked out on this out of memory.

Is PostgreSQL throwing an error with OOM, or is getting killed -9 by the OOM killer?  Do you get a core file you can inspect with gdb?

You might want to see the query, but it is a huge plan, and I can't really break this down. It shouldn't matter though. But just so you can get a glimpse here is the plan:

Insert on businessoperation  (cost=5358849.28..5361878.44 rows=34619 width=1197)
  ->  Unique  (cost=5358849.28..5361532.25 rows=34619 width=1197)
        

Maybe it is memory for trigger or constraint checking, although I don't know why that would appear instantly.  What triggers or constraints do you have on businessoperation? 

What if you just run the SELECT without the INSERT?  Or insert into a temp table rather than into businessoperation?  And if that doesn't crash, what if you then insert to businessoperation from the temp table?
 
Also, what version?

Cheers,

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Gunter
In reply to this post by Tom Lane-2

Thanks for looking at my problem Tom Lane and Jeff Janes. Sorry for not having given enough detail.

The version is 10.2 latest. The database was originally built with 10.1 and then just started with 10.2. No dump and reload or pg_upgrade. Underlying system is 64bit Amazon Linux (CentOS like) running on an AMD64 VM (m5a) right now.

I said "crash" and that is wrong. Not a signal nor core dump. It is the ERROR:  out of memory. Only the query crashes. Although I don't know if may be the backend server might have left a core dump? Where would that be? Would it help anyone if I started the server with the -c option to get a core dump? I guess I could re-compile with gcc -g debugging symbols all on and then run with that -c option, and then use gdb to find out which line it was failing at and then inspect the query plan data structure? Would that be helpful? Does anyone want the coredump to inspect?

The short version is:

Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 1425134928 used
2019-04-14 16:38:26.355 UTC [11061] ERROR:  out of memory
2019-04-14 16:38:26.355 UTC [11061] DETAIL:  Failed on request of size 8272 in memory context "ExecutorState".

Here is the out of memory error dump in its full glory.

TopMemoryContext: 2197400 total in 7 blocks; 42952 free (15 chunks); 2154448 used
  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used
  pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used
  TopTransactionContext: 8192 total in 1 blocks; 7720 free (2 chunks); 472 used
  RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
  MessageContext: 2097152 total in 9 blocks; 396480 free (10 chunks); 1700672 used
  Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used
  TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used
  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
    PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used:
      ExecutorState: 1416621920 total in 168098 blocks; 8494152 free (3102 chunks); 1408127768 used
        HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used
          HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used
        HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used
          HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
        HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used
          HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
        TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used
        TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used
        HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 8389408 used
          HashBatchContext: 106640 total in 3 blocks; 7936 free (0 chunks); 98704 used
        TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 used
          Caller tuples: 4194304 total in 10 blocks; 1434888 free (20 chunks); 2759416 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used
  CacheMemoryContext: 1101328 total in 14 blocks; 386840 free (1 chunks); 714488 used
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: businessop_docid_ndx
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: businessop_sbjentityidroot_ndx
    index info: 2048 total in 2 blocks; 704 free (1 chunks); 1344 used: businessop_sbjroleiid_ndx
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index
    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx
    index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx
    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: act_id_fkidx
    index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: act_id_idx
    index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_constraint_conrelid_contypid_conname_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: actrelationship_pkey
    index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_target_idx
    index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_source_idx
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: documentinformation_pk
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_relid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_seii
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_sbjentcodeonly
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2618_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indrelid_index
    relation rules: 229376 total in 31 blocks; 5136 free (0 chunks); 224240 used: v_businessoperation
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_db_role_setting_databaseid_rol_index
    index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_opclass_am_name_nsp_index
    index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_name_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_enum_oid_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_class_relname_nsp_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_pubname_index
    index info: 2048 total in 2 blocks; 592 free (3 chunks); 1456 used: pg_statistic_relid_att_inh_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_cast_source_target_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_name_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_transform_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_collation_oid_index
    index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amop_fam_strat_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indexrelid_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_template_tmplname_index
    index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_ts_config_map_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_opclass_oid_index
    index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_evtname_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_statistic_ext_name_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_dict_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_oid_index
    index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_conversion_default_index
    index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_operator_oprname_l_r_n_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_trigger_tgrelid_tgname_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_enum_typid_label_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_config_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_user_mapping_oid_index
    index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_opfamily_am_name_nsp_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_table_relid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_type_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_aggregate_fnoid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_constraint_oid_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_rewrite_rel_rulename_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_parser_prsname_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_config_cfgname_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_parser_oid_index
    index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_publication_rel_prrelid_prpubid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_operator_oid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_namespace_nspname_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_template_oid_index
    index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_amop_opr_fam_index
    index info: 2048 total in 2 blocks; 672 free (3 chunks); 1376 used: pg_default_acl_role_nsp_obj_index
    index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_collation_name_enc_nsp_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_rel_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_range_rngtypid_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_dict_dictname_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_type_typname_nsp_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_opfamily_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_oid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index
    index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_proc_proname_args_nsp_index
    index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_partitioned_table_partrelid_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_transform_type_lang_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_attribute_relid_attnum_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_proc_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_namespace_oid_index
    index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amproc_fam_proc_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_name_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_attribute_relid_attnam_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_conversion_oid_index
    index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_user_mapping_user_server_index
    index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_subscription_rel_srrelid_srsubid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_sequence_seqrelid_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_conversion_name_nsp_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_oid_index
    index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_member_role_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_subscription_oid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_tablespace_oid_index
    index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_shseclabel_object_index
    index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roname_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_datname_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_subscription_subname_index
    index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roiident_index
    index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_role_member_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_oid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_rolname_index
  WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used
  PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
  MdSmgr: 8192 total in 1 blocks; 7256 free (1 chunks); 936 used
  LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used
  Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
  ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used
Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 1425134928 used
2019-04-14 16:38:26.355 UTC [11061] ERROR:  out of memory
2019-04-14 16:38:26.355 UTC [11061] DETAIL:  Failed on request of size 8272 in memory context "ExecutorState".
I am delighted that Tom Lane is hinting that my query plan doesn't look so crazy, and it isn't. And delighted there may be a known bug involved.

I wonder if this is one data issue. May be a few rows have excessively long text fields? But even checking for that is rather difficult because there are many tables and columns involved. Would really be nice if the error would say exactly what plan step that ExecutorState referred to, so one could narrow it down.

regards,
-Gunther

On 4/14/2019 17:19, Tom Lane wrote:
Gunther [hidden email] writes:
For weeks now, I am banging my head at an "out of memory" situation. 
There is only one query I am running on an 8 GB system, whatever I try, 
I get knocked out on this out of memory. It is extremely impenetrable to 
understand and fix this error. I guess I could add a swap file, and then 
I would have to take the penalty of swapping. But how can I actually 
address an out of memory condition if the system doesn't tell me where 
it is happening?

      
You might want to see the query, but it is a huge plan, and I can't 
really break this down. It shouldn't matter though. But just so you can 
get a glimpse here is the plan:
Is that the whole plan?  With just three sorts and two materializes,
it really shouldn't use more than more-or-less 5X work_mem.  What do
you have work_mem set to, anyway?  Is this a 64-bit build of PG?

Also, are the estimated rowcounts shown here anywhere close to what
you expect in reality?  If there are any AFTER INSERT triggers on the
insertion target table, you'd also be accumulating per-row trigger
queue entries ... but if there's only circa 35K rows to be inserted,
it's hard to credit that eating more than a couple hundred KB, either.

Might this be a bug?
It's conceivable that you've hit some memory-leakage bug, but if so you
haven't provided nearly enough info for anyone else to reproduce it.
You haven't even shown us the actual error message :-(

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

			regards, tom lane
Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Peter Much
In reply to this post by Tom Lane-2
On Sun, Apr 14, 2019 at 05:19:11PM -0400, Tom Lane wrote:
! Gunther <[hidden email]> writes:
! > For weeks now, I am banging my head at an "out of memory" situation.
! > There is only one query I am running on an 8 GB system, whatever I try,
! > I get knocked out on this out of memory. It is extremely impenetrable to
! > understand and fix this error. I guess I could add a swap file, and then
! > I would have to take the penalty of swapping. But how can I actually
! > address an out of memory condition if the system doesn't tell me where
! > it is happening?

Well, esactly with a swap space. No offense intended, but if You
don't have a swap space, You should not complain about unintellegibe
Out-of-memory situations.
Swapspace is not usually used to run applications from (that would
indeed give horrible performance), it is used to not get out-of-memory
errors. With a swapspace, the out-of-memory situation will persist,
and so one has time to take measurements and analyze system
behaviour and from that, one can better understand what is causing
the problem, and decide what actions should be taken, on an informed
base (e.g. correct flaws in the system tuning, fix bad query, buy
more memory, or what may be applicable)

If I remember correctly, I did even see a DTRACE flag in my build,
so what more is to wish? :))

P.


Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Justin Pryzby
In reply to this post by Gunter
On Sun, Apr 14, 2019 at 09:05:48PM -0400, Gunther wrote:
> Thanks for looking at my problem Tom Lane and Jeff Janes. Sorry for not
> having given enough detail.
>
> The version is 10.2 latest.

v10.7 is available; could you upgrade ?

What are these set to ?  shared_buffers? work_mem?

Was postgres locally compiled, packaged by distribution, or PGDG RPM/DEB ?

Can you show \d businessoperation ?

> The short version is:
>
> Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 1425134928 used
> 2019-04-14 16:38:26.355 UTC [11061] ERROR:  out of memory
> 2019-04-14 16:38:26.355 UTC [11061] DETAIL:  Failed on request of size 8272 in memory context "ExecutorState".

Could you rerun the query with \set VERBOSITY verbose to show the file/line
that's failing ?

If you wanted to show a stack trace, you could attach gdb to PID from SELECT
pg_backend_pid(), "b"reak on errdetail, run the query, and then "bt" when it
fails.

Justin


Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Jeff Janes
In reply to this post by Gunter
On Sun, Apr 14, 2019 at 9:06 PM Gunther <[hidden email]> wrote:

Thanks for looking at my problem Tom Lane and Jeff Janes. Sorry for not having given enough detail.

The version is 10.2 latest. The database was originally built with 10.1 and then just started with 10.2.

Do you mean 11.2?  The latest in the 10 series is 10.7.  If you do mean 10.2, there a fix for a memory leak bug since then that might plausibly be relevant (bdc7f686d1b8f423cb)


I said "crash" and that is wrong. Not a signal nor core dump. It is the ERROR:  out of memory. Only the query crashes. Although I don't know if may be the backend server might have left a core dump?

I don't think there would be a core dump on only an ERROR, and probably not worthwhile to trick it into generating one.
 

The short version is:

Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 1425134928 used
2019-04-14 16:38:26.355 UTC [11061] ERROR:  out of memory
2019-04-14 16:38:26.355 UTC [11061] DETAIL:  Failed on request of size 8272 in memory context "ExecutorState".
 I don't know why a 8GB system with a lot of cache that could be evicted would get an OOM when something using 1.5GB asks for 8272 bytes more.  But that is a question of how the kernel works, rather than how PostgreSQL works.  But I also think the log you quote above belongs to a different event than the vmstat trace in your first email.
 
      ExecutorState: 1416621920 total in 168098 blocks; 8494152 free (3102 chunks); 1408127768 used
        HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used
          HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used

This does not seem to match your query plan.  Why would a plan with no Hash Joins have a HashBatchContext?  I think this log must be from a different query than the one that generated the plan you posted.  Perhaps one was before you lowered work_mem and one was after?

Cheers,

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Gunter
In reply to this post by Justin Pryzby
Thanks Justin Pryzby too, and Jeff Janes, responding to both of you for efficiency. Answers and more logs and the gdb backtrace below.
The version is 10.2 latest.
v10.7 is available; could you upgrade ?
Sorry I meant 11.2 actually latest.
What are these set to ?  shared_buffers? work_mem?

shared_buffers=2G (of 8 total), then 1G, didn't help.

work_mem=4M by now (I had once been successful of avoiding out of memory by reducing work mem from 64M to 8M. But as Tom Lane says, it shouldn't be using more than 5 x work_mem in this query plan.

Jeff Janes said:

 I don't know why a 8GB system with a lot of cache that could be evicted would get an OOM when something using 1.5GB asks for 8272 bytes more.  But that is a question of how the kernel works, rather than how PostgreSQL works.  But I also think the log you quote above belongs to a different event than the vmstat trace in your first email.
and I agree, except that the vmstat log and the error really belong together, same timestamp. Nothing else running on that machine this Sunday. Yes I ran this several times with different parameters, so some mixup is possible, but always ending in the same crash anyway. So here again, without the vmstat log, which really wouldn't be any different than I showed you. (See below for the ENABLE_NESTLOOP=off setting, not having those settings same between explain and actual execution might account for the discrepancy that you saw.)

integrator=# SET ENABLE_NESTLOOP TO OFF;
SET
integrator=# \set VERBOSITY verbose
integrator=# explain INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;
integrator=# \pset pager off
Pager usage is off.
integrator=# \pset format unaligned
Output format is unaligned.
integrator=# explain INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;
QUERY PLAN
Insert on businessoperation  (cost=5850091.58..5853120.74 rows=34619 width=1197)
  ->  Unique  (cost=5850091.58..5852774.55 rows=34619 width=1197)
        ->  Sort  (cost=5850091.58..5850178.13 rows=34619 width=1197)
              Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformationsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectentityemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_displayname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubject_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid
              ->  Hash Right Join  (cost=4489522.06..5829375.93 rows=34619 width=1197)
                    Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text))
                    ->  Hash Right Join  (cost=1473632.24..2808301.92 rows=13 width=341)
                          Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text))
                          ->  Hash Left Join  (cost=38864.03..1373533.69 rows=1 width=219)
                                Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyname.entityinternalid)::text)
                                ->  Hash Left Join  (cost=2503.10..1332874.75 rows=1 width=229)
                                      Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyid.entityinternalid)::text)
                                      ->  Seq Scan on documentinformationsubject documentinformationsubject_2  (cost=0.00..1329868.64 rows=1 width=177)
                                            Filter: ((participationtypecode)::text = 'AUT'::text)
                                      ->  Hash  (cost=1574.82..1574.82 rows=34182 width=89)
                                            ->  Seq Scan on entity_id agencyid  (cost=0.00..1574.82 rows=34182 width=89)
                                ->  Hash  (cost=27066.08..27066.08 rows=399908 width=64)
                                      ->  Seq Scan on bestname agencyname  (cost=0.00..27066.08 rows=399908 width=64)
                          ->  Hash  (cost=1434768.02..1434768.02 rows=13 width=233)
                                ->  Hash Right Join  (cost=953906.58..1434768.02 rows=13 width=233)
                                      Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text)
                                      ->  Seq Scan on act_id an  (cost=0.00..425941.04 rows=14645404 width=134)
                                      ->  Hash  (cost=953906.57..953906.57 rows=1 width=136)
                                            ->  Hash Join  (cost=456015.28..953906.57 rows=1 width=136)
                                                  Hash Cond: ((q.actinternalid)::text = (r.sourceinternalid)::text)
                                                  ->  Seq Scan on documentinformation q  (cost=0.00..497440.84 rows=120119 width=99)
                                                        Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))
                                                  ->  Hash  (cost=456015.26..456015.26 rows=1 width=74)
                                                        ->  Seq Scan on actrelationship r  (cost=0.00..456015.26 rows=1 width=74)
                                                              Filter: ((typecode)::text = 'SUBJ'::text)
                    ->  Hash  (cost=3011313.54..3011313.54 rows=34619 width=930)
                          ->  Merge Left Join  (cost=2998334.98..3011313.54 rows=34619 width=930)
                                Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))
                                ->  Sort  (cost=1408783.87..1408870.41 rows=34619 width=882)
                                      Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid
                                      ->  Seq Scan on documentinformationsubject  (cost=0.00..1392681.22 rows=34619 width=882)
                                            Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))
                                ->  Materialize  (cost=1589551.12..1594604.04 rows=1010585 width=159)
                                      ->  Sort  (cost=1589551.12..1592077.58 rows=1010585 width=159)
                                            Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid
                                            ->  Seq Scan on documentinformationsubject documentinformationsubject_1  (cost=0.00..1329868.64 rows=1010585 width=159)
                                                  Filter: ((participationtypecode)::text = 'PRD'::text)

and the error memory status dump (I hope my grey boxes help a bit to lighten this massive amount of data...

TopMemoryContext: 4294552 total in 7 blocks; 42952 free (15 chunks); 4251600 used
  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used
  pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used
  TopTransactionContext: 8192 total in 1 blocks; 7720 free (2 chunks); 472 used
  RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
  MessageContext: 2097152 total in 9 blocks; 396480 free (10 chunks); 1700672 used
  Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used
  TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used
  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
    PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used:
      ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used
        HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used
          HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used
        HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used
          HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
        HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used
          HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
        TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used
        TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used
        HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 8389408 used
          HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 chunks); 100703776 used
        TupleSort main: 452880 total in 8 blocks; 126248 free (27 chunks); 326632 used
          Caller tuples: 1048576 total in 8 blocks; 21608 free (14 chunks); 1026968 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used
  CacheMemoryContext: 1101328 total in 14 blocks; 386840 free (1 chunks); 714488 used
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: businessop_docid_ndx
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: businessop_sbjentityidroot_ndx
    index info: 2048 total in 2 blocks; 704 free (1 chunks); 1344 used: businessop_sbjroleiid_ndx
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index
    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx
    index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx
    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: act_id_fkidx
    index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: act_id_idx
    index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_constraint_conrelid_contypid_conname_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: actrelationship_pkey
    index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_target_idx
    index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: actrelationship_source_idx
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: documentinformation_pk
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_relid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_seii
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: docinfsubj_ndx_sbjentcodeonly
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2618_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indrelid_index
    relation rules: 229376 total in 31 blocks; 5136 free (0 chunks); 224240 used: v_businessoperation
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_db_role_setting_databaseid_rol_index
    index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_opclass_am_name_nsp_index
    index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_name_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_enum_oid_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_class_relname_nsp_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_pubname_index
    index info: 2048 total in 2 blocks; 592 free (3 chunks); 1456 used: pg_statistic_relid_att_inh_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_cast_source_target_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_name_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_transform_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_collation_oid_index
    index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amop_fam_strat_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indexrelid_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_template_tmplname_index
    index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_ts_config_map_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_opclass_oid_index
    index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_evtname_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_statistic_ext_name_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_dict_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_oid_index
    index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_conversion_default_index
    index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_operator_oprname_l_r_n_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_trigger_tgrelid_tgname_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_enum_typid_label_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_config_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_user_mapping_oid_index
    index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_opfamily_am_name_nsp_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_table_relid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_type_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_aggregate_fnoid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_constraint_oid_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_rewrite_rel_rulename_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_parser_prsname_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_config_cfgname_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_parser_oid_index
    index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_publication_rel_prrelid_prpubid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_operator_oid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_namespace_nspname_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_template_oid_index
    index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_amop_opr_fam_index
    index info: 2048 total in 2 blocks; 672 free (3 chunks); 1376 used: pg_default_acl_role_nsp_obj_index
    index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_collation_name_enc_nsp_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_rel_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_range_rngtypid_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_dict_dictname_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_type_typname_nsp_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_opfamily_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_oid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index
    index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_proc_proname_args_nsp_index
    index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_partitioned_table_partrelid_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_transform_type_lang_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_attribute_relid_attnum_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_proc_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_namespace_oid_index
    index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amproc_fam_proc_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_name_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_attribute_relid_attnam_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_conversion_oid_index
    index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_user_mapping_user_server_index
    index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_subscription_rel_srrelid_srsubid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_sequence_seqrelid_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_conversion_name_nsp_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_oid_index
    index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_member_role_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_subscription_oid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_tablespace_oid_index
    index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_shseclabel_object_index
    index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roname_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_datname_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_subscription_subname_index
    index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_replication_origin_roiident_index
    index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_role_member_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_oid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_rolname_index
  WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used
  PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
  MdSmgr: 8192 total in 1 blocks; 7256 free (1 chunks); 936 used
  LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used
  Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
  ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used
Grand total: 2354072632 bytes in 269647 blocks; 5754640 free (17559 chunks); 2348317992 used

Was postgres locally compiled, packaged by distribution, or PGDG RPM/DEB ?
Locally compiled. I just recompiled with --enable-debug, ready to deploy that to create a core dump and check it out.
Can you show \d businessoperation ?
               Table "reports.businessoperation"
          Column           |          Type          | Modifiers
---------------------------+------------------------+-----------
 documentinternalid        | character varying(255) |
 is_current                | character(1)           |
 documentid                | character varying(555) |
 documenttypecode          | character varying(512) |
 subjectroleinternalid     | character varying(255) |
 subjectentityinternalid   | character varying(255) |
 subjectentityid           | character varying(555) |
 subjectentityidroot       | character varying(555) |
 subjectentityname         | character varying      |
 subjectentitytel          | text                   |
 subjectentityemail        | text                   |
 otherentityinternalid     | character varying(255) |
 confidentialitycode       | character varying(512) |
 actinternalid             | character varying(255) |
 operationcode             | character varying(512) |
 operationname             | text                   |
 operationqualifiercode    | character varying(512) |
 operationqualifiername    | character varying(512) |
 approvalnumber            | character varying(555) |
 approvalnumbersystem      | character varying(555) |
 approvalstatecode         | character varying(512) |
 approvalstatecodesystem   | character varying(512) |
 approvaleffectivetimelow  | character varying(512) |
 approvaleffectivetimehigh | character varying(512) |
 approvalstatuscode        | character varying(32)  |
 licensecode               | character varying(512) |
 agencyid                  | character varying(555) |
 agencyname                | text                   |
 productitemcode           | character varying(512) |
 productinternalid         | character varying(255) |

Could you rerun the query with \set VERBOSITY verbose to show the file/line
that's failing ?

Here goes:

integrator=# \set VERBOSITY verbose
integrator=# SET ENABLE_NESTLOOP TO OFF;
SET
integrator=# INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;
ERROR:  53200: out of memory
DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
LOCATION:  MemoryContextAlloc, mcxt.c:798

you notice that I set ENABLE_NESTLOOP to off, that is because the planner goes off thinking the NL plan is marginally more efficient, but in fact it will take 5 hours to get to the same out of memory crash, while the no NL plan gets there in half an hour. That verbose setting didn't help much I guess.

If you wanted to show a stack trace, you could attach gdb to PID from SELECT
pg_backend_pid(), "b"reak on errdetail, run the query, and then "bt" when it
fails.
gdb -p 27930
GNU gdb (GDB) Red Hat Enterprise Linux 8.0.1-30.amzn2.0.3
...
Attaching to process 27930
Reading symbols from /usr/local/pgsql/bin/postgres...done.
...
(gdb) b errdetail
Breakpoint 1 at 0x82b210: file elog.c, line 872.
(gdb) cont
Continuing.
Breakpoint 1, errdetail (fmt=fmt@entry=0x9d9958 "Failed on request of size %zu in memory context \"%s\".") at elog.c:872
872     {
(gdb) bt
#0  errdetail (fmt=fmt@entry=0x9d9958 "Failed on request of size %zu in memory context \"%s\".") at elog.c:872
#1  0x000000000084e320 in MemoryContextAlloc (context=0x1111600, size=size@entry=32800) at mcxt.c:794
#2  0x000000000060ce7a in dense_alloc (size=384, size@entry=381, hashtable=<optimized out>, hashtable=<optimized out>)
    at nodeHash.c:2696
#3  0x000000000060d788 in ExecHashTableInsert (hashtable=hashtable@entry=0x10ead08, slot=<optimized out>, hashvalue=194758122)
    at nodeHash.c:1614
#4  0x0000000000610c6f in ExecHashJoinNewBatch (hjstate=0x10806b0) at nodeHashjoin.c:1051
#5  ExecHashJoinImpl (parallel=false, pstate=0x10806b0) at nodeHashjoin.c:539
#6  ExecHashJoin (pstate=0x10806b0) at nodeHashjoin.c:565
#7  0x000000000061ce4e in ExecProcNode (node=0x10806b0) at ../../../src/include/executor/executor.h:247
#8  ExecSort (pstate=0x1080490) at nodeSort.c:107
#9  0x000000000061d2c4 in ExecProcNode (node=0x1080490) at ../../../src/include/executor/executor.h:247
#10 ExecUnique (pstate=0x107ff60) at nodeUnique.c:73
#11 0x0000000000619732 in ExecProcNode (node=0x107ff60) at ../../../src/include/executor/executor.h:247
#12 ExecModifyTable (pstate=0x107fd20) at nodeModifyTable.c:2025
#13 0x00000000005f75ba in ExecProcNode (node=0x107fd20) at ../../../src/include/executor/executor.h:247
#14 ExecutePlan (execute_once=<optimized out>, dest=0x7f0442721998, direction=<optimized out>, numberTuples=0,
    sendTuples=<optimized out>, operation=CMD_INSERT, use_parallel_mode=<optimized out>, planstate=0x107fd20, estate=0x107f830)
    at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x1086880, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#16 0x000000000072a972 in ProcessQuery (plan=<optimized out>,
    sourceText=0xf4a710 "INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;", params=0x0,
    queryEnv=0x0, dest=0x7f0442721998, completionTag=0x7fff2e4cad30 "") at pquery.c:161
#17 0x000000000072abb0 in PortalRunMulti (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true,
    setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x7f0442721998, altdest=altdest@entry=0x7f0442721998,
    completionTag=completionTag@entry=0x7fff2e4cad30 "") at pquery.c:1286
#18 0x000000000072b661 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807,
    isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x7f0442721998,
    altdest=altdest@entry=0x7f0442721998, completionTag=0x7fff2e4cad30 "") at pquery.c:799
#19 0x00000000007276e8 in exec_simple_query (
    query_string=0xf4a710 "INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;") at postgres.c:1145
#20 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xf76ce8, dbname=<optimized out>,
    username=<optimized out>) at postgres.c:4182
#21 0x00000000006be215 in BackendRun (port=0xf6dfe0) at postmaster.c:4361
#22 BackendStartup (port=0xf6dfe0) at postmaster.c:4033
#23 ServerLoop () at postmaster.c:1706
#24 0x00000000006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379
#25 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228

That's it.

Thank you all very much for your interest in this case.

-Gunther


Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Tom Lane-2
Gunther <[hidden email]> writes:
>        ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used

Oooh, that looks like a memory leak right enough.  The ExecutorState
should not get that big for any reasonable query.

Your error and stack trace show a failure in HashBatchContext,
which is probably the last of these four:

>            HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used
>            HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
>            HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
>            HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 chunks); 100703776 used

Perhaps that's more than it should be, but it's silly to obsess over 100M
when there's a 2.2G problem elsewhere.  I think it's likely that it was
just coincidence that the failure happened right there.  Unfortunately,
that leaves us with no info about where the actual leak is coming from.

The memory map shows that there were three sorts and four hashes going
on, so I'm not sure I believe that this corresponds to the query plan
you showed us before.

Any chance of extracting a self-contained test case that reproduces this?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Gunter
On 4/14/2019 23:24, Tom Lane wrote:
>>         ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used
> Oooh, that looks like a memory leak right enough.  The ExecutorState
> should not get that big for any reasonable query.
2.2 GB is massive yes.
> Your error and stack trace show a failure in HashBatchContext,
> which is probably the last of these four:
>
>>             HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used
>>             HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
>>             HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
>>             HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 chunks); 100703776 used
> Perhaps that's more than it should be, but it's silly to obsess over 100M
> when there's a 2.2G problem elsewhere.
Yes.
>    I think it's likely that it was
> just coincidence that the failure happened right there.  Unfortunately,
> that leaves us with no info about where the actual leak is coming from.

Strange though, that the vmstat tracking never showed that the cache
allocated memory goes much below 6 GB. Even if this 2.2 GB memory leak
is there, and even if I had 2 GB of shared_buffers, I would still have
enough for the OS to give me.

Is there any doubt that this might be a problem with Linux? Because if
you want, I can whip out a FreeBSD machine, compile pgsql, and attach
the same disk, and try it there. I am longing to have a reason to move
back to FreeBSD anyway. But I have tons of stuff to do, so if you do not
have reason to suspect Linux to do wrong here, I prefer skipping that
futile attempt

> The memory map shows that there were three sorts and four hashes going
> on, so I'm not sure I believe that this corresponds to the query plan
> you showed us before.
Like I said, the first explain was not using the same constraints (no
NL). Now what I sent last should all be consistent. Memory dump and
explain plan and gdb backtrace.
> Any chance of extracting a self-contained test case that reproduces this?

With 18 million rows involved in the base tables, hardly.

But I am ready to try some other things with the debugger that you want
me to try. If we have a memory leak issue, we might just as well try to 
plug it!

I could even to give someone of you access to the system that runs this.

thanks,
-Gunther


Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Justin Pryzby
On Sun, Apr 14, 2019 at 11:59:45PM -0400, Gunther wrote:
> On 4/14/2019 23:24, Tom Lane wrote:
> >Any chance of extracting a self-contained test case that reproduces this?
> With 18 million rows involved in the base tables, hardly.

Were you able to reproduce the problem with SELECT (without INSERT) ?
How many rows does it output ?  Show explain analyze if possible.  If that
still errors, can you make it work with a small enough LIMIT ?

We haven't seen the view - maybe it's very complicated, but can you reproduce
with a simpler one ?  Fewer joins ?  Or fewer join conditions ?

Justin


Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Jeff Janes
In reply to this post by Gunter
On Sun, Apr 14, 2019 at 11:04 PM Gunther <[hidden email]> wrote:
Could you rerun the query with \set VERBOSITY verbose to show the file/line
that's failing ?

Here goes:

integrator=# \set VERBOSITY verbose
integrator=# SET ENABLE_NESTLOOP TO OFF;
SET
integrator=# INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;
ERROR:  53200: out of memory
DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
LOCATION:  MemoryContextAlloc, mcxt.c:798

you notice that I set ENABLE_NESTLOOP to off, that is because the planner goes off thinking the NL plan is marginally more efficient, but in fact it will take 5 hours to get to the same out of memory crash, while the no NL plan gets there in half an hour. That verbose setting didn't help much I guess.

I think the backtrace of the enable_nestloop=on plan would be more useful.  Here someone has filled up memory, and then we see HashBatchContext trip over it that.  But it isn't the one the one that caused the problem, so the backtrace doesn't help.  With the previous plan, it was an allocation into ExecutorState which tripped over the problem, and it is likely that it is coming from the same series of allocations that caused the problem.

To get it to happen faster, maybe you could run the server with a small setting of "ulimit -v"?  Or, you could try to capture it live in gdb. Unfortunately I don't know how to set a breakpoint for allocations into a specific context, and setting a breakpoint for any memory allocation is probably going to fire too often to be useful.

Yes, the verbose option didn't help (but the gdb backtrace made up for it--kind of--we really need the backtrace of the allocations into ExecutorState).  It isn't helpful to know that a memory allocation failed in the mcxt.c code.  To bad it doesn't report the location of the caller of that code.  I know in Perl you can use Carp::croak to do that, but I don't know to do it in C.

But really the first thing I want to know now is what if you just do the select, without the insert? 

explain analyze SELECT * FROM reports.v_BusinessOperation

If that works, what about "create temporary table foo as SELECT * FROM reports.v_BusinessOperation" ?

And if that works, what about "INSERT INTO reports.BusinessOperation SELECT * FROM foo"?

If the ERROR happens in the first or last of these, it might be much easier to analyze in that simplified context. If it happens in the middle one, then we probably haven't achieved much. (And if there is no ERROR at all, then you have workaround, but we still haven't found the fundamental bug).

Are you not showing the view definition for proprietary reasons, or just because you don't think it will be useful? If the latter, please send it as an attachment, I can't guarantee it will be useful, but there is only one way find out.

Cheers,

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Jeff Janes
In reply to this post by Gunter
On Sun, Apr 14, 2019 at 11:59 PM Gunther <[hidden email]> wrote:
 
Is there any doubt that this might be a problem with Linux? Because if
you want, I can whip out a FreeBSD machine, compile pgsql, and attach
the same disk, and try it there. I am longing to have a reason to move
back to FreeBSD anyway. But I have tons of stuff to do, so if you do not
have reason to suspect Linux to do wrong here, I prefer skipping that
futile attempt

I think the PostgreSQL leaking in the first place would be independent of Linux being ungraceful about it.  So repeating it on BSD probably wouldn't help us here.  If you want to take up the 2nd issue with the kernel folks, having some evidence from BSD might (but not very likely) be helpful for that, but that would be for a different mailing list.

Cheers,

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Tomas Vondra-4
In reply to this post by Gunter
On Sun, Apr 14, 2019 at 11:59:45PM -0400, Gunther wrote:

>On 4/14/2019 23:24, Tom Lane wrote:
>>>        ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used
>>Oooh, that looks like a memory leak right enough.  The ExecutorState
>>should not get that big for any reasonable query.
>2.2 GB is massive yes.
>>Your error and stack trace show a failure in HashBatchContext,
>>which is probably the last of these four:
>>
>>>            HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used
>>>            HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
>>>            HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used
>>>            HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 chunks); 100703776 used
>>Perhaps that's more than it should be, but it's silly to obsess over 100M
>>when there's a 2.2G problem elsewhere.
>Yes.
>>   I think it's likely that it was
>>just coincidence that the failure happened right there.  Unfortunately,
>>that leaves us with no info about where the actual leak is coming from.
>
>Strange though, that the vmstat tracking never showed that the cache
>allocated memory goes much below 6 GB. Even if this 2.2 GB memory leak
>is there, and even if I had 2 GB of shared_buffers, I would still have
>enough for the OS to give me.
>

Depends on how the kernel is configured. What are vm.overcommit_memory
and vm.overcommit_ratio set to, for example?

It may easily be the case that the kernel is only allowing 50% of RAM to
be committed to user space, and then refusing to allocate more despite
having free memory. That's fairly common issue on swapless systems.

Try running the query again, watch

    cat /proc/meminfo | grep Commit

and if it crashes when Committed_AS hits the CommitLimit.

That doesn't explain where the memory leak is, though :-(


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Tom Lane-2
In reply to this post by Jeff Janes
Jeff Janes <[hidden email]> writes:
> To get it to happen faster, maybe you could run the server with a small
> setting of "ulimit -v"?  Or, you could try to capture it live in gdb.
> Unfortunately I don't know how to set a breakpoint for allocations into a
> specific context, and setting a breakpoint for any memory allocation is
> probably going to fire too often to be useful.

If you can use gdb at all, it's not that hard to break on allocations
into a specific context; I've done it many times.  The strategy is
basically

1. Let query run long enough for memory usage to start increasing,
then attach to backend with gdb.

2. Set breakpoint at, probably, AllocSetAlloc.  (In some cases,
reallocs could be the problem, but I doubt it here.)  Then "c".

3. When it stops, "p *context" and see if this is the context
you're looking for.  In this case, since we want to know about
allocations into ExecutorState and we know there's only one
active one, you just have to look at the context name.  In general
you might have to look at the backtrace.  Anyway, if it isn't the
one you want, just "c" until you get to an allocation into the
one you do want.

4. Once you have found out the address of the context you care
about, make the breakpoint conditional on the context argument
being that one.  It might look like this:

Breakpoint 1, AllocSetAlloc (context=0x1483be0, size=480) at aset.c:715
715     {
(gdb) p *context
$1 = {type = T_AllocSetContext, isReset = false, allowInCritSection = false,
  methods = 0xa33f40, parent = 0x0, firstchild = 0x1537f30, prevchild = 0x0,
  nextchild = 0x0, name = 0xa3483f "TopMemoryContext", ident = 0x0,
  reset_cbs = 0x0}
(gdb) cond 1  context == 0x1483be0

5. Now repeatedly "c", and check the stack trace each time, for a
dozen or two times to get a feeling for where the allocations are
being requested.

In some cases you might be able to find the context address in a
more efficient way than what I suggest in #3 --- for instance,
you could instead set a breakpoint where the context is created
and snag its address immediately, or you could dig around in
backend data structures to find it.  But these ways generally
require more familiarity with the code than just watching the
requests go by.

> Are you not showing the view definition for proprietary reasons, or just
> because you don't think it will be useful?

As far as that goes, I think the most likely theory right now is that
some particular function being used in the view is leaking memory.
So yes, we need to see the view ... or you can try removing bits of
it to see if the leak goes away.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Tomas Vondra-4
In reply to this post by Jeff Janes
On Sun, Apr 14, 2019 at 05:19:50PM -0400, Jeff Janes wrote:

>   On Sun, Apr 14, 2019 at 4:51 PM Gunther <[hidden email]> wrote:
>
>     For weeks now, I am banging my head at an "out of memory" situation.
>     There is only one query I am running on an 8 GB system, whatever I try,
>     I get knocked out on this out of memory.
>
>   Is PostgreSQL throwing an error with OOM, or is getting killed -9 by the
>   OOM killer?  Do you get a core file you can inspect with gdb?
>
>     You might want to see the query, but it is a huge plan, and I can't
>     really break this down. It shouldn't matter though. But just so you can
>     get a glimpse here is the plan:
>
> Insert on businessoperation  (cost=5358849.28..5361878.44 rows=34619 width=1197)
>   ->  Unique  (cost=5358849.28..5361532.25 rows=34619 width=1197)
>        
>
>   Maybe it is memory for trigger or constraint checking, although I don't
>   know why that would appear instantly.  What triggers or constraints do you
>   have on businessoperation? 

Yeah, that would be my guess too. If I had to guess, something likely gets
confused and allocates memory in es_query_ctx instead of the per-tuple
context (es_per_tuple_exprcontext).

Triggers, constraints and expr evaluation all seem like a plausible
candidates. It's going to be hard to nail the exact place, though :-(

>   What if you just run the SELECT without the INSERT?  Or insert into a temp
>   table rather than into businessoperation?  And if that doesn't crash, what
>   if you then insert to businessoperation from the temp table?
>    

Yeah. What's the schema of "businessoperation"? Anything special about
it? Triggers, expression indexes, check constraints, ...

Gunther, you mentioned you build postgres from sources. Would it be
possible to add some sort of extra debugging to see where the memory is
allocated from? It's a bit heavy-handed, though.

Or maybe splitting es_query_ctx into smaller contexts. That might be
easier to evaluate than sifting throuht god-knows-how-many-gbs of log.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Tom Lane-2
In reply to this post by Tom Lane-2
I wrote:
> If you can use gdb at all, it's not that hard to break on allocations
> into a specific context; I've done it many times.  The strategy is
> basically
> 1. Let query run long enough for memory usage to start increasing,
> then attach to backend with gdb.

BTW, just to clarify that strategy a bit: the usage pattern we expect
for ExecutorState is that there are a bunch of individual allocations
during executor startup, but then none while the query is running
(or at least, if there are any in that phase, they get freed before
moving on to the next row).  The form of the leak, almost certainly,
is that some allocation is happening per-row and *not* getting freed.
So there's no point in groveling through the startup behavior.  What
we want to know about is what happens after we reach the ought-to-be
steady state behavior.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Gunter
In reply to this post by Jeff Janes

OK Guys, you are very kind to continue taking an interest in this matter.

I will try what I can to help squish the bug.

Tomas Vondra just added a good idea that explains why I get the out of memory with still having so much cache available:

# sysctl vm.overcommit_memory
vm.overcommit_memory = 2
# sysctl vm.overcommit_ratio
vm.overcommit_ratio = 50

as he predicted.

# cat /proc/meminfo |grep Commit
CommitLimit:     3955192 kB
Committed_AS:    2937352 kB

So I thing that explains why it turns into an out of memory error. We don't worry or wonder about that any more. I will change that parameter in the future to allow for some spikes. But it's not going to resolve the underlying memory leak issue.

Now I run explain analyze SELECT ... without the INSERT. 

integrator=#  \set VERBOSITY verbose
integrator=#
integrator=# \pset pager off
Pager usage is off.
integrator=# \pset format unaligned
Output format is unaligned.
integrator=# \set VERBOSITY verbose
integrator=#
integrator=# SET ENABLE_NESTLOOP TO OFF;
SET
integrator=# explain analyze SELECT * FROM reports.v_BusinessOperation;
ERROR:  53200: out of memory
DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
LOCATION:  MemoryContextAlloc, mcxt.c:798

And since that failed already, I guess we don't need to worry about the temporary table insert.

About adding LIMIT, I don't think it makes sense in the outer query, since the error is probably happening earlier. I did put a LIMIT 100 on one of the tables we join to, and it helped. But that doesn't really tell us anything I think.

Then yes, I can try the backtrace with the NLs enabled. It will just take a long long time and unfortunately it is extremely likely that I lose the console and then will be unable to get back to it. OK, screen(1) resolves that problem too. Will do, after I reported the above.

But now you have already produced more ideas ...

  Maybe it is memory for trigger or constraint checking, although I don't
  know why that would appear instantly.  What triggers or constraints do you
  have on businessoperation? 

Yeah, that would be my guess too. If I had to guess, something likely gets
confused and allocates memory in es_query_ctx instead of the per-tuple
context (es_per_tuple_exprcontext).

Triggers, constraints and expr evaluation all seem like a plausible
candidates. It's going to be hard to nail the exact place, though

I think triggers and constraints is ruled out, because the problem happens without the INSERT.

That leaves us with expression evaluation. And OK, now you really wanna see the query, although it should be in the plan too. But for what it is worth:

SELECT DISTINCT
        documentInternalId, is_current,
	documentId,
	documentTypeCode,
	subjectRoleInternalId,
	subjectEntityInternalId,
	subjectEntityId,
	subjectEntityIdRoot,
	subjectEntityName,
	subjectEntityTel,
	subjectEntityEmail,
	otherEntityInternalId,
	confidentialityCode,
	actInternalId, 
	code_code as operationCode, 
	code_displayName AS operationName,
	operationQualifierCode,
	operationQualifierName,
	approvalNumber,
	approvalNumberSystem,
	approvalStateCode,
	approvalStateCodeSystem,
	approvalEffectiveTimeLow,
	approvalEffectiveTimeHigh,
	approvalStatusCode,
	licenseCode,
	agencyId,
	agencyName,
	productItemCode,
	productInternalId
  FROM reports.DocumentInformationSubject 
  LEFT OUTER JOIN (SELECT documentInternalId, documentId, actInternalId,
	                  subjectEntityCode as productItemCode,
			  subjectEntityInternalId as productInternalId
		     FROM reports.DocumentInformationSubject
		    WHERE participationTypeCode = 'PRD') prd
    USING(documentInternalId, documentId, actInternalId)
  LEFT OUTER JOIN (
    SELECT documentInternalId,
           q.code_code AS operationQualifierCode,
  	   q.code_displayName AS operationQualifierName,
  	   r.targetInternalId AS actInternalId,
	   actInternalId AS approvalInternalId,
  	   an.extension AS approvalNumber,
  	   an.root AS approvalNumberSystem,
	   qs.subjectEntityCode AS approvalStateCode,
	   qs.subjectEntityCodeSystem AS approvalStateCodeSystem,
  	   qs.effectivetime_low AS approvalEffectiveTimeLow,
  	   qs.effectivetime_high AS approvalEffectiveTimeHigh,
  	   qs.statusCode AS approvalStatusCode,
	   qs.code_code AS licenseCode,
	   agencyId.extension AS agencyId,
	   agencyName.trivialName AS agencyName
      FROM reports.DocumentInformation q
      LEFT OUTER JOIN (SELECT * FROM reports.DocumentInformationSubject WHERE participationTypeCode = 'AUT') qs
        USING(documentInternalId, actInternalId)
      INNER JOIN integrator.ActRelationship r 
        ON(    r.sourceInternalId = actInternalId 
  	   AND r.typeCode = 'SUBJ')
      LEFT OUTER JOIN integrator.Act_id an USING(actInternalId)
      LEFT OUTER JOIN integrator.Entity_id agencyId ON(agencyId.entityInternalId = otherEntityInternalId)
      LEFT OUTER JOIN reports.BestName agencyName ON(agencyName.entityInternalId = otherEntityInternalId)
     WHERE q.classCode = 'CNTRCT'
       AND q.moodCode = 'EVN'
       AND q.code_codeSystem = '2.16.840.1.113883.3.26.1.1'      
    ) q
    USING(documentInternalId, actInternalId)
 WHERE classCode = 'ACT' 
   AND moodCode = 'DEF' 
   AND code_codeSystem = '2.16.840.1.113883.3.26.1.1' 
   AND participationTypeCode IN ('PPRF','PRF');

You see that the expressions are all just equal operations, some IN, nothing outlandish.

Now I will try what Tom Lane suggested. Here you go. And I have it stopped at this state, so if you want me to inspect anything else, I can do it.

With screen(1) I can be sure I won't lose my stuff when my internet goes down. Nice.

I have one screen session with 3 windows:

  1. psql
  2. gdb
  3. misc (vmstat, etc.)

Now I have let this run for a good long time while setting up my screen stuff. And then:

ps -x 

look for the postgres job with the EXPLAIN ... that's $PID, then:

gdb -p $PID

Then first I do

cont

but then it stops at SIGUSR1, because of the parallel workers signalling each other. 

handle SIGUSR1 nostop

suppresses that stopping. Then I break CTRL-C, and set the breakpoint where Tom Lane said:

b AllocSetAlloc

once it stops there I do

Breakpoint 1, AllocSetAlloc (context=0x1168230, size=8) at aset.c:715
715     {
(gdb) p context->name
$4 = 0x96ce5b "ExecutorState"

So I should even be able to set a conditional breakpoint.

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) b AllocSetAlloc if  strcmp(context->name, "ExecutorState") == 0
Breakpoint 2 at 0x848ed0: file aset.c, line 715.
(gdb) cont
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=10) at aset.c:715
715     {
(gdb) cont
Continuing.

Program received signal SIGUSR1, User defined signal 1.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=152) at aset.c:715
715     {
(gdb) cont
Continuing.

Program received signal SIGUSR1, User defined signal 1.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=201) at aset.c:715
715     {
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
715     {
(gdb) p context->name
$8 = 0x96ce5b "ExecutorState"

Nice. Now the question is, am I at the place where memory gets squeezed? And I think yes. With top

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
31752 postgres  20   0 2772964   1.2g 329640 t   0.0 16.5   8:46.59 postgres: postgres integrator [local] EXPLAIN

I guess I should run this for a little longer. So I disable my breakpoints

(gdb) info breakpoints
Num     Type           Disp Enb Address            What
2       breakpoint     keep y   0x0000000000848ed0 in AllocSetAlloc at aset.c:715
        stop only if  strcmp(context->name, "ExecutorState") == 0
        breakpoint already hit 6 times
(gdb) disable 2
(gdb) cont
Continuing.

while watching top:

31752 postgres  20   0 2777060   1.3g 329920 D  33.2 17.9   8:52.07 postgres: postgres integrator [local] EXPLAIN

31752 postgres  20   0 2777060   1.4g 329920 D  33.2 17.9   8:52.07 postgres: postgres integrator [local] EXPLAIN

31752 postgres  20   0 2777060   1.5g 329920 D  33.2 17.9   8:52.07 postgres: postgres integrator [local] EXPLAIN

it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing fast, so now back to gdb and break:

^C
Program received signal SIGINT, Interrupt.
0x00007f048f336d71 in read () from /lib64/libpthread.so.0
(gdb) enable 2
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=385) at aset.c:715
715     {

Now I give you a bt so we have something to look at:

#0  AllocSetAlloc (context=0x1168230, size=385) at aset.c:715
#1  0x000000000084e6cd in palloc (size=385) at mcxt.c:938
#2  0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x8bbc528, hashvalue=hashvalue@entry=0x7fff2e4ca76c,
    tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277
#3  0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042
#4  ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539
#5  ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565
#6  0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461
#7  0x000000000061ce4e in ExecProcNode (node=0x11688e0) at ../../../src/include/executor/executor.h:247
#8  ExecSort (pstate=0x11687d0) at nodeSort.c:107
#9  0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461
#10 0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247
#11 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73
#12 0x00000000005fde68 in ExecProcNodeInstr (node=0x11685e0) at execProcnode.c:461
#13 0x00000000005f75ba in ExecProcNode (node=0x11685e0) at ../../../src/include/executor/executor.h:247
#14 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,
    sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x11685e0, estate=0x1168340)
    at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x119b6d8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#16 0x000000000059c6f8 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x1199a68, into=into@entry=0x0, es=es@entry=0x1141d48,
    queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7fff2e4ca990) at explain.c:535
#17 0x000000000059c9ef in ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=0x0, es=0x1141d48,
    queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=0x0, queryEnv=0x0)
    at explain.c:371
#18 0x000000000059ce37 in ExplainQuery (pstate=pstate@entry=0xf74608, stmt=stmt@entry=0x11ef240,
    queryString=queryString@entry=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,
    params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0xf74578) at explain.c:254
#19 0x000000000072ca5d in standard_ProcessUtility (pstmt=0x11ef390,
    queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xf74578, completionTag=0x7fff2e4cab20 "") at utility.c:675
#20 0x000000000072a052 in PortalRunUtility (portal=0xfb06b0, pstmt=0x11ef390, isTopLevel=<optimized out>,
    setHoldSnapshot=<optimized out>, dest=<optimized out>, completionTag=0x7fff2e4cab20 "") at pquery.c:1178
#21 0x000000000072add2 in FillPortalStore (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true) at pquery.c:1038
#22 0x000000000072b855 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807,
    isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0xf4c570, altdest=altdest@entry=0xf4c570,
    completionTag=0x7fff2e4cad30 "") at pquery.c:768
#23 0x00000000007276e8 in exec_simple_query (
    query_string=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...) at postgres.c:1145
#24 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xf76ce8, dbname=<optimized out>,
    username=<optimized out>) at postgres.c:4182
#25 0x00000000006be215 in BackendRun (port=0xf6efe0) at postmaster.c:4361
#26 BackendStartup (port=0xf6efe0) at postmaster.c:4033
#27 ServerLoop () at postmaster.c:1706
#28 0x00000000006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379
#29 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228

But who knows if that's it. I continue and watch top again...

31752 postgres  20   0 3112352   1.8g 329920 D  32.2 23.7   9:43.75 postgres: postgres integrator [local] EXPLAIN

it went quickly to 1.6, then after some time to 1.7, then 1.8, and I stop again:

^C
Program received signal SIGINT, Interrupt.
0x00007f048f336d71 in read () from /lib64/libpthread.so.0
(gdb) enable 2
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=375) at aset.c:715
715     {
bt
#0  AllocSetAlloc (context=0x1168230, size=375) at aset.c:715
#1  0x000000000084e6cd in palloc (size=375) at mcxt.c:938
#2  0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x21df688, hashvalue=hashvalue@entry=0x7fff2e4ca76c,
    tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277
#3  0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042
#4  ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539
#5  ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565
#6  0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461
#7  0x000000000061ce4e in ExecProcNode (node=0x11688e0) at ../../../src/include/executor/executor.h:247
#8  ExecSort (pstate=0x11687d0) at nodeSort.c:107
#9  0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461
#10 0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247
#11 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73
#12 0x00000000005fde68 in ExecProcNodeInstr (node=0x11685e0) at execProcnode.c:461
#13 0x00000000005f75ba in ExecProcNode (node=0x11685e0) at ../../../src/include/executor/executor.h:247
#14 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,
    sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x11685e0, estate=0x1168340)
    at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x119b6d8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#16 0x000000000059c6f8 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x1199a68, into=into@entry=0x0, es=es@entry=0x1141d48,
    queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7fff2e4ca990) at explain.c:535
#17 0x000000000059c9ef in ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=0x0, es=0x1141d48,
    queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=0x0, queryEnv=0x0)
    at explain.c:371
#18 0x000000000059ce37 in ExplainQuery (pstate=pstate@entry=0xf74608, stmt=stmt@entry=0x11ef240,
    queryString=queryString@entry=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,
    params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0xf74578) at explain.c:254
#19 0x000000000072ca5d in standard_ProcessUtility (pstmt=0x11ef390,
    queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xf74578, completionTag=0x7fff2e4cab20 "") at utility.c:675
#20 0x000000000072a052 in PortalRunUtility (portal=0xfb06b0, pstmt=0x11ef390, isTopLevel=<optimized out>,
    setHoldSnapshot=<optimized out>, dest=<optimized out>, completionTag=0x7fff2e4cab20 "") at pquery.c:1178
#21 0x000000000072add2 in FillPortalStore (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true) at pquery.c:1038
#22 0x000000000072b855 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807,
    isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0xf4c570, altdest=altdest@entry=0xf4c570,
    completionTag=0x7fff2e4cad30 "") at pquery.c:768
#23 0x00000000007276e8 in exec_simple_query (
    query_string=0xf4af30 "explain analyze\nSELECT DISTINCT\n        documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...) at postgres.c:1145
#24 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xf76ce8, dbname=<optimized out>,
    username=<optimized out>) at postgres.c:4182
#25 0x00000000006be215 in BackendRun (port=0xf6efe0) at postmaster.c:4361
#26 BackendStartup (port=0xf6efe0) at postmaster.c:4033
#27 ServerLoop () at postmaster.c:1706
#28 0x00000000006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379
#29 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228

Good, now I leave this all sitting like that for you to ask me what else you might want to see.

We are now close to the edge of the cliff.

-Gunther


Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Jeff Janes
In reply to this post by Tom Lane-2


On Mon, Apr 15, 2019 at 11:28 AM Tom Lane <[hidden email]> wrote:
Jeff Janes <[hidden email]> writes:
> To get it to happen faster, maybe you could run the server with a small
> setting of "ulimit -v"?  Or, you could try to capture it live in gdb.
> Unfortunately I don't know how to set a breakpoint for allocations into a
> specific context, and setting a breakpoint for any memory allocation is
> probably going to fire too often to be useful.

If you can use gdb at all, it's not that hard to break on allocations
into a specific context; I've done it many times.  The strategy is
basically

1. Let query run long enough for memory usage to start increasing,
then attach to backend with gdb.

2. Set breakpoint at, probably, AllocSetAlloc.  (In some cases,
reallocs could be the problem, but I doubt it here.)  Then "c".

3. When it stops, "p *context" and see if this is the context
you're looking for.  In this case, since we want to know about
allocations into ExecutorState and we know there's only one
active one, you just have to look at the context name.  In general
you might have to look at the backtrace.  Anyway, if it isn't the
one you want, just "c" until you get to an allocation into the
one you do want.

4. Once you have found out the address of the context you care
about, make the breakpoint conditional on the context argument
being that one.  It might look like this:

Breakpoint 1, AllocSetAlloc (context=0x1483be0, size=480) at aset.c:715
715     {
(gdb) p *context
$1 = {type = T_AllocSetContext, isReset = false, allowInCritSection = false,
  methods = 0xa33f40, parent = 0x0, firstchild = 0x1537f30, prevchild = 0x0,
  nextchild = 0x0, name = 0xa3483f "TopMemoryContext", ident = 0x0,
  reset_cbs = 0x0}
(gdb) cond 1  context == 0x1483be0

5. Now repeatedly "c", and check the stack trace each time, for a
dozen or two times to get a feeling for where the allocations are
being requested.

In some cases you might be able to find the context address in a
more efficient way than what I suggest in #3 --- for instance,
you could instead set a breakpoint where the context is created
and snag its address immediately, or you could dig around in
backend data structures to find it.  But these ways generally
require more familiarity with the code than just watching the
requests go by.


Thanks for the recipe.  I can use gdb at all, just not very skillfully :)
 
With that as a starting point, experimentally, this seems to work to short circuit the loop described in your step 3 (which I fear could be thousands of iterations in some situations):

cond 1 strcmp(context.name,"ExecutorState")==0

Also, I've found that in the last few versions of PostgreSQL, processes might get unreasonable numbers of SIGUSR1 (maybe due to parallelization?) and so to avoid having to stand on the 'c' button, you might need this: 

handle SIGUSR1 noprint nostop
 
Cheers,

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory errors are frustrating as heck!

Tom Lane-2
In reply to this post by Gunter
Gunther <[hidden email]> writes:
> Now I give you a bt so we have something to look at:

> #0  AllocSetAlloc (context=0x1168230, size=385) at aset.c:715
> #1  0x000000000084e6cd in palloc (size=385) at mcxt.c:938
> #2  0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x8bbc528, hashvalue=hashvalue@entry=0x7fff2e4ca76c,
>      tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277

I'm pretty sure that's not the droid we're looking for.
ExecHashJoinGetSavedTuple does palloc a new tuple, but it immediately
sticks it into a TupleTableSlot that will be responsible for freeing
it (when the next tuple is stuck into the same slot).  I'd suggest
continuing a few times and looking for other code paths leading
to AllocSetAlloc in this context.

My first thought on noticing the SELECT DISTINCT was that you might be
hitting the grouping-function-related leak that Andres fixed in 9cf37a527;
but that fix did make it into 11.2 (by just a couple of days...).  Still,
maybe there's another issue in the same area.

                        regards, tom lane


1234