Loading 500m json files to database

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Loading 500m json files to database

pinker
Hi, do you have maybe idea how to make loading process faster?

I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.

What I came up with now is:

time for i in datafiles/*; do
  psql -c "\copy json_parts(json_data) FROM $i"&
done

which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.

some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables instead
of 1)


Any ideas?



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

Ertan Küçükoğlu

> On 23 Mar 2020, at 13:20, pinker <[hidden email]> wrote:
>
> Hi, do you have maybe idea how to make loading process faster?
>
> I have 500 millions of json files (1 json per file) that I need to load to
> db.
> My test set is "only" 1 million files.
>
> What I came up with now is:
>
> time for i in datafiles/*; do
>  psql -c "\copy json_parts(json_data) FROM $i"&
> done
>
> which is the fastest so far. But it's not what i expect. Loading 1m of data
> takes me ~3h so loading 500 times more is just unacceptable.
>
> some facts:
> * the target db is on cloud so there is no option to do tricks like turning
> fsync off
> * version postgres 11
> * i can spin up huge postgres instance if necessary in terms of cpu/ram
> * i tried already hash partitioning (to write to 10 different tables instead
> of 1)
>
>
> Any ideas?
Hello,

I may not be knowledge enough to answer your question.

However, if possible, you may think of using a local physical computer to do all uploading and after do backup/restore on cloud system.

Compressed backup will be far less internet traffic compared to direct data inserts.

Moreover you can do additional tricks as you mentioned.

Thanks & regards,
Ertan




Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

Andrei Zhidenkov
Try to write a stored procedure (probably pl/python) that will accept an array of JSON objects so it will be possible to load data in chunks (by 100-1000 files) which should be faster.

> On 23. Mar 2020, at 12:49, Ertan Küçükoğlu <[hidden email]> wrote:
>
>
>> On 23 Mar 2020, at 13:20, pinker <[hidden email]> wrote:
>>
>> Hi, do you have maybe idea how to make loading process faster?
>>
>> I have 500 millions of json files (1 json per file) that I need to load to
>> db.
>> My test set is "only" 1 million files.
>>
>> What I came up with now is:
>>
>> time for i in datafiles/*; do
>> psql -c "\copy json_parts(json_data) FROM $i"&
>> done
>>
>> which is the fastest so far. But it's not what i expect. Loading 1m of data
>> takes me ~3h so loading 500 times more is just unacceptable.
>>
>> some facts:
>> * the target db is on cloud so there is no option to do tricks like turning
>> fsync off
>> * version postgres 11
>> * i can spin up huge postgres instance if necessary in terms of cpu/ram
>> * i tried already hash partitioning (to write to 10 different tables instead
>> of 1)
>>
>>
>> Any ideas?
> Hello,
>
> I may not be knowledge enough to answer your question.
>
> However, if possible, you may think of using a local physical computer to do all uploading and after do backup/restore on cloud system.
>
> Compressed backup will be far less internet traffic compared to direct data inserts.
>
> Moreover you can do additional tricks as you mentioned.
>
> Thanks & regards,
> Ertan
>
>
>
>



lup
Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

lup


> On Mar 23, 2020, at 5:59 AM, Andrei Zhidenkov <[hidden email]> wrote:
>
> Try to write a stored procedure (probably pl/python) that will accept an array of JSON objects so it will be possible to load data in chunks (by 100-1000 files) which should be faster.
>
>>> On 23. Mar 2020, at 12:49, Ertan Küçükoğlu <[hidden email]> wrote:
>>>
>>>
>>>> On 23 Mar 2020, at 13:20, pinker <[hidden email]> wrote:
>>>
>>> Hi, do you have maybe idea how to make loading process faster?
>>>
>>> I have 500 millions of json files (1 json per file) that I need to load to
>>> db.
>>> My test set is "only" 1 million files.
>>>
>>> What I came up with now is:
>>>
>>> time for i in datafiles/*; do
>>> psql -c "\copy json_parts(json_data) FROM $i"&
>>> done
>>>
>>> which is the fastest so far. But it's not what i expect. Loading 1m of data
>>> takes me ~3h so loading 500 times more is just unacceptable.
>>>
>>> some facts:
>>> * the target db is on cloud so there is no option to do tricks like turning
>>> fsync off
>>> * version postgres 11
>>> * i can spin up huge postgres instance if necessary in terms of cpu/ram
>>> * i tried already hash partitioning (to write to 10 different tables instead
>>> of 1)
>>>
>>>
>>> Any ideas?
>> Hello,
>>
>> I may not be knowledge enough to answer your question.
>>
>> However, if possible, you may think of using a local physical computer to do all uploading and after do backup/restore on cloud system.
>>
>> Compressed backup will be far less internet traffic compared to direct data inserts.
>>
>> Moreover you can do additional tricks as you mentioned.
>>
>> Thanks & regards,
>> Ertan
>>
>>

Drop any and all indices

>>
>>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

Christopher Browne-3
In reply to this post by pinker


On Mon, 23 Mar 2020 at 06:24, pinker <[hidden email]> wrote:
Hi, do you have maybe idea how to make loading process faster?

I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.

What I came up with now is:

time for i in datafiles/*; do
  psql -c "\copy json_parts(json_data) FROM $i"&
done

which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.

some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables instead
of 1)


Any ideas?


Well, you're paying for a lot of overhead in that, as you're
establishing a psql command, connecting to a database, spawning a backend
process, starting a transactions, committing a transaction, closing the backend
process, disconnecting from the database, and cleaning up after the launching
of the psql command.  And you're doing that 500 million times.

The one thing I left off that was the loading of a single tuple into json_parts.

What you could do to improve things quite a lot would be to group some number
of those files together, so that each time you pay for the overhead, you at least
get the benefit of loading several entries into json_parts.

So, loosely, I'd commend using /bin/cat (or similar) to assemble several files together
into one, and then \copy that one file in.

Having 2 tuples loaded at once drops overhead by 50%
Having 10 tuples loaded at once drops overhead by 90%
Having 100 tuples loaded at once drops overhead by 99%
Having 1000 tuples loaded at once drops overhead by 99.9%

There probably isn't too much real value to going past 1000 tuples per batch; the
overhead, by that point, is getting pretty immaterial.

Reducing that overhead is the single most important thing you can do.

It is also quite likely that you could run such streams in parallel, although
it would require quite a bit more information about the I/O capabilities of your
hardware to know if that would do any good.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
lup
Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

lup
In reply to this post by pinker


On 3/23/20 4:24 AM, pinker wrote:

> Hi, do you have maybe idea how to make loading process faster?
>
> I have 500 millions of json files (1 json per file) that I need to load to
> db.
> My test set is "only" 1 million files.
>
> What I came up with now is:
>
> time for i in datafiles/*; do
>    psql -c "\copy json_parts(json_data) FROM $i"&
> done
>
> which is the fastest so far. But it's not what i expect. Loading 1m of data
> takes me ~3h so loading 500 times more is just unacceptable.
>
> some facts:
> * the target db is on cloud so there is no option to do tricks like turning
> fsync off
> * version postgres 11
> * i can spin up huge postgres instance if necessary in terms of cpu/ram
> * i tried already hash partitioning (to write to 10 different tables instead
> of 1)
>
>
> Any ideas?
>
Most advanced languages have a bulk copy implementation.  I've found this to be blindingly fast when the receiving table has no indices, constraints.  It's not clear how large your files are, but you might take this time to "normalized" them: extract any id, datatype, etc into table attributes.

> --
> Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

Adrian Klaver-4
In reply to this post by pinker
On 3/23/20 3:24 AM, pinker wrote:

> Hi, do you have maybe idea how to make loading process faster?
>
> I have 500 millions of json files (1 json per file) that I need to load to
> db.
> My test set is "only" 1 million files.
>
> What I came up with now is:
>
> time for i in datafiles/*; do
>    psql -c "\copy json_parts(json_data) FROM $i"&
> done
>
> which is the fastest so far. But it's not what i expect. Loading 1m of data
> takes me ~3h so loading 500 times more is just unacceptable.

Aggregating the JSON files as others have suggested would help greatly.

Knowing what is happening in json_parts() might help folks provide
further tips.


>
> some facts:
> * the target db is on cloud so there is no option to do tricks like turning
> fsync off
> * version postgres 11
> * i can spin up huge postgres instance if necessary in terms of cpu/ram
> * i tried already hash partitioning (to write to 10 different tables instead
> of 1)
>
>
> Any ideas?
>
>
>
> --
> Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
>
>


--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

pinker
In reply to this post by Christopher Browne-3
Christopher Browne-3 wrote

> Well, you're paying for a lot of overhead in that, as you're
> establishing a psql command, connecting to a database, spawning a backend
> process, starting a transactions, committing a transaction, closing the
> backend
> process, disconnecting from the database, and cleaning up after the
> launching
> of the psql command.  And you're doing that 500 million times.
>
> The one thing I left off that was the loading of a single tuple into
> json_parts.
>
> What you could do to improve things quite a lot would be to group some
> number
> of those files together, so that each time you pay for the overhead, you
> at
> least
> get the benefit of loading several entries into json_parts.
>
> So, loosely, I'd commend using /bin/cat (or similar) to assemble several
> files together
> into one, and then \copy that one file in.
>
> Having 2 tuples loaded at once drops overhead by 50%
> Having 10 tuples loaded at once drops overhead by 90%
> Having 100 tuples loaded at once drops overhead by 99%
> Having 1000 tuples loaded at once drops overhead by 99.9%
>
> There probably isn't too much real value to going past 1000 tuples per
> batch; the
> overhead, by that point, is getting pretty immaterial.
>
> Reducing that overhead is the single most important thing you can do.

Yes, I was thinking about that but no idea now how to do it right now. like
some kind of outer loop to concatenate those files? and adding delimiter
between them?


Christopher Browne-3 wrote
> It is also quite likely that you could run such streams in parallel,
> although
> it would require quite a bit more information about the I/O capabilities
> of
> your
> hardware to know if that would do any good.

I can spin up every size of instance.





--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

pinker
In reply to this post by Ertan Küçükoğlu
Ertan Küçükoğlu wrote
> However, if possible, you may think of using a local physical computer to
> do all uploading and after do backup/restore on cloud system.
>
> Compressed backup will be far less internet traffic compared to direct
> data inserts.

I was thinking about that but data source is a blob storage, so downloading
it first and then loading locally it's couple days extra for processing :/
it's not that fast even when I'm doing it locally ... so that would be like
extra 2 steps overhead :/



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

pinker
In reply to this post by Andrei Zhidenkov
it's a cloud and no plpythonu extension avaiable unfortunately



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

pinker
In reply to this post by lup
there is no indexes nor foreign keys, or any other constraints



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

pinker
In reply to this post by Adrian Klaver-4
Hi,
json_parts it's just single table with 2 column:

                                                  Table "public.json_parts"
  Column   |  Type   | Collation | Nullable |                Default                
| Storage  | Stats target | Description
-----------+---------+-----------+----------+----------------------------------------+----------+--------------+-------------
 id        | integer |           | not null |
nextval('json_parts_id_seq'::regclass) | plain    |              |
 json_data | jsonb   |           |          |  

no indexes, constraints or anything else



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

Adrian Klaver-4
On 3/23/20 5:26 PM, pinker wrote:
> Hi,
> json_parts it's just single table with 2 column:

Well I misread that.

>
>                                                    Table "public.json_parts"
>    Column   |  Type   | Collation | Nullable |                Default
> | Storage  | Stats target | Description
> -----------+---------+-----------+----------+----------------------------------------+----------+--------------+-------------
>   id        | integer |           | not null |
> nextval('json_parts_id_seq'::regclass) | plain    |              |
>   json_data | jsonb   |           |          |
>
> no indexes, constraints or anything else
>
>
>
> --
> Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
>
>


--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

Adrian Klaver-4
In reply to this post by pinker
On 3/23/20 5:23 PM, pinker wrote:
> it's a cloud and no plpythonu extension avaiable unfortunately

I presume Python itself is available, so would it not be possible to
create a program that concatenates the files into batches and COPY(s)
that data into Postgres using the Psycopg2 COPY functions:

https://www.psycopg.org/docs/usage.html#using-copy-to-and-copy-from

>
>
>
> --
> Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
>
>


--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

David G Johnston
In reply to this post by pinker
On Mon, Mar 23, 2020 at 3:24 AM pinker <[hidden email]> wrote:
time for i in datafiles/*; do
  psql -c "\copy json_parts(json_data) FROM $i"&
done

Don't know whether this is faster but it does avoid spinning up a connection multiple times.

#bash, linux
    function append_each_split_file_to_etl_load_script() {
        for filetoload in ./*; do
            ronumber="$(basename $filetoload)"
            # only process files since subdirs can be present
            if [[ -f "$filetoload" ]]; then
                echo ""
                echo "\set invoice"' `cat '"'""$filetoload""'"'`'
                echo ", ('$ronumber',:'invoice')"
            fi >> "$PSQLSCRIPT"
        done

        echo ""  >> "$PSQLSCRIPT"
        echo ";" >> "$PSQLSCRIPT"
        echo ""  >> "$PSQLSCRIPT"
    }

There is a bit other related code that is needed (for my specific usage) but this is the core of it.  Use psql variables to capture the contents of each file into a variable and then just perform a normal insert (specifically, a VALUES (...), (...) variant).  Since you can intermix psql and SQL you basically output a bloody long script, that has memory issues at scale - but you can divide and conquer - and then "psql --file bloody_long_script_part_1_of_100000.psql".

David J.

lup
Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

lup


On Mar 23, 2020, at 7:11 PM, David G. Johnston <[hidden email]> wrote:

On Mon, Mar 23, 2020 at 3:24 AM pinker <[hidden email]> wrote:
time for i in datafiles/*; do
  psql -c "\copy json_parts(json_data) FROM $i"&
done

Don't know whether this is faster but it does avoid spinning up a connection multiple times.

#bash, linux
    function append_each_split_file_to_etl_load_script() {
        for filetoload in ./*; do
            ronumber="$(basename $filetoload)"
            # only process files since subdirs can be present
            if [[ -f "$filetoload" ]]; then
                echo ""
                echo "\set invoice"' `cat '"'""$filetoload""'"'`'
                echo ", ('$ronumber',:'invoice')"
            fi >> "$PSQLSCRIPT"
        done

        echo ""  >> "$PSQLSCRIPT"
        echo ";" >> "$PSQLSCRIPT"
        echo ""  >> "$PSQLSCRIPT"
    }

There is a bit other related code that is needed (for my specific usage) but this is the core of it.  Use psql variables to capture the contents of each file into a variable and then just perform a normal insert (specifically, a VALUES (...), (...) variant).  Since you can intermix psql and SQL you basically output a bloody long script, that has memory issues at scale - but you can divide and conquer - and then "psql --file bloody_long_script_part_1_of_100000.psql".

David J.

Can one put 550M files in a single directory?  I thought it topped out at 16M or so.



Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

pinker
it's in a blob storage in Azure. I'm testing with 1m that I have locally



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

pinker
In reply to this post by Adrian Klaver-4
hmm now I'm thinking maybe setting up pgbouncer in front of postgres with
statement mode would help?



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

Reid Thompson-4
In reply to this post by pinker
On Mon, 2020-03-23 at 03:24 -0700, pinker wrote:

> [EXTERNAL SOURCE]
>
>
>
> Hi, do you have maybe idea how to make loading process faster?
>
> I have 500 millions of json files (1 json per file) that I need to load to
> db.
> My test set is "only" 1 million files.
>
> What I came up with now is:
>
> time for i in datafiles/*; do
>   psql -c "\copy json_parts(json_data) FROM $i"&
> done
>
> which is the fastest so far. But it's not what i expect. Loading 1m of data
> takes me ~3h so loading 500 times more is just unacceptable.
>
> some facts:
> * the target db is on cloud so there is no option to do tricks like turning
> fsync off
> * version postgres 11
> * i can spin up huge postgres instance if necessary in terms of cpu/ram
> * i tried already hash partitioning (to write to 10 different tables instead
> of 1)
>
>
> Any ideas?


https://www.gnu.org/software/parallel/ 



Reply | Threaded
Open this post in threaded view
|

Re: Loading 500m json files to database

Peter J. Holzer
In reply to this post by pinker
On 2020-03-23 17:18:45 -0700, pinker wrote:

> Christopher Browne-3 wrote
> > Well, you're paying for a lot of overhead in that, as you're
> > establishing a psql command, connecting to a database, spawning a
> > backend process, starting a transactions, committing a transaction,
> > closing the backend process, disconnecting from the database, and
> > cleaning up after the launching of the psql command.  And you're
> > doing that 500 million times.
> >
> > The one thing I left off that was the loading of a single tuple into
> > json_parts.
[...]
> > Reducing that overhead is the single most important thing you can do.
>
> Yes, I was thinking about that but no idea now how to do it right now.

Do you know any programming language (Python, Perl, ...)? You'll
probably get a huge gain from writing a script that just opens the
connection once and then inserts each file.

Copy usually is even faster by a fair amount, but since you have to read
the data for each row from a different file (and - if I understood you
correctly, a remote one at that), the additional speedup is probably not
that great in this case.

Splitting the work int batches and executing several batches in parallel
probably helps.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | [hidden email]         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

signature.asc (849 bytes) Download Attachment
12