Hi All,
I am uploading data into PostgreSQL using the CSV file and checking if there is any duplicates value in DB it should return a duplicate error. I am using below mention query. if Card_Bank.objects.filter( Q(ACCOUNT_NUMBER=card_number) ).exists(): flag=2 else: flag=1 it is taking too much time i am using 600k cards in CSV. Kindly help me in making the query faster. I am using Python, Django & PostgreSQL. -- Best Regards, Sachin Kumar ![]() |
Am 30.12.20 um 08:36 schrieb Sachin
Kumar:
I think it would be easier to not check the duplicates before, but let the DB complain about duplicates. That would about slash the roundtrips to the DB in half. Instead of check + insert there would be only an insert, which might fail every now and then. Regards, Holger -- Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012 |
HI Mr. Holger, This will not suffice our requirement. we have to validate that there would not be any duplicate value in DB. we have done that earlier by leaving DB to check if there is any duplicate and found duplicate value in DB. we have a table "card" and a single column "number " which we are updated with a csv file with 600k numbers and require no duplicate number should be there in the table. Please if can have a faster query can help us in achieving this requirement. ![]() On Wed, Dec 30, 2020 at 1:54 PM Holger Jakobs <[hidden email]> wrote:
Best Regards, Sachin Kumar |
Am 30.12.20 um 09:59 schrieb Sachin
Kumar:
Quote: leaving DB to check if there is any duplicate and found duplicate value in DB No, that's not possible with PostgreSQL. If you have a Primary Key or a Unique Constraint, no duplicates will be allowed by the DB. There must be a different reason.
-- Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012 |
In reply to this post by Sachin Kumar-2
On 30 Dec 2020, at 19:59, Sachin Kumar wrote:
If the constraint was in place before any data was inserted into the database then there will be no duplicates. Adding a constraint to a database will not remove existing data which violates the constraint.
One trap is that the data may not be the same even if it’s a “duplicate”. The “number” may have spaces or other human readable characters. It is a duplicate from the human perspective but not the machine. These “duplicates” will be also be very hard to find with a simple query. If OP is not already doing this I suggest the “number” should be stored in the database as BIGINT eg.
Obviously the csv will have to be checked to ensure the number is a legal integer value. There is nothing in the original post to say what is contained in the card field. And it has not been stated if this field is also unique. Assuming it has text then this will be more interesting to work with as a human may see the duplication but not the machine, but if it’s only the number that has to be unique then it is straight forward.
As noted by Holgar Jacobs up thread it is not possible for a query to be faster than checks done at time the data is inserted. The data has to be inserted and indexes created etc. If the duplicate key constraint is violated then the duplicate is found. Otherwise the data has to be inserted and a query run afterwards. Assuming there is a reasonably recent PostgreSQL hidden behind the Django and Python, i.e., >9.5, then the INSERT can be attempted with
Once this is in place the data will be added to the table as fast as that is possible, and with no duplicates. It will be very much slower if every insertion has to be checked in the Django+Python stack. Basically if speed is needed go to the database and talk in SQL. Regards |
In reply to this post by Sachin Kumar-2
> On Dec 30, 2020, at 12:36 AM, Sachin Kumar <[hidden email]> wrote:
> > Hi All, > > I am uploading data into PostgreSQL using the CSV file and checking if there is any duplicates value in DB it should return a duplicate error. I am using below mention query. > > if Card_Bank.objects.filter( Q(ACCOUNT_NUMBER=card_number) ).exists(): > flag=2 > else: > flag=1 > it is taking too much time i am using 600k cards in CSV. > > Kindly help me in making the query faster. > > I am using Python, Django & PostgreSQL. > -- > > Best Regards, > Sachin Kumar Are you checking one-by-one because your goal is not to fail the whole upload that contains the duplicates, but rather to skip only the duplicates? If that's the case, I think you'd be better off copying the CSV straight into a temp table, using a join to delete duplicates from it, then insert the remainder into the target table, and finally drop the temp table. |
Hi Scott, Yes, I am checking one by one because my goal is to fail the whole upload if there is any duplicate entry and to inform the user that they have a duplicate entry in the file. Regards Sachin ![]() On Wed, Dec 30, 2020 at 6:43 PM Scott Ribe <[hidden email]> wrote: > On Dec 30, 2020, at 12:36 AM, Sachin Kumar <[hidden email]> wrote: Best Regards, Sachin Kumar |
> On Dec 30, 2020, at 6:24 AM, Sachin Kumar <[hidden email]> wrote:
> > Yes, I am checking one by one because my goal is to fail the whole upload if there is any duplicate entry and to inform the user that they have a duplicate entry in the file. That's not what I said, though. If you want to fail the whole copy, then you don't have to check one by one, just try the copy--assuming you have the correct constraints in place. Unless you want to tell the user *which* rows are duplicates, in which case you can try a variant on my prior suggestion, copy into a temp table, use a join to find duplicates... |
Free forum by Nabble | Edit this page |