Replacing Apache Solr with Postgre Full Text Search?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Replacing Apache Solr with Postgre Full Text Search?

J2eeInside J2eeInside
Hi all,

I hope someone  can help/suggest:
I'm currently maintaining a project that uses Apache Solr /Lucene. To be honest, I wold like to replace Solr with Postgre Full Text Search. However, there is a huge amount of documents involved - arround 200GB. Wondering, can Postgre handle this efficiently?
Does anyone have specific experience, and what should the infrastructure look like?

P.S. Not to be confused, the Sol works just fine, i just wanted to eliminate one component from the whole system (if Full text search can replace Solr at all)
Reply | Threaded
Open this post in threaded view
|

Re: Replacing Apache Solr with Postgre Full Text Search?

Mike Rylander
On Wed, Mar 25, 2020 at 8:37 AM J2eeInside J2eeInside
<[hidden email]> wrote:
>
> Hi all,
>
> I hope someone  can help/suggest:
> I'm currently maintaining a project that uses Apache Solr /Lucene. To be honest, I wold like to replace Solr with Postgre Full Text Search. However, there is a huge amount of documents involved - arround 200GB. Wondering, can Postgre handle this efficiently?
> Does anyone have specific experience, and what should the infrastructure look like?
>
> P.S. Not to be confused, the Sol works just fine, i just wanted to eliminate one component from the whole system (if Full text search can replace Solr at all)

I'm one of the core developers (and the primary developer of the
search subsystem) for the Evergreen ILS [1] (integrated library system
-- think book library, not software library).  We've been using PGs
full-text indexing infrastructure since day one, and I can say it is
definitely capable of handling pretty much anything you can throw at
it.

Our indexing requirements are very complex and need to be very
configurable, and need to include a lot more than just "search and
rank a text column," so we've had to build a ton of infrastructure
around record (document) ingest, searching/filtering, linking, and
display.  If your indexing and search requirements are stable,
specific, and well-understood it should be straight forward,
especially if you don't have to take into account non-document
attributes like physical location, availability, and arbitrary
real-time visibility rules like Evergreen does.

As for scale, it's more about document count than total size.  There
are Evergreen libraries with several million records to search, and
with proper hardware and tuning everything works well.  Our main
performance issue has to do with all of the stuff outside the records
(documents) themselves that have to be taken into account during
search.  The core full-text search part of our queries is extremely
performant, and has only gotten better over the years.

[1] http://evergreen-ils.org

HTH,
--
Mike Rylander
 | Executive Director
 | Equinox Open Library Initiative
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  [hidden email]
 | web:  http://equinoxinitiative.org


Reply | Threaded
Open this post in threaded view
|

Re: Replacing Apache Solr with Postgre Full Text Search?

J2eeInside J2eeInside
Hi Mike, and thanks for valuable answer!
In short, you think a PG Full Text Search can do the same as Apache Solr?

P.S. I need to index .pdf, .html and MS Word .doc/.docx files, is there any constraints in Ful Text search regarding those file types?


On Wed, Mar 25, 2020 at 3:36 PM Mike Rylander <[hidden email]> wrote:
On Wed, Mar 25, 2020 at 8:37 AM J2eeInside J2eeInside
<[hidden email]> wrote:
>
> Hi all,
>
> I hope someone  can help/suggest:
> I'm currently maintaining a project that uses Apache Solr /Lucene. To be honest, I wold like to replace Solr with Postgre Full Text Search. However, there is a huge amount of documents involved - arround 200GB. Wondering, can Postgre handle this efficiently?
> Does anyone have specific experience, and what should the infrastructure look like?
>
> P.S. Not to be confused, the Sol works just fine, i just wanted to eliminate one component from the whole system (if Full text search can replace Solr at all)

I'm one of the core developers (and the primary developer of the
search subsystem) for the Evergreen ILS [1] (integrated library system
-- think book library, not software library).  We've been using PGs
full-text indexing infrastructure since day one, and I can say it is
definitely capable of handling pretty much anything you can throw at
it.

Our indexing requirements are very complex and need to be very
configurable, and need to include a lot more than just "search and
rank a text column," so we've had to build a ton of infrastructure
around record (document) ingest, searching/filtering, linking, and
display.  If your indexing and search requirements are stable,
specific, and well-understood it should be straight forward,
especially if you don't have to take into account non-document
attributes like physical location, availability, and arbitrary
real-time visibility rules like Evergreen does.

As for scale, it's more about document count than total size.  There
are Evergreen libraries with several million records to search, and
with proper hardware and tuning everything works well.  Our main
performance issue has to do with all of the stuff outside the records
(documents) themselves that have to be taken into account during
search.  The core full-text search part of our queries is extremely
performant, and has only gotten better over the years.

[1] http://evergreen-ils.org

HTH,
--
Mike Rylander
 | Executive Director
 | Equinox Open Library Initiative
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  [hidden email]
 | web:  http://equinoxinitiative.org
Reply | Threaded
Open this post in threaded view
|

Re: Replacing Apache Solr with Postgre Full Text Search?

Mike Rylander
On Thu, Mar 26, 2020 at 4:03 AM J2eeInside J2eeInside
<[hidden email]> wrote:
>
> Hi Mike, and thanks for valuable answer!
> In short, you think a PG Full Text Search can do the same as Apache Solr?
>

Can it?  I mean, it does today.  Whether it would for you depends on
your needs and how much effort you can afford to put into the stuff
that is /not/ the full text engine itself, like document normalizers
and search UIs.

There are trade-offs to be made when choosing any tool.  Solr is
great, and so is Lucene (Solr's heart), and so is Elastic Search.  For
that matter, Zebra is awesome for full text indexing, too.  Those all
make indexing a pile of documents easy.  But, none of those are great
as an authoritative data store, so for instance there will necessarily
be drift between your data and the Solr index requiring a full
refresh.  It's also hard to integrate non-document filtering
requirements like I have in my use case.  Both of those are important
to my use case, so PG's full text is my preference.

Solr also didn't exist (publicly) in 2004 when we started building Evergreen. :)

> P.S. I need to index .pdf, .html and MS Word .doc/.docx files, is there any constraints in Ful Text search regarding those file types?
>

It can't handle those without some help -- it supports exactly text --
but you can extract the text using other tools.

--
Mike Rylander
 | Executive Director
 | Equinox Open Library Initiative
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  [hidden email]
 | web:  http://equinoxinitiative.org

>
> On Wed, Mar 25, 2020 at 3:36 PM Mike Rylander <[hidden email]> wrote:
>>
>> On Wed, Mar 25, 2020 at 8:37 AM J2eeInside J2eeInside
>> <[hidden email]> wrote:
>> >
>> > Hi all,
>> >
>> > I hope someone  can help/suggest:
>> > I'm currently maintaining a project that uses Apache Solr /Lucene. To be honest, I wold like to replace Solr with Postgre Full Text Search. However, there is a huge amount of documents involved - arround 200GB. Wondering, can Postgre handle this efficiently?
>> > Does anyone have specific experience, and what should the infrastructure look like?
>> >
>> > P.S. Not to be confused, the Sol works just fine, i just wanted to eliminate one component from the whole system (if Full text search can replace Solr at all)
>>
>> I'm one of the core developers (and the primary developer of the
>> search subsystem) for the Evergreen ILS [1] (integrated library system
>> -- think book library, not software library).  We've been using PGs
>> full-text indexing infrastructure since day one, and I can say it is
>> definitely capable of handling pretty much anything you can throw at
>> it.
>>
>> Our indexing requirements are very complex and need to be very
>> configurable, and need to include a lot more than just "search and
>> rank a text column," so we've had to build a ton of infrastructure
>> around record (document) ingest, searching/filtering, linking, and
>> display.  If your indexing and search requirements are stable,
>> specific, and well-understood it should be straight forward,
>> especially if you don't have to take into account non-document
>> attributes like physical location, availability, and arbitrary
>> real-time visibility rules like Evergreen does.
>>
>> As for scale, it's more about document count than total size.  There
>> are Evergreen libraries with several million records to search, and
>> with proper hardware and tuning everything works well.  Our main
>> performance issue has to do with all of the stuff outside the records
>> (documents) themselves that have to be taken into account during
>> search.  The core full-text search part of our queries is extremely
>> performant, and has only gotten better over the years.
>>
>> [1] http://evergreen-ils.org
>>
>> HTH,
>> --
>> Mike Rylander
>>  | Executive Director
>>  | Equinox Open Library Initiative
>>  | phone:  1-877-OPEN-ILS (673-6457)
>>  | email:  [hidden email]
>>  | web:  http://equinoxinitiative.org


Reply | Threaded
Open this post in threaded view
|

Sv: Replacing Apache Solr with Postgre Full Text Search?

Andreas Joseph Krogh-2
In reply to this post by J2eeInside J2eeInside
På onsdag 25. mars 2020 kl. 13:36:38, skrev J2eeInside J2eeInside <[hidden email]>:
Hi all,

I hope someone  can help/suggest:
I'm currently maintaining a project that uses Apache Solr /Lucene. To be honest, I wold like to replace Solr with Postgre Full Text Search. However, there is a huge amount of documents involved - arround 200GB. Wondering, can Postgre handle this efficiently?
Does anyone have specific experience, and what should the infrastructure look like?

P.S. Not to be confused, the Sol works just fine, i just wanted to eliminate one component from the whole system (if Full text search can replace Solr at all)
 
I see you've gotten some answers but wanted to chime in...
We seach in ~15mill. emails and ~10 mill documents (extracted text from Word/PDF etc. using Java-tools), and use PG and FTS (gin, not rum) for the exact same reasons as Evergreen (it seems). We have to mix FTS with domain-specific logic/filtering and that is based on relational data in the database. I don't see how we could have done that using an external search-engine. Maybe it's easy, I don't have any experience with it.
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Reply | Threaded
Open this post in threaded view
|

Re: Replacing Apache Solr with Postgre Full Text Search?

J2eeInside J2eeInside
In reply to this post by Mike Rylander
Thanks again.
For the end, the finally question:

On Thu, Mar 26, 2020 at 4:18 PM Mike Rylander <[hidden email]> wrote:
On Thu, Mar 26, 2020 at 4:03 AM J2eeInside J2eeInside
<[hidden email]> wrote:
>

> P.S. I need to index .pdf, .html and MS Word .doc/.docx files, is there any constraints in Ful Text search regarding those file types?
>

It can't handle those without some help -- it supports exactly text --
but you can extract the text using other tools.


- Can you recommend those tools you mention above/any useful resource on how to do that?

Reply | Threaded
Open this post in threaded view
|

Re: Replacing Apache Solr with Postgre Full Text Search?

J2eeInside J2eeInside
In reply to this post by Andreas Joseph Krogh-2
You are wellcome Andreas, and thanks for useful answer ;-)

On Thu, Mar 26, 2020 at 4:33 PM Andreas Joseph Krogh <[hidden email]> wrote:
På onsdag 25. mars 2020 kl. 13:36:38, skrev J2eeInside J2eeInside <[hidden email]>:
Hi all,

I hope someone  can help/suggest:
I'm currently maintaining a project that uses Apache Solr /Lucene. To be honest, I wold like to replace Solr with Postgre Full Text Search. However, there is a huge amount of documents involved - arround 200GB. Wondering, can Postgre handle this efficiently?
Does anyone have specific experience, and what should the infrastructure look like?

P.S. Not to be confused, the Sol works just fine, i just wanted to eliminate one component from the whole system (if Full text search can replace Solr at all)
 
I see you've gotten some answers but wanted to chime in...
We seach in ~15mill. emails and ~10 mill documents (extracted text from Word/PDF etc. using Java-tools), and use PG and FTS (gin, not rum) for the exact same reasons as Evergreen (it seems). We have to mix FTS with domain-specific logic/filtering and that is based on relational data in the database. I don't see how we could have done that using an external search-engine. Maybe it's easy, I don't have any experience with it.
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Reply | Threaded
Open this post in threaded view
|

Re: Replacing Apache Solr with Postgre Full Text Search?

Artjom Simon
In reply to this post by J2eeInside J2eeInside
On 26.03.20 17:05, J2eeInside J2eeInside wrote:
 >> P.S. I need to index .pdf, .html and MS Word .doc/.docx files, is
 >> there any constraints in Ful Text search regarding those file types?
 >
 > - Can you recommend those tools you mention above/any useful resource
on how to do that?


For PDFs, I know of at least two tools that can extract text. Try
Ghostscript:

     gs -sDEVICE=txtwrite -o output.txt input.pdf


or a tool called 'pdftotext':

     pdftotext [options] [PDF-file [text-file]]

Both give slightly different results, mainly in terms of indentation and
layout of the generated plain text, and how they deal with tabular layouts.

Note that PDF is a container format that can embed virtually anything:
text, images, flash videos, ...
You'll get good results if the PDF input is plain text. If you're
dealing with embedded images like scanned documents, you'll probably
need a OCR pass with tools like 'tesseract' to extract the recognized text.

You'll need similar tools to extract the text from DOC and HTML files
since you're only interested in their plain text representation, not the
meta data and markup.
Finding converters from HTML/DOC to plain text shouldn't be too hard.
You could also try to find a commercial document conversion vendor, or
try to convert HTML and DOC both to PDF so you'll only have to deal with
PDF-to-text extraction in the end.

Good luck!

Artjom


--
Artjom Simon