Zend_Search_Lucene and PDF files

15 messages Options
Embed this post
Permalink
Bill Chmura-2

Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
Some javascript/style in this post has been disabled (why?)
Hello,

I am implementing Lucene and need to index my PDF files. 

I have found several solutions, but they all require some non PHP component such as XPDF, etc...  I need this to be cross platform, so those are generally out.

I also started looking for ways to get inside Zend_PDF to get at the elements of each page with no success yet.  I was hoping that I could iterate the pages in a PDF (done), get a list of the elements on that page (?) and then grab the text from perhaps the Zend_Pdf_Element_String I was able to find in there.  Since I am not going to be displaying the context in my search, the location of the text does not matter to me so much.

I am getting totally bogged down in the source code for the pages and the parsers, partially at least because I am not familiar with the nomenclature of PDF internals  :(

Does anyone have any pointers on how to approach this?  Ideally I'd like to keep it Zend, but I can use other PDF libraries if I need to.

Thanks

Bill

Matthias W.

Re: Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
Hi,
some time ago I had the same problem. But I needed the support for other documents, too (Excel, Powerpoint, ...).
Because of this I created my index with java Apache projects: Lucene, PDFBox (PDF parser/writer) and POI (Office document parser/writer).

I think it wouldn't be much work to parse your PDF docs Java-side...

Bill Chmura-2 wrote:
Hello,

I am implementing Lucene and need to index my PDF files.

I have found several solutions, but they all require some non PHP
component such as XPDF, etc...  I need this to be cross platform, so
those are generally out.

I also started looking for ways to get inside Zend_PDF to get at the
elements of each page with no success yet.  I was hoping that I could
iterate the pages in a PDF (done), get a list of the elements on that
page (?) and then grab the text from perhaps the Zend_Pdf_Element_String
I was able to find in there.  Since I am not going to be displaying the
context in my search, the location of the text does not matter to me so
much.

I am getting totally bogged down in the source code for the pages and
the parsers, partially at least because I am not familiar with the
nomenclature of PDF internals  :(

Does anyone have any pointers on how to approach this?  Ideally I'd like
to keep it Zend, but I can use other PDF libraries if I need to.

Thanks

Bill
Shaun Farrell

Re: Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with Zend. (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)  the Framework has come along way since then so it's probably out of date. I have been thinking about updating the topic.  The current implementation uses XPDF which at the time was the best to convert PDF's to Text.  I have been looking for some other libraries but have no luck.  I'm also looking so ill let you know if i find anything.

On Wed, Sep 9, 2009 at 8:00 AM, Matthias W. <[hidden email]> wrote:

Hi,
some time ago I had the same problem. But I needed the support for other
documents, too (Excel, Powerpoint, ...).
Because of this I created my index with java Apache projects: Lucene, PDFBox
(PDF parser/writer) and POI (Office document parser/writer).

I think it wouldn't be much work to parse your PDF docs Java-side...


Bill Chmura-2 wrote:
>
> Hello,
>
> I am implementing Lucene and need to index my PDF files.
>
> I have found several solutions, but they all require some non PHP
> component such as XPDF, etc...  I need this to be cross platform, so
> those are generally out.
>
> I also started looking for ways to get inside Zend_PDF to get at the
> elements of each page with no success yet.  I was hoping that I could
> iterate the pages in a PDF (done), get a list of the elements on that
> page (?) and then grab the text from perhaps the Zend_Pdf_Element_String
> I was able to find in there.  Since I am not going to be displaying the
> context in my search, the location of the text does not matter to me so
> much.
>
> I am getting totally bogged down in the source code for the pages and
> the parsers, partially at least because I am not familiar with the
> nomenclature of PDF internals  :(
>
> Does anyone have any pointers on how to approach this?  Ideally I'd like
> to keep it Zend, but I can use other PDF libraries if I need to.
>
> Thanks
>
> Bill
>
>
>

--
View this message in context: http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
Sent from the Zend MVC mailing list archive at Nabble.com.




--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com
Bill Chmura-2

Re: Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
Some javascript/style in this post has been disabled (why?)

Thanks Shaun and Matthias,

Shaun: I actually already found your post, and so far it is the most likely scenario if I cannot get a pure PHP solution working - The server is OpenBSD, but development is done on OSX, Linux, and Windows so it presents a problem with the XPDF.  But if push comes to shove it's where I will be heading.

Matthias:  It needs to be able to update on the fly, and running Java up there may be a bit dicey...  There is also a db component, so some of the meta data comes from my model, and well - its seeming to look painful as I move ahead either way - thanks for the suggestion though!

I was really hoping someone with Zend_PDF knowledge would see this and yell, hey - just grab this array from the PDF object, its got your strings :)

Thanks guys!






Shaun Farrell wrote:
About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with Zend. (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)  the Framework has come along way since then so it's probably out of date. I have been thinking about updating the topic.  The current implementation uses XPDF which at the time was the best to convert PDF's to Text.  I have been looking for some other libraries but have no luck.  I'm also looking so ill let you know if i find anything.

On Wed, Sep 9, 2009 at 8:00 AM, Matthias W. <[hidden email]> wrote:

Hi,
some time ago I had the same problem. But I needed the support for other
documents, too (Excel, Powerpoint, ...).
Because of this I created my index with java Apache projects: Lucene, PDFBox
(PDF parser/writer) and POI (Office document parser/writer).

I think it wouldn't be much work to parse your PDF docs Java-side...


Bill Chmura-2 wrote:
>
> Hello,
>
> I am implementing Lucene and need to index my PDF files.
>
> I have found several solutions, but they all require some non PHP
> component such as XPDF, etc...  I need this to be cross platform, so
> those are generally out.
>
> I also started looking for ways to get inside Zend_PDF to get at the
> elements of each page with no success yet.  I was hoping that I could
> iterate the pages in a PDF (done), get a list of the elements on that
> page (?) and then grab the text from perhaps the Zend_Pdf_Element_String
> I was able to find in there.  Since I am not going to be displaying the
> context in my search, the location of the text does not matter to me so
> much.
>
> I am getting totally bogged down in the source code for the pages and
> the parsers, partially at least because I am not familiar with the
> nomenclature of PDF internals  :(
>
> Does anyone have any pointers on how to approach this?  Ideally I'd like
> to keep it Zend, but I can use other PDF libraries if I need to.
>
> Thanks
>
> Bill
>
>
>

--
View this message in context: http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
Sent from the Zend MVC mailing list archive at Nabble.com.




--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com

Matthias W.

Re: Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
What about writing a java webservice.
With Apache XML-RPC its really easy to setup a webservice.

The webservice could share PDFBox functionality to your PHP Application...

Bill Chmura-2 wrote:
Thanks Shaun and Matthias,

Shaun: I actually already found your post, and so far it is the most
likely scenario if I cannot get a pure PHP solution working - The server
is OpenBSD, but development is done on OSX, Linux, and Windows so it
presents a problem with the XPDF.  But if push comes to shove it's where
I will be heading.

Matthias:  It needs to be able to update on the fly, and running Java up
there may be a bit dicey...  There is also a db component, so some of
the meta data comes from my model, and well - its seeming to look
painful as I move ahead either way - thanks for the suggestion though!

I was really hoping someone with Zend_PDF knowledge would see this and
yell, hey - just grab this array from the PDF object, its got your
strings :)

Thanks guys!






Shaun Farrell wrote:
> About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with
> Zend.
> (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)  
> the Framework has come along way since then so it's probably out of
> date. I have been thinking about updating the topic.  The current
> implementation uses XPDF which at the time was the best to convert
> PDF's to Text.  I have been looking for some other libraries but have
> no luck.  I'm also looking so ill let you know if i find anything.
>
> On Wed, Sep 9, 2009 at 8:00 AM, Matthias W.
> <Matthias.Wangler@e-projecta.com
> <mailto:Matthias.Wangler@e-projecta.com>> wrote:
>
>
>     Hi,
>     some time ago I had the same problem. But I needed the support for
>     other
>     documents, too (Excel, Powerpoint, ...).
>     Because of this I created my index with java Apache projects:
>     Lucene, PDFBox
>     (PDF parser/writer) and POI (Office document parser/writer).
>
>     I think it wouldn't be much work to parse your PDF docs Java-side...
>
>
>     Bill Chmura-2 wrote:
>     >
>     > Hello,
>     >
>     > I am implementing Lucene and need to index my PDF files.
>     >
>     > I have found several solutions, but they all require some non PHP
>     > component such as XPDF, etc...  I need this to be cross platform, so
>     > those are generally out.
>     >
>     > I also started looking for ways to get inside Zend_PDF to get at the
>     > elements of each page with no success yet.  I was hoping that I
>     could
>     > iterate the pages in a PDF (done), get a list of the elements on
>     that
>     > page (?) and then grab the text from perhaps the
>     Zend_Pdf_Element_String
>     > I was able to find in there.  Since I am not going to be
>     displaying the
>     > context in my search, the location of the text does not matter
>     to me so
>     > much.
>     >
>     > I am getting totally bogged down in the source code for the
>     pages and
>     > the parsers, partially at least because I am not familiar with the
>     > nomenclature of PDF internals  :(
>     >
>     > Does anyone have any pointers on how to approach this?  Ideally
>     I'd like
>     > to keep it Zend, but I can use other PDF libraries if I need to.
>     >
>     > Thanks
>     >
>     > Bill
>     >
>     >
>     >
>
>     --
>     View this message in context:
>     http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
>     Sent from the Zend MVC mailing list archive at Nabble.com.
>
>
>
>
> --
> Shaun J. Farrell
> Washington, DC
> (202) 713-5241
> www.farrelley.com <http://www.farrelley.com>
Shaun Farrell

Re: Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
In reply to this post by Bill Chmura-2
Bill, I have looked at the Zend_PDF and I am not sure you can read the text.  I will look again in 1.9.2 and see.  I think its write only. But I could be totally wrong.  It may be a good question to ask in the #PHPC chat room


On Wed, Sep 9, 2009 at 8:39 AM, Bill Chmura <[hidden email]> wrote:

Thanks Shaun and Matthias,

Shaun: I actually already found your post, and so far it is the most likely scenario if I cannot get a pure PHP solution working - The server is OpenBSD, but development is done on OSX, Linux, and Windows so it presents a problem with the XPDF.  But if push comes to shove it's where I will be heading.

Matthias:  It needs to be able to update on the fly, and running Java up there may be a bit dicey...  There is also a db component, so some of the meta data comes from my model, and well - its seeming to look painful as I move ahead either way - thanks for the suggestion though!

I was really hoping someone with Zend_PDF knowledge would see this and yell, hey - just grab this array from the PDF object, its got your strings :)

Thanks guys!







Shaun Farrell wrote:
About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with Zend. (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)  the Framework has come along way since then so it's probably out of date. I have been thinking about updating the topic.  The current implementation uses XPDF which at the time was the best to convert PDF's to Text.  I have been looking for some other libraries but have no luck.  I'm also looking so ill let you know if i find anything.

On Wed, Sep 9, 2009 at 8:00 AM, Matthias W. <[hidden email]> wrote:

Hi,
some time ago I had the same problem. But I needed the support for other
documents, too (Excel, Powerpoint, ...).
Because of this I created my index with java Apache projects: Lucene, PDFBox
(PDF parser/writer) and POI (Office document parser/writer).

I think it wouldn't be much work to parse your PDF docs Java-side...


Bill Chmura-2 wrote:
>
> Hello,
>
> I am implementing Lucene and need to index my PDF files.
>
> I have found several solutions, but they all require some non PHP
> component such as XPDF, etc...  I need this to be cross platform, so
> those are generally out.
>
> I also started looking for ways to get inside Zend_PDF to get at the
> elements of each page with no success yet.  I was hoping that I could
> iterate the pages in a PDF (done), get a list of the elements on that
> page (?) and then grab the text from perhaps the Zend_Pdf_Element_String
> I was able to find in there.  Since I am not going to be displaying the
> context in my search, the location of the text does not matter to me so
> much.
>
> I am getting totally bogged down in the source code for the pages and
> the parsers, partially at least because I am not familiar with the
> nomenclature of PDF internals  :(
>
> Does anyone have any pointers on how to approach this?  Ideally I'd like
> to keep it Zend, but I can use other PDF libraries if I need to.
>
> Thanks
>
> Bill
>
>
>

--
View this message in context: http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
Sent from the Zend MVC mailing list archive at Nabble.com.




--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com




--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com
Bill Chmura-2

Re: Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
Some javascript/style in this post has been disabled (why?)

Hi Shaun,

It does not support it from the API level - but I was trolling through the code and looking to see if I could use the parser to grab the strings out of the PDF.  It does look like it is able to go through the items in the PDF on load and separate them into different elements - its just accessing those elements is the tough part for me - I will probably check some more in a bit...  Just have to track down where its putting them.   If I can extend one of the PDF classes to do it I will... I definitely do not want to start changing the actual zend code (upgrading would be hell then).



Shaun Farrell wrote:
Bill, I have looked at the Zend_PDF and I am not sure you can read the text.  I will look again in 1.9.2 and see.  I think its write only. But I could be totally wrong.  It may be a good question to ask in the #PHPC chat room


On Wed, Sep 9, 2009 at 8:39 AM, Bill Chmura <[hidden email]> wrote:

Thanks Shaun and Matthias,

Shaun: I actually already found your post, and so far it is the most likely scenario if I cannot get a pure PHP solution working - The server is OpenBSD, but development is done on OSX, Linux, and Windows so it presents a problem with the XPDF.  But if push comes to shove it's where I will be heading.

Matthias:  It needs to be able to update on the fly, and running Java up there may be a bit dicey...  There is also a db component, so some of the meta data comes from my model, and well - its seeming to look painful as I move ahead either way - thanks for the suggestion though!

I was really hoping someone with Zend_PDF knowledge would see this and yell, hey - just grab this array from the PDF object, its got your strings :)

Thanks guys!







Shaun Farrell wrote:
About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with Zend. (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)  the Framework has come along way since then so it's probably out of date. I have been thinking about updating the topic.  The current implementation uses XPDF which at the time was the best to convert PDF's to Text.  I have been looking for some other libraries but have no luck.  I'm also looking so ill let you know if i find anything.

On Wed, Sep 9, 2009 at 8:00 AM, Matthias W. <[hidden email]> wrote:

Hi,
some time ago I had the same problem. But I needed the support for other
documents, too (Excel, Powerpoint, ...).
Because of this I created my index with java Apache projects: Lucene, PDFBox
(PDF parser/writer) and POI (Office document parser/writer).

I think it wouldn't be much work to parse your PDF docs Java-side...


Bill Chmura-2 wrote:
>
> Hello,
>
> I am implementing Lucene and need to index my PDF files.
>
> I have found several solutions, but they all require some non PHP
> component such as XPDF, etc...  I need this to be cross platform, so
> those are generally out.
>
> I also started looking for ways to get inside Zend_PDF to get at the
> elements of each page with no success yet.  I was hoping that I could
> iterate the pages in a PDF (done), get a list of the elements on that
> page (?) and then grab the text from perhaps the Zend_Pdf_Element_String
> I was able to find in there.  Since I am not going to be displaying the
> context in my search, the location of the text does not matter to me so
> much.
>
> I am getting totally bogged down in the source code for the pages and
> the parsers, partially at least because I am not familiar with the
> nomenclature of PDF internals  :(
>
> Does anyone have any pointers on how to approach this?  Ideally I'd like
> to keep it Zend, but I can use other PDF libraries if I need to.
>
> Thanks
>
> Bill
>
>
>

--
View this message in context: http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
Sent from the Zend MVC mailing list archive at Nabble.com.




--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com




--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com

Markus Wolff

Re: Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
In reply to this post by Bill Chmura-2
Bill Chmura wrote:
> Shaun: I actually already found your post, and so far it is the most
> likely scenario if I cannot get a pure PHP solution working - The server
> is OpenBSD, but development is done on OSX, Linux, and Windows so it
> presents a problem with the XPDF.  But if push comes to shove it's where
> I will be heading.

A little off-topic, but if it's clear what the deployment platform is
(OpenBSD in this case), I can highly recommend using a virtualization
tool such as VirtualBox or vmWare to run a setup very similar to the
deployment system right on your dev box.

Not only will this eliminate the problem that the tools you use in
production are not available in your dev environment, it also helps
avoiding portability problems - code that works perfectly on a Windows
box does not neccessarily work on a Unix box without modifications.
Case-sensitivity in filenames, different path separators and the likes
are only the most common and obvious issues, but some functions and/or
extensions may also behave differently across operating systems.

CU
 Markus


Bill Chmura-2

Re: Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
Some javascript/style in this post has been disabled (why?)

I hear ya...  I'm doing dev on a linux box, and that catches most of the incompatibilities during dev (I am lead and doing most of the coding) - those that don't get caught there get caught on the test server (which is a duplicate of the live boxes).  I agree running dev in the same environment would be ideal, but OpenBSD, while a great stable box, has its challenges when trying to get other new desktop software running on it.

Although giving dev's a VM they could shove the code over into do do some testing on that platform themselves is a neat idea... hmmmm

thanks!



Markus Wolff wrote:
Bill Chmura wrote:
  
Shaun: I actually already found your post, and so far it is the most
likely scenario if I cannot get a pure PHP solution working - The server
is OpenBSD, but development is done on OSX, Linux, and Windows so it
presents a problem with the XPDF.  But if push comes to shove it's where
I will be heading.
    

A little off-topic, but if it's clear what the deployment platform is
(OpenBSD in this case), I can highly recommend using a virtualization
tool such as VirtualBox or vmWare to run a setup very similar to the
deployment system right on your dev box.

Not only will this eliminate the problem that the tools you use in
production are not available in your dev environment, it also helps
avoiding portability problems - code that works perfectly on a Windows
box does not neccessarily work on a Unix box without modifications.
Case-sensitivity in filenames, different path separators and the likes
are only the most common and obvious issues, but some functions and/or
extensions may also behave differently across operating systems.

CU
 Markus


  

Bill Chmura-2

Re: Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
In reply to this post by Bill Chmura-2
Some javascript/style in this post has been disabled (why?)

Just to bring closure to this... basically what we ended up doing was writing the PDF code ourselves to grab only the text out of the PDF.   The spec's are available from Adobe for the PDF format, so it was not that bad in the end.  At least it is still all PHP.

Thanks to everyone for the suggestions on this



Bill Chmura wrote:

Thanks Shaun and Matthias,

Shaun: I actually already found your post, and so far it is the most likely scenario if I cannot get a pure PHP solution working - The server is OpenBSD, but development is done on OSX, Linux, and Windows so it presents a problem with the XPDF.  But if push comes to shove it's where I will be heading.

Matthias:  It needs to be able to update on the fly, and running Java up there may be a bit dicey...  There is also a db component, so some of the meta data comes from my model, and well - its seeming to look painful as I move ahead either way - thanks for the suggestion though!

I was really hoping someone with Zend_PDF knowledge would see this and yell, hey - just grab this array from the PDF object, its got your strings :)

Thanks guys!






Shaun Farrell wrote:
About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with Zend. (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)  the Framework has come along way since then so it's probably out of date. I have been thinking about updating the topic.  The current implementation uses XPDF which at the time was the best to convert PDF's to Text.  I have been looking for some other libraries but have no luck.  I'm also looking so ill let you know if i find anything.

On Wed, Sep 9, 2009 at 8:00 AM, Matthias W. <[hidden email]> wrote:

Hi,
some time ago I had the same problem. But I needed the support for other
documents, too (Excel, Powerpoint, ...).
Because of this I created my index with java Apache projects: Lucene, PDFBox
(PDF parser/writer) and POI (Office document parser/writer).

I think it wouldn't be much work to parse your PDF docs Java-side...


Bill Chmura-2 wrote:
>
> Hello,
>
> I am implementing Lucene and need to index my PDF files.
>
> I have found several solutions, but they all require some non PHP
> component such as XPDF, etc...  I need this to be cross platform, so
> those are generally out.
>
> I also started looking for ways to get inside Zend_PDF to get at the
> elements of each page with no success yet.  I was hoping that I could
> iterate the pages in a PDF (done), get a list of the elements on that
> page (?) and then grab the text from perhaps the Zend_Pdf_Element_String
> I was able to find in there.  Since I am not going to be displaying the
> context in my search, the location of the text does not matter to me so
> much.
>
> I am getting totally bogged down in the source code for the pages and
> the parsers, partially at least because I am not familiar with the
> nomenclature of PDF internals  :(
>
> Does anyone have any pointers on how to approach this?  Ideally I'd like
> to keep it Zend, but I can use other PDF libraries if I need to.
>
> Thanks
>
> Bill
>
>
>

--
View this message in context: http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
Sent from the Zend MVC mailing list archive at Nabble.com.




--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com


Shaun Farrell

Re: Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
Bill,

Are you going to open source that code?


On Fri, Sep 11, 2009 at 1:58 PM, Bill Chmura <[hidden email]> wrote:

Just to bring closure to this... basically what we ended up doing was writing the PDF code ourselves to grab only the text out of the PDF.   The spec's are available from Adobe for the PDF format, so it was not that bad in the end.  At least it is still all PHP.

Thanks to everyone for the suggestions on this




Bill Chmura wrote:

Thanks Shaun and Matthias,

Shaun: I actually already found your post, and so far it is the most likely scenario if I cannot get a pure PHP solution working - The server is OpenBSD, but development is done on OSX, Linux, and Windows so it presents a problem with the XPDF.  But if push comes to shove it's where I will be heading.

Matthias:  It needs to be able to update on the fly, and running Java up there may be a bit dicey...  There is also a db component, so some of the meta data comes from my model, and well - its seeming to look painful as I move ahead either way - thanks for the suggestion though!

I was really hoping someone with Zend_PDF knowledge would see this and yell, hey - just grab this array from the PDF object, its got your strings :)

Thanks guys!






Shaun Farrell wrote:
About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with Zend. (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)  the Framework has come along way since then so it's probably out of date. I have been thinking about updating the topic.  The current implementation uses XPDF which at the time was the best to convert PDF's to Text.  I have been looking for some other libraries but have no luck.  I'm also looking so ill let you know if i find anything.

On Wed, Sep 9, 2009 at 8:00 AM, Matthias W. <[hidden email]> wrote:

Hi,
some time ago I had the same problem. But I needed the support for other
documents, too (Excel, Powerpoint, ...).
Because of this I created my index with java Apache projects: Lucene, PDFBox
(PDF parser/writer) and POI (Office document parser/writer).

I think it wouldn't be much work to parse your PDF docs Java-side...


Bill Chmura-2 wrote:
>
> Hello,
>
> I am implementing Lucene and need to index my PDF files.
>
> I have found several solutions, but they all require some non PHP
> component such as XPDF, etc...  I need this to be cross platform, so
> those are generally out.
>
> I also started looking for ways to get inside Zend_PDF to get at the
> elements of each page with no success yet.  I was hoping that I could
> iterate the pages in a PDF (done), get a list of the elements on that
> page (?) and then grab the text from perhaps the Zend_Pdf_Element_String
> I was able to find in there.  Since I am not going to be displaying the
> context in my search, the location of the text does not matter to me so
> much.
>
> I am getting totally bogged down in the source code for the pages and
> the parsers, partially at least because I am not familiar with the
> nomenclature of PDF internals  :(
>
> Does anyone have any pointers on how to approach this?  Ideally I'd like
> to keep it Zend, but I can use other PDF libraries if I need to.
>
> Thanks
>
> Bill
>
>
>

--
View this message in context: http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
Sent from the Zend MVC mailing list archive at Nabble.com.




--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com





--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com
Bill Chmura-2

Re: Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
Some javascript/style in this post has been disabled (why?)


I don't see why we wouldn't.   Let me clean it up a bit, and I will post it. 

Nothing terribly complicated, but it could save some time for other people.


Shaun Farrell wrote:
Bill,

Are you going to open source that code?


On Fri, Sep 11, 2009 at 1:58 PM, Bill Chmura <[hidden email]> wrote:

Just to bring closure to this... basically what we ended up doing was writing the PDF code ourselves to grab only the text out of the PDF.   The spec's are available from Adobe for the PDF format, so it was not that bad in the end.  At least it is still all PHP.

Thanks to everyone for the suggestions on this




Bill Chmura wrote:

Thanks Shaun and Matthias,

Shaun: I actually already found your post, and so far it is the most likely scenario if I cannot get a pure PHP solution working - The server is OpenBSD, but development is done on OSX, Linux, and Windows so it presents a problem with the XPDF.  But if push comes to shove it's where I will be heading.

Matthias:  It needs to be able to update on the fly, and running Java up there may be a bit dicey...  There is also a db component, so some of the meta data comes from my model, and well - its seeming to look painful as I move ahead either way - thanks for the suggestion though!

I was really hoping someone with Zend_PDF knowledge would see this and yell, hey - just grab this array from the PDF object, its got your strings :)

Thanks guys!






Shaun Farrell wrote:
About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with Zend. (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)  the Framework has come along way since then so it's probably out of date. I have been thinking about updating the topic.  The current implementation uses XPDF which at the time was the best to convert PDF's to Text.  I have been looking for some other libraries but have no luck.  I'm also looking so ill let you know if i find anything.

On Wed, Sep 9, 2009 at 8:00 AM, Matthias W. <[hidden email]> wrote:

Hi,
some time ago I had the same problem. But I needed the support for other
documents, too (Excel, Powerpoint, ...).
Because of this I created my index with java Apache projects: Lucene, PDFBox
(PDF parser/writer) and POI (Office document parser/writer).

I think it wouldn't be much work to parse your PDF docs Java-side...


Bill Chmura-2 wrote:
>
> Hello,
>
> I am implementing Lucene and need to index my PDF files.
>
> I have found several solutions, but they all require some non PHP
> component such as XPDF, etc...  I need this to be cross platform, so
> those are generally out.
>
> I also started looking for ways to get inside Zend_PDF to get at the
> elements of each page with no success yet.  I was hoping that I could
> iterate the pages in a PDF (done), get a list of the elements on that
> page (?) and then grab the text from perhaps the Zend_Pdf_Element_String
> I was able to find in there.  Since I am not going to be displaying the
> context in my search, the location of the text does not matter to me so
> much.
>
> I am getting totally bogged down in the source code for the pages and
> the parsers, partially at least because I am not familiar with the
> nomenclature of PDF internals  :(
>
> Does anyone have any pointers on how to approach this?  Ideally I'd like
> to keep it Zend, but I can use other PDF libraries if I need to.
>
> Thanks
>
> Bill
>
>
>

--
View this message in context: http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
Sent from the Zend MVC mailing list archive at Nabble.com.




--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com





--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com

Bill Chmura-2

Re: Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
Some javascript/style in this post has been disabled (why?)

Hey, I spoke with the guy who wrote it and he is cool with putting it out - he wanted a day or two to include some brief docs

I'll post it then

Following that we are going to read keywords and titles also, which it don't do now and wrap it as a lucence_PDF class and give that one out also


Bill Chmura wrote:


I don't see why we wouldn't.   Let me clean it up a bit, and I will post it. 

Nothing terribly complicated, but it could save some time for other people.


Shaun Farrell wrote:
Bill,

Are you going to open source that code?


On Fri, Sep 11, 2009 at 1:58 PM, Bill Chmura <[hidden email]> wrote:

Just to bring closure to this... basically what we ended up doing was writing the PDF code ourselves to grab only the text out of the PDF.   The spec's are available from Adobe for the PDF format, so it was not that bad in the end.  At least it is still all PHP.

Thanks to everyone for the suggestions on this




Bill Chmura wrote:

Thanks Shaun and Matthias,

Shaun: I actually already found your post, and so far it is the most likely scenario if I cannot get a pure PHP solution working - The server is OpenBSD, but development is done on OSX, Linux, and Windows so it presents a problem with the XPDF.  But if push comes to shove it's where I will be heading.

Matthias:  It needs to be able to update on the fly, and running Java up there may be a bit dicey...  There is also a db component, so some of the meta data comes from my model, and well - its seeming to look painful as I move ahead either way - thanks for the suggestion though!

I was really hoping someone with Zend_PDF knowledge would see this and yell, hey - just grab this array from the PDF object, its got your strings :)

Thanks guys!






Shaun Farrell wrote:
About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with Zend. (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)  the Framework has come along way since then so it's probably out of date. I have been thinking about updating the topic.  The current implementation uses XPDF which at the time was the best to convert PDF's to Text.  I have been looking for some other libraries but have no luck.  I'm also looking so ill let you know if i find anything.

On Wed, Sep 9, 2009 at 8:00 AM, Matthias W. <[hidden email]> wrote:

Hi,
some time ago I had the same problem. But I needed the support for other
documents, too (Excel, Powerpoint, ...).
Because of this I created my index with java Apache projects: Lucene, PDFBox
(PDF parser/writer) and POI (Office document parser/writer).

I think it wouldn't be much work to parse your PDF docs Java-side...


Bill Chmura-2 wrote:
>
> Hello,
>
> I am implementing Lucene and need to index my PDF files.
>
> I have found several solutions, but they all require some non PHP
> component such as XPDF, etc...  I need this to be cross platform, so
> those are generally out.
>
> I also started looking for ways to get inside Zend_PDF to get at the
> elements of each page with no success yet.  I was hoping that I could
> iterate the pages in a PDF (done), get a list of the elements on that
> page (?) and then grab the text from perhaps the Zend_Pdf_Element_String
> I was able to find in there.  Since I am not going to be displaying the
> context in my search, the location of the text does not matter to me so
> much.
>
> I am getting totally bogged down in the source code for the pages and
> the parsers, partially at least because I am not familiar with the
> nomenclature of PDF internals  :(
>
> Does anyone have any pointers on how to approach this?  Ideally I'd like
> to keep it Zend, but I can use other PDF libraries if I need to.
>
> Thanks
>
> Bill
>
>
>

--
View this message in context: http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
Sent from the Zend MVC mailing list archive at Nabble.com.




--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com





--
Shaun J. Farrell
Washington, DC
(202) 713-5241
www.farrelley.com


sebdev

Re: Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
Hi there,

we are also looking for a PHP only solution.

Did you put the classes yet for download?

Thanks,
Seb.

Bill Chmura-2 wrote:
Hey, I spoke with the guy who wrote it and he is cool with putting it
out - he wanted a day or two to include some brief docs

I'll post it then

Following that we are going to read keywords and titles also, which it
don't do now and wrap it as a lucence_PDF class and give that one out also


Bill Chmura wrote:
>
>
> I don't see why we wouldn't.   Let me clean it up a bit, and I will
> post it.
>
> Nothing terribly complicated, but it could save some time for other
> people.
>
>
> Shaun Farrell wrote:
>> Bill,
>>
>> Are you going to open source that code?
>>
>>
>> On Fri, Sep 11, 2009 at 1:58 PM, Bill Chmura <Bill@explosivo.com
>> <mailto:Bill@explosivo.com>> wrote:
>>
>>
>>     Just to bring closure to this... basically what we ended up doing
>>     was writing the PDF code ourselves to grab only the text out of
>>     the PDF.   The spec's are available from Adobe for the PDF
>>     format, so it was not that bad in the end.  At least it is still
>>     all PHP.
>>
>>     Thanks to everyone for the suggestions on this
>>
>>
>>
>>
>>     Bill Chmura wrote:
>>>
>>>     Thanks Shaun and Matthias,
>>>
>>>     Shaun: I actually already found your post, and so far it is the
>>>     most likely scenario if I cannot get a pure PHP solution working
>>>     - The server is OpenBSD, but development is done on OSX, Linux,
>>>     and Windows so it presents a problem with the XPDF.  But if push
>>>     comes to shove it's where I will be heading.
>>>
>>>     Matthias:  It needs to be able to update on the fly, and running
>>>     Java up there may be a bit dicey...  There is also a db
>>>     component, so some of the meta data comes from my model, and
>>>     well - its seeming to look painful as I move ahead either way -
>>>     thanks for the suggestion though!
>>>
>>>     I was really hoping someone with Zend_PDF knowledge would see
>>>     this and yell, hey - just grab this array from the PDF object,
>>>     its got your strings :)
>>>
>>>     Thanks guys!
>>>
>>>
>>>
>>>
>>>
>>>
>>>     Shaun Farrell wrote:
>>>>     About a 1 1/2 yrs ago I wrote a 2 part post on how to index
>>>>     pdf's with Zend.
>>>>     (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)
>>>>     the Framework has come along way since then so it's probably
>>>>     out of date. I have been thinking about updating the topic.
>>>>     The current implementation uses XPDF which at the time was the
>>>>     best to convert PDF's to Text.  I have been looking for some
>>>>     other libraries but have no luck.  I'm also looking so ill let
>>>>     you know if i find anything.
>>>>
>>>>     On Wed, Sep 9, 2009 at 8:00 AM, Matthias W.
>>>>     <Matthias.Wangler@e-projecta.com
>>>>     <mailto:Matthias.Wangler@e-projecta.com>> wrote:
>>>>
>>>>
>>>>         Hi,
>>>>         some time ago I had the same problem. But I needed the
>>>>         support for other
>>>>         documents, too (Excel, Powerpoint, ...).
>>>>         Because of this I created my index with java Apache
>>>>         projects: Lucene, PDFBox
>>>>         (PDF parser/writer) and POI (Office document parser/writer).
>>>>
>>>>         I think it wouldn't be much work to parse your PDF docs
>>>>         Java-side...
>>>>
>>>>
>>>>         Bill Chmura-2 wrote:
>>>>         >
>>>>         > Hello,
>>>>         >
>>>>         > I am implementing Lucene and need to index my PDF files.
>>>>         >
>>>>         > I have found several solutions, but they all require some
>>>>         non PHP
>>>>         > component such as XPDF, etc...  I need this to be cross
>>>>         platform, so
>>>>         > those are generally out.
>>>>         >
>>>>         > I also started looking for ways to get inside Zend_PDF to
>>>>         get at the
>>>>         > elements of each page with no success yet.  I was hoping
>>>>         that I could
>>>>         > iterate the pages in a PDF (done), get a list of the
>>>>         elements on that
>>>>         > page (?) and then grab the text from perhaps the
>>>>         Zend_Pdf_Element_String
>>>>         > I was able to find in there.  Since I am not going to be
>>>>         displaying the
>>>>         > context in my search, the location of the text does not
>>>>         matter to me so
>>>>         > much.
>>>>         >
>>>>         > I am getting totally bogged down in the source code for
>>>>         the pages and
>>>>         > the parsers, partially at least because I am not familiar
>>>>         with the
>>>>         > nomenclature of PDF internals  :(
>>>>         >
>>>>         > Does anyone have any pointers on how to approach this?
>>>>          Ideally I'd like
>>>>         > to keep it Zend, but I can use other PDF libraries if I
>>>>         need to.
>>>>         >
>>>>         > Thanks
>>>>         >
>>>>         > Bill
>>>>         >
>>>>         >
>>>>         >
>>>>
>>>>         --
>>>>         View this message in context:
>>>>         http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
>>>>         Sent from the Zend MVC mailing list archive at Nabble.com.
>>>>
>>>>
>>>>
>>>>
>>>>     --
>>>>     Shaun J. Farrell
>>>>     Washington, DC
>>>>     (202) 713-5241
>>>>     www.farrelley.com <http://www.farrelley.com>
>>>
>>
>>
>>
>>
>> --
>> Shaun J. Farrell
>> Washington, DC
>> (202) 713-5241
>> www.farrelley.com <http://www.farrelley.com>
>
Bill Chmura-2

Re: Zend_Search_Lucene and PDF files

Reply Threaded More More options
Print post
Permalink
Not yet, got prioritized to something else.  A few more days maybe...
hopefully monday



sebdev wrote:

> Hi there,
>
> we are also looking for a PHP only solution.
>
> Did you put the classes yet for download?
>
> Thanks,
> Seb.
>
>
> Bill Chmura-2 wrote:
>  
>> Hey, I spoke with the guy who wrote it and he is cool with putting it
>> out - he wanted a day or two to include some brief docs
>>
>> I'll post it then
>>
>> Following that we are going to read keywords and titles also, which it
>> don't do now and wrap it as a lucence_PDF class and give that one out also
>>
>>
>> Bill Chmura wrote:
>>    
>>> I don't see why we wouldn't.   Let me clean it up a bit, and I will
>>> post it.
>>>
>>> Nothing terribly complicated, but it could save some time for other
>>> people.
>>>
>>>
>>> Shaun Farrell wrote:
>>>      
>>>> Bill,
>>>>
>>>> Are you going to open source that code?
>>>>
>>>>
>>>> On Fri, Sep 11, 2009 at 1:58 PM, Bill Chmura <[hidden email]
>>>> <mailto:[hidden email]>> wrote:
>>>>
>>>>
>>>>     Just to bring closure to this... basically what we ended up doing
>>>>     was writing the PDF code ourselves to grab only the text out of
>>>>     the PDF.   The spec's are available from Adobe for the PDF
>>>>     format, so it was not that bad in the end.  At least it is still
>>>>     all PHP.
>>>>
>>>>     Thanks to everyone for the suggestions on this
>>>>
>>>>
>>>>
>>>>
>>>>     Bill Chmura wrote:
>>>>        
>>>>>     Thanks Shaun and Matthias,
>>>>>
>>>>>     Shaun: I actually already found your post, and so far it is the
>>>>>     most likely scenario if I cannot get a pure PHP solution working
>>>>>     - The server is OpenBSD, but development is done on OSX, Linux,
>>>>>     and Windows so it presents a problem with the XPDF.  But if push
>>>>>     comes to shove it's where I will be heading.
>>>>>
>>>>>     Matthias:  It needs to be able to update on the fly, and running
>>>>>     Java up there may be a bit dicey...  There is also a db
>>>>>     component, so some of the meta data comes from my model, and
>>>>>     well - its seeming to look painful as I move ahead either way -
>>>>>     thanks for the suggestion though!
>>>>>
>>>>>     I was really hoping someone with Zend_PDF knowledge would see
>>>>>     this and yell, hey - just grab this array from the PDF object,
>>>>>     its got your strings :)
>>>>>
>>>>>     Thanks guys!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>     Shaun Farrell wrote:
>>>>>          
>>>>>>     About a 1 1/2 yrs ago I wrote a 2 part post on how to index
>>>>>>     pdf's with Zend.
>>>>>>    
>>>>>> (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/)
>>>>>>     the Framework has come along way since then so it's probably
>>>>>>     out of date. I have been thinking about updating the topic.
>>>>>>     The current implementation uses XPDF which at the time was the
>>>>>>     best to convert PDF's to Text.  I have been looking for some
>>>>>>     other libraries but have no luck.  I'm also looking so ill let
>>>>>>     you know if i find anything.
>>>>>>
>>>>>>     On Wed, Sep 9, 2009 at 8:00 AM, Matthias W.
>>>>>>     <[hidden email]
>>>>>>     <mailto:[hidden email]>> wrote:
>>>>>>
>>>>>>
>>>>>>         Hi,
>>>>>>         some time ago I had the same problem. But I needed the
>>>>>>         support for other
>>>>>>         documents, too (Excel, Powerpoint, ...).
>>>>>>         Because of this I created my index with java Apache
>>>>>>         projects: Lucene, PDFBox
>>>>>>         (PDF parser/writer) and POI (Office document parser/writer).
>>>>>>
>>>>>>         I think it wouldn't be much work to parse your PDF docs
>>>>>>         Java-side...
>>>>>>
>>>>>>
>>>>>>         Bill Chmura-2 wrote:
>>>>>>         >
>>>>>>         > Hello,
>>>>>>         >
>>>>>>         > I am implementing Lucene and need to index my PDF files.
>>>>>>         >
>>>>>>         > I have found several solutions, but they all require some
>>>>>>         non PHP
>>>>>>         > component such as XPDF, etc...  I need this to be cross
>>>>>>         platform, so
>>>>>>         > those are generally out.
>>>>>>         >
>>>>>>         > I also started looking for ways to get inside Zend_PDF to
>>>>>>         get at the
>>>>>>         > elements of each page with no success yet.  I was hoping
>>>>>>         that I could
>>>>>>         > iterate the pages in a PDF (done), get a list of the
>>>>>>         elements on that
>>>>>>         > page (?) and then grab the text from perhaps the
>>>>>>         Zend_Pdf_Element_String
>>>>>>         > I was able to find in there.  Since I am not going to be
>>>>>>         displaying the
>>>>>>         > context in my search, the location of the text does not
>>>>>>         matter to me so
>>>>>>         > much.
>>>>>>         >
>>>>>>         > I am getting totally bogged down in the source code for
>>>>>>         the pages and
>>>>>>         > the parsers, partially at least because I am not familiar
>>>>>>         with the
>>>>>>         > nomenclature of PDF internals  :(
>>>>>>         >
>>>>>>         > Does anyone have any pointers on how to approach this?
>>>>>>          Ideally I'd like
>>>>>>         > to keep it Zend, but I can use other PDF libraries if I
>>>>>>         need to.
>>>>>>         >
>>>>>>         > Thanks
>>>>>>         >
>>>>>>         > Bill
>>>>>>         >
>>>>>>         >
>>>>>>         >
>>>>>>
>>>>>>         --
>>>>>>         View this message in context:
>>>>>>        
>>>>>> http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html
>>>>>>         Sent from the Zend MVC mailing list archive at Nabble.com.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>     --
>>>>>>     Shaun J. Farrell
>>>>>>     Washington, DC
>>>>>>     (202) 713-5241
>>>>>>     www.farrelley.com <http://www.farrelley.com>
>>>>>>            
>>>>
>>>>
>>>> --
>>>> Shaun J. Farrell
>>>> Washington, DC
>>>> (202) 713-5241
>>>> www.farrelley.com <http://www.farrelley.com>
>>>>        
>>
>>    
>
>