|
|
|
Bill Chmura-2
|
Some javascript/style in this post has been disabled (why?)
Hello,I am implementing Lucene and need to index my PDF files. I have found several solutions, but they all require some non PHP component such as XPDF, etc... I need this to be cross platform, so those are generally out. I also started looking for ways to get inside Zend_PDF to get at the elements of each page with no success yet. I was hoping that I could iterate the pages in a PDF (done), get a list of the elements on that page (?) and then grab the text from perhaps the Zend_Pdf_Element_String I was able to find in there. Since I am not going to be displaying the context in my search, the location of the text does not matter to me so much. I am getting totally bogged down in the source code for the pages and the parsers, partially at least because I am not familiar with the nomenclature of PDF internals :( Does anyone have any pointers on how to approach this? Ideally I'd like to keep it Zend, but I can use other PDF libraries if I need to. Thanks Bill |
||||||||||||||||
|
Matthias W.
|
Hi,
some time ago I had the same problem. But I needed the support for other documents, too (Excel, Powerpoint, ...). Because of this I created my index with java Apache projects: Lucene, PDFBox (PDF parser/writer) and POI (Office document parser/writer). I think it wouldn't be much work to parse your PDF docs Java-side...
|
||||
|
Shaun Farrell
|
About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with Zend. (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/) the Framework has come along way since then so it's probably out of date. I have been thinking about updating the topic. The current implementation uses XPDF which at the time was the best to convert PDF's to Text. I have been looking for some other libraries but have no luck. I'm also looking so ill let you know if i find anything.
On Wed, Sep 9, 2009 at 8:00 AM, Matthias W. <[hidden email]> wrote:
-- Shaun J. Farrell Washington, DC (202) 713-5241 www.farrelley.com |
||||||||||||||||
|
Bill Chmura-2
|
Some javascript/style in this post has been disabled (why?)
Thanks Shaun and Matthias, Shaun: I actually already found your post, and so far it is the most likely scenario if I cannot get a pure PHP solution working - The server is OpenBSD, but development is done on OSX, Linux, and Windows so it presents a problem with the XPDF. But if push comes to shove it's where I will be heading. Matthias: It needs to be able to update on the fly, and running Java up there may be a bit dicey... There is also a db component, so some of the meta data comes from my model, and well - its seeming to look painful as I move ahead either way - thanks for the suggestion though! I was really hoping someone with Zend_PDF knowledge would see this and yell, hey - just grab this array from the PDF object, its got your strings :) Thanks guys! Shaun Farrell wrote: About a 1 1/2 yrs ago I wrote a 2 part post on how to index pdf's with Zend. (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/) the Framework has come along way since then so it's probably out of date. I have been thinking about updating the topic. The current implementation uses XPDF which at the time was the best to convert PDF's to Text. I have been looking for some other libraries but have no luck. I'm also looking so ill let you know if i find anything. |
||||||||||||||||
|
Matthias W.
|
What about writing a java webservice.
With Apache XML-RPC its really easy to setup a webservice. The webservice could share PDFBox functionality to your PHP Application...
|
||||||||||||||||
|
Shaun Farrell
|
In reply to this post
by Bill Chmura-2
Bill, I have looked at the Zend_PDF and I am not sure you can read the text. I will look again in 1.9.2 and see. I think its write only. But I could be totally wrong. It may be a good question to ask in the #PHPC chat room
On Wed, Sep 9, 2009 at 8:39 AM, Bill Chmura <[hidden email]> wrote:
-- Shaun J. Farrell Washington, DC (202) 713-5241 www.farrelley.com |
||||||||||||||||
|
Bill Chmura-2
|
Some javascript/style in this post has been disabled (why?)
Hi Shaun, It does not support it from the API level - but I was trolling through the code and looking to see if I could use the parser to grab the strings out of the PDF. It does look like it is able to go through the items in the PDF on load and separate them into different elements - its just accessing those elements is the tough part for me - I will probably check some more in a bit... Just have to track down where its putting them. If I can extend one of the PDF classes to do it I will... I definitely do not want to start changing the actual zend code (upgrading would be hell then). Shaun Farrell wrote: Bill, I have looked at the Zend_PDF and I am not sure you can read the text. I will look again in 1.9.2 and see. I think its write only. But I could be totally wrong. It may be a good question to ask in the #PHPC chat room |
||||||||||||||||
|
Markus Wolff
|
In reply to this post
by Bill Chmura-2
Bill Chmura wrote:
> Shaun: I actually already found your post, and so far it is the most > likely scenario if I cannot get a pure PHP solution working - The server > is OpenBSD, but development is done on OSX, Linux, and Windows so it > presents a problem with the XPDF. But if push comes to shove it's where > I will be heading. A little off-topic, but if it's clear what the deployment platform is (OpenBSD in this case), I can highly recommend using a virtualization tool such as VirtualBox or vmWare to run a setup very similar to the deployment system right on your dev box. Not only will this eliminate the problem that the tools you use in production are not available in your dev environment, it also helps avoiding portability problems - code that works perfectly on a Windows box does not neccessarily work on a Unix box without modifications. Case-sensitivity in filenames, different path separators and the likes are only the most common and obvious issues, but some functions and/or extensions may also behave differently across operating systems. CU Markus |
||||||||||||||||
|
Bill Chmura-2
|
Some javascript/style in this post has been disabled (why?)
I hear ya... I'm doing dev on a linux box, and that catches most of the incompatibilities during dev (I am lead and doing most of the coding) - those that don't get caught there get caught on the test server (which is a duplicate of the live boxes). I agree running dev in the same environment would be ideal, but OpenBSD, while a great stable box, has its challenges when trying to get other new desktop software running on it. Although giving dev's a VM they could shove the code over into do do some testing on that platform themselves is a neat idea... hmmmm thanks! Markus Wolff wrote: Bill Chmura wrote: |
||||||||||||||||
|
Bill Chmura-2
|
In reply to this post
by Bill Chmura-2
Some javascript/style in this post has been disabled (why?)
Just to bring closure to this... basically what we ended up doing was writing the PDF code ourselves to grab only the text out of the PDF. The spec's are available from Adobe for the PDF format, so it was not that bad in the end. At least it is still all PHP. Thanks to everyone for the suggestions on this Bill Chmura wrote:
|
||||||||||||||||
|
Shaun Farrell
|
Bill,
Are you going to open source that code? On Fri, Sep 11, 2009 at 1:58 PM, Bill Chmura <[hidden email]> wrote:
-- Shaun J. Farrell Washington, DC (202) 713-5241 www.farrelley.com |
||||||||||||||||
|
Bill Chmura-2
|
Some javascript/style in this post has been disabled (why?)
I don't see why we wouldn't. Let me clean it up a bit, and I will post it. Nothing terribly complicated, but it could save some time for other people. Shaun Farrell wrote: Bill, |
||||||||||||||||
|
Bill Chmura-2
|
Some javascript/style in this post has been disabled (why?)
Hey, I spoke with the guy who wrote it and he is cool with putting it out - he wanted a day or two to include some brief docs I'll post it then Following that we are going to read keywords and titles also, which it don't do now and wrap it as a lucence_PDF class and give that one out also Bill Chmura wrote:
|
||||||||||||||||
|
sebdev
|
Hi there,
we are also looking for a PHP only solution. Did you put the classes yet for download? Thanks, Seb.
|
||||||||||||||||
|
Bill Chmura-2
|
Not yet, got prioritized to something else. A few more days maybe...
hopefully monday sebdev wrote: > Hi there, > > we are also looking for a PHP only solution. > > Did you put the classes yet for download? > > Thanks, > Seb. > > > Bill Chmura-2 wrote: > >> Hey, I spoke with the guy who wrote it and he is cool with putting it >> out - he wanted a day or two to include some brief docs >> >> I'll post it then >> >> Following that we are going to read keywords and titles also, which it >> don't do now and wrap it as a lucence_PDF class and give that one out also >> >> >> Bill Chmura wrote: >> >>> I don't see why we wouldn't. Let me clean it up a bit, and I will >>> post it. >>> >>> Nothing terribly complicated, but it could save some time for other >>> people. >>> >>> >>> Shaun Farrell wrote: >>> >>>> Bill, >>>> >>>> Are you going to open source that code? >>>> >>>> >>>> On Fri, Sep 11, 2009 at 1:58 PM, Bill Chmura <[hidden email] >>>> <mailto:[hidden email]>> wrote: >>>> >>>> >>>> Just to bring closure to this... basically what we ended up doing >>>> was writing the PDF code ourselves to grab only the text out of >>>> the PDF. The spec's are available from Adobe for the PDF >>>> format, so it was not that bad in the end. At least it is still >>>> all PHP. >>>> >>>> Thanks to everyone for the suggestions on this >>>> >>>> >>>> >>>> >>>> Bill Chmura wrote: >>>> >>>>> Thanks Shaun and Matthias, >>>>> >>>>> Shaun: I actually already found your post, and so far it is the >>>>> most likely scenario if I cannot get a pure PHP solution working >>>>> - The server is OpenBSD, but development is done on OSX, Linux, >>>>> and Windows so it presents a problem with the XPDF. But if push >>>>> comes to shove it's where I will be heading. >>>>> >>>>> Matthias: It needs to be able to update on the fly, and running >>>>> Java up there may be a bit dicey... There is also a db >>>>> component, so some of the meta data comes from my model, and >>>>> well - its seeming to look painful as I move ahead either way - >>>>> thanks for the suggestion though! >>>>> >>>>> I was really hoping someone with Zend_PDF knowledge would see >>>>> this and yell, hey - just grab this array from the PDF object, >>>>> its got your strings :) >>>>> >>>>> Thanks guys! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Shaun Farrell wrote: >>>>> >>>>>> About a 1 1/2 yrs ago I wrote a 2 part post on how to index >>>>>> pdf's with Zend. >>>>>> >>>>>> (http://www.kapustabrothers.com/2008/01/20/indexing-pdf-documents-with-zend_search_lucene/) >>>>>> the Framework has come along way since then so it's probably >>>>>> out of date. I have been thinking about updating the topic. >>>>>> The current implementation uses XPDF which at the time was the >>>>>> best to convert PDF's to Text. I have been looking for some >>>>>> other libraries but have no luck. I'm also looking so ill let >>>>>> you know if i find anything. >>>>>> >>>>>> On Wed, Sep 9, 2009 at 8:00 AM, Matthias W. >>>>>> <[hidden email] >>>>>> <mailto:[hidden email]>> wrote: >>>>>> >>>>>> >>>>>> Hi, >>>>>> some time ago I had the same problem. But I needed the >>>>>> support for other >>>>>> documents, too (Excel, Powerpoint, ...). >>>>>> Because of this I created my index with java Apache >>>>>> projects: Lucene, PDFBox >>>>>> (PDF parser/writer) and POI (Office document parser/writer). >>>>>> >>>>>> I think it wouldn't be much work to parse your PDF docs >>>>>> Java-side... >>>>>> >>>>>> >>>>>> Bill Chmura-2 wrote: >>>>>> > >>>>>> > Hello, >>>>>> > >>>>>> > I am implementing Lucene and need to index my PDF files. >>>>>> > >>>>>> > I have found several solutions, but they all require some >>>>>> non PHP >>>>>> > component such as XPDF, etc... I need this to be cross >>>>>> platform, so >>>>>> > those are generally out. >>>>>> > >>>>>> > I also started looking for ways to get inside Zend_PDF to >>>>>> get at the >>>>>> > elements of each page with no success yet. I was hoping >>>>>> that I could >>>>>> > iterate the pages in a PDF (done), get a list of the >>>>>> elements on that >>>>>> > page (?) and then grab the text from perhaps the >>>>>> Zend_Pdf_Element_String >>>>>> > I was able to find in there. Since I am not going to be >>>>>> displaying the >>>>>> > context in my search, the location of the text does not >>>>>> matter to me so >>>>>> > much. >>>>>> > >>>>>> > I am getting totally bogged down in the source code for >>>>>> the pages and >>>>>> > the parsers, partially at least because I am not familiar >>>>>> with the >>>>>> > nomenclature of PDF internals :( >>>>>> > >>>>>> > Does anyone have any pointers on how to approach this? >>>>>> Ideally I'd like >>>>>> > to keep it Zend, but I can use other PDF libraries if I >>>>>> need to. >>>>>> > >>>>>> > Thanks >>>>>> > >>>>>> > Bill >>>>>> > >>>>>> > >>>>>> > >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> >>>>>> http://www.nabble.com/Zend_Search_Lucene-and-PDF-files-tp25352084p25363483.html >>>>>> Sent from the Zend MVC mailing list archive at Nabble.com. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Shaun J. Farrell >>>>>> Washington, DC >>>>>> (202) 713-5241 >>>>>> www.farrelley.com <http://www.farrelley.com> >>>>>> >>>> >>>> >>>> -- >>>> Shaun J. Farrell >>>> Washington, DC >>>> (202) 713-5241 >>>> www.farrelley.com <http://www.farrelley.com> >>>> >> >> > > |
||||||||||||||||
| Free Embeddable Forum Powered by Nabble | Help |