Searching within files
Permalink
Is it possible to expand the search so that it's possible to search within PDF files?
thx for your help!
thx for your help!
Has anyone an idea. Unfortunately, the Google search is no solution because the site should be protected.
I don't think it is off the shelf. Some coding will be needed.
Jordanlev wrote a howto on something like your requirement:
http://www.concrete5.org/documentation/how-tos/developers/how-to-in...
You may also find this one useful, as it does it for products (so maybe use a similar technique for files)
http://www.concrete5.org/documentation/how-tos/developers/modify-si...
These addons may also do some of what you are looking for (they search files, but not file content)
http://www.concrete5.org/marketplace/addons/image-file-search/...
http://www.concrete5.org/marketplace/addons/document_library/...
Jordanlev wrote a howto on something like your requirement:
http://www.concrete5.org/documentation/how-tos/developers/how-to-in...
You may also find this one useful, as it does it for products (so maybe use a similar technique for files)
http://www.concrete5.org/documentation/how-tos/developers/modify-si...
These addons may also do some of what you are looking for (they search files, but not file content)
http://www.concrete5.org/marketplace/addons/image-file-search/...
http://www.concrete5.org/marketplace/addons/document_library/...
thanks!
Meanwhile, I've programmed my own solution.
After uploading the file, the contents of the PDF is read and written to the database. And the search looks for matches in these fields and shows them separately in the list (with a direct download link to the file). In addition, the search can be limited by filesets.
If you want to know the details, please let me know.
Meanwhile, I've programmed my own solution.
After uploading the file, the contents of the PDF is read and written to the database. And the search looks for matches in these fields and shows them separately in the list (with a direct download link to the file). In addition, the search can be limited by filesets.
If you want to know the details, please let me know.
I've achieved this independently using the same method, utilising server-side processing of the files (though not in PHP, so not packagable). Did you use a PHP library to scrape text from the files? I couldn't get any of the open source ones to work reliably with all the various PDF versions.
I use these two functions. Unfortunately it does not work with every PDF. I don't know why.
private function pdf2string($sourcefile) { $fp = fopen($sourcefile, 'rb'); $content = fread($fp, filesize($sourcefile)); fclose($fp); $searchstart = 'stream'; $searchend = 'endstream'; $pdfText = ''; $pos = 0; $pos2 = 0; $startpos = 0; while ($pos !== false && $pos2 !== false) { $pos = strpos($content, $searchstart, $startpos); $pos2 = strpos($content, $searchend, $startpos + 1); if ($pos !== false && $pos2 !== false){ if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
Viewing 15 lines of 114 lines. View entire code block.
Ah yes, I tried the same script and had limited success. The reason is that it doesn't support all versions of the PDF file format, so it's rendered fairly useless.
I managed to get round it using two server-side approaches combined with the php shell_exec function. For PDFs I'm using the pdftotext utility from the xpdf package, and for Word files I'm using a headless install of OpenOffice combined with the unoconv command line util. They can both output to stdout, so it's easy to get the parsed text back into php. This is on a Linux (CentOS) server so I'm not how how cross-platform this approach is, but it works well for me.
I'll probably write up the steps into a howto to share with the community.
I managed to get round it using two server-side approaches combined with the php shell_exec function. For PDFs I'm using the pdftotext utility from the xpdf package, and for Word files I'm using a headless install of OpenOffice combined with the unoconv command line util. They can both output to stdout, so it's easy to get the parsed text back into php. This is on a Linux (CentOS) server so I'm not how how cross-platform this approach is, but it works well for me.
I'll probably write up the steps into a howto to share with the community.
Would be really interesting how you achieved this!