Go Back   DreamTeamDownloads1, FTP Help, Movies, Bollywood, Applications, etc. & Mature Sex Forum, Rapidshare, Filefactory, Freakshare, Rapidgator, Turbobit, & More MULTI Filehosts > Computer/MAC Help/Info. & New Technology > How To - (Tips and Tricks) & Computer Fun

How To - (Tips and Tricks) & Computer Fun All the Help You Need and Tips & Tricks for your PC & some Humourous Stuff. If You Need Computer Help, Start a New Thread in the Computer Help Section Above

IMPORTANT ANNOUNCEMENT
Hallo to All Members. As you can see we regularly Upgrade our Servers, (Sorry for any Downtime during this). We also have added more Forums to help you with many things and for you to enjoy. We now need you to help us to keep this site up and running. This site works at a loss every month and we appeal to you to donate what you can. If you would like to help us, then please just send a message to any Member of Staff for info on how to do this,,,, & Thank You for Being Members of this site.
Post New ThreadReply
 
LinkBack Thread Tools Display Modes
Old 08-02-13, 19:47   #1
 
Ladybbird's Avatar
 
Join Date: Feb 2011
Posts: 34,775
Thanks: 23,502
Thanked 12,728 Times in 8,566 Posts
Ladybbird has a reputation beyond reputeLadybbird has a reputation beyond reputeLadybbird has a reputation beyond reputeLadybbird has a reputation beyond reputeLadybbird has a reputation beyond reputeLadybbird has a reputation beyond reputeLadybbird has a reputation beyond reputeLadybbird has a reputation beyond reputeLadybbird has a reputation beyond reputeLadybbird has a reputation beyond reputeLadybbird has a reputation beyond repute

Awards Showcase
Best Admin Best Admin Gold Medal Gold Medal 
Total Awards: 6

Computers Copy Text from a PDF while Preserving the Formatting




PDF, the ubiquitous document format, is great for sharing documents while preserving fonts, images, and the general layout across platforms. Is there an easy way, however, to preserve that very formatting when copying and pasting text out of the document?

Today’s Question & Answer session comes to us courtesy of SuperUser—a subdivision of Stack Exchange, a community-driven grouping of Q&A web sites.

The Question

SuperUser reader Colen is searching for a way to extract text from PDFs while preserving the formatting:
When I copy text out of a PDF file and into a text editor, it ends up mangled in a variety of ways. Formatting like bold and italics are lost; soft line breaks within a paragraph of text are converted to hard line breaks; dashes to break a word over two lines are preserved even when they shouldn’t be; and single and double quotes are replaced with ? signs.
Ideally, I’d like to be able to copy text from a PDF and have formatting converted to HTML codes, “smart quotes” converted to ” and ‘, and line breaks done properly. Is there any way to do this?
Is there a quick and easy way for Colen (and the rest of us) to get grab text without sacrificing the formatting?


The Answer

SuperUser contributor Frabjous offers a solution combined with a heavy dose of caution:
Firstly, you have to understand what a PDF is. PDFs are designed to mimic a printed page, and they are designed only as an output format, not an input format. a PDF is basically a map containing the exact location of characters (individual letters or punctuation, etc.) or images. In most cases, a PDF does not even store information about where one word ends and another begins, much less things like soft breaks vs. hard breaks for paragraph endings.
(A few recent PDFs do store some information about this stuff, but that’s a new technology, and you’d be lucky to find PDFs like that. Even if you did, your PDF viewer might not know about it.)
Anyway, it’s up to your software to implement some kind of “artificial intelligence” to extract merely from the locations of individual characters what is a word, what is a paragraph, and so on. Different software is going to do this better than others, and it’s also going to depend on how the PDF was made. In any case, you should never expect perfect results. Having the output PDF is not the same as having the source document. Far better to try to obtain that if you can.
The standard solution to your kind of problem is to use Adobe Acrobat Professional (the expensive one, not the free reader) to convert the PDF to HTML. Even that is not going to get perfect results.
There is free software that can be used to extract text from PDFs with some of formatting intact, but again, don’t expect perfect results.

See, e.g., calibre (which can convert to RTF format), pdftohtml/pdfreflow, or the AbiWord word processor (with all import/export plugins enabled).
There’s also a PDF import plugin for OpenOffice.

But please don’t expect perfection with any of these results. You’re going against the grain here. PDF just is not meant as an editable input format.
If you are having trouble deciding which tool to start with, Calibre is a veritable document Swiss Army knife. You can also use it to convert PDF files for use on your ebook reader and organize your ebook/document library.
__________________
Nil Carborundum Illegitemi My Advice is Free My Friendship is Priceless

FREEBIES Continue to be a BURDEN on Our Increasing Server/Privacy Costs. Please DONATE Something to HELP...PM an Admin for Further Info.



& Thanks to Those That Have Taken The Time to Register & Become a Member of ... 1...
Ladybbird is online now  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiTweet this Post!
Reply With Quote
Post New ThreadReply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2019, vBulletin Solutions Inc.
SEO by vBSEO 3.5.2
Designed by: vBSkinworks