|
|
|||||||||
|
|||||||||
|
|||||||||
| |
|||
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
removing MS Word HTML from a file
Hi,
I have a problem with converting word docs to HTML. As you probably know, when word generates its HTML it has a lot of needless tags. Is there a way I can cut these out and just leave basic formatting tags such as <b>, <br>, <ol>, <li> etc. I found this code to remove Word HTML PHP Code:
I found this on another site as a solution for removing word HTML but the problem is it removes all of the HTML leaving the file as a blob of text only with nor line breaks or anything. I am afraid I don't understand the code above fully so I was hoping someone on here could help me out. Basically, I need help modiying the code above to remove all the crap but still leave certain tags. Thanks in advance, Martin |
|
#2
|
|||
|
|||
|
bah,...it won't let me post a .rar file,....grrr
__________________
-- Jason |
|
#3
|
|||
|
|||
|
PHP Code:
__________________
__________________________________________________ _ Wil Moore III, MCP | Integrations Specialist | Senior Consultant Are You Listed...? | DigitallySmooth Inc. |
|
#4
|
|||
|
|||
|
The solution is to use the Office HTML Filter tool provided by MS. I've had really good luck with it when I've needed to convert Word HTML in the past.
URL Otherwise, if you have a version of Dreamweaver, they have a built in feature to clean up Word HTML that is found under Commands. |
|
#5
|
|||
|
|||
|
Dreamweaver has proven to be quite useless in stripping word html:
Here is an example of what you could do with the open source command line html tidy program: http://www.laidbak.net/tidyword/index.php?s=1 |
|
#6
|
|||
|
|||
|
Hi,
Thanks for the replies. The problem is, I don't need to strip HTML from word HTML documents.....I need to do it dynamically from the content of a text box which will then be added to a database. The text box has a WYSIWG editor which keeps the word formatting (which is what I need) when you paste the content but it also keeps the Word HTML junk. Laidbak, I will try the code you gave below. Thanks. |
|
#7
|
|||
|
|||
|
Ok, well since everything will be in the html textarea and not in a file on the server you will have to change this line:
PHP Code:
You will need to pass in the info from the textbox to this piece of code. |
![]() |
| Viewing: Dev Articles Community Forums > Programming > General Programming Help > removing MS Word HTML from a file |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|