General Programming Help
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
 
User Name:
Password:
Remember me
 
Go Back   Dev Articles Community ForumsProgrammingGeneral Programming Help

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Display Modes
 
Unread Dev Articles Community Forums Sponsor:
  #1  
Old June 18th, 2003, 06:23 AM
martincrumlish martincrumlish is offline
Junior Member
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Jan 2003
Location: Ireland
Posts: 9 martincrumlish User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
removing MS Word HTML from a file

Hi,

I have a problem with converting word docs to HTML. As you probably know, when word generates its HTML it has a lot of needless tags. Is there a way I can cut these out and just leave basic formatting tags such as <b>, <br>, <ol>, <li> etc.

I found this code to remove Word HTML

PHP Code:
 $search = array ("'<script[^>]*?>.*?</script>'si",  // Strip out javascript 
                 
"'<[\/\!]*?[^<>]*?>'si",           // Strip out html tags 
                 
"'([\r\n])[\s]+'",                 // Strip out white space 
                 
"'&(quot|#34);'i",                 // Replace html entities 
                 
"'&(amp|#38);'i"
                 
"'&(lt|#60);'i"
                 
"'&(gt|#62);'i"
                 
"'&(nbsp|#160);'i"
                 
"'&(iexcl|#161);'i"
                 
"'&(cent|#162);'i"
                 
"'&(pound|#163);'i"
                 
"'&(copy|#169);'i"
                 
"'&#(\d+);'e");                    // evaluate as php 

$replace = array (""
                  
""
                  
"\\1"
                  
"\""
                  
"&"
                  
"<"
                  
">"
                  
" "
                  
chr(161), 
                  
chr(162), 
                  
chr(163), 
                  
chr(169), 
                  
"chr(\\1)"); 

$content preg_replace ($search$replace$content); 


I found this on another site as a solution for removing word HTML but the problem is it removes all of the HTML leaving the file as a blob of text only with nor line breaks or anything. I am afraid I don't understand the code above fully so I was hoping someone on here could help me out.

Basically, I need help modiying the code above to remove all the crap but still leave certain tags.

Thanks in advance,
Martin

Reply With Quote
  #2  
Old June 18th, 2003, 10:10 AM
Taelo Taelo is offline
5B's
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Oct 2002
Location: PC, FL
Posts: 366 Taelo User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 2 h 30 m 59 sec
Reputation Power: 7
bah,...it won't let me post a .rar file,....grrr
__________________
-- Jason

Reply With Quote
  #3  
Old June 18th, 2003, 12:42 PM
digitallysmooth digitallysmooth is offline
you know how we do
Dev Articles Novice (500 - 999 posts)
 
Join Date: Jun 2002
Posts: 788 digitallysmooth User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 2 h 34 m 21 sec
Reputation Power: 7
PHP Code:
<?php 

function cleanUpHTML($text) { 
     
    
// remove escape slashes 
    
$text stripslashes($text); 
     
    
// trim everything before the body tag right away, leaving possibility for body attributes 
    
$text stristr$text"<body"); 
     
    
// strip tags, still leaving attributes, second variable is allowable tags 
    
$text strip_tags($text'<p><b><i><u><a><h1><h2><h3><h4><h4><h5><h6>'); 
         
    
// removes the attributes for allowed tags, use separate replace for heading tags since a 
    // heading tag is two characters 
    
$text ereg_replace("<([p|b|i|u])[^>]*>""<\\1>"$text); 
    
$text ereg_replace("<([h1|h2|h3|h4|h5|h6][1-6])[^>]*>""<\\1>"$text); 
     
    return (
$text); 



?>
__________________
__________________________________________________ _
Wil Moore III, MCP | Integrations Specialist | Senior Consultant
Are You Listed...? | DigitallySmooth Inc.

Reply With Quote
  #4  
Old June 18th, 2003, 12:44 PM
iris777888 iris777888 is offline
Junior Member
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Jun 2003
Posts: 2 iris777888 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
The solution is to use the Office HTML Filter tool provided by MS. I've had really good luck with it when I've needed to convert Word HTML in the past.

URL

Otherwise, if you have a version of Dreamweaver, they have a built in feature to clean up Word HTML that is found under Commands.

Reply With Quote
  #5  
Old June 18th, 2003, 01:56 PM
digitallysmooth digitallysmooth is offline
you know how we do
Dev Articles Novice (500 - 999 posts)
 
Join Date: Jun 2002
Posts: 788 digitallysmooth User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 2 h 34 m 21 sec
Reputation Power: 7
Dreamweaver has proven to be quite useless in stripping word html:

Here is an example of what you could do with the open source command line html tidy program:

http://www.laidbak.net/tidyword/index.php?s=1

Reply With Quote
  #6  
Old June 19th, 2003, 03:35 AM
martincrumlish martincrumlish is offline
Junior Member
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Jan 2003
Location: Ireland
Posts: 9 martincrumlish User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Hi,

Thanks for the replies. The problem is, I don't need to strip HTML from word HTML documents.....I need to do it dynamically from the content of a text box which will then be added to a database. The text box has a WYSIWG editor which keeps the word formatting (which is what I need) when you paste the content but it also keeps the Word HTML junk.

Laidbak, I will try the code you gave below.

Thanks.

Reply With Quote
  #7  
Old June 19th, 2003, 03:48 AM
digitallysmooth digitallysmooth is offline
you know how we do
Dev Articles Novice (500 - 999 posts)
 
Join Date: Jun 2002
Posts: 788 digitallysmooth User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 2 h 34 m 21 sec
Reputation Power: 7
Ok, well since everything will be in the html textarea and not in a file on the server you will have to change this line:

PHP Code:
 $cleanhtml = `./tidy -clean --word-2000 yes wordhtml.htm`; 


You will need to pass in the info from the textbox to this piece of code.

Reply With Quote
Reply

Viewing: Dev Articles Community ForumsProgrammingGeneral Programming Help > removing MS Word HTML from a file


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 3 hosted by Hostway
Stay green...Green IT