JavaScript Development
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
 
User Name:
Password:
Remember me
 
Go Back   Dev Articles Community ForumsProgrammingJavaScript Development

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Display Modes
 
Unread Dev Articles Community Forums Sponsor:
  #1  
Old February 15th, 2008, 10:02 AM
andrewwan1980 andrewwan1980 is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Jun 2007
Posts: 14 andrewwan1980 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 3 h 25 m 42 sec
Reputation Power: 0
QUERY: comparing website contents

I've got two websites, one original, the other based off the original.

I like to diff/compare the websites using diff automatic comparison tools to see what text/information has changed. The problem is, the HTML code and layout has been changed drastically so I can't do a straight text file compare. What am interested in is purely the raw content (paragraphs, sentences, etc.). The original site has no javascript, onmouseover hovers, etc. The new revamped website has javascript, onmouseover hovers, popups, etc.

How can I create a script (Perl? C++?) so that it extracts the main text BODIEs from both sites? I guess also have to specify starting & ending delimiters. Once extracted, it would need to convert < p ></ p > paragraph tags, and strip out < a onmouseover... > anchor links (while maintaining the word inbetween the anchor link ofcourse). The new website uses two spaces after each full stop while the old website uses 1 space. Will this matter?

Once we got the plain text, how to wrap the paragraphs after 80 characters per line... so that we can easily do file compares.


And please do not suggest copying and pasting the text into NotePad or Word. I said 'website' which means they contain dozens of html files (probably 100s). Plus, I like a script to automate this compare process so I can repeat the process in future and remind myself of diffs....

Reply With Quote
  #2  
Old February 18th, 2008, 05:17 AM
andrewwan1980 andrewwan1980 is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Jun 2007
Posts: 14 andrewwan1980 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 3 h 25 m 42 sec
Reputation Power: 0
I need a tool to get me the substring between delimiters then 79char

line wrap the result and then diff... for both oldsite/old1.htm and

newsite/new1.htm

As for web crawling, old site is local, new site is online. But I

rather hard code the URLs in a big list (mapping).

I think I'll use Perl (maybe Python), to:

1. for each item in mapping list
1.1 download newsite/html file
1.2 substring using newsite delimiters on newsite file
1.3 substring using oldsite delimiters on oldsite file
1.3 html2txt/hindent both oldsite & newsite file and line wrap 79char

and put into 2 separate new folders (diff1, diff2).
1.4 repeat through mapping list

After that I can use Beyond Compare to compare the diff1 & diff2

folders. Hopefully both corresponding text files will be 79char line

wrapped with whitespace down to 1 char (eliminating 2 or more

consecutive spaces, and tab spaces). Also maintain carriage returns?

Reply With Quote
Reply

Viewing: Dev Articles Community ForumsProgrammingJavaScript Development > QUERY: comparing website contents


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 5 hosted by Hostway
Stay green...Green IT