Java Development
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
 
User Name:
Password:
Remember me
 



Go Back   Dev Articles Community ForumsProgrammingJava Development

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Display Modes
 
Unread Dev Articles Community Forums Sponsor:
  #1  
Old July 27th, 2005, 11:32 AM
maralbjo maralbjo is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Jan 2005
Posts: 13 maralbjo User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 49 m 27 sec
Reputation Power: 0
Extracting contents from html sourcecode

Hi, I need to make a java class that is able to do two things:

1. detect all hyperlinks from one html page (for further crawling)

2. extract all textual text (that is not tags, style, scripts, etc.), in other words the text that normally is presented in a web browser.

All help appreciated!

Marius

Reply With Quote
  #2  
Old July 27th, 2005, 11:43 AM
MadCowDzz's Avatar
MadCowDzz MadCowDzz is offline
I'm Internet Famous
Dev Articles Frequenter (2500 - 2999 posts)
 
Join Date: Jan 2003
Location: Toronto, Canada
Posts: 2,886 MadCowDzz User rank is Lance Corporal (50 - 100 Reputation Level)MadCowDzz User rank is Lance Corporal (50 - 100 Reputation Level)MadCowDzz User rank is Lance Corporal (50 - 100 Reputation Level) 
Time spent in forums: 1 Week 16 h 19 m 35 sec
Reputation Power: 18
Is the webpage well-formed?

If the webpage is well-formed (and hopefully valid HTML code), you could use DOM to parse through the page... otherwise you'll need to use string functions to troll your way through the code.

Reply With Quote
  #3  
Old July 27th, 2005, 12:32 PM
Kyle Kyle is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Jul 2005
Posts: 3 Kyle User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 50 m 35 sec
Reputation Power: 0
You'd most likely want to use regular expressions. One can easily match a pattern and grab anything that starts with an "<a" and ends with an "</a>" in simplest form. Then you could filter out the actual link from that. Another expression could be used to strip all html tags leaving only content. Look into regular expressions because they can do some really powerful finding/filtering tasks easily.

Reply With Quote
  #4  
Old July 28th, 2005, 07:35 AM
maralbjo maralbjo is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Jan 2005
Posts: 13 maralbjo User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 49 m 27 sec
Reputation Power: 0
I cannot garantuee that the pages I crawl are well formed.

Reply With Quote
  #5  
Old July 28th, 2005, 08:43 AM
MadCowDzz's Avatar
MadCowDzz MadCowDzz is offline
I'm Internet Famous
Dev Articles Frequenter (2500 - 2999 posts)
 
Join Date: Jan 2003
Location: Toronto, Canada
Posts: 2,886 MadCowDzz User rank is Lance Corporal (50 - 100 Reputation Level)MadCowDzz User rank is Lance Corporal (50 - 100 Reputation Level)MadCowDzz User rank is Lance Corporal (50 - 100 Reputation Level) 
Time spent in forums: 1 Week 16 h 19 m 35 sec
Reputation Power: 18
In that case, a string manipulation function and/or regular expressions would be your best bet.

Reply With Quote
  #6  
Old July 28th, 2005, 09:15 AM
MadCowDzz's Avatar
MadCowDzz MadCowDzz is offline
I'm Internet Famous
Dev Articles Frequenter (2500 - 2999 posts)
 
Join Date: Jan 2003
Location: Toronto, Canada
Posts: 2,886 MadCowDzz User rank is Lance Corporal (50 - 100 Reputation Level)MadCowDzz User rank is Lance Corporal (50 - 100 Reputation Level)MadCowDzz User rank is Lance Corporal (50 - 100 Reputation Level) 
Time spent in forums: 1 Week 16 h 19 m 35 sec
Reputation Power: 18
How convenient, Devarticles just posted a chapter from a book about Regular Expressions in Java =)

Reply With Quote
  #7  
Old July 28th, 2005, 09:55 AM
maralbjo maralbjo is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Jan 2005
Posts: 13 maralbjo User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 49 m 27 sec
Reputation Power: 0
Great!

That's a good tip, I'll dig into it! Thanks!

Reply With Quote
  #8  
Old July 29th, 2005, 06:51 AM
maralbjo maralbjo is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Jan 2005
Posts: 13 maralbjo User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 49 m 27 sec
Reputation Power: 0
More help needed on regex

Consider this regular expression tailored to extract all hyperlinks in a line of text:

String regex = "http:.*\""

This sentence will grab anything between 'http:' and ' " ' but that may is not intentonal as this will be grabbed:

<a href = "http://www.alink" /a> <ahref="http:anotherlink" /a>

In other words, I want to grab the FIRST occurence of hyphen, not the last. Any ideas?

Reply With Quote
Reply

Viewing: Dev Articles Community ForumsProgrammingJava Development > Extracting contents from html sourcecode


Developer Shed Advertisers and Affiliates


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2018, Jelsoft Enterprises Ltd.

© 2003-2018 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap