|
|
|||||||||
|
|||||||||
|
|||||||||
| |
|||
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
Extracting contents from html sourcecode
Hi, I need to make a java class that is able to do two things:
1. detect all hyperlinks from one html page (for further crawling) 2. extract all textual text (that is not tags, style, scripts, etc.), in other words the text that normally is presented in a web browser. All help appreciated! Marius |
|
#2
|
||||
|
||||
|
Is the webpage well-formed?
If the webpage is well-formed (and hopefully valid HTML code), you could use DOM to parse through the page... otherwise you'll need to use string functions to troll your way through the code. |
|
#3
|
|||
|
|||
|
You'd most likely want to use regular expressions. One can easily match a pattern and grab anything that starts with an "<a" and ends with an "</a>" in simplest form. Then you could filter out the actual link from that. Another expression could be used to strip all html tags leaving only content. Look into regular expressions because they can do some really powerful finding/filtering tasks easily.
|
|
#4
|
|||
|
|||
|
I cannot garantuee that the pages I crawl are well formed.
|
|
#5
|
||||
|
||||
|
In that case, a string manipulation function and/or regular expressions would be your best bet.
|
|
#6
|
||||
|
||||
|
How convenient, Devarticles just posted a chapter from a book about Regular Expressions in Java =)
|
|
#7
|
|||
|
|||
|
Great!
That's a good tip, I'll dig into it! Thanks!
|
|
#8
|
|||
|
|||
|
More help needed on regex
Consider this regular expression tailored to extract all hyperlinks in a line of text:
String regex = "http:.*\"" This sentence will grab anything between 'http:' and ' " ' but that may is not intentonal as this will be grabbed: <a href = "http://www.alink" /a> <ahref="http:anotherlink" /a> In other words, I want to grab the FIRST occurence of hyphen, not the last. Any ideas? |
![]() |
| Viewing: Dev Articles Community Forums > Programming > Java Development > Extracting contents from html sourcecode |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|