Quantcast
Viewing all articles
Browse latest Browse all 2

Java Web Scraping using Jsoup

I'm trying to make a java application which can scrape infos off web sites, and I've done some googling, and managed very simple scraper, but not enough.It seems that my scraper is not scraping some information on this website, espesially the part where I want to scrape. Image may be NSFW.
Clik here to view.
enter image description here

1.

        Elements links = htmlDocument.select("a");        for (Element link : links) {           this.links.add(link.attr("href"));        }
        Elements linksOnPage = htmlDocument.select("a[href]");        System.out.println("Found (" + linksOnPage.size() +") links");        for(Element link : linksOnPage)        {            this.links.add(link.absUrl("href"));        }

I've tried both code, but I cant find that link anywhere in Elements object.I believe that those information I want is the result of search, so when my program connects to that url, that information are gone. How can I solve this? I want an program whenever it gets started, scraping the result of that search.

Here is the link to the web site

So my question is,

1.How do I scrape that link into my code's Elements object? What am I doing Wrong?

2.Is there any way to pick that link and proceed to that link only(not all hyperlinks)?

    final Document doc = Jsoup.connect("http://www.work.go.kr/empInfo/empInfoSrch/list/dtlEmpSrchList.do?pageIndex=2&pageUnit=10&len=0&tot=0&relYn=N&totalEmpCount=0&jobsCount=0&mainSubYn=N&region=41000&lastIndex=1&siteClcd=all&firstIndex=1&pageSize=10&recordCountPerPage=10&rowNo=0&softMatchingPossibleYn=N&benefitSrchAndOr=O&keyword=CAD&charSet=EUC-KR&startPos=0&collectionName=tb_workinfo&softMatchingMinRate=+66&softMatchingMaxRate=100&empTpGbcd=1&onlyTitleSrchYn=N&onlyContentSrchYn=N&serialversionuid=3990642507954558837&resultCnt=10&sortOrderBy=DESC&sortField=DATE").userAgent(USER_AGENT).get();    try    {        Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);        Document htmlDocument = connection.get();        this.htmlDocument = htmlDocument;        String qqq=htmlDocument.toString();        System.out.println(qqq);        if(connection.response().statusCode() == 200) // 200 is the HTTP OK status code                                                      // indicating that everything is great.        {            System.out.println("\n**Visiting** Received web page at " + url);        }        if(!connection.response().contentType().contains("text/html"))        {            System.out.println("**Failure** Retrieved something other than HTML");            return false;        }        Elements linksOnPage = htmlDocument.select("a[href]");        System.out.println("Found (" + linksOnPage.size() +") links");        for(Element link : linksOnPage)        {            this.links.add(link.absUrl("href"));            System.out.println(link.absUrl("href"));        }        return true;    }    catch(IOException ioe)    {        // We were not successful in our HTTP request        return false;    }

this is the entire code I use for scraping.This code, I'm using from this site.


Viewing all articles
Browse latest Browse all 2

Trending Articles