Skip to main content

Parsing HTML with JSoup

The transit demo site I'm building uses data from several websites - such as the Port Authority's press page. Unfortunately, most sites aren't designed for easy parsing; usually data is only available as HTML. So I need to parse HTML to uncover useful information, and if you've ever looked at the source of an average web page, writing a parser from scratch is a challenging task.
Enter the jsoup library. This programming interface exposes the structure of a HTML document in a way that can be queried, traversed and even altered by Java programs. I had previously used it to clean out HTML tags from user-submitted content, but recently have been taking advantage of its powerful query engine, which is similar to the way jQuery selectors operate.
Let's say I'm parsing some markup that repeats for different stories:
I would want to parse this by first finding all of the stories using the <div class="story"> element, and then for each story, query for interesting bits of data. For instance I might want to know the headline, a short blurb of text and the URL of the story:
In the real world, sites don't usually have markup this nice. Often they don't even have a containing element around the markup I'm interested in. For example the MTA site has headlines all in the same containing element, and stories are delimited only by the archaic <hr/> tag. In this case, I query for the containing element and loop through all of its children, noting interesting tags along the way. When I hit a boundary, I evaluate the tags I found to extract the news story information.
Using the jsoup library, it is possible to extract data from even the messiest websites with much less effort than would be required to write a parser using the standard Java SDK.

Comments

Popular posts from this blog

ReactJS, NPM and Maven

I'm just starting to get into working with ReactJS, Facebook's open source rendering framework. My project uses SpringBoot for annotation-driven dependency injection and MVC. I thought it would be great if I could use a bit of ReactJS to enhance the application. If you're looking for a basic conceptual intro, I recommend ReactJS for Stupid People and of course the official documentation  is quite good. In full disclosure, I still have no idea how to do "flux" yet. As an experienced Java backend developer, I'm pretty decent at hacking Maven builds - which is precisely what this blog post is going to be about. First, a word about how React likes to be built. Like many front-end tools, there is a toolkit for the node package manager (NPM). From the command prompt, one might run npm install -g react-tools  which installs the jsx command. The  jsx  command provides the ability to transform JSX syntax into ordinary JavaScript, which is precisely what I want...

Solved: Unable to Locate Spring Namespace Handler

I attempted to run a Spring WebMVC application, and when starting up the application complained that it didn't know how to handle the MVC namespace in my XML configuration. The project runs JDK 7 and Spring 4.0.6 using Maven as the build system. The following is my XML configuration file: <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans"        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"        xmlns:mvc="http://www.springframework.org/schema/mvc"        xsi:schemaLocation="         http://www.springframework.org/schema/beans         http://www.springframework.org/schema/beans/spring-beans.xsd         http://www.springframework.org/schema/mvc         http://www.springframework.org/schema/mvc/spring-mvc.xsd">          <mvc:annotation-driven/> ...

Culture Matters

Yesterday morning, my software engineering team and I learned that we would be moving into subleased office space on the fourth floor. In the afternoon, with little fanfare and only one cardboard box among us, we moved all our stuff down two floors. In full disclosure, I did offer to buy some at Staples across the street, but nobody really felt it necessary. I think our move exemplifies the kind of culture I want to build as we grow the engineering team: unfussy, collaborative, empowered, pragmatic. The job market for software engineers in NYC is booming, so it is surprising to many candidates how much we care about team culture. We've declined to make an offer based on culture more here than anywhere else that I've conducted engineering interviews for. There's certainly a "no assholes" rule around here, but our considerations go deeper than that. Ultimately, we want to hire people aligned with our company's values: Entrepreneurial thinking and actio...