Skip to main content

PhantomJS and JavascriptExecutor

My main project at TheStreet is a web scraper, and it relies on PhantomJS browser to run user interactions, make DOM adjustments and take screenshots. Taking screenshots is an area where PhantomJS excels over the other Selenium WebDriver implementations.

The thing that works so well for PhantomJS is that the viewport is not constrained to a screen, so it takes screen captures that are as long as the page content is. I just set the viewport to 1366x1 and let the content extend the height of the screen.

One thing that doesn't work well is rendering fonts. Running on AWS the rendering of fonts is extremely unstable. The same site can render as a serif or sans-serif font, and it's not clear why. The problem with this is that it makes the screenshots very different even if the site isn't changed. I'm using pixel-level analysis of the screenshot to detect changes, so this sets off lots of false positives.

PhantomJS uses the JavascriptExecutor interface to allow arbitrary JavaScript to execute in the browser. The interface allows me to pass in the script as a string and optionally some arguments. Arguments are exposed as an arguments array variable. I find that it's easier to understand the scripts when I set arguments by string replacement rather than use the arguments capability.

I use the capability of the JavascriptExecutor interface in several different ways, but the relevant example here is that I can change the style of elements on the page by manipulating properties via the DOM tree. I have a script that recursively sets the font to Arial on all DOM nodes -- PhantomJS renders standard fonts like Arial without any issues. This script generally takes under 20 ms for the pages I work with. Now all of my screen captures are consistent and this source of false positives is no longer occurring.

I also use the JavascriptExecutor capabilities in PhantomJS to adjust the CSS style attributes and check document status on page load. It has proven to be a useful capability to build on.

Comments

Post a Comment

Popular posts from this blog

ReactJS, NPM and Maven

I'm just starting to get into working with ReactJS, Facebook's open source rendering framework. My project uses SpringBoot for annotation-driven dependency injection and MVC. I thought it would be great if I could use a bit of ReactJS to enhance the application. If you're looking for a basic conceptual intro, I recommend ReactJS for Stupid People and of course the official documentation  is quite good. In full disclosure, I still have no idea how to do "flux" yet. As an experienced Java backend developer, I'm pretty decent at hacking Maven builds - which is precisely what this blog post is going to be about. First, a word about how React likes to be built. Like many front-end tools, there is a toolkit for the node package manager (NPM). From the command prompt, one might run npm install -g react-tools  which installs the jsx command. The  jsx  command provides the ability to transform JSX syntax into ordinary JavaScript, which is precisely what I want...

Solved: Unable to Locate Spring Namespace Handler

I attempted to run a Spring WebMVC application, and when starting up the application complained that it didn't know how to handle the MVC namespace in my XML configuration. The project runs JDK 7 and Spring 4.0.6 using Maven as the build system. The following is my XML configuration file: <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans"        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"        xmlns:mvc="http://www.springframework.org/schema/mvc"        xsi:schemaLocation="         http://www.springframework.org/schema/beans         http://www.springframework.org/schema/beans/spring-beans.xsd         http://www.springframework.org/schema/mvc         http://www.springframework.org/schema/mvc/spring-mvc.xsd">          <mvc:annotation-driven/> ...

AWS S3 versus CloudFront Performance

Yesterday I took Amazon CloudFront for a spin. Creating the CloudFront distribution was pretty simple - the wizard process flowed nicely. I found myself relying on the help text in places, but the most surprising thing was how long it took for the distribution to become enabled. I didn't time it exactly, but I probably spent 45 minutes waiting for my new CloudFront distribution to change from "In Progress" to "Enabled" status. The performance is a bit confusing. Compared to the S3 bucket, I didn't see any improvement in performance in a few tries - in fact, the CloudFront CDN performance was worse than the S3 bucket on its own for my 217 KB image file. I decided to take a larger sample, loading the same image 30 times in Chrome and noting the timing data from the "network" tab in the developer tools. I'm located in Brooklyn, have CloudFront configured for the US/Europe with download mode configured. My S3 bucket is in the US Standard zone, w...