Archive for June, 2009

[ANN] celerity_parser 0.1.1

Monday, June 22nd, 2009

HTML parsing in JRuby seems to be going through a slightly odd patch. Nokogiri and Hpricot both seem to have problems. There’s one project I’m working on at the moment which needs xpath support, and by chance I happen to be using Celerity, which wraps htmlunit. If I need an HTML parser, I thought, there must be one somewhere hidden within that I can use. For extra bonus points, I wouldn’t even need to package any native code, celerity already has that covered…

And so it came to pass. celerity_parser is an almost trivially thin wrapper around HtmlUnit’s HTMLParser class that’s got just enough functionality to do what I need, which is search for elements by xpath, and extract text and XHTML structure. When I say “trivially thin”, I really mean it – there’s a grand total of 2 Ruby classes, and 5 methods you might want to use.

Here’s how it works, taken from the README:


root_node = CelerityParser.parse(html_content)
found_elements = root_node.search("//html/head/title")
found_elements.first.text # => "Html page title"

That’s pretty much it. Dependencies are on jarib-celerity and jruby itself. Enjoy, and I’m open to pull requests and suggestions if you need more than this. I’ve not done any speed tests, but it’s native Java so might be quite nippy.

Custom Rails Environments

Wednesday, June 3rd, 2009

It’s slightly more involved than you might think to make a custom Rails environment that is based on another. In my case, I wanted to have a staging environment that was as close as possible to production. So, I thought require 'config/environment/production' should do the trick.

Not so.

Because of the config.foo magic and the fact that it requires binding tomfoolery, environments aren’t loaded, or loadable, with require. They’re read and eval’d. Here’s what I’ve got at the top of config/environment/staging.rb at the moment:


production_environment_path = File.join(File.dirname(configuration.environment_path), 'production.rb')
eval(IO.read(production_environment_path), binding, production_environment_path)

So far so good. I’ll update here if that turns out not to be the whole story.

Entries (RSS)