Tags : web scraping

Getting around PIOs with Web Inspector

By Mayra Cruz
@MayraC27

One way to get around bureaucratic hassles is to get the to the data directly by scraping it off the Web.

The fight for public records can sometimes be avoided by taking the data directly from websites, Dan Nguyen of ProPublica said.

On Saturday, Nguyen led a hands-on class of “Web Inspector,” which refers to a Google Chrome add-on that allows non-programmers to obtain information posted online. Using it can help familiarize journalists with HTML by recognizing patterns in Web coding.

Getting familiar with HTML markup language may be daunting, but Web inspector can help reporters ...

Read more ...

Importing RSS and ATOM feeds

Here’s how to use Google Spreadsheets to import RSS and ATOM data.

ImportFeed for RSS and ATOM

All sorts of data gets pushed out as RSS/ATOM feeds. You can put those in spreadsheets too. The command takes the following form:

=ImportFeed(URL, [feedQuery | itemQuery], [headers], [numItems])

  • URL of the feed.
  • We'll almost always use itemQuery options ("items", "items author", "items title", "items summary", "items url", or "items created"), as they return individual items in the feed while feedQuery just returns metadata about the feed.
  • "Items" will be the best default option, as it returns everything you'll ...
Read more ...

Tech Tip: Google Spreadsheet data scraping

In this guide, we're going to walk through the process of scraping and cleaning data from the web in real time, using only Google Spreadsheets. As an example, I'll be using Columbia 911, a site I put together for this purpose.

Google Spreadsheets are the ultimate weapon when it comes to real-time data and Web-based mashups. There's a slight learning curve, but once you get over it you'll find Google Spreadsheets can do just about anything Microsoft Excel spreadsheets can. Even better, they're hosted in the cloud, which means they're almost always available online ...

Read more ...

Senate Votes in XML

One of my personal annoyances came to a quiet end last week, when the U.S. Senate decided to begin publishing vote information in XML rather than the HTML that had been its format for years. The House, usually the institutionally more nimble of the chambers, began publishing vote information in XML back in 2003 (view the source on this page to see an example). Here's a Senate vote - it has information on the date and time of the vote, plus all of the individual positions. This makes it easier to parse the information into a spreadsheet or database ... Read more ...

Let OpenKapow robots do your scraping

If you’re like me, learning enough Python or PERL to be dangerous with Web-scraping is on your To Do list — just not anywhere near the top.

Enter OpenKapow Robomaker, billed as an “easy-to-use point and click visual development environment” that makes Web scraping easy and intuitive for just about anyone.

Plus, it’s free.

You can download the program from OpenKapow’s Web site. A few hours later, once the download is complete and the program is installed, you can start building your first web-scraping “robot” — OpenKapow’s endearing term.

When you first fire up the program, a pop-up ...

Read more ...

Scraping barriers and how to avoid them

In early January, a short thread on NICARL, a computer-assisted reporting Listserv operated by IRE and NICAR, highlighted a bizarre step taken by the Seattle Fire Department to shield its response time data from mashup artists. Intending to obstruct automated Web scrapers, Seattle fire officials reformatted the public response data available on their Web site from HTML into a JPEG image. The theory, according to the officials behind the decision, was that malevolent users could harvest the data and exploit it for nefarious ends. "Our intent is to enhance the safety of personnel and the public but still provide information ...

Read more ...