Webscraping in GIS

Data in the Digital Age

Data is the foundation of insightful information for research and organizations, and thus drives their successes. The critical role that data plays in our society has led to a race to collect and mine data. Over the course of the 20th century, incredible methods such as the introduction of Neilson ratings (by Arthur Neilson which the company Neilson bears his name) changed how research was conducted, businesses took decisions, and as a result shaped the structures governing our societies. The greatest development in data collection (what some may consider to be the greatest social experiment) took the form of the Internet. The advent of the Internet has not only changed the lives of people, it changed how we understand them as well. The greatest methods of collecting data are now being championed by organizations such as Google, Facebook, and many other household names. While these organizations employ sophisticated technologies to gather and refine data, one doesn’t necessarily need to create a unicorn startup to get data from the web.

Numerous methods exist, however, some methods lead to more successes than others (as illustrated in the previous paragraph). To gather data some may resort to older methods of collecting data. For surveying populations such practices include soliciting ads on community bulletins, posting fliers on local poles/ walls, going door to door, or cold call. Additionally, individuals aggregating data from texts and documents may resort to manually enter data or copy paste the data into an excel sheet. These methods are often time consuming, resource intensive, and may have uncertain (error prone) results. What technology can repetitively scan for, aggregate, and refine data? Your personal computer of course! One powerful method that achieves this effect, and the focus of this article, is web scraping.

With web scraping, a company or individual can collect real-time information from the user with or without consent, such as a Google searches, tweets, and other messaging apps. Companies, such as Google and Facebook, and individuals collect this information for their own goals such as creating their own AIs or selling it to 3rd parties. There is a reason why Facebook and Google offer their messaging and search services “free” (hint, you are the product). Some companies collect this information to accurately meet supply and demand without wasting excessive ecological and monetary resources. For instance, if one web scrapes data on the CIA World Factbook they can collect accurate data pertaining to geography, population, historical dates associated with the country (for making a time line graph), and hyper links linking to sources or other data. A scraping bot can collect live user information (which can be illegal and unethical depending on the site/ situation) or get an API token to access (i.e. Twitter, Google, Facebook). I’ll apply both of these types in a GIS context with an emphasis on web scraping.

Apply it to GIS / Webmap

With respect to GIS, there are two types of data that are of interest. One of those types consists of geographical and tabular information (i.e. CIA World Factbook). The other contains tabulated data that can be joined to a shapefile (i.e. household income at the neighborhood level). A classic example of the first type would be showing a day’s worth of tweets of a particular theme, such as beer vs. pizza across the United States. This example can show “micro-cultures” at the city level. The second type of data comes into hand when you have a shapefile that doesn’t come with the information you need. For instance, you can have a shapefile of neighborhoods, but may not contain information, such as average household income. Most may consider searching for the data and downloading it from a credible site / database for free or by purchase. Alternatively, others may seek out the data by requesting special permission from a company, government, or other organization. If neither of these options are available, web scraping is a method that may be strongly considered.

How to do it

To web scrape, you’ll need to do some research beforehand and have some familiarity in programming languages. The research component is searching through websites containing the data you need, its credibility, how to scrape from it, and its Terms & Conditions (see next section). The languages you’ll need to know are Python / R for scraping and HTML with CSS to understand its structure and where exactly to extract data from the page source. Any programming language has its own web scraping package, but the ones that are well documented and relatively easy to program are Python and R. Python has the BeautifulSoup, requests, and urllib2 packages; whereas, R has rvest. There are great examples and steps of how to start web scraping in both of these languages. Conceptually, this is how web scraping works:

Import libraries
Connect to url page and parse it to either text or lxml (some environment setup may be required for this beforehand)
Extract the data at a particular spot (i.e. id, class, tag/element, xpath)
Save it and write it to csv file
Perform tabular join with shapefile
Write new shapefile

Warning: Careful Considerations & Ethics

Web scraping can be fun and an important skill to have under your belt; however, web scrapers can also run in potential legal issues. Reasons to why web scraping can be viewed as unethical or legal issue varies by the organization. Some of the most common ones are: 1) overloading their server based on the amount of requests, 2) abusing their API, especially when it’s free with limits (i.e. geocoding service), 3) scraping the data and selling it back to 3rd parties, and 4) violating their terms and conditions. To prevent this from happening, do your homework beforehand. Read their terms and conditions of the site. If scraping or collection of data is not mentioned, then either call the organization and ask or conduct web scraping at a throttled rate (i.e. Python code in a for/while loop – sys.sleep(5)) to prevent overloading requests on their server. Another consideration is to see the source of the data. The Internet contains a lot of information to the public, but a lot of that information may be inaccurate, altered, or simply outdated. I highly suggest you to read this article first before you start web scraping.