Web Scraping: How we learnt to stop worrying and love web scraping

There has been a lot of bad blood between the website owners and those who just scrape. I remember the time when there were automated tools for the Firefox browser to just scrape and compare the webpage for any subsequent changes. Those were good times, and it predates the RSS feeds. However, as the CSS came into prominence, the web scraping was being increasing frowned upon and therefore fell out of respect for those who did it.

Hence I was surprised when I saw the Nature publishing an article on web scraping. What’s stopping the researchers to scrape the AI based articles and start drawing meaningful conversations about it?

Web scrapers are computer programs that extract information from — that is, ‘scrape’ — web sites. The structure and content of a web page are encoded in Hypertext Markup Language (HTML), which you can see using your browser’s ‘view source’ or ‘inspect element’ function. A scraper understands HTML, and is able to parse and extract information from it. For example, you can program your scraper to extract specific fields of information from an online table or download documents linked on the page.

How we learnt to stop worrying and love web scraping
Just a javascript to show the scraping tools. It’s not really difficult

Here’s something for the context and for the research:

Can you get the data an easier way? Scraping all 300,000+ records off of ClinicalTrials.gov every day would be a massive job for our FDAAA TrialsTracker project. Luckily, ClinicalTrials.gov makes their full dataset available for download; our software simply grabs that file once per day. We weren’t so lucky with the data for our EU TrialsTracker, so we scrape the EU registry monthly.

I do hope some ideas spark conversations around the web scraping it becomes mainstream again!