How to extract Open House London building locations from the website

It's the annual OHL challenge because they won't post a Google Map, here's mine!

First we have to establish the index numbers of the current buildings, these are apparent in the link given for each e.g.

https://openhouselondon.open-city.org.uk/listings/3416

From the website we note there are 841 buildings although the index numbers can reach four figures.


Stage One

Let's use PHP to visit 0 to 10,000 while checking the 200 OK return code from HTTP headers. This does not unduly load the OHL server because we are only getting the headers not the whole page with get_headers($url);

Check HTTP headers with PHP

Nano screen shot

vcount.txt holds all the attempted reads while valid.txt lists the successful ones.

The results are held in valid841.txt


Stage Two

Next to get the OHL building pages from the file list valid841.txt into a directory ready for the next stage where we find the locations and description. 841 files listed by their index numbers. That's a wget -i filename function. If you want to know what the index numbers are click the file link.

I've already done this, they're here on my server so we don't have to repeatedly load the OHL server.


Stage Three

Use grep and sed to find the latitude, longitude, and location and write to gmap.txt, this is a shell script.

Shell script using grep and sed to find lat, long, and name

Nano screen shot



Stage Four

Running the above program to produce gmap.txt then importing that into Excel and saving as a comma separated value (.csv) file which can be imported into Google as a layer on a Google Map.

This results in a the linked Google Map.


Stage Five

There were some good things I found along the way to doing this. One is exec to open a new file for reading and writing, it could not be easier done that way. (see above).

The other was an occasion when I had inadvertently preceded the links file with row numbers. I had achieved this file here valid841.txt which if you click the link will see is just a list of URLs.

Having read the file with cat I then read it with cat -b valid841.txt > newfile.txt which numbers the lines then deleted the original file. Whoops. See newfile.txt

This left me with a problem. I needed the unnumbered links to use in the Google map but had to get rid of 841 index numbers in the left hand column without doing it by hand.

It was at this point I learned about awk and more specifically awk '{print $2}' valid841.txt > newfile.txt (see first example) which ignores the l/h column and prints only the r/h. Problem solved. Unix is wonderful.



Brought to you on a Low End Box courtesy of @otaneimi