Tag Archive: Visualization


Today I was going to move away from working on what data to gather and how to gather it for the scientific locations because I realized that I had really skipped to the end, which probably won’t work very well. The plan was to work on it in the same way I did the ISWC demo by writing out the framework (the skeleton blocks/loops, variable initialization, comments for later) for everything, which really helps in getting a good idea of how everything will work out, and then slowly filling parts of it in to create the functionality part by part, eventually leading to the battle of the main data gathering/visualization. However, it turns out that I hit a snag immediately, seeing as how the CS lab servers were down once again, and I don’t really have any other place where I can test my PHP code.

I started writing up the framework anyhow, since I shouldn’t really need to debug that, but stalled again while working on the initial query used to find the starting location. The problem was that I was trying sample queries for, say, New York, but was not getting any answers! This seemed pretty bad, considering that it should be a pretty simple query and it is rather important that I be able to search the data for a name. After a much longer time than it should have taken, I realized that SPARQL does not equate instances of literals with others if they differ on language tags…and this includes between one without a tag at all and one having one! Since I had been searching for matches of “New York”, I was not getting any results, but as soon as I remembered to add on the @en, it worked.


^Almost an hour of work because of a tag that I should have remembered from previous run-ins with them….

The first part of the project that I will be working on is this initial search, which will search for the location you enter, either by name or directly by latitude/longitude. If there are no results (if searching by name), it’ll return to the search form letting the user know. If there is just one result, it’ll proceed to a page with some useful information on the location, as well as a form with some filters that can be submitted to generate the corresponding visualization. At some point, I want to try to make this information function be an actual mash-up, if I can find other sources to draw data from, and so this could really be a neat demo by itself if enough time is spent on it. Until the entire thing is functional, it’ll just be some info straight from the dbpedia results, however. Finally, if there is more than one result, it’ll display a small bit of info on each result and prompt the user to select one, which will redirect the user to the info page/form for visualization.

Another thing that I found was that trying to use the regex filter on a dbpedia query kills all queries to the point where none of them even finish. I might look into this later to see if there’s better ways to do it, but it makes a lot of sense to me that it would do this seeing as how it would need to run some sort of string processing on every result now. As a result, the search might be much less flexible than I was hoping, since it may need exact matches to find the location. I’m hoping to at least find a way to get capitalization issues to be handled, but any of these optimizations will wait until I get this thing functional, which may or may not be this semester.

Next time I hope to get the initial search functional as well as the functionality for 0 and 2+ results to work. Then I’ll work on the info page, and then finally move onto the form for the visualization and the actual map itself.

An idea/challenge that John had was to try to create a process that used semantic technologies to automate the process of creating a Google Map visualization of scientific locations around a specific place; the examples were showing things like notable natural places, labs, museums, etc around a given city. I haven’t really had much time to work on the idea, due to my computer being in the repair shop most of last week, and now trying to catch up on the many projects/assignments that built up, so today was really the first more in-depth look at it. I decided to spend today looking at queries, to try to get an idea of how to reliably get actual location data for scientific locations. As a sort of background guideline, my rough idea is to have the end product be able to take the name of a place, figure out what kind of place it is (probably will need user input to clarify), and then select a set of queries that will work for that kind of place to get the results to use in the map.

So, with that in mind, the first thing I did was to try to look at what sort of types are associated with locations, which is key for both finding what to center the map around as well as finding and distinguishing between different kinds of places around it.

select distinct ?name, ?lat, ?long, ?label, ?type
{
?name a ?type.
?name geo:lat ?lat.
?name geo:long ?long.
?name rdfs:label ?label.
} ORDER BY ?name

This query basically grabs all URI’s that have a latitude/longitude, their names/labels, and their type. Using this, I can see the range of different types I can try to use later to find and differentiate locations.

Next I tried to narrow down the results somewhat, using Place and Feature specifications, as well as limiting the results to ones with an English label.

select distinct ?name, ?lat, ?long, ?label
{
{?name a <http://dbpedia.org/ontology/Place>} UNION {?name a <http://dbpedia.org/ontology/Feature>}.
?name geo:lat ?lat.
?name geo:long ?long.
?name rdfs:label ?label.
FILTER langMatches( lang(?label), "EN" )
} ORDER BY ?name

However, although I see the natural places/features I would expect, I don’t see labs or colleges, which I will definitely need. I suspect that the first query returned so many results that I might just be seeing a subset of results that just happen to not include what I’m looking for, so I changed it to be more specific and just get me the various types.

select distinct ?type
{
?name a ?type.
?name geo:lat [].
?name geo:long [].
} ORDER BY ?type

Looking at the sheer amount of results, I realized that a ridiculous amount of things apparently have a latitude and longitude, and tried to narrow it down to just the overarching themes using a regex on what looks like the top level of dbpedia objects.

select distinct ?type
{
?name a ?type.
?name geo:lat [].
?name geo:long [].
FILTER regex(?type,"^http://dbpedia.org/ontology/")
} ORDER BY ?type

This looks promising…I see themes for sites of special scientific interest, educational institutions, protected areas, historical sites, and a lot of other categories which I am hoping are included under the Places/Features from earlier.

I ran a brief query just to check that these things do, in fact, all fit under Places/Features:

select distinct ?name, ?lat, ?long, ?label, ?type
{
?name a ?type
{?name a <http://dbpedia.org/ontology/Place>} UNION {?name a <http://dbpedia.org/ontology/Feature>}.
?name geo:lat ?lat.
?name geo:long ?long.
?name rdfs:label ?label.
FILTER langMatches( lang(?label), "EN" )
} ORDER BY ?name

Next, I need to look at the attributes of the different kinds of places to try to get an idea of attributes that I can use to narrow down the full queries later. I’m thinking of using queries like this:

select ?subject, ?property, ?object
{
?subject a <http://dbpedia.org/ontology/EducationalOrganization>.
?subject ?property ?object
} ORDER BY ?subject LIMIT 10

A few key issues will be finding a way to use these to make sure that they are scientific locations and to have a query that does not take too long. I am thinking of having the final code use several queries, specifically tailored to a specific kind of ‘scientific location’, and aggregating all of that data into the map one-by-one. This will allow each query to be as small as possible for its case. Based on the original process of making the maps manually, the automated process needs to be able to plot museums, learned societies (would that be like the USGS headquarters or something?), universities/colleges, libraries, and historic sites relevant to science.

I sort of did a lot of general queries this time, trying to get an idea of the kinds of areas I’ll be able to use in searching for places, but not much on the specific location types and attributes, which I’ll probably look at next time.

On a side note, you can now go here to see the W3C Semantic Web Wiki page for the ISWC 2010 Data/Demos and if you scroll down to Browsers Developed for ISWC, my Filtered Browser is listed!

This post will detail the important segments of the Python script that I am using to generate my visualization demo, found here, with accompanying snippets. Since part of the reason for making the script was in the hopes of allowing future visualizations of similar queries and graphs to be created quickly by adapting the script, hopefully this will help anyone trying to do so.

The first segment of the script deals with the query itself, loading the query results from the endpoint into an output file for use by the parsing section. Basically, this code downloads the contents of the URL, which is the direct link to the JSON-format output from the wineagent SPARQL endpoint.

queryResults = urllib.urlopen('http://wineagent.tw.rpi.edu:2020/books?query=PREFIX+rdf%3A+%0D%0APREFIX+tw%3A+%0D%0APREFIX+twi%3A+%0D%0APREFIX+foaf%3A+%0D%0ASELECT+%3FLocation+(count(%3FLocation)+AS+%3FNumResearchers)%0D%0AWHERE+{+%0D%0A+++++%3FResearcher+a+foaf%3APerson.+%0D%0A+++++%3FResearcher+tw%3AhasAffiliation+twi%3ATetherlessWorldConstellation.%0D%0A+++++%3FResearcher+tw%3AhasLocation+%3FLocation.%0D%0A}%0D%0AGROUP+BY+%3FLocation&output=json')
queryFile = open("queryResults","w")
for lines in queryResults.readlines():
	    queryFile.writelines(lines)
queryFile.close()

The code to do so is fairly straightforward. It opens the URL, hardcoded into the script, and opens the queryResults file for writing the output to. It then proceeds to read each line from the URL and outputs the line into the output file. After this loop finishes, the file is closed.

After this, it reopens the results file, this time as read-only, as well as the visualization.html output file, to prepare for the following section of code, for the parsing of the actual data needed for the visualization from the query results.

data = {}
valCheck = False
for lines in queryFile.readlines():
	    if valCheck:
		    loc = lines.find('"value":') #Location of the URI's associated value
		    if loc >= 0:
	                uriValue = lines[loc+10:-4]	
	                data[uriName] = uriValue
		        valCheck = False
	    else:
		    loc = lines.find("http://tw.rpi.edu/instances/") # Location of the URI
		    if loc >= 0:
	                uriName = lines[loc-1:-5]		
                        valCheck = True

To do so, a dictionary is created, which is basically the Python associative container, like the map structure in the C++ STL. It starts a loop that reads the file, line by line. The else block is the first one that should trigger, as it finds the first line of data for each record (the location URI, in this case). Once found, it uses valCheck to signify that the next line must contain the value associated with the location. This loop is specifically tailored to read the output of the endpoint, and would have to be changed anytime the output changes significantly. However, the actual changes would not take that long, thanks to the consistent formatting of the endpoint output. Another note about the code is the way the actual saved data depends on the array subscripting, which just cuts out a specific substring…again very specifically tailored to the output, but also very easy to alter. After the dictionary is complete, the next step is to grab all that data and write it into the formatted JSONobject string for the Google visualization API.

jsonObjectList = []

# Column names
jsonObjectList.append("var JSONObject={cols:[{id:'s',label:'Locations',type:'string'},{id:'o',label:'Number of researchers',type:'number'}],rows:[")

# The rest of the JSONobject
for k, v in sorted(data.items()):
	jsonObjectList.append("{c:[{v:")
	jsonObjectList.append(k)
	jsonObjectList.append("},{v:")
	jsonObjectList.append(v)
	jsonObjectList.append("}]},")
jsonObjectList.append("]};")

# Generate full string
jsonObject = ''.join(jsonObjectList)

A list is created to hold the different segments of the string being created, before being joined together at the end. The first addition is the column names, which is hardcoded in. The rest is generic, simply pulling all of the dictionary key/value pairs and outputting that into the correct format.

The largest section is the HTML generation, simply because there are so many lines of the HTML that are hardcoded in. You basically just need to find a Google Visualization example for your desired chart (I used a horizontal bar chart), edit the caption/label/options information to match your visualization, and write it into a string. The string is written in three parts, starting with the first half of the HTML, stopping where the JSONobject string is needed, adding in my newly generated string, then appending the rest of the HTML. Finally, the whole string is written to the visualization.html file and all the files are closed. Done!

The result is a script that you just run and get the ready-to-go HTML file to upload. As noted in my previous blog post, there’s a bunch of advantages to doing it this way. In short, it is much better for future maintenance than translating the endpoint output to the JSONobject manually each time, and it is more robust than a dynamic webpage that tries to reload the results every time it is loaded. This tries to strike a compromise, with a dynamic script that generates a static page whenever an update is needed.

Screenshot of the Visualization

Today I did the actual coding of the Python script I had planned last time.  What had started as an idea to use the script to go from the SPARQL endpoint output to the needed input for the JSON object grew to include the idea of simply generating the entire HTML of the visualization with the script, and finally ended up grabbing the query data from the endpoint by itself.

The script is divided into four main elements.  First, the script accesses the URL for the actual JSON-format results from the SPARQL endpoint, copying the output into a queryResults file.  It then reads these results, parsing out the needed data into a dictionary, using the rooms as a key and the number of researchers as the value.  Using this dictionary, the JSON object needed for the Google visualization API is built.  Finally, the HTML for the page is output, inserting the JSON object line into the correct place.  The final script output is visualization.html, which can be uploaded and viewed online.

This method seems really roundabout, but there were a few reasons why I wanted to do it this way.  I considered trying to do the same method, but entirely within a dynamic webpage, which would have the benefit of always being updated.  However, when I was looking at other visualization demos, I realized that several would not function because their source endpoint was gone.  By having my actual page be completely static, this won’t be a problem.  On the other hand, if I had simply typed out the JSON object manually, I would have to do it manually again if I ever wanted to update the information.  This method means that I can generate a new page anytime I want by running the script, but the current webpage won’t ever have endpoint connection issues.  Even if the endpoint goes down and I need to update it (unlikely, since this is a demo, but I was trying to imagine it as a maintainable task), I will be able to just change the URL in the script, and probably how it parses if the output of the endpoint is different.  The final reason is that when I was brainstorming how to do this, I didn’t actually know if using a normal programming language like C++ would allow easy access to a webpage result, and I didn’t like the idea of possible broken links if using a web programming language like ASP/PHP.  However, I did know from past brainstorms that Python had all the needed methods and capabilities, so I just used that.  Ironically, this was my first time using Python, other than one or two small utilities I made for work…I got a lot of use out of Google and a reference book from the library while doing this!  It was sort of a crash course in both SPARQL and Python.

My SPARQL visualization demo can be found at http://www.rpi.edu/~ngp2/TWC/visualization.html

So today I basically began working out how I was going to create the actual visualization that I wanted.  To reiterate what I was hoping to make, I am going to be making a visualization of the various locations that TWC researchers are in, where it will show the locations and the number of people at each one.

When I first started, I was thinking of just creating a script to translate the output from the SPARQL endpoint into the JSON code needed for the JSONobject in the Google visualization code, but while working out how I wanted to do that, I realized that I could perhaps make the script do everything; load the query data, generate the JSON code, and integrate that into a full HTML file that I could simply upload to get the visualization.

I think that this is possible since the query to the endpoint is done entirely within the URL, so if I can get the script to load a webpage, I can just insert the URL in to retrieve the data.  From there, it is just a matter of getting the script to convert the format to the one needed by the visualization code, and to generate the rest of the visualization code around that and output the whole thing.

As far as the actual conversion, my initial plan was just going to do some string parsing to find instances of the location URI’s, keeping track of the count in a hash table/dictionary structure. From that, creating the data needed for the visualization should be easy. The only drawback is that the script won’t be directly runnable online, but rather I’d run it myself then upload the output.

Later, while doing the planning and some of the framework for the code, I realized that I could rework the query once again to get me just the data that I needed for the visualization.  While doing this, I tested to see if the endpoint would allow me to use grouping/aggregation operators to do the counting work for me, and it succeeded, simplifying it even more.  The new query that I will be using is as follows:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX tw: <http://tw.rpi.edu/schema/>
PREFIX twi: <http://tw.rpi.edu/instances/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?Location (count(?Location) AS ?NumResearchers)
WHERE {
?Researcher a foaf:Person.
?Researcher tw:hasAffiliation twi:TetherlessWorldConstellation.
?Researcher tw:hasLocation ?Location.
}
GROUP BY ?Location

This post will have some of the more interesting queries that our meeting resulted in, but I will be leaving out the prefixes for them later. The prefixes used were:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX tw: <http://tw.rpi.edu/schema/>
PREFIX twi: <http://tw.rpi.edu/instances/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

Today I worked with Cameron on a SPARQL visualization exercise, which will end with a visualization of some data retrieved by a SPARQL query, using the Google visualization API.  As it turns out, we spent a while constructing various queries.  We had started by looking at a simple SPARQL query that simply retrieved all the data for one of us in RDF form, followed by each of us trying to figure out how to pull both of our information at once, finally ending in the following query:

DESCRIBE ?s
WHERE {
?s a foaf:Person.
?s foaf:name ?o.
FILTER (?o = "Philip Ng"@en || ?o = "Cameron Helm"@en)
}

I basically arrived at this after figuring out what the actual triples were by starting out with a SELECT on ?s ?p ?o, then building the filter/query to match what I wanted, thus why the query still has the first and last variables with those names.  After this query, we wanted to try to build one to come up with a table of all of the undergraduates.  We first tried a query using tw:hasRole, but found out that for some reason, that doesn’t seem to result in the triples at all, where all of the predicates have blanks as their object.  We decided to go further out first, and resulted in this query:

SELECT ?s ?p ?o
WHERE {
?s a foaf:Person.
?s tw:hasAffiliation twi:TetherlessWorldConstellation.
}

I had left in the ?p and ?o so I could see the various kinds of results returned for each person, looking for interesting data that we could try to build queries around next.  There were a whole bunch of queries here, but none as interesting, mostly just playing around with various combinations of triples and filters to see what would result. After a bit, I noticed the location triples for everyone, and quickly changed the query to the following:

SELECT ?Researcher ?ResearcherName ?Location
WHERE {
?Researcher a foaf:Person.
?Researcher foaf:name ?ResearcherName.
?Researcher tw:hasAffiliation twi:TetherlessWorldConstellation.
?Researcher tw:hasLocation ?Location.
}

This pulls all the affiliated people once again, but also their locations.  At this point, we realized that since all of the undergraduates should be in the same room, and only the undergraduates, that we had found an indirect way to get our original goal:

SELECT ?UndergradStudent ?UndergradStudentName ?Location
WHERE {
?UndergradStudent a foaf:Person.
?UndergradStudent foaf:name ?UndergradStudentName.
?UndergradStudent tw:hasAffiliation twi:TetherlessWorldConstellation.
?UndergradStudent tw:hasLocation ?Location.
FILTER( ?Location = twi:RPI_Winslow_1148A )
}

However, for the visualization part, I was thinking of using the one before, with all the TWC users and their locations, to graph the rooms and the number of people in each one, since the undergrad query is interesting but doesn’t really have anything graphable.  We did not actually finish with the visualization part, without enough time to figure out how to get a JSON translation compatible with the Google API, but we were pretty satisfied with our query work, and left feeling much more comfortable with SPARQL.

The SPARQL endpoint that we used while doing this was at http://wineagent.tw.rpi.edu:2020/query.html