Hail the data providers.
Detailed analysis produced in the public sphere by hobbyists and some exceptional talent headed towards team employment, couldn’t be accomplished without the fine work of another group of dedicated and articulate hobbyists.
Editor's Note: Rotoworld’s Season Pass is now available for the low price of $19.99. You get plenty of extra articles including the minor league report, the power play report and much, much more. Buy it now!
Data providers are often thanked and, almost always, credited – and to those that don’t properly credit data sources should be called out for such egregious behavior – but there is so much more than just getting a website up and running. That’s merely step one.
A few years back, I tried my hand at providing CHL data. Partnered with the work of a talented developer, we embarked on trying to build a self-sustaining website that would update extended CHL player data nightly. We quickly found out the logistics of maintaining the endeavor was immense. The work effort and maintenance costs were above and beyond anything a pair of hobbyists could provide. Support was fairly non-existent while data requests and display recommendations were plentiful – which really shows how persuasive data has become. Suggestions for further enhancements were overwhelming.
The site never made it out of beta.
I’d like to shine some light on some of the data providers that have made recent contributions in the public sphere in the hopes of encouraging some public support, and patronage. The work being done is exemplary, and since there’s a distinct need, having a variety of sources can be beneficial. Especially if an NHL team decides they’d like to internalize the sites (see @ExtraSkater).
I’ve often cited some of my most common used sites – including Hockeyviz and Corsica Hockey.*** My most widely used data is tracked by Corey Sznajder, who provides different sets of data or visualizations, each used for a different purpose. A lot of instances incorporate multiple data providers, used in coordination with each other.
For example, reading player line formation from Hockeyviz and then using available microstats from Sznajder’s tracking data, coupled with data from Corsica.Hockey can provide a robust enough data sample to perform advanced analysis and get a very good indication of a player’s performance with accompanying context.
*** Note: I engaged Corsica to participate, but didn’t receive a response. The site is a fantastic resource and one of, if not the longest standing site currently under operation after a reboot to Corsica 2.0. The site is administered by Emmanuel Perry.
I was working on the McKeen’s Hockey Yearbook at the time HockeyAnalysis.com went dark. I tried to access the site only to receive an error message. The sheer panic considering the amount of incomplete work, engulfed me entirely. When the online world lost the pre-eminent site for With or Without You analysis (WOWY for short) due to founder David Johnson getting snapped up by the Calgary Flames, the sheer panic was overwhelming.
The immediate void was gargantuan, but fortunately for the online world WOWY’s Natural Stat Trick had already developed and included WOWY results on the site. Further site enhancements/functionality includes a line tool to isolate different combinations of players (up to five) on the ice. The stampede to find the next available site for this kind of analysis could have been overwhelming, but NST saved the day.
The site offers more than just simple WOWY’s and a description here won’t do the site justice for the functionality.
Brad Timmins, founder and proprietor of the site explained a little about how that process played out in real time. I asked Brad a few questions in regards to the site, its upkeep and maintenance challenges.
· When Hockey Analysis went dark, this was the site that picked up the slack for WOWY’s. How did this flock of new users affect your ability to keep the site up and running?
The first sign I had of Hockey Analysis going dark was when I started getting server resource alarms. I was using most of the server's resources to add older seasons because there's pretty much no traffic at the start of August, right? By that point the site had already ground to a halt.
In the end I did have to move up to a more powerful server, which is fortunately quick and easy to do.
· You’ve added new components to the site, including a variety of methods to download data. How has data accessibility requirements from users changed the way you’ve designed the site?
It's mostly about the little things, and recognizing the differences between what works for someone reading the site and what works best for someone doing work with the data. The eye and brain comprehend ice time as 6:37 much better than 6.617, but anybody who has tried to work with times in Excel knows the misery that can bring. So the site displays it as 6:37, but the copy and download buttons give it to you as 6.617. Things like hiding less-used columns by default to limit horizontal scrolling, but a button to download the entire table without having to un-hide all of the columns first.
· Aside from WOWY’s, are there any other metrics that you plan on adding to the site? (e.g. WAR/GAR, expected goals).
The current plans are more incremental than adding entirely new metrics to the site. Adding metrics that can already be calculated from what is there, but not being done for you yet. Adding more options to the filters - I've had several requests recently for 6v5 and 5v6 options. I'd like to expand the line tool to let you do line vs line or line vs pairing opposition WOWYs. I'd like to add expected goals at some point too, but developing and testing a model or even implementing someone else's is a big job.
· How do you manage costs associated with the site? Is there a Patreon page?
The first step is to keep the costs down, so I've put a lot of time into making everything run as efficiently as possible so it runs on the least expensive hosting possible. I mentioned having to move up to a more powerful server earlier, but it's still pretty low-end overall.
There is a Patreon for the site as well.
It has allowed me to add resources when they are needed, and even run a secondary server over the summer so I could make significant background changes without taking the live site down for maintenance.
The last point Brad made is more important than the statement makes it seem. Having a dedicated space to back up a production environment while performing upgrades, or integrating new code is crucial, especially with the method of adding incremental functionality.
Utilizing a dedicated production environment without any testing/acceptance environment, makes developing and promoting enhancements and code changes tricky and fraught with potential bugs – or downstream actions that cause problems with existing functionality.
Regression testing is important – testing to ensure newly added functionality didn’t cause a break or alter current functionality. Adding new functionality to the detriment of existing is just double the work effort.
Barlowe Analytics is run by Matt Barlowe and the through the Twitter feed (@BarloweAnalytic). There’s a unique quality to this ‘data provider’ because there’s no website. Data retrieval is conducted by a twitter query that returns values, or chart in a return tweet to the original user posting the query.
The two distinct elements to Barlowe Analytics include the Twitter bot, and maybe even more important, the tutorials on wide ranging technical subjects, encompassing programming languages and visualizations.
My favorite may be the SQL modules, due to its ease of use and structure, (select these records, from this data source, where these conditions exist). The rest is just syntax that’s easily learned.
The Barlowe Analytics ‘query bot’ has recently been promoting game probabilities prior to matches and recap data as well. Matt has also developed his own expected goals methodology here – including code to allow users to create their own.
Isolating the bot, it contains individual features, including:
2018-11-04 ixG Leaders:
MAXIME.LAJOIE OTT 1.45
JEFF.SKINNER BUF 0.78
CAM.ATKINSON CBJ 0.72
CONOR.SHEARY BUF 0.63
PIERRE-LUC.DUBOIS CBJ 0.58
— Barlowe Analytics (@barloweanalytic) November 5, 2018
— Barlowe Analytics (@barloweanalytic) November 5, 2018
2018-11-05 playoff probabilities (Probabilitè de Sèries Éliminatoires). pic.twitter.com/keBUs43VlD
— Barlowe Analytics (@barloweanalytic) November 5, 2018
Game Situation Probabilities:
Philadelphia Flyers @ Arizona Coyotes 2018-11-05:
Arizona Coyotes: 53.8%
Philadelphia Flyers: 46.2% pic.twitter.com/3vHDfrsHbY
— Barlowe Analytics (@barloweanalytic) November 5, 2018
Data providers are becoming more widely accessible, more sites, including NHL.com, but this is a unique option for data retrievals when internet usage is restricted, or for quick, easy answers. For a quick reference with a few variables returns immediate results. This functionality can be of benefit to a wide variety of users, ranging from a twitter user, analyst, or someone at the game with limited data to load heavy sites. Using any application, a user can get a quick chart, or stat.
A primer on how to use the Twitter bot is below.
Ok the query bot is back in action and now with 2019 stats. Still working on updating to new database so may have future down time. You can read about how to use it here: https://t.co/jZzjGfm7n0 https://t.co/koojIXP5Nw
— Barlowe Analytics (@barloweanalytic) October 13, 2018
- What was the inspiration for the tutorials?
For me the inspiration was that I remember how difficult it was when I was starting to learn analytics. There’s a lot of material out there but almost none of it pertains to hockey or if it does it’s at such a high level that it would just be unintelligible for a beginner. So I wanted to give people tools to help them get started and make it easier for them than it was for me
- In creating these tutorials, did they open any new avenues or introduce any ideas that you hadn’t considered in the past?
Maybe not the tutorials themselves but you get a lot of people asking questions or showing you stuff they’ve worked on that you can learn from.
The Barlowe query bot is a unique feature and perhaps the first – or at least one of the first – to make an appearance on Twitter.
- Was the intent here for quick data retrievals?
Yeah I believe I came up with it when Corsica was down or in the midst of its new implementation where it didn’t have all its old features up yet. One of my favorite things build on that old site was the rolling average graphs of certain stats over a time frame. I basically wanted to provide people a way to get those but I didn’t want to go to all the trouble of building a website mainly because it’s not something I’m very good at. But yeah, I wanted people to be able to get quick data on players and teams while just on their phone.
- Have you been able to measure public usage of the bot to date?
I can see the notifications it gets when people tweet at it but that’s about all I use to measure its usage. There are a few people that really seem to enjoy the feature.
- Is there an intent to expand the bot functionality?
Yeah I want to add some more stats, mainly relative teammate statistics, to it that I’m building on my new database. I also want to change the syntax of the queries to make it a little easier to use and include the extra seasons from my new database. It’s currently still running on the old database which only goes back to 2015 I believe.
· Do you offer personal tutorial services for anyone interested in furthering their knowledge and/or functionality?
Yeah if anyone wants to set up private tutoring sessions I’d be more than happy to help accommodate that for a modest fee. My main focuses are Tableau, Python, R, and SQL. I also know a fair bit about some AWS services as well.
- How do you manage costs associated with the site? Is there a Patreon page?
No I currently do not have a Patreon and as of right now everything I do is free and the plan is to keep it that way. I won’t rule out asking for money in the future if things get too expensive but there are no plans to do so at the moment
Hey everyone! We moved our website and now have our very own domain! https://t.co/9dDuSYvRyQ. Additionally, we've created a Patreon page to help with the cost of the server/website: https://t.co/prlbchHWjG. Any contributions are greatly appreciated. — EvolvingWild (@EvolvingWild) August 19, 2018
It’s one thing to create, write about and provide data on new(er) metrics, but there’s more involved than that. The Evolving Wild page is built by a set of twins – applicable that they hail from Minnesota, but this isn’t a baseball blog – that knocked the online world for a loop when they announced how their twitter feed was run.
Josh and Luke have built this site and have been public proponents of the GAR/WAR debates. The 2018 Twitter WAR debate is a prime example. There’s a level-headed intent in their debate and a feedback loop they administer on their own.
Over this past weekend, they attempted a cool feature by trying to show just how their model interprets plays and the numerical values assigned to their expected goals model.
This is absolutely great.
Constantly testing and refining models strengthens the final results, adding versatility as new knowledge and technology overwrite the past learnings. Developers and analysts can do this privately behind the scenes, tweak and release the new data, without any public transparency.
Here, the twins did it in a public setting offering the ultimate form of transparency. The effort spawned some requests from other Twitter users.
The first thread is here:
Every now and then, we like to evaluate how our xG model is performing. Here are the top-10 expected goal values/events through 11/03/18 (from https://t.co/RHiWvqfo2N).
— EvolvingWild (@EvolvingWild) November 4, 2018
And an additional thread:
Another expected goal thread: the lowest expected goal events/shots that resulted in a goal. I bet you all can guess what number 1 is.
#1: ARI vs. OTT, 10/30/18
ARI - Stepan (G)
xG value: .0067 pic.twitter.com/pLQeOQBaWc
— EvolvingWild (@EvolvingWild) November 5, 2018
Here’s a Q and A with Evolving Wild.
Q: The site looks great and the functionality for users is slick and easy. Was there an inspiration for the design?
A: Thank you! But to be honest, not really. The site is made using the R programming language's Shiny package, which comes with built in themes. We've made some modifications to the CSS styles and have been very aware of the overall user experience (table layouts, charts, etc.), but the actual design of the website is essentially a stock theme.
Q: How long did the planning and coding take for the effort before the site went live?
A: In terms of actually writing the code to make the website, I'd say it's probably taken us ~3 months of coding 1-6 hours per day (give or take). The code that makes the numbers that go into the site has been a work in progress for over a year (at around the same rate). It's basically the culmination of all of the work we have done in hockey since learning R. On the other hand, we've had to re-do things multiple times... For instance, our GAR/WAR model is actually in it's third iteration.
Q: The references page contains an interesting adaptation of the basketball RAPM (Regularized Adjusted Plus-Minus). Can you give us a brief overview of what RAPM is and how it applies to hockey?
A: RAPM (Regularized Adjusted Plus-Minus) is a regression technique that attempts to determine each skater's offensive and defensive contributions to the league scoring rate (Goals For, Corsi For, xG For per 60 - you could also use Fenwick shots or shots on goal). Using a regularized linear regression, you can account for each skater's teammates, opponents, the score state, zone starts, and additional variables to arrive at a rating that is *independent* of these factors. This is particularly useful for analysis of basketball and hockey in that we can remove the impact of each skater's teammates, opponents, etc. on their overall play. Teammates are by and large the most impactful aspect of a skater's performance (think Thornton and Cheechoo), so RAPM allows us to isolate a skater's contributions from their teammates (as well as the other factors that were mentioned) - to the best of our ability. It is similar to a relative to teammate metric (originally created on puckalytics.com, also available on corsicahockey.com) but with the added benefit of controlling for opponents and zones.
Q: Goals/wins above replacement is still fairly new concept in hockey and the site captures raw data and visualizations. GAR/WAR are difficult concepts to introduce to a team sport like hockey. What has been the most difficult part of introducing your WAR/GAR models to the public?
A: Unfortunately, we haven't been able to finish the writeup for our GAR/WAR model yet (we presented it at RITSAC this past September -- slides here -- video here, so that's probably the first hurdle. But overall, I would say WAR models in hockey are inherently very complicated. You need to manage multiple strength states and multiple positions at the same time - and goalies need to be modeled in a separate way. In addition, the methods required to isolate a player's contributions are rather in-depth and (in some cases) can create problems that are difficult to identify (multicollinearity, specifically, which is the problem that arises when players spend a significant amount of time on the ice together). The biggest hurdle we anticipate is properly explaining our complete methodology in a way that is relatively easy to understand. Additionally, we've made our model available on a daily basis beginning at the start of this season. For us, we're very used to looking at the prior models (and current models in Manny's case) en masse or with larger time frames at play, so to speak, but looking at the model as it works in season is a new thing. We hope to present more research on when the model will "stabilize", but it's important to take the early season WAR numbers with a grain of salt.
Q: How do you manage costs associated with the site? Is there a Patreon page?
A: We have a Patreon page that can be found here. We are also working on additional features that will available to patrons. Hopefully we should have something up soon!
There are a great amount of other data providers that I haven’t touched upon – mainly because I’ve become a staunch user of the sites I listed above. I’d like to try to include more over the course of this season.