I wanted to sample a broad cross-section of the internet, including sites with large development teams and long life spans as well as newer properties produced by passionate amateurs.
My list of domains to scrape started with the top 1,000 Alexa sites to represent the 'popular' and 'large' end of the spectrum. They provide a CSV report of their top 1 million sites by trailing month traffic rank, so I included the top 1,000 from that.
To fill in the rest of the list of domains to survey I randomly sampled Quick Left's mailing list, which is a mix of past and potential clients, employees, fans, and other folks from around the world. I considered just sampling the top million list from Alexa randomly, but even the top million properties still includes larger sites compared to the estimated 271 million domains registered worldwide, so our mailing list was hopefully a better sample that could include MVPs, personal sites, and other 'scrappy' web properties.
The final list contained 10,400 domains that are roughly a representative sample of the internet. From these domains, I downloaded all of the CSS files linked from their main page with:
cat domains.txt | xargs -I % wget http://% -r -A '*.css*' -H --follow-tags=link
This process collected about 28,000 CSS files in total from those 10,400 domains. These were run through a CSS parser node module which let me save around 8.7 million records of selector, property, and value (e.g.
span.important, font-weight, bold). These were saved in Postgres, and heavily indexed for exploration.