Abstract
This post covers the oil/gas well and chemical data extracted from the FracFocus.org website by the staff of SkyTruth.org. The FracFocus.org website is gradually becoming the default repository for general hydro-fracturing well information as well as the chemicals used on each well during the fracturing process. The “findings” I present will be sample Structured Query Language (SQL) queries of a substantial subset of the extracted FracFocus.org data (800,000+ records).
When I first saw FracFocus.org’s website, I thought I had found an answer to my oil and gas well location information difficulties, as well of chemicals used in their operations. Then I tried to use it and encountered another “fracking wall.” As objective as I can be now and in my professional opinion based upon my experience cited in my biography below, FracFocus.org’s website at that time was intentionally constructed to make information extraction difficult.
I found a Frequently Asked Questions webpage upon which I found the following question:
“Are the records from FracFocus available in a digital format such as Excel?”
It was answered as follows:
“No. FracFocus was originally designed to serve records one at a time in Adobe pdf format in order to ensure accurate, unaltered and uncompromised data. Consequently the chemical information gathered does not currently reside in a database or spreadsheet format.”
Another question on the same page was as follows:
“Why can’t the system show me the information on more than one well at a time?”
It was was answered as follows:
“The purpose of the FracFocus records presentation system was to provide those who may live near a well that has been fractured with information concerning the materials used to fracture the well. All information other than the information used in the search form is available only in an Adobe pdf format. As such, information such as Ingredients, Trade Names and CAS numbers is not available for search or data aggregation purposes.”
It was this answer that surprised me the most. Once again, I was stymied. At the time this was only a “night job” self-education project, so I moved on to looking for well information from more easily-obtained sources.
SkyTruth.org Extracts Data from FracFocus.org
Several months later, while surfing the web, I noted that SkyTruth.org had extracted data from the FracFocus.org website. I went to the article and attempted to obtain the data from the links they had posted on the http://frack.skytruth.org/fracking-chemical-database/frack-chemical-data-download web page. I found it to be a little bit difficult to use for analytical reasons, so I applied some of my programming skills and pulled all of the extracted data into two data sets: one of “reports” that contained information about each well, and another of “chemicals” citing which chemicals were used in each reported well. I then joined these two data sets into a “blended” data set and pushed all three of the resulting data sets to the “Download” section of an open-source repository at http://code.google.com/p/fracking-analysis/.
Database Preparation
As I wished to do some Structured Query Language (SQL) analysis of the extracted data and take a swipe at creating a high-speed, easily-searchable website using Apache Solr 4 as the back-end search engine, I imported the data into two tables within a MySQL database on an Amazon Web Services (AWS) Relational Database Service (RDS) instance. I used an AWS RDS instance so that I could expand or contract its capacity and capability at will and as needed. Having been pointed to a source of Chemical Abstract Number information coupled with recognized and suspected toxicities, I consolidated this data into a table suitable for import into the MySQL database as well.
Once these three tables, as well as a “view” that joined the chemicals to their respective well reports, had been created in the MySQL database, I backed it up and posted it to the open-source repository at http://code.google.com/p/fracking-analysis/downloads/detail?name=frackanalysis_db_backup.zip&can=2&q=.
The tables presently contained within the database are as follows:
- “chemicals” – a table of approximately 800,000 records citing information about each chemical documented as having been used on the reported well.
- “reports” – a table of approximately 27,000 records citing information about each well reported as of September 21, 2012.
- “uvw-chemicals-reports” – a view joining the each record in the “chemicals” table with its respective “reports” table.
- “cas_chemicals” – a table of approximately 8,000 rows citing information about each chemical with its noted toxicity (e.g. cancer, respiratory, blood) and its category (e.g. recognized or suspected).
The above database and the tables within it are the basis of the queries cited below.
Simple Analytical Queries
Let’s start off with some simple analytical queries of the database. These will be prefaced with an explanation of their purpose in the format of a natural language query.
“What is the maximum number of chemicals entered for a reported well?”
SELECT MAX(`Count`) Max_Count FROM (SELECT COUNT(*) `Count` FROM frackanalysis.`chemicals` GROUP BY pdf_seqid) subqry1;
Which gives one an answer of 310.
“What is the minimum number of chemicals entered for a reported well?”
SELECT MIN(`Count`) Min_Count FROM (SELECT COUNT(*) `Count` FROM frackanalysis.`chemicals` GROUP BY pdf_seqid) subqry1;
Which gives one an answer of 1.
“What is the average number of chemicals entered for a reported well?”
SELECT AVG(`Count`) Avg_Count FROM (SELECT COUNT(*) `Count` FROM frackanalysis.`chemicals` GROUP BY pdf_seqid) subqry1;
Which gives one an answer of 29.0704.
More Complex Analytical Queries
Let’s continue with some more complex analytical queries of the database. These will be prefaced with an explanation of their purpose in the format of a natural language query.
“What is the maximum number of chemicals entered for a reported well that are recognized carcinogens?”
SELECT MAX(`Count`) Max_Count FROM (SELECT COUNT(*) `Count` FROM frackanalysis.`chemicals` c INNER JOIN frackanalysis.`cas_chemicals` cc ON c.`cas_number` = cc.`CAS_EDF_ID` WHERE c.`cas_type` = ‘valid’ AND cc.`Toxicity` = ‘cancer’ AND cc.`Category` = ‘recognized’ GROUP BY pdf_seqid) subqry1;
Which gives one an answer of 6.
“What is the average number of chemicals entered for a reported well that are recognized carcinogens?”
SELECT AVG(`Count`) Avg_Count FROM (SELECT COUNT(*) `Count` FROM frackanalysis.`chemicals` c INNER JOIN frackanalysis.`cas_chemicals` cc ON c.`cas_number` = cc.`CAS_EDF_ID` WHERE c.`cas_type` = ‘valid’ AND cc.`Toxicity` = ‘cancer’ AND cc.`Category` = ‘recognized’ GROUP BY pdf_seqid) subqry1;
Which gives one an answer of 1.2445.
A More Interesting Analytical Query
Let’s continue with some more interesting analytical queries of the database. These will be prefaced with an explanation of their purpose in the format of a natural language query.
“What are the most reported chemicals that are recognized carcinogens in descending order?”
SELECT * FROM (SELECT cc.`Chemical_Name`, COUNT(*) `Count` FROM frackanalysis.`chemicals` c INNER JOIN frackanalysis.`cas_chemicals` cc ON c.`cas_number` = cc.`CAS_EDF_ID` WHERE c.`cas_type` = ‘valid’ AND cc.`Toxicity` = ‘cancer’ AND cc.`Category` = ‘recognized’ GROUP BY cc.`Chemical_Name`) subqry1 ORDER BY `Count` DESC;
Which gives one the resultset below:
NAPHTHALENE 6691
BENZYL CHLORIDE 1636
FORMALDEHYDE 1563
EPICHLOROHYDRIN 378
ATTAPULGITE 334
NITRILOTRIACETIC ACID, TRISODIUM SALT MONOHYDRATE 259
ACRYLAMIDE 233
BIS(2-CHLOROETHYL) ETHER 204
ETHYLBENZENE 134
THIOUREA 96
ACETALDEHYDE 29
1,4-DIOXANE 11
ETHYLENE OXIDE 11
ANTIMONY TRIOXIDE 5
NITRILOTRIACETIC ACID 1
propylene glycol butyl ether 1
Summary
The queries above are just a few of the many that could be made against the cited database and its tables. A more thorough set of queries seem to be in order by professionals more versed in health studies. Now that a database suitable for analysis is available, such may yet be possible. As the records in the reports table contain latitude and longitude values, queries relevant to location seem to be in order as well. A list of recommendations concerning further development of this database appears below.
Recommendations
- A table of CAS chemicals along with their toxicity on a relative scale be found and imported into the database. Perhaps information from the Material Safety Data Sheets (MSDS) can be used to obtain the relative toxicity as the MSDS ranks chemicals on a scale from 0 to 4.
- A table of cities is obtained from a source like the National Atlas Data Download and the reported wells plotted in relation to said cities.
- A table of wells drilled be imported into the database so that a query of wells drilled versus wells reported to the FracFocus.org site can be made to assess the level of reporting compliance in jurisdictions in which reporting is mandatory.
- A table of schools is imported into the database so that proximity-oriented queries can be made against both wells and chemicals.
- A table of earthquakes is imported into the database so that proximity-oriented and time-sequenced queries can be made against reported wells.
- An open-source repository is created for storing and retrieval of both the data and analytical queries.
- An open-source, high-speed search engine web site is created as an aid to the data analysis community as well as a demonstration site for presentations to policy makers as to what is possible with a relative minimum of effort.
Biography
David Darling is a U.S. Army veteran (tactical nuclear weapons) with a B.S. in Zoology and over thirty years of professional experience in Information Technology as a software developer, corporate-level database developer/administrator and search engine developer/administrator. He presently authors software in Java, Java Server Faces, PrimeFaces, coupled with the Apache open-source technologies of Velocity, Solr, and Hadoop. Additional authorship in databases is done in SQL Server, MySQL, PostgreSQL/PostGIS, and Oracle. He especially enjoys any applications dealing with mapping, spatial data, and Geographical Information Services (GIS). He has extensive experience in petrochemical, law enforcement, local/county/regional governments, voter registration, taxation and property tax appraisal, telecommunications, real-time tracking (spatial/status), and health-care information.