Cloud Hosted Archive
Index Based Searches
This page describes how you can search of the SCEDC dataset through index files. There are index files in comma delimited (csv) format and also in Apache Parquet format. While you can do some simple filtering based on file names, these columnar indexes also allow you to do searches based on latitude and longitude of the seismic station, or magnitude of an earthquake. Download of the index files is free of charge under the AWS Open Data Set Program. You can also use an AWS service like AWS Athena to run sql queries enable you to do more precise queries. AWS Athena does not have a free tier, but has an attractive cost model at $5/TB scanned. The tutorial below shows an example of how to run sql queries using Athena.
Index File LocationsThe index files are in the following locations:
Continuous waveforms: continuous_waveforms/index/
Earthquake catalogs: earthquake_catalogs/index/
index formats have these columns for continous waveforms and earthquake catalogs:
|ms_filename (string)||eventdate (date)|
|net (string)||eventtime (string)|
|sta (string)||eventtype (string)|
|seedchan (string)||geographical type (string)|
|location (string)||magnitude (double)|
|lat (double)||magnitudetype (string)|
|lon (double)||latitude (double)|
|sample_rate (double)||longitude (double)|
Tutorial: Using AWS Athena to Filter Continuous Waveforms by Lat/Lon
- Open the Athena query editor. You will need an AWS account. Refer to AWS Athena documentation on setup. Make sure you are in the us-west-2 region.
- Create a database (in this demo called "scedcindex") enter command:
CREATE DATABASE scedcindex
and press "Run query"
- Make sure you are using the scedcindex database and create a new query.
- Create a table (in this demo called "scedc_parquet") using the parquet index with this query:
CREATE EXTERNAL TABLE IF NOT EXISTS scedc_parquet ( `ms_filename` string, `net` string, `sta` string, `seedchan` string, `location` string, `lat` double, `lon` double, `sample_rate` double ) PARTITIONED BY ( year int, year_doy string ) STORED AS PARQUET LOCATION 's3://scedc-pds/continuous_waveforms/index/parquet/' TBLPROPERTIES ("parquet.compress"="SNAPPY");
- Then run this query:
MSCK REPAIR TABLE scedc_parquet #NOTE: This can take 40-50 min.
- You are then able to run a sql query on this table. For example, this query below returns 56 rows and scans 139.46 KB:
select net,sta,ms_filename, sample_rate from scedc_parquet where seedchan = 'BHE' and year_doy > '2016_150' and year_doy<'2016_157' and lat between 34.00 and 34.50 and lon between -117 and -116 order by ms_filename
If you had done the same query using the csv index - 916.21 KB would have been scanned. The parquet format is compressed and therefore allows you to scan more data at a lower cost.