Southern California Earthquake Data Center

Index Based Searches

This page describes how you can search of the SCEDC dataset through index files. There are index files in comma delimited (csv) format and also in Apache Parquet format. While you can do some simple filtering based on file names, these columnar indexes also allow you to do searches based on latitude and longitude of the seismic station. Download of the index files is free of charge under the AWS Open Data Set Program. You can also use an AWS service like AWS Athena to run sql queries enable you to do more precise queries. AWS Athena does not have a free tier, but has an attractive cost model at $5/TB scanned. The tutorial below shows an example of how to run sql queries using Athena.

 

Index Columns

Both index formats have these columns:

ms_filename (string)
net (string)
sta (string)
seedchan (string)
location (string)
lat (double)
lon (double)
sample_rate (double)

 

Tutorial: Using AWS Athena to Search

  1. Open the Athena query editor. You will need an AWS account. Refer to AWS Athena documentation on setup. Make sure you are in the us-west-2 region.
  2. Create a database (in this demo called "scedcindex") enter command:
    CREATE DATABASE scedcindex
    and press "Run query"
  3. Make sure you are using the scedcindex database and create a new query.
  4. Create a table (in this demo called "scedc_parquet") using the parquet index with this query:
    CREATE EXTERNAL TABLE IF NOT EXISTS scedc_parquet ( `ms_filename` string, `net` string, `sta` string, `seedchan` string, `location` string, `lat` double, `lon` double, `sample_rate` double ) PARTITIONED BY ( year int, year_doy string ) STORED AS PARQUET LOCATION 's3://scedc-pds/index/parquet/' TBLPROPERTIES ("parquet.compress"="SNAPPY");
  5. Then run this query:
    MSCK REPAIR TABLE scedc_parquet
  6. You are then able to run a sql query on this table. For example, this query below returns 56 rows and scans 139.46 KB:
    select net,sta,ms_filename, sample_rate from scedc_parquet where seedchan = 'BHE' and year_doy > '2016_150' and year_doy<'2016_157' and lat between 34.00 and 34.50 and lon between -117 and -116 order by ms_filename
    If you had done the same query using the csv index - 916.21 KB would have been scanned. The parquet format is compressed and therefore allows you to scan more data at a lower cost.

 

References

  • AWS Athena Documentation
  • AWS Getting Started Resource Center Starter page for learning about what types of AWS resources are available.
  •  

    top