Amazon Athena Query Federation — A transformational technology flying under the radar

  • Query language — Being able to create queries on the fly enables analysts and data scientists to quickly perform exploratory data analysis and to pivot as the structure of the data or the business requirements change. However, by not having the structure of an API, it is not difficult to introduce bugs and errors into production environments if a structured process is not followed.
  • Application Programming Interface API — Having a well-defined API ensures that there is less confusion about what is being sent and received. However, building APIs and writing code take time and they often need to be changed as schemas change and when business needs evolve.
  • The data needs to be copied and converted. Depending on the data size, the computing resources and time to do so may not be insignificant. Furthermore, all the data needs to be converted before we can start using it. We cannot start performing other computations on the new dataset until the conversion is completed.
  • The complete data set being processed hopefully fits into memory. If it doesn’t, the system may crash or the performance will be much slower when accessing the data from disk.
  • CSV, TSV, PSV, or any other delimited data
  • Apache Parquet
  • JSON
  • Apache ORC
  • Apache Avro
  • Hadoop Sequence Files
  • Apache and Nginx server logs
  • Logs files
  • HBase
  • Hive
  • Kafka (streaming data)
  • MapR-DB
  • MongoDB
  • Open Time Series Database
  • Nearly all relational databases with a JDBC driver
  • Hadoop Distributed File System
  • MapR-FS
  • Amazon S3
  • It can take full advantage of today’s modern hardware with the latest GPUs and CPUs
  • It supports zero-copy reads enabling fast data access without incurring serialization overhead
  • It supports data adjacency for sequential access (scans)
  • It has O(1) (constant-time) random access
  • It is SIMD and vectorization-friendly
  • It is relocatable without “pointer swizzling”, allowing for true zero-copy access in shared memory
SELECT * FROM e.`employee.json` LIMIT 3; 
+---------------+------------------+-------------+
| employee_id | full_name | first_name |
+---------------+------------------+-------------+
| 1 | Joe Robert | Joe |
| 2 | Steve Lloyd | Steve |
| 3 | Bill McCaffery | Bill |
+---------------+------------------+-------------+
  • Amazon DynamoDB
  • Amazon DocumentDB
  • All Amazon RDS flavors
  • Many JDBC-compliant databases
  • Amazon CloudWatch
  • Enabling query parallelism
  • Filter predicate pushdown

--

--

--

Data Scientist, Artificial Intelligence, Machine Learning, Author of “Artificial Intelligence with Python”

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Scrum Project Management

Google Cloud Platform Technology Nuggets — Mar 1–15, 2022 Edition

Use AWS ECR Credential Helper with Multiple AWS Accounts

AWS CloudFormation 101 — What is CloudFormation and Do I Need to Know About It? (Part 1)

Modern Engineering Buzzwords You Should Not Miss

Key Engineering Buzzwords In The Industry

What is ZAF v2 and why should I migrate?

Using Google Sheets as a multilingual CMS

Real-Time Breath Tracking via AirPods

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alberto Artasanchez

Alberto Artasanchez

Data Scientist, Artificial Intelligence, Machine Learning, Author of “Artificial Intelligence with Python”

More from Medium

Ancillary#4 First thought on the deabte of Work

Fixing Wonderland. A Plan to Instill Confidence.

My Musical Theme

From a survey to 10,000 users. 14 months of Nobe