data engineering thoughts on Fraud detection tutorial
https://docs.hopsworks.ai/4.2/tutorials/
I need to pick up advanced SQL, I didn’t touch much SQL beyond SELECT, JOIN, GROUP BY
CTE: named subquery, more readable and composable
Window: aggregates without collapsing rows. OVER (PARTITION BY ...)
WITH temp_table AS (
SELECT user_id, COUNT(*) AS event_count
FROM events
GROUP BY user_id
)
SELECT *
FROM temp_table
WHERE event_count > 10;
SELECT
user_id,
event_time,
COUNT(*) OVER (PARTITION BY user_id ORDER BY event_time ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS rolling_event_count
FROM events;
--rule location check
WITH CustomerLocations AS (
SELECT
c.customer_id,
c.customer_latitude,
c.customer_longitude,
ca.account_number
FROM Customers c
JOIN Customer_Accounts ca ON c.customer_id = ca.customer_id
),
PotentialFraud2 AS (
SELECT
t.txn_id
FROM Transactions t
JOIN CustomerLocations cl ON t.account_number = cl.account_number
WHERE calculate_distance(cl.customer_latitude, cl.customer_longitude, t.txn_latitude, t.txn_longitude) > 100
)
Note: comparison generated by gemini
analytical design: (OLAP) Data Consistency: Ensuring data consistency between real-time and batch processing systems. Dataflow is a nice model, but no popular outside the google ecosystem. we have two options:
- spark path, large-scale ETL workload, with structured streaming. (second latency) Exactly -one delibery is possible, but required carefully configured sink (idempotent write to s3) or watermark/checkpoint.
- Flink path, true streaming, record by record. Event support Advanced (watermarks, out-of-order data)
Seperate service Separate services for data ingestion, fraud detection rules, notification, and reporting Message Queues (e.g., Kafka, RabbitMQ): For reliable and scalable data flow between services.
Consider In-Memory Databases/Data Stores: For low latency lookups and calculations (e.g., Redis, Hazelcast). Key-Value Stores: Could be used to store pre-calculated averages or user profiles. Stream Processing Platforms: (e.g., Kafka Streams, Apache Flink, Spark Streaming) for real-time data ingestion and processing. Data Lake/Warehouse: For storing historical data and performing analytical queries for model training and reporting (e.g., S3, Hadoop, Snowflake).
Rule 1 (Amount): Maintain a real-time average transaction amount per user and category. This could be updated with each new transaction using a sliding window or exponential moving average. Store this information in a fast key-value store, where the key is (customer_id, transaction_category). Rule 2 (Location): Need to access the customer’s current address or a recent known location. This could be stored in a fast data store. Implement a geohashing or similar technique to quickly identify transactions outside the radius.
ETL vs ELT ETL: Can enforce data quality and consistency before analysis. Well-defined reporting requirements, strict data quality needs. ELT: Relies on the data warehouse/lake for transformation and quality checks. Exploratory analysis, diverse data sources, evolving requirements. depends on the maturity of the data requirements. For operational reporting with well-defined rules, ETL can offer better performance and data quality upfront
Batch vs. Real-time: For the weekly report, batch processing is likely sufficient. However, for the analytical requirement of ASAP fraud detection, real-time processing is crucial.
Data Storage: Lake vs. Warehouse: Schema-on-read (schema is applied when data is queried). vs Schema-on-write (schema is defined before data is loaded). Can contain raw, potentially inconsistent data. Requires robust governance to manage diverse data. Exploration, discovery, advanced analytics (e.g., ML). vs Business intelligence, reporting, structured analysis.
A common pattern is to use a data lake as a central repository for all data and a data warehouse for structured data optimized for reporting and analysis.
US core populartion is around 100m. given is weekly + near-real time requirement. let’s do batch focus. Frequency: Instead of real-time ingestion, batch-oriented ETL pipeline leveraging a data lake for staging and a data warehouse for analysis is a common and effective approach. The data extraction method should focus on efficient and scalable techniques like CDC. potentially utilizing message queues for buffering and reliability.
- Data Lake (Staging Area): Ingest the extracted batch data into a data lake (e.g., S3, Azure Data Lake Storage, Google Cloud Storage). This provides a cost-effective and scalable storage layer for the raw data.
- then ETL, flatten, clean, join, Aggregate(spark) -> warehouse (optimized for analytical queries)
- Use an orchestration tool like Apache Airflow to schedule it.
- Reporting and Visualization Tools: Connect BI tools (like Tableau
How OLTP?
- Database Change Data Capture (CDC): If the transaction data resides in an operational database, CDC tools (like Debezium, AWS DMS, or similar) would be the most efficient way to capture changes incrementally without putting a heavy load on the source database. This is ideal for continuous data streams.
- Message Queues (as a buffer): The source system could push transactions to a message queue (like Kafka or RabbitMQ). Your data pipeline can then consume these messages in batches for processing. This adds resilience and decoupling.
with MLOps:
- the pipeline needs to be flexible and adaptable, so a rule engine within the analytical processing component.
- Integrate machine learning models into the analytical pipeline. These models can learn complex fraud patterns from historical data and identify anomalies that rule-based systems might miss. Feature Engineering: The transformation step should extract relevant features from the transaction data that can be used by the ML models. and have some Feedback Loop.
- Data Enrichment and A/B Testing
Optimizing Performance of Batch vs Streaming Batch
- Data Partitioning: Partition data effectively in the data lake and data warehouse to enable parallel processing by Spark or other engines.
- Optimized Joins: Choose the appropriate join strategies (e.g., broadcast join, shuffle hash join) based on data size and distribution.
- Data Filtering and Projection: Filter out unnecessary data and project only the required columns early in the pipeline to reduce processing overhead.
- Caching: Utilize caching mechanisms provided by the processing engine for frequently accessed data.
- Management: Properly configure the resources (CPU, memory) allocated to the processing engine based on the data volume and complexity of transformations.
- Code Optimization: Write efficient code, avoiding unnecessary loops or computations.
Streaming
- Stateless Operations: Prioritize stateless operations whenever possible to minimize overhead.
- Efficient Windowing: Optimize windowing configurations (e.g., size, slide) for aggregations and pattern detection.
- Parallel Processing: Leverage the parallel processing capabilities of the stream processing engine by partitioning data appropriately.
- Lightweight Serializers: Use efficient data serialization formats (e.g., Avro, Protocol Buffers) to minimize network overhead.
- Optimized Lookups: If rules require looking up external data, use efficient data stores (e.g., in-memory caches, key-value stores) with optimized indexing.
dive deep into the analytical aspects 2.1: Start with analyzing the most frequent and slowest queries. Identify the columns used for filtering and joining and consider creating indexes on those. For the Transactions table, partitioning by timestamp is a strong candidate for further optimization. Indexing: Pros: Significantly speeds up queries that filter or sort by the indexed columns. Cons: Increases write operation overhead (as indexes need to be updated), consumes additional storage space, and too many indexes can actually slow down queries. Choosing the right columns to index is crucial. Partitioning: Pros: Improves query performance by allowing the database to scan only relevant partitions, makes data management (archiving, backup) easier, and can improve query parallelism. Cons: Requires careful selection of the partitioning key, can complicate certain types of queries that span multiple partitions, and might not be beneficial for all query patterns.
Remember the overhead of maintaining indexes on write operations. Regularly review and remove unused indexes. Consider composite indexes for columns frequently used together in queries.
2.2 For non-time-series metrics, focus on efficient querying of the current data with appropriate indexing. For time-series metrics, consider using SQL window functions, dedicated time-series databases, and pre-aggregation techniques (materialized views) to efficiently analyze historical trends. The choice depends on the complexity of the metrics, the volume of historical data, and the required query performance.
Metrics that Don’t Require Time-Series History These metrics can typically be calculated directly using SQL queries with aggregations, so Indexing on the relevant filtering columns
Metrics that Require Time-Series History Window Functions in SQL: Utilize SQL window functions (e.g., AVG() OVER (PARTITION BY customer_id ORDER BY timestamp ROWS BETWEEN …), LAG(), LEAD()) to perform calculations across a set of rows related to the current row within a defined window (e.g., previous month’s transactions for a user). Time-Series Databases: For more complex time-series analysis or if the volume of historical data is very large and requires specialized querying capabilities, consider using a dedicated time-series database (e.g., InfluxDB, TimescaleDB). These databases are optimized for storing and querying time-stamped data. Pre-aggregation/Materialized Views: For frequently used time-series metrics, consider creating pre-aggregated tables (materialized views) that store the results of these calculations at regular intervals (e.g., daily, weekly). This can significantly speed up querying these metrics as the calculations are done beforehand.