Sunday, March 16, 2025

Identify Google Cloud Platform's BigQuery table columns that only contain null values

 

To identify columns in Google Cloud Platform's BigQuery table that only contain null values, you would need to examine each column individually using SQL queries.

For a single column, you can use a query like this:

SELECT COUNT(*)
FROM `project.dataset.table`
WHERE column_name IS NOT NULL

This will return a count of non-null values in the column. If the count is 0, then all values in the column are null.

However, if you have numerous columns and want to check all of them, you can leverage the INFORMATION_SCHEMA view to dynamically generate and execute the queries. Here's an example:


DECLARE columns ARRAY<STRING>;
DECLARE queries STRING;
DECLARE i INT64 DEFAULT 0;

SET columns = (
SELECT ARRAY_AGG(column_name)
FROM `projectname.dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'tablename'
);

SET queries = (
SELECT STRING_AGG("SELECT '" || column_name || "' AS column_name, COUNT(*) AS non_nulls FROM `projectId.dataset.tablename` WHERE " || column_name || " IS NOT NULL", ' UNION ALL ')
FROM UNNEST(columns) AS column_name
);

EXECUTE IMMEDIATE """
CREATE TEMPORARY TABLE results AS
SELECT
column_name,
non_nulls
FROM (
""" || queries || """
);
""";
EXECUTE IMMEDIATE """
SELECT column_name
FROM results
WHERE non_nulls = 0;
""";

Wednesday, July 31, 2024

QuadTree data structure for image processing and geographic information

A Quad Tree is a data structure used for organizing and searching nodes in a two-dimensional space. It is commonly used in various applications such as image processing, geographic information systems, and more.

 QuadTree is a tree data structure where each internal node has exactly four children: north-west, north-east, south-west and south-east. They are mainly used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions.


Below is a simple implementation of a QuadTree in Java, along with methods to insert and search points.


```java

public class QuadTree {


    private final int MAX_CAPACITY = 4;

    private int level = 0;

    private List<Point> points;

    private QuadTree northWest = null;

    private QuadTree northEast = null;

    private QuadTree southWest = null;

    private QuadTree southEast = null;

    private Rectangle boundary;


    QuadTree(int level, Rectangle boundary) {

        this.level = level;

        points = new ArrayList<Point>();

        this.boundary = boundary;

    }


    void insert(Point point) {

        if (!boundary.contains(point)) {

            return;

        }


        if (points.size() < MAX_CAPACITY) {

            points.add(point);

            return;

        }


        if (northWest == null) {

            split();

        }


        northWest.insert(point);

        northEast.insert(point);

        southWest.insert(point);

        southEast.insert(point);

    }


    void split() {

        int xOffset = boundary.x + (boundary.width / 2);

        int yOffset = boundary.y + (boundary.height / 2);


        northWest = new QuadTree(level + 1, new Rectangle(boundary.x, boundary.y, boundary.width / 2, boundary.height / 2));

        northEast = new QuadTree(level + 1, new Rectangle(xOffset, boundary.y, boundary.width / 2, boundary.height / 2));

        southWest = new QuadTree(level + 1, new Rectangle(boundary.x, yOffset, boundary.width / 2, boundary.height / 2));

        southEast = new QuadTree(level + 1, new Rectangle(xOffset, yOffset, boundary.width / 2, boundary.height / 2));

    }


    List<Point> queryRange(Rectangle range) {

        List<Point> pointsInRange = new ArrayList<Point>();


        if (!boundary.intersects(range)) {

            return pointsInRange;

        }


        for (Point point : points) {

            if (range.contains(point)) {

                pointsInRange.add(point);

            }

        }


        if (northWest == null) {

            return pointsInRange;

        }


        pointsInRange.addAll(northWest.queryRange(range));

        pointsInRange.addAll(northEast.queryRange(range));

        pointsInRange.addAll(southWest.queryRange(range));

        pointsInRange.addAll(southEast.queryRange(range));


        return pointsInRange;

    }

}

```


You can use the `queryRange` method to find all points that exist within a given range. The range is a rectangle specified by where it starts (x,y) and its width and height. The `insert` method is used to insert a new point in the QuadTree. Note: This example assumes you have a simple `Point` and `Rectangle` class, you can create these classes as per your needs.


This code will create a QuadTree and insert points into it until it reaches its maximum capacity. At that point, it will split into four smaller trees. The `queryRange` method will return all points that fall within a given rectangle.

Tuesday, July 30, 2024

Building a Trie data structure to support typo head suggestion in Search Box

 Building a Trie data structure to support typo head suggestions is a comprehensive task. We'll start with creating a Trie data structure and then implement a search algorithm that tolerates typos.

First, we will create a TrieNode class represents each node in the Trie: ```java public class TrieNode { public TrieNode[] children; public boolean isEndOfWord; public TrieNode() { this.isEndOfWord = false; this.children = new TrieNode[26]; } } ``` Then we create the Trie class and two methods: insert and search: ```java public class Trie { private TrieNode root; public Trie() { root = new TrieNode(); } public void insert(String word) { TrieNode node = root; for (int i = 0; i < word.length(); i++) { char c = word.charAt(i); if (node.children[c - 'a'] == null) { node.children[c - 'a'] = new TrieNode(); } node = node.children[c - 'a']; } node.isEndOfWord = true; } public boolean search(String word) { return match(word, root, 0); } private boolean match(String word, TrieNode node, int start) { if (start == word.length()) { return node.isEndOfWord; } char c = word.charAt(start); if (node.children[c - 'a'] != null) { if (start == word.length() - 1 && node.children[c - 'a'].isEndOfWord) { return true; } return match(word, node.children[c - 'a'], start + 1); } else { for (int i = 0; i < 26; i++) { if (node.children[i] != null) { if (match(word, node.children[i], start + 1)) { return true; } } } } return false; } } ``` In this implementation, the search method uses a helper method called match. This match method checks for potential matches for the input word in the Trie. If a character does not match, it recursively checks all children nodes for a potential match. This is a basic implementation and may not work perfectly for all types of typos, but it gives an idea of how to implement a typo-tolerant search in a Trie data structure. For a more advanced typo-tolerant mechanism, you might want to use more sophisticated algorithms like the Levenshtein distance or BK-trees.

Tuesday, July 23, 2024

Spark Performance: Interview Questions and Answers

 

Spark Performance: Interview Questions and Answers

Introduction

Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics. It has gained popularity for its ability to handle large datasets in a distributed computing environment. As a result, understanding Spark performance tuning is crucial for data engineers and developers. This article presents some tough interview questions and answers related to Spark performance.

Main Section

1. What is Spark Performance Tuning?

Answer: Spark Performance Tuning is the process of adjusting settings to optimize the performance of Spark applications. It involves managing and tuning Spark configurations, garbage collection, serialization and memory management, among other things.

2. What are some common methods for improving the performance of Spark applications?

Answer: Some common methods include minimizing the read/write operations to disk, avoiding shuffling of data, broadcasting common data needed by tasks within each stage, and properly partitioning your data.

3. What is the role of the Spark scheduler in Spark performance?

Answer: The Spark scheduler plays a crucial role in distributing the data and scheduling tasks. It optimizes the sequence of operations by grouping them into stages in the DAG (Directed Acyclic Graph) scheduler.

4. How does Spark's in-memory processing improve performance?

Answer: Spark's in-memory processing capability allows it to store intermediate data in memory rather than writing it to disk. This significantly speeds up iterative algorithms and interactive data mining tasks.

5. What is the impact of data serialization on Spark performance?

Answer: Data serialization plays a vital role in the performance of any distributed application. Formats that are slow to serialize objects or consume a large number of bytes can greatly slow down computation. Spark provides two serialization libraries – Java Serialization and Kryo. Kryo is faster and more compact but does not support all serializable types.

Conclusion

Understanding Spark performance tuning is crucial for optimizing your Spark applications. These interview questions and answers should provide a solid foundation for demonstrating your knowledge and skills in a job interview. However, remember that practical, hands-on experience will always be the best way to learn and understand Spark performance tuning.

Cassandra: Key Features Explored

 

Unleashing the Power of Cassandra: Key Features Explored

Introduction

In the realm of database management systems, Apache Cassandra stands out for its ability to handle large amounts of data across many commodity servers. As a NoSQL database, it provides high availability and scalability without compromising performance. This blog post will delve into some of the key features of Cassandra, including Cassandra keys, High Availability (HA), Tables, NoSQL, and the Coordinator node.

Main Section

Cassandra Keys

In Cassandra, data distribution across various nodes is managed by a partition key, also known as a Cassandra key. This key is a part of the primary key and is responsible for data distribution across the system. The partitioner calculates the token of the partition key and decides which node the data belongs to. This feature ensures efficient data distribution and quick data retrieval.

High Availability (HA)

Cassandra is designed to have no single point of failure, making it highly available and fault-tolerant. It achieves this by replicating data across multiple nodes. Even if one node fails, the data is still available from replicated nodes. This feature ensures that your application is always up and running, providing a seamless user experience.

Tables

In Cassandra, data is stored in tables containing rows of information. Each row is uniquely identifiable by a primary key, which can be simple (single column) or composite (multiple columns). Tables in Cassandra are where your data lives and breathes, and they're designed to be efficient for your specific queries.

NoSQL

As a NoSQL database, Cassandra provides flexibility with your data. Unlike relational databases, it doesn't require a fixed schema and is able to handle unstructured, semi-structured, and structured data. It's built for scalability and can handle large amounts of data across many servers.

Coordinator Node

In Cassandra, any node chosen by the client or load balancer serves as the coordinator. This node is responsible for processing the client's read or write requests. It communicates with other nodes, collects responses, and returns the result to the client. This feature ensures efficient processing and management of requests.

Conclusion

Apache Cassandra, with its robust set of features, is a powerful tool in the world of database management systems. Its unique approach to keys, high availability, flexible table structure, NoSQL nature, and efficient coordinator node handling make it a go-to choice for managing large amounts of data across distributed systems. Whether you're dealing with unstructured data or looking for a system that offers high availability, Cassandra has got you covered.


Tuesday, July 16, 2024

Essential system design concepts for 10+ experience professional


1. Domain Name System (DNS)

The Domain Name System (DNS) serves as a fundamental component of the internet infrastructure, translating user-friendly domain names into their corresponding IP addresses.

2. Load Balancer

A load balancer is a networking device or software designed to distribute incoming network traffic across multiple servers, ensuring optimal resource utilization, reduced latency, and maintained high availability.

Some common algorithms include:

Round Robin: Requests are sequentially and evenly distributed across all available servers in a cyclical manner.

Least Connections: The load balancer assigns requests to the server with the fewest active connections, giving priority to less-busy servers.

IP Hash: The client's IP address is hashed, and the resulting value is used to determine which server the request should be directed to. This method ensures that a specific client's requests are consistently routed to the same server, helping maintain session persistence

3.API Gateway

An API Gateway serves as a server or service that functions as an intermediary between external clients and the internal microservices or API-based backend services of an application

The primary functions of an API Gateway encompass:


Request Routing: The API Gateway directs incoming API requests from clients to the appropriate backend service or microservice, based on predefined rules and configurations.

Authentication and Authorization: The API Gateway manages user authentication and authorization, ensuring that only authorized clients can access the services. It verifies API keys, tokens, or other credentials before routing requests to the backend services.

Rate Limiting and Throttling: To safeguard backend services from excessive load or abuse, the API Gateway enforces rate limits or throttles requests from clients according to predefined policies.

Caching: In order to minimize latency and backend load, the API Gateway caches frequently-used responses, serving them directly to clients without the need to query the backend services.

Request and Response Transformation: The API Gateway can modify requests and responses, such as converting data formats, adding or removing headers, or altering query parameters, to ensure compatibility between clients and services.

4. CDN

A Content Delivery Network (CDN) is a distributed network of servers that store and deliver content, such as images, videos, stylesheets, and scripts, to users from locations that are geographically closer to them.


5. Forward Proxy vs. Reverse Proxy

A forward proxy, also referred to as a "proxy server" or simply "proxy," is a server positioned in front of one or more client machines, acting as an intermediary between the clients and the internet. When a client machine requests a resource on the internet, the request is initially sent to the forward proxy. The forward proxy then forwards the request to the internet on behalf of the client machine and returns the response to the client machine.

  On the other hand, a reverse proxy is a server that sits in front of one or more web servers, serving as an intermediary between the web servers and the internet. When a client requests a resource on the internet, the request is first sent to the reverse proxy. The reverse proxy then forwards the request to one of the web servers, which returns the response to the reverse proxy. Finally, the reverse proxy returns the response to the client.


6. Caching:

Cache is a high-speed storage layer positioned between the application and the original data source, such as a database, file system, or remote web service. When an application requests data, the cache is checked first. If the data is present in the cache, it is returned to the application. If the data is not found in the cache, it is retrieved from its original source, stored in the cache for future use, and then returned to the application. In a distributed system, caching can occur in multiple locations, including the client, DNS, CDN, load balancer, API gateway, server, database, and more. 


7. Data Partitioning:

In a database, horizontal partitioning, often referred to as sharding, entails dividing the rows of a table into smaller tables and storing them on distinct servers or database instances. This method is employed to distribute the database load across multiple servers, thereby enhancing performance.


Conversely, vertical partitioning involves splitting the columns of a table into separate tables. This technique aims to reduce the column count in a table and boost the performance of queries that only access a limited number of columns.

8. Database Replication

Uses of DB replication:

Improved Performance: By distributing read queries among multiple replicas, the load on the primary database can be reduced, leading to faster query response times.

High Availability: If the primary database experiences failure or downtime, replicas can continue to provide data, ensuring uninterrupted access to the application.

Enhanced Data Protection: Maintaining multiple copies of the database across different locations helps safeguard against data loss due to hardware failures or other disasters.

Load Balancing: Replicas can handle read queries, allowing for better load distribution and reducing overall stress on the primary database.

9. Distributed Messaging Systems: 

Distributed messaging systems provide a reliable, scalable, and fault-tolerant means for exchanging messages between numerous, possibly geographically-dispersed applications, services, or components. These systems facilitate communication by decoupling sender and receiver components, enabling them to develop and function independently. Distributed messaging systems are especially valuable in large-scale or intricate systems, like those seen in microservices architectures or distributed computing environments. Examples of these systems include Apache Kafka and RabbitMQ.

10. Microservices:

The primary characteristics of microservices include:

Single Responsibility: Adhering to the Single Responsibility Principle, each microservice focuses on a specific function or domain, making the services more straightforward to comprehend, develop, and maintain.

Independence: Microservices can be independently developed, deployed, and scaled, offering increased flexibility and agility in the development process. Teams can work on various services simultaneously without impacting the entire system.

Decentralization: Typically, microservices are decentralized, with each service possessing its data and business logic. This approach fosters separation of concerns and empowers teams to make decisions and select technologies tailored to their unique requirements.

Communication: Microservices interact with each other using lightweight protocols, such as HTTP/REST, gRPC, or message queues. This fosters interoperability and facilitates the integration of new services or the replacement of existing ones.

Fault Tolerance: As microservices are independent, the failure of one service does not necessarily result in the collapse of the entire system, enhancing the application's overall resiliency.


11. NoSQL Databases

NoSQL databases, or “Not Only SQL” databases, are non-relational databases designed to store, manage, and retrieve unstructured or semi-structured data. They offer an alternative to traditional relational databases, which rely on structured data and predefined schemas. NoSQL databases have become popular due to their flexibility, scalability, and ability to handle large volumes of data, making them well-suited for modern applications, big data processing, and real-time analytics.

NoSQL databases can be categorized into four main types:

Document-Based: These databases store data in document-like structures, such as JSON or BSON. Each document is self-contained and can have its own unique structure, making them suitable for handling heterogeneous data. Examples of document-based NoSQL databases include MongoDB and Couchbase.

Key-Value: These databases store data as key-value pairs, where the key acts as a unique identifier, and the value holds the associated data. Key-value databases are highly efficient for simple read and write operations, and they can be easily partitioned and scaled horizontally. Examples of key-value NoSQL databases include Redis and Amazon DynamoDB.

Column-Family: These databases store data in column families, which are groups of related columns. They are designed to handle write-heavy workloads and are highly efficient for querying data with a known row and column keys. Examples of column-family NoSQL databases include Apache Cassandra and HBase.

Graph-Based: These databases are designed for storing and querying data that has complex relationships and interconnected structures, such as social networks or recommendation systems. Graph databases use nodes, edges, and properties to represent and store data, making it easier to perform complex traversals and relationship-based queries. Examples of graph-based NoSQL databases include Neo4j and Amazon Neptune.

12. Database Index:

Although indexes can significantly enhance query performance, they also involve certain trade-offs:

Storage Space: Indexes require additional storage space since they generate and maintain separate data structures alongside the original table data.

Write Performance: When data is inserted, updated, or deleted in a table, the corresponding indexes must also be updated, which may slow down write operations.

13. Distributed File Systems

Distributed file systems are storage systems designed to manage and grant access to files and directories across multiple servers, nodes, or machines, frequently distributed across a network. They allow users and applications to access and modify files as though they were situated on a local file system, despite the fact that the actual files may be physically located on various remote servers. Distributed file systems are commonly employed in large-scale or distributed computing environments to offer fault tolerance, high availability, and enhanced performance.

14. Notification System

These are used to send notifications or alerts to users, such as emails, push notifications, or text messages.

15. Full-text Search

Full-text search allows users to search for particular words or phrases within an application or website. Upon receiving a user query, the application or website delivers the most relevant results. To accomplish this rapidly and effectively, full-text search utilizes an inverted index, a data structure that associates words or phrases with the documents where they are found. Elastic Search is an example of such systems.

16. Distributed Coordination Services

Distributed coordination services are systems engineered to regulate and synchronize the actions of distributed applications, services, or nodes in a dependable, efficient, and fault-tolerant way. They assist in maintaining consistency, addressing distributed synchronization, and overseeing the configuration and state of diverse components in a distributed setting. Distributed coordination services are especially valuable in large-scale or intricate systems, like those encountered in microservices architectures, distributed computing environments, or clustered databases. Apache ZooKeeper, etcd, and Consul are examples of such services.

17. Heartbeat

In a distributed environment, work/data is distributed among servers. To efficiently route requests in such a setup, servers need to know what other servers are part of the system. Furthermore, servers should know if other servers are alive and working.

18. Checksum

In a distributed system, while moving data between components, it is possible that the data fetched from a node may arrive corrupted. This corruption can occur because of faults in a storage device, network, software, etc. How can a distributed system ensure data integrity, so that the client receives an error instead of corrupt data?


To solve this, we can calculate a checksum and store it with data.


To calculate a checksum, a cryptographic hash-function like MD5, SHA-1, SHA-256, or SHA-512 is used. The hash function takes the input data and produces a string (containing letters and numbers) of fixed length; this string is called the checksum.


Spring AOP in 5mins - Interview refresh

 Spring AOP support Aspect J programming: working on only spring beans only.  Aspect J works on all the java object

cross-cutting concerns - Logging, security

What is a join point? - result for executed method

What is a advice? - what we code we want to execute

What is a pointcut?  - when you want to apply 

What is an aspect? advice + pointcut

weaving - make sure aspects executed correctly

Weaver - before, after returning, after(finally), around, after throwing,


Spring Boot basic in 5 mins - Interview


 Spring Boot: less cofig compare to spring MVc like Dispacher servlet, view Resolver, component scan

 Auto config - Spring Boot auto-configuration attempts to automatically configure your Spring application based on the jar dependencies that you have added.

 Spring Boot Actuator - to monitor allication. metrics on /application url

 Embaded server - default tomcat server. Comes with dependency spring-boot-starter-web

Spring Starters:

Spring Starters parent - no need worry about version, based on spring version all the spring versions related jars are used

Spring initialiser - start.spring.io ui tool to create sample spring boot project with selected spring features

appliction.properties : used to cofig log level, enable feature, enbel profile, server.port, etc

CommandLineRunner - to run something do once application container started



 

Spring Web Design patterns to remember for Interview

Spring Web Design patterns

  1. Front Controller:
  2. Prototype
  3. Dependency injection
  4. Factory 
  5. Template methods 
  6. Abstract controller
  7. MVC - Servlet -> model
  8. MVC > view -> Servlet - model
  9. Dispatcher servlet: - > send request to mapped controllers  Dispatcher servlet - > controller -> model -> view Resolver -> Dispatcher servlet -> browser 
  10. view Resolver - > map name of the view to jsp 

 

Spring annotations and Basic in 5 mins - Spring interview refresh

Spring annotations  and Basics

  • @component - is the most generic one and  @repository, @ service, @controller are supposed to be used in their specific layers
  • @repository - So when you tag any component with at repository, spring would automatically add the exception translation for Jdbc exceptions.So whenever a Jdbc exception happens, then it needs to be translated into the specific spring exception.
  • spring beans are not thread safe
  • scope: singleton(default scope),  prototype, request and session.
  • request and session are both applicable only in web application context.
  • gang of four singleton or basic singleton:  it means one instance per class loader.
  • The spring singleton is one instance per container, one instance per application context. if you multiple application context than you will have multiple objects
  • Constructor injection: to create  immutable beans and for mandatory dependencies
  • Setter injection: Ex: internationalize something, which are not required while creating and setter injection for optional dependencies.
  • @Autowire: if only one implementation than by type. otherwise by name of the variable
  • NosuchBeanException: bean not scaned means 
  • NouniqueBeanDefination: if multiple implementation per interface and variable name is interface name
  • @Autowire
  • ClassA classAImpl;
    • Search by type  ClassA if multiple bean on same type , match by variable name "classAImpl" , if no match found then by name used @primary or match by qulifier
  • @primary : on class to tell use this implementation class for injection to solve NOsuchBeanException
  • @qulifier("") : used on class ann injection to identify  bean
  • BOM - bill of materials dependency  to have same spring versions.  all versions will take from dependency Management

 

Spark submit on YARN - Execution flow


When a Spark job is submitted to YARN, the following steps occur:

1. The client submits the job to the YARN ResourceManager (RM) via the YARN client.

2. The RM allocates a container for the ApplicationMaster (AM) on one of the cluster nodes. The AM is responsible for managing the lifecycle of the Spark application on YARN.

3. The AM starts on the allocated container and registers itself with the RM.

4. The AM negotiates resources with the RM to allocate containers for Spark executors.

5. The RM assigns containers to the AM, which in turn launches the Spark executors on these containers.

6. The executors register themselves with the AM.

7. The AM submits tasks to the executors and monitors their progress.

8. The executors execute the tasks and report the results back to the AM.

9. The AM aggregates the results and updates the job status.

10. Once the job is completed or terminated, the AM releases the resources and notifies the RM.


The ApplicationMaster is responsible for coordinating the execution of the Spark application on YARN. It negotiates with the RM to allocate resources (containers) for the Spark executors and manages their lifecycle. It also monitors the progress of the executors and handles failures by requesting the RM to re-allocate containers if needed.


The ApplicationManager is a component within the RM that receives requests from clients to submit applications. It is responsible for accepting application submissions, allocating resources to the applications, and tracking their status. The ApplicationManager works closely with the ApplicationMaster to manage the execution of Spark applications on YARN.


The relationship between the ApplicationMaster and the executors is that the AM is responsible for launching and managing the lifecycle of the executors. It negotiates with the RM to allocate resources (containers) for the executors and monitors their progress. The executors register themselves with the AM and execute the tasks assigned by the AM. The AM communicates with the executors to collect the results and update the job status.

 

HDFS - Edit logs and FsImage

 HDFS - Edit logs and FsImage

What is DistCp? - It is a tool that is used for copying a very large amount of data to and from Hadoop file systems in parallel

Default replication factor? By default, the replication factor is 3. There are no two copies that will be on the same data node. Usually, the first two copies will be on the same rack, and the third copy will be off the shelf. It is advised to set the replication factor to at least three so that one copy is always safe, even if something happens to the rack.

The two types of metadata that NameNode server stores are in Disk and RAM.

Edit logs: all transaction logs like added block, deleted blocks, replication details after last fsimage file created. The updates are periodically merged to FSImage.

FsImage: file stored in OS system and contains the complete directory stucture of hdfs. Name node reads when starting to memory


Spark execution optimisers - Catalyst and Tungsten


Spark uses two engines to optimize and run the queries - Catalyst and Tungsten, in that order. Catalyst basically generates an optimized physical query plan from the logical query plan by applying a series of transformations like predicate pushdown, column pruning, and constant folding on the logical plan. 

This optimized query plan is then used by Tungsten to generate optimized code, that resembles hand written code, by making use of Whole-stage Codegen functionality introduced in Spark 2.0. This functionality has improved Spark's efficiency by a huge margin from Spark 1.6, which used the traditional Volcano Iterator Model