Iterative Data Processing

USER INFLUENCE ANALYSIS

Technology boom has exponentially increased the number of online applications. However online marketing has made it harder for the new as well as existing applications to target the right users for capturing the market. While majority illusion can be attributed for the success of latest applications, starting off by targeting influential users has still continued on of the primary marketing strategy. Iterative data processing using Spark is one such methodology that can be used to find influential users in a exponentially increasing social graph of users.

As a part of this project I primarily worked on a 10.4 GB of Twitter Social Graph dataset which contained a twitter user id and the id of the users that they followed. Firstly I used Spark RDD and Spark DataFrames to generate the number of followers of each user and also calculate the number of edges and vertices of the social graph data. Secondly I worked on the implementation of PageRank algorithm to rank each user based on their influence.

After calculating the user influence, I used Spark GraphX library to calculate the second degree influential scores of nodes in the social graph to include nodes that might not be connected with the rest of the nodes in the graph.

Technologies Used:

AWS EMR CLUSTERS
APACHE SPARK 2.2.0
APACHE SPARK SHELL
APACHE SPARK GRAPHX
APACHE SPARK ACCUMULATORS
MAVEN
JAVA

Role: Developer
Event: Coursework
Location: Carnegie Mellon University, Pittsburgh
Year: Fall 2017

Subhadeep Bhattacharyya

SOFTWARE ENGINEER

USER INFLUENCE ANALYSIS

Technologies Used:

Source: https://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce