Sam Entries

Facts about compiling a java application

Java is an object oriented programming language, a programming environment, an operating environment (java virtual machine) and also a language for the web (JavaScript). There are few terms that we should understand to gain an insight as to how java works.

Emulator

Emulating means imitating. Similarly in computing emulating refers to a particular computer system (host) behaving like another computer system (guest). Emulators are the particular hardware that enables this imitation. By emulating older programming codes we can get better quality results (like quality graphics for games and stuff) for original programs. Continue reading “Facts about compiling a java application”

Bit-Store Analytics Platform (15) – System Decomposition details

The query rewrite engine

Query rewrite engine.png Continue reading “Bit-Store Analytics Platform (15) – System Decomposition details”

Bit-Store Analytics Platform (15) – System Architecture

The basic purpose of the system is to enhance the query performance by effective manipulation of indexing techniques. For aggregate queries that involve numerical data, bit-sliced indexes are to be used, whereas for low-cardinality data, projection indexes are to be used. In addition, considering the type of data, the indexing type to be applied on is decided. Moreover, the same table is allowed to contain multiple indexes thus allowing the most effective indexing table to be used to be decided in accordance with the query at hand.

The types of indexing techniques that are newly introduced are the bit-sliced indexing, projection indexing, RID-Bit indexing ad Fast-Bit indexing. The design of the system is adapted so as to facilitate these indexing types. The query-rewrite engine and the index builder are the parts of the existing system which are affected by the new design. The query rewrite engine first decides if the query can solely be run on bitmap indexes. If so, the query is modified in order to run on indexes instead of the base table. IndexDictator is an abstract rule based system which comes into use when there are multiple indexes implemented on the same base table. The IndexReadDictator, which is an implementation of the index plumper, chooses the optimum indexing technique to be used on the query. Continue reading “Bit-Store Analytics Platform (15) – System Architecture”

Bit-Store Analytics Platform (14) – Hive indexes ; Create, Store and Use

Creating indexes

For base tables which are partitioned, indexes will also be partitioned. Moreover, indexes are created only on tables that support getPos(), seek() or equivalent methods.

An index is also a normal Hive table with the following format.

Key columns : col_1, ….., col_k – base table columns on which the index is defined.
List<offset> : a list of raws with the given column values.

Offset = file_name + byte_offset

Here file_name is the relative path to the path to the particular base table/partition that contains the row. Byte_offset refers to the position of that row within the file. Continue reading “Bit-Store Analytics Platform (14) – Hive indexes ; Create, Store and Use”

Bit-Store Analytics Platform (13) – Life of a map task

Here, a single application is referred to as a “job” .The user inputs for a map-reduce job includes a configuration, a jar that includes a map function, a combiner function and a reducer function and the directories on HDFS for inputs and output.

A File in the input directory is usually allocated on map split and a map task is allocated for each such map split. If the input file size is too big(bigger than a HDFS block size), two or more map splits are allocated for the same map split. Resource manager allocates a container for each map task. These containers are allocated by exploiting principle of locality as follows.

If there is a container available in the same Node manager as the map split

Then allocate that container.

Else if there is a container available on a node manager within the same rack,

Then allocate that container.

Else allocate any other node manager of the cluster.

When a Container is assigned the MapTask is launched Continue reading “Bit-Store Analytics Platform (13) – Life of a map task”

The Senate Bus problem

This problem was originally based on the Senate bus at Wellesley College. Riders come to a bus stop and wait for a bus. When the bus arrives, all the waiting riders invoke boardBus, but anyone who arrives while the bus is boarding has to wait for the next bus. The capacity of the bus is 50 people; if there are more than 50 people waiting, some will have to wait for the next bus. When all the waiting riders have boarded, the bus can invoke depart. If the bus arrives when there are no riders, it should depart immediately.

The busses and riders will continue to arrive throughout the day. Assume inter-arrival time of busses and riders are exponentially distributed with a mean of 20 min and 30 sec, respectively.

Please find the solution to this problem implemented in java here.

Shelter Animal Outcomes (6) – Submissions, Results and Discussion

Submissions

Approach	Score
XGBoost algorithm	0.71494
Multilayer perceptron	12.3
J48 Tree	12.3
Naive Bayesian classifier	12.76

Results

Continue reading “Shelter Animal Outcomes (6) – Submissions, Results and Discussion”

Shelter Animal Outcomes (5) – Naïve Bayes Classifier in Weka Learner

About the approach

Naïve Bayesian classifier is a statistical classifier which predicts the probability of an instance belonging to a particular class. This method is built up based on the Bayesian theorem with the assumption that the attributes are conditionally independent.

Bayesian theorem

Let X be the set of training data and P(H|X) be the posterior probability of a hypothesis H. The baysian theorem can be stated by the following equation.

Continue reading “Shelter Animal Outcomes (5) – Naïve Bayes Classifier in Weka Learner”

Shelter Animal Outcomes (4) – J48 Classifier in Weka Learner

About the J48 classifier

J48 tree implements the C4.5 algorithm which was originally developed by Ross Quinlan. C4.5 is an improvement over the ID3 algorithm.

ID3 → C4.5 → J48Weka

C4.5

C4.5 makes the use of information entropy to gain good results from the ID3 algorithm. This is a supervised learning problem with the training composing of already labeled data. Each sample in the training data set is considered as a vector with a separate dimensions for each attribute[1]. For example, a sample in the shelter animal outcome prediction problem will contain the following dimensions in its vector. Continue reading “Shelter Animal Outcomes (4) – J48 Classifier in Weka Learner”

Shelter Animal Outcomes (3) – Multilayer perceptron

About the approach

Multilayer perceptron is a feed forward artificial neural network model. This is an improvement over the linear perceptron. Here, the input is initially encoded using a nonlinear transformation. that encoded input is then projected and converted to be linearly separable. there could be one or more such hidden layers. Weka facilitates the user to define the number of hidden layers used. However, it also has provided some values as follows.

‘a’ = (attribs + classes) / 2
‘i’ = attribs
‘o’ = classes
‘t’ = attribs + classes.”

Continue reading “Shelter Animal Outcomes (3) – Multilayer perceptron”