Adoption of Agile Methodology in Data Projects

Posted by Sanket Shrestha on January 12, 2012
 Before discussing about agile methodology in data projects, first let’s briefly discuss about the nature of data projects. In almost all data projects, there are mainly three steps, popularly known as ETL,: Extraction (E), Transformation (T), and Loading (L). At Deerwalk, the three main steps of data projects are data import, data mapping, and application processing, and an analogy can be drawn between these three steps with ETL.

(Role Playing by: AshayT, SanketS, KanchanP, BimanS & NeerajS; Photos by: ManishS; Design by: NimeshD; Concept by: PramodR)

Adoption of Agile Methodology in Data Projects

First, raw data of clients is imported to our import tables. The imported data is then cleaned, mapped, and transported to Deerwalk standard scrub tables with necessary business logic implemented. For mapping, Data Standardization Document (DSD) is prepared which contains information on required business logics and on how each field of source tables will be mapped with each field of scrub tables. For data projects, DSD is like detail design used in software development projects. On the basis of DSD, scripts are written to convert client specific data into Deerwalk’s standard format. The data is then processed to present them in our applications.

The above mentioned steps are complemented by Data Import Review (imported data is compared with control total), DSD Review (DSD is reviewed to verify logics/mapping) and Scrub Review (unit testing performed by developers). In addition, Data Scrub QC is performed by independent QC resources to identify any defects in this phase. Some issues found in these steps require us to go back to the client for their feedback and might also result in re-import and re-scrub of data. Until and unless all issues get resolved, one can expect to have much iteration in data projects.

Data projects require significant participation of clients with the team throughout the project. This is the main reason behind adopting agile methodology in data projects. In inception phase, several iterations take place with continuous communication between stakeholders inside the company and outside the company. This continuous communication also helps to manage changes in requirements more effectively.

Agile software development methodology used in data projects can be explained through the following flow chart:

Data Process Flow Chart


Data Process Flow Chart

Agile process in data projects can be divided into following level of planning:

a) Implementation Planning
To implement any client, project manager needs information like number of data feeds / employer groups associated with particular client, data carriers, and data files associated with different feeds. This information is necessary to define timeline and resources to implement any client.

b) Sprint Planning
The implementation of client is broken down into sprints. Generally, the number of sprints depends upon the number of data feeds. The duration of a sprint is fixed but it usually takes 4 weeks to implement a particular data feed. Each sprint is divided into a number of iterative sub-phases.

c) Iteration Planning
The nature of data project is such that the PM needs to plan for each iteration. From the flow chart as well, we can see that iteration occurs mainly in Data Import, DSD Preparation and Data Scrubbing phases. An iteration which take place in Data Scrubbing phase is mainly due to defects. To minimize re-scrubbing, DSD Review and Scrub Review should be done properly.

d) Daily Planning
Every morning, the team meets for a quick stand up meeting to discuss about previous day’s development, tasks which will be worked upon on that particular day and any impediments.

Like in every other discipline of project management, the focus of data project managers is also finishing project in specified time and making best use of available resources. In data projects, the primary goal is cleaning and standardizing client’s data. Data sent by clients depends upon the data carriers. There is no hard and fast rule to implement logic to standardize data for all clients. We need to modify logic according to client’s data. The most common example is memberid. We need to make memberid unique for each member but there is no fixed formula which we can follow to make it unique. Similarly, level of aggregation is always different for different clients.

So, continuous interaction with clients is of utmost importance for successful implementation of data projects. Non-iterative methodologies, like waterfall model, are very much unsuitable for data projects and, hence, we have adopted Agile methodology in data projects at Deerwalk.