Important points
- The entire process is atomic so that
if there are any problems, the transactions are rolled back and the database is not corrupted.
- Detailed log file
- Separating the processes and
creating intermediate CSV files gives institutions tremendous
flexibility. It means that the ETL process can be started and
stopped at any point in the process. For example, institutions
can write their own Extract program and then call the Estudias ETL
process to complete the transformations and loading.
Institutions also have a lot of power to stage things at different
times (i.e. the extraction process could happen at one time, the
transformation at another time, and loading at still another).
|
Extraction -- The Estudias extraction engine automatically
downloads information at a scheduled time (via SSL, SSH,
ODBC, etc). The extraction tool supports partitioned extracts and
round-robin extract windows (i.e. if a download can't finish by a
specified cutoff time, the extraction tool will abort and restart the
next day where it left off). Institutions have used this engine to
automatically download other information (even if it's not related to
Estudias).
Extract from multiple
databases. Specifying the tables, fields
Extracted CSV Files: As part of ZogoTech's Open Architecture,
information is extracted into comma separated value files, so that power
users (IR, IT) can access the raw data directly. These files can
easily be imported into Excel, SPSS, or SAS. Third party applications can be added to further
manipulate the downloaded .CSV files or use them as input.
Alternatively, these files can be
encrypted for security or written to an encrypted virtual disk. In this case, the data remains encrypted
throughout the ETL process and is never written to disk unencrypted.
Pre-Load Transformations: Estudias supports pre- and
post-load transformations. It is often easier to do simple text
transformations here (i.e. date conversion) as opposed to in SQL Server
(step 6). In the pre-load transformations, the CSV files are manipulated
and metrics are automatically computed to improve performance and
scalability. At this point other pre-defined data sources
(i.e. National Student Clearinghouse) are combined with institutional
data to compute transfer-outs, transfer-ins, etc.
Transformed CSV Files: Transformed Files are available for
power users which, at this point, have metrics not available in files in step 2
(i.e. cohorts, stop-ins, stop-outs, transfer outs). Again, these
files can be easily manipulated using Excel, SPSS, SAS. As in Step
2,
institutions may write their own programs to customize these files,
although it is usually easier to use SQL as in step 5.
Database Preparation: Database is automatically
created or upgraded if necessary. In case of an upgrade, the
schema is migrated using ZogoTech's Seamless Schema Migration (SSM) algorithms,
automatically creating or changing missing schema elements. The
database is backed up before any changes are made. Users are
disconnected to protect the integrity of the data and prevent
disruptions to the load process. Permissions are applied.
Security is sever-based and integrated with Active Directory / LDAP so
end users can access the data warehouse via any tool.
Load and Post-Load Transformations:
The .CSV files are bulk loaded into the database. Remaining transformations are done on
the database using standard SQL statements for easy extensibility.
Once information is on the server, users can access it via any ODBC
connection (i.e. Excel, SPSS, SAS).
Because many transformations are quite
complex, Estudias supports arbitrarily deep hierarchical
transformations. ZogoTech's Automated Dependency
Calculation (ADC) Engine ensures that dependencies between the transformations are
automatically computed and processed in the correct order (i.e.
calculation of term indices cannot be calculated until starting cohorts
are calculated which in turn cannot be calculated until term enrollment
and FTIC is calculated).
Data Quality Checks:
Warnings are created if various conditions are met: i.e. all rows
in a table field are empty or the number of rows in a table has
significantly increased or decreased since the last run. In one
institution, Estudias caught a subtle problem in their ERP upgrade by
noticing that a field had a significant number of empty values compared
to before the run before the ERP upgrade.
OLAP Cube Design and
Processing: Fact and dimension tables (star and snowflake
schema) are created and OLAP cubes are created (if they don't exist) and processed.
Users can access these cubes from any OLAP-compliant tool (i.e.
MapPoint, Excel, SPSS, SAS).
|