Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Generating realistic dates using SQL Data Generator and Python

$
0
0

Guest Blogger:

Phil Factor

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral .

When you’re generating test data, you have to fill in quite a few date fields. By default,SQL Data Generator (SDG) will generate random values for these date columns using a datetime generator, and allow you to specify the date range within upper and lower limits.

This is fine, generally, but occasionally you need something more. What if the date in a column has to be greater than a date in another one, by some varying interval? What if a date must be the same as a previous date, if the date in a third column is NOT NULL ? There are plenty of examples where what SDG provides just isn’t quite enough.

This article will show how to exert more control over the test date in your date columns, using SDG’s python Generator, where a Python expression or Python program provides the value to use to generate the SQL value.

If you’re unfamiliar with SDG, I recommend you read the following pieces as well:

How to start producing realistic test data with SQL Data Generator gives a basic tour of the tool, introduces the Customers database that we use here, and shows what data SDG generates for each of the columns, by default Generating test data with localized addresses using SQL Data Generator tackles the problem of producing realistic, localized addresses using regular expressions SDG date generation out-of-the-box

Firstly, download and run the script to create the Customer database. If you’ve already created a previous version of the database, while working through any of the above articles, it’s probably easiest to drop it and recreate it using the referenced script, point SDG at it and generate fresh data.

If you have a previously saved SDG project, it will still work, assuming the new database has the same name and connection details. If not, you can right-click the project file, open it in EditPad (or similar) and edit the XML to point at your new database.

Figure 1 shows the auto-generated data for the NotePerson table, a simple table that connects a customer with a note. Customers can have many notes, and notes can apply to many customers.


Generating realistic dates using SQL Data Generator and Python

Figure 1

The problem is in the default values for the date columns. The InsertionDate needs to start within the last five years, but that is very easy to fix using the supplied Min and Max date ranges.


Generating realistic dates using SQL Data Generator and Python

Figure 2

That’s better, but now the ModifiedDate column is wrong; each of the date values in this column need to be some random interval after their respective insertion date.

One option is to use choose Offset from column from the Range: dropdown, then specify an offset from the InsertionDate column, of between, say, 1 and 1,000 days. That would work well enough in this simple case, but you’ll see that while the allocation of dates after the start date is random, its distribution is uniform there is an equal chance of any particular value being returned.

This isn’t often realistic; you’re more likely to see a normal (‘bell shaped’) distribution in real production data. Let’s see what is possible using custom python scripts, as this allows us to create more realistic distributions of data that conform more closely with the way production data behaves.

Working with the SDG Python generator

I’m not really a Python programmer, and I don’t know many DBAs who use Python either, but I had little trouble getting all this to work because there are loads of Python examples on the Internet that can be adapted for use.

SDG was designed for a previous version of Python (2.7), but my Python install is up-to-date (v3.5.2 at time of writing). This means that some of the examples provided for SDG won’t work unless you have installed only the older Python libraries (2.7), because Python made breaking changes between the two versions.

As such, in the following scripts I stuck to what’s possible with Python when using only the .NET python libraries. This is necessary anyway when manipulating SQL Data.

Installing or Upgrading Python

If you don’t have Python, or need to upgrade to the latest version, I recommend use of Chocolatey . You can install it from PowerShell (running as an administrator). After that, using Chocolatey to install Python (or anything else) is as simple as issuing commands such as ‘choco install python’, or ‘choco upgrade python’.

You may see a few ‘python script timed out for row’ error warnings in the SDG UI, on the columns that used these Python 3.5.2 scripts (I did). They tend to pop up on reopening a SDG project, when SDG regenerates the preview data. Generally, just clicking on or in the offending columns removes the error warning after a few seconds.

Using the Python generator for dates

To fix the ModifiedDate column in the NotePerson table, I go to my collection of Python templates and choose one that seems most similar to what I want. Here, I wanted the modification date to be at a date that is normally distributed at an average of 720 days after the entry was made but with a standard distribution of 200 days. Real data is often normally distributed.

I’ve chosen a simple way of doing this but there are plenty of faster and better ones around such as the Box-Muller transform. I’ve added the twist that if the result is in the future, or before the record was created, I use a NULL instead.

from System import Random random = Random() def main(config): StandardDeviation=200 #days Mean=720 #days DaysAfterwards= random.Next(- StandardDeviation,StandardDeviation)+random.Next(- StandardDeviation,StandardDeviation)+random.Next(- StandardDeviation,StandardDeviation)+Mean EndDate= InsertionDate.AddDays(DaysAfterwards) if DateTime.Now<EndDate or EndDate<InsertionDate: EndDate= null return EndDate

Listing 1: Python Script for ModifiedDate column in NotePerson table

I paste it in and away we go; we now have dates for modification of the record that are

Viewing all articles
Browse latest Browse all 9596

Trending Articles