Ian London bio photo

Ian London

Data scientist at Metis, NYC.

Email Twitter Facebook LinkedIn Github

Some data is inherently cyclical. Time is a rich example of this: minutes, hours, seconds, day of week, week of month, month, season, and so on all follow cycles. Ecological features like tide, astrological features like position in orbit, spatial features like rotation or longitude, visual features like color wheels are all naturally cyclical.

Our problem is: how can we let our machine learning model know that a feature is cyclical? Let’s explore a simple 24-hour time dataset. The time might be connected to temperature, or exits through a subway turnstile, or anything. But we want to convey its cyclical nature to our model.

First, we’ll generate some fake times. Since we’re only looking at where the time appears on a 24-hour clock, we can represent the times as seconds past midnight.

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
def rand_times(n):
    """Generate n rows of random 24-hour times (seconds past midnight)"""
    rand_seconds = np.random.randint(0, 24*60*60, n)
    return pd.DataFrame(data=dict(seconds=rand_seconds))

n_rows = 1000

df = rand_times(n_rows)
# sort for the sake of graphing
df = df.sort_values('seconds').reset_index(drop=True)
df.head()
seconds
0 192
1 212
2 299
3 300
4 353

Seconds past midnight alone conveys no closeness between data that crosses the “split”. Here, the split is at midnight.

df.seconds.plot();

png

Notice that the distance between a point as 5 minutes before and 5 minutes after the split is very large. This is undesirable: we want our machine learning model to see that 23:55 and 00:05 are 10 minutes apart, but as it stands, those times will appear to be 23 hours and 50 minutes apart!

Transformation into 2 dimensions

Here’s the trick: we will create two new features, deriving a sine transform and cosine transform of the seconds-past-midnight feature. We can forget the raw “seconds” column from now on.

seconds_in_day = 24*60*60

df['sin_time'] = np.sin(2*np.pi*df.seconds/seconds_in_day)
df['cos_time'] = np.cos(2*np.pi*df.seconds/seconds_in_day)

df.drop('seconds', axis=1, inplace=True)

df.head()
sin_time cos_time
0 0.013962 0.999903
1 0.015416 0.999881
2 0.021742 0.999764
3 0.021815 0.999762
4 0.025668 0.999671
df.sin_time.plot();

png

Notice that now, 5 minutes before midnight and 5 minutes after is 10 minutes apart, just as we wanted.

However, with just this sine transformation, you get a weird side-effect. Notice that every horizontal line you draw across the graph touches two points. So from this feature alone, it appears that midnight==noon, 1:15am==10:45am, and so on. There is nothing to break the symmetry across the period. We really need two dimensions for a cyclical feature. Cosine to the rescue!

df.cos_time.plot();

png

With an additional out-of-phase feature (cos), the symmetry is broken. Using the two features together, all times can be distinguished from each other.

An intuitive way to show what we just did is to plot the two-feature transformation in 2D as a 24-hour clock. The distance between two points corresponds to the difference in time as we expect from a 24-hour cycle. (I’m just plotting a subset of the data so we can see the individual points).

df.sample(50).plot.scatter('sin_time','cos_time').set_aspect('equal');

png

Voila! We can feed the sin_time and cos_time features into our machine learning model, and the cyclical nature of 24-hour time will carry over.

Further reading: