tiny thoughts: March 2009

Tuesday, March 31, 2009

plan 3.31

1. Finish the migrating.

2. try the JUnit today.

3. Some preparation about the research project. thinking in a broader way will give you some hints.

4. Reading the relevant chapters in the introduction to the theory of computation.

5. The last but most important thing is, GOING TO THE GYM AND HAVE SOME SPORTS!

Saturday, March 28, 2009

The plan for the new quarter!

In the next quarter, I want to do the following things:
1. Improve my English, both writing and speaking. I cannot afford to be kept in 104.5 the third times. The ways to achieve this includes but not limited to:
a. speak with people in English
b. listen to the cnn & npr news every day. Watch the English news, not the Chinese!
c. attend the 108 and practice the academic writing
d. try to speak more fluently

2. Read some classic books in computer science. Reserve the time before you go to bed for reading and thinking.

3. The research project

4. Get familiar with the unix/linux system. At least pretend to be professional

plan 3.28

1. Return the books that are not needed and borrow the new books.

2. Play with Python and JUnit.

3. continue to migrate the codes to unix; learn the gdb and vi. Get familiar with Unix.

4. Start using the latex.

5. continue to read the interesting sigmod paper..

Python substitution

>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*$\s*$:',
... r'static PyObject*\npy_\1(void)\n{',
... 'def myfunc():')
The answer is : 'static PyObject*\npy_myfunc(void)\n{'

re.sub(pattern, repl, string):
The substitution has two steps:
a. Using pattern to match the string; find the left-most occurance of the pattern. If no substring matches the pattern, the sub func returns the orginal string without modification.

b. Replace the matched substring with the repl. The \number is replaced accordingly.

The use of re.VERBOSE to make the regular exp look nicer.
Whitespace within the pattern is ignored, except when in a character class or preceded by an unescaped backslash, and, when a line contains a '#' neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such '#' through the end of the line are ignored.
That means that the two following regular expression objects that match a decimal number are functionally equal:
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)

b = re.compile(r"\d+\.\d*")

More about regular expression: http://docs.activestate.com/komodo/4.4/regex-intro.html

Friday, March 27, 2009

plan 3.27

1. Migrate the code to Unix platform.

2. Play badminton.

3. Finish reading the sigmod paper

4. try JUnit and the Python regular exp

Thursday, March 26, 2009

plan 3.26

1. Discuss with Brian about the global tree project and start coding. (4 hours)

2. Read the paper about RAID. (2h)

3. About the SSD project. What about modifing the data structure to manage the PB scale data? Read the paper in Sigmod08. (2h)

4. Play with Python. See whether you can parse the Reuters dataset! (1.5h)

Wednesday, March 25, 2009

Python again!

\s means the white spaces. It is the same as [ \t\n\r\f\v].
re.sub(pattern, replace_string, original_string)

>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*$\s*$:',
...r'static PyObject*\npy_\1(void)\n{',
...'def myfunc():')

The result is 'static PyObject*\npy_myfunc(void)\n{'

using raw stirng to simplify the problem.
'\\\\section' = r'\\section'
'\\section' =r'\section'

The matching for the phone numbers:

>>> phonePattern = re.compile(r'''
# don't match beginning of string, number can start anywhere
(\d{3}) # area code is 3 digits (e.g. '800')
\D* # optional separator is any number of non-digits
(\d{3}) # trunk is 3 digits (e.g. '555')
\D* # optional separator
(\d{4}) # rest of number is 4 digits (e.g. '1212')
\D* # optional separator
(\d*) # extension is optional and can be any number of digits
$ # end of string
''', re.VERBOSE)
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')
('800', '555', '1212', '')

>>> po = re.compile(r'\w*?e') #nongreedy matching
>>> mo = re.search(po, r'the the')
>>> re.findall(po, r'the the')
result: ['the', 'the']

assignments for 3.25

Today's jobs are rather straightforward.
1. Read the codes about the regular exp in Python. (continuously immersed in it)
2. Coding in C for the global trees
3. Summarize the paper "Migrating enterprise storage to SSDs: analysis of tradeoffs"

schedule for April

This is a monthly plan for April. The work listed here should be finished in a month.

1. Here I want you to get familiar with two tools. The first is the Latex, the second.. er.. is the JUnit function in eclipse.

2. Join the group, and try to build the relationship with the senior students. You almost messed one thing! Remember not talk too much.

3. In the following month, I hope you can finish the following work about the SSDs.
a) Learn how to use the disksim simulator.
b) Know how to set the parameters for the SSD.
c) Some paper reading about how to design algorithms based on SSD.

4. Another research project
Implementing codes for the global trees! Start today and finish it in two weeks.

5. Read the classic paper about the RAID. It is a stimulating paper.

The most interesting thing I read today is about the meaning of Ph.D.
It is people having dreams

New Resolutions

New Resolution: (Let me see how long you can insist on your goals, I guess, a month)

1. Get up early every day. At around 7:50am.

2. Develop some good habits when you have time. One is asking for help from others, and build the relationships with them.

3. Work on the project of SSDs. The purpose of this is to implement the data extensive algorithms on the SSDs and analyze the performance of them.

4. Classic papers about the database, and RAID. Try to read them!

5. Read the articles about how to conduct research!

Friday, March 20, 2009

about python regular expression

The repeating labels:
*: zero or more repeating "a*b"; It can match ab, aab, aaaab
+: one or more repeating a+b cannot match ab.
?: zero or one time occurence
(m,n): at least m, at most n times appearence

Using the python regular exp.
import re; re is the package supporting the regular exp.

Backslash '\'
In order to match a string "\section", the regular exp. should be "\\section". The string we want to pass to the compile() is "\\\\section".
A better way is to use the raw string. We can pass the s"\\section" to the compile function.

matching and searching
re.match(repattern, string, mode)
re.search(repattern, string, mode)
The difference is: match will try to match from the beginning of the string, while search can find the matching string anywhere.

Some mode such as re.I (ignore case), re.DOTALL(dot can match any chars)

re.split() is used to extract the contents you need.

(?P...) can be used to match and reference the contents.

To extract the strings we need, use group().

The Python regular exp. is so funny! I will play with it using some applications in the follow-up posts.

An example:
about how to write a regular expression to extract the id and topic info from a string like this: "". The regular expression could be "[a-zA-Z]*)\"[0-9a-zA-Z=.\'\"\-\n ]*newid=\"(?P[0-9]+)\">"You can get the id and topic info using the group function

tiny thoughts